Detecting Pneumonia in Chest X-Rays Using Various Deep Neural Nets

Simoni Maniar
14 min readMay 13, 2021

--

Figure 1: Various Chest X-Rays showing (a) Normal Lungs (b) Lungs with Bacterial Pneumonia © Lungs with Viral Pneumonia, and (d) Lungs with COVID-19 Pneumonia [1]

By: William Avery, Ethan Golla, Matthew Jiang, Simoni Maniar

ABSTRACT

Diagnosing pneumonia is unreliable using only clinical symptoms, and often relies on chest X-rays in order to confirm a diagnosis. However, chest X-rays require a trained radiologist to interpret, and this can be a limiting factor on diagnosing pneumonia using chest X-rays. Our project aims to assess whether some common machine learning image processing techniques can be used to assist radiologists in detecting pneumonia in chest X-rays. We focus on convolutional neural networks, and find that machine learning techniques can approach the accuracy of radiologists in this specific application.

INTRODUCTION

X-rays are typically used by radiologists to determine if a patient has pneumonia. This is because X-rays of pneumonia patients reveal the excess tiny white spots (known as infiltrates) in the lungs or fluids surrounding the lungs, which make it easier to identify the infection.

A pneumonia detecting model would be extremely useful for the medical industry because currently there is too much data for the limited number of radiologists to process. There is a large burnout problem within healthcare and the demand for radiologists far exceeds the supply. If a tool that would aid doctors in identifying pneumonia was developed that matched the performance of radiologists, healthcare as a whole would be more efficient and accurate. Furthermore if this proves successful, similar models could be applied to other diagnoses and images requiring the consultation of a radiologist, such as MRIs or CT scans.

Thus, the task is as follows: given a dataset containing chest X-ray images, build a binary classifier that detects the presence of pneumonia in an image. The primary metric that will be observed to measure model performance is sensitivity. We will also find the AUC and provide the accuracy of our models. Sensitivity and specificity are metrics that are often used in medical image processing. Sensitivity is the true positive rate (True Positives / Total Positives) that measures how well a model returns positive for all positive cases. Specificity is the true negative rate (True Negatives / Total Negatives) that measures how well the model identifies negative cases. Maximizing both sensitivity and specificity is preferred, however higher sensitivity typically indicates lower specificity and vice versa. With classification of pneumonia, it can be detrimental to patients if a positive pneumonia case is not identified, so we will prioritize maximizing the model’s sensitivity.

For our approach, we decided to build a convolutional neural network (CNN) because CNNs are well documented and popular for image classification. Choosing an established image classification technique allows us to implement the model and tune the hyperparameters quicker and easier than a newer technique, allowing us to reach a high level of performance more efficiently. However, alternative approaches were considered, such as Capsule Neural Networks and Transfer Learning.

EXPLORATORY DATA ANALYSIS

A publicly available Kaggle dataset of chest X-rays will be used to train the model [2]. The dataset comes from a study performed by Daniel Kermany, Kang Zhang, and Michael Goldbaum [2]. The training set consists of 1341 normal chest X-rays and 3875 chest X-rays containing pneumonia. The testing set consists of 234 normal chest X-rays and 390 X-rays containing pneumonia. Lastly, the validation set consists of 8 normal chest X-rays and 8 X-rays containing pneumonia [2]. An example of a normal chest X-ray and an X-ray containing pneumonia is provided in Figure 2 below.

Figure 2. Example images from the dataset used for model training.

The images in the dataset include both anterior (from the front) and posterior (from the back) X-rays from the Guangzhou Women and Children’s Medical Center. The images were obtained during routine clinical care from pediatric patients ranging from one to five years of age, meaning that the sample size is limited to children [2]. All of the images were screened to remove the low quality images. Then, two medical experts graded the scans and cleared them for model training. The evaluation set was additionally checked by a third expert [2].

The dataset we chose to use influenced a couple of model decisions. First, we chose to use stratified k-fold cross validation to measure model performance. This decision was made because the dataset contains significantly more positive pneumonia images than negative ones and we wished to avoid having a training or validation set with a higher skew of positive or negative images. Second, since the dataset is relatively small, we chose to use data augmentation techniques to increase the diversity of the data. The specific techniques used are covered later in the article.

INTRODUCTION TO CONVOLUTIONAL NEURAL NETWORKS (CNNS)

CNNs are a widely used model architecture for computer vision tasks, such as image classification. They also have well-documented libraries such as pytorch, keras, and tensorflow, which make it easier for beginners, such as us, to navigate. In general, CNNs are especially useful for image processing because they are able to keep track of spatial dependencies and identify image features. From a high level overview, as an image passes through the layers of the CNN, it captures color, edges, and eventually larger elements like shapes until it successfully identifies the object. An example of the arrangement of the CNN’s layers can be seen in Figure 3. Additionally, CNNs can identify relevant features anywhere in the image, which is necessary for the pneumonia detection model since pneumonia can appear in multiple places of a lung.

Figure 3. High-level overview of Convolutional Neural Networks, including Convolutional, Pooling, and Fully-Connected Layers [3]

CNNs are named as such due to their extensive convolutional and pooling layers that allow a large image (e.g. 7680×4320, which is the size of an 8K image) to be reduced to a much smaller sized matrix without losing its spatial dependencies, decreasing the total computational power required to train the network. The convolutional, pooling, and final fully connected layers are discussed in further detail in the following subsections.

All of the models we tested build upon convolutional neural networks in various forms.

Convolutional Layers

A convolutional layer is typically used to group similar input values of an image together to help identify particular features. It is typically the first layer inside of a CNN. The layer requires a NxN filter (typically 3x3) that shifts across an image. After every shift, the dot product between the filter and the area of the image that the filter covers is computed. Figure 4 shows how a filter is used in a convolutional layer.

Figure 4. Convolutional Layer example showing effect of applying a filter. [4]

The filter begins operating at the top left of the input image. The dot product between the filter and the portion of the image that the filter covers is computed, resulting in a value of 16 which is placed in the output array, also known as the feature map. The filter is then shifted and the process is repeated to find the dot product for the rest of the feature map.

After the feature map is generated, a ReLU activation function is applied to introduce nonlinearity to the model. This is necessary since finding the dot product is a linear operation and neural networks require nonlinearity activation functions. The model is tuned by adjusting the hyperparameters, which include the number of filters used, the stride of the filter (how much to shift the filter by), and the amount of padding (how to handle image edges).

Pooling Layers

Pooling layers are similar to, but less powerful than, convolutional layers and are used to reduce complexity and improve efficiency. Pooling layers also require an NxN filter to be shifted across an input image. However, pooling layer filters do not have weights. Instead, an aggregate function is applied to the area the filter covers. The two popular aggregate functions are max pooling and average pooling, shown in Figure 5.

Figure 5. Example of Max and Average Pooling [5]

Max pooling finds the maximum value over the area that the filter covers. In contrast, average pooling finds the average value of the area.

Fully-Connected Layers

Fully-connected layers are a typical dense neural network layer common in multilayer perceptrons. There may be several fully-connected layers at the end of the CNN, but for classification problems, the last layer will use either a sigmoid or softmax activation function. This layer will output probabilities that will allow the model to decide which class to choose. For example, detecting the presence of pneumonia in chest x-rays is a binary classification problem, so a sigmoid activation function is used in the final layer of our model architecture.

APPLYING CONVOLUTIONAL NEURAL NETWORK TO DATASET

Data Augmentation / Preprocessing

Before training our model, we used data augmentation to increase the diversity of the data. Data augmentation includes transformations such as translating, rotating, flipping, and zooming on the images. Data augmentation does not increase the sample size. Instead, each training image is transformed differently each epoch during training. This makes the classifier more robust to unseen data. The parameters for the input image transformations to the model can be seen in Figure 6. The application of the transformations to the model is shown in Figure 7.

Figure 6. Showing potential transformations such as Rotations, Zoom, Shift, Flips
Figure 7. Input to model is transformed input images

An example of the transformed images can be seen in Figure 8. It is clear that the original images appear vastly different from their original forms shown earlier in Figure 2.

Figure 8. Example of Transformed Images

Model Architecture

The model architecture was heavily based upon previous Kaggle submissions and are shown in Figures 9 and 10 below. The model consists of 22 total layers. There are 5 convolutional layers, 5 max pooling layers, and 2 fully-connected layers which are written in keras as Dense(). The final fully-connected layer uses a sigmoid activation function since pneumonia classification is a binary problem. There are additional layers used in the model: Flatten(), Dropout(), and BatchNormalization() layers. The single flatten layer reshapes the input (initially 5x5x256) into a single array of 6400 values. The flatten operation is done to connect the outputs from the max pooling layer into the fully connected layer. The dropout layers randomly set inputs to zero based on the rate parameter. A dropout layer with a rate of 0.1 will randomly set 10% of the inputs to be 0. This is done to prevent overfitting, and thus ultimately improve model performance. Lastly, batch normalization layers scale the input in order to maintain a mean of 0 and a variance of 1. This allows for a much faster learning rate to be used and thus reduces total training time. Dropout layers and batch normalization layers were added after each convolutional layer for these benefits.

Figure 9. Code Snippet showing information about each layer, including filter size and activation functions
Figure 10. Outputs of model.summary() function. This function provides a detailed description of each layer in the model, including the parameters (trainable weights) and the output shape.

Stochastic gradient descent (SGD) was used as the optimizer. The model trains for a fixed number of epochs. For each epoch, the learning rate is reduced by a factor of 0.3 if the validation set accuracy has not improved in the last 2 epochs. Decreasing the learning rate over time allows for fine tuning the model. The fixed number of epochs allows us to avoid overfitting as well. Since the model performance is reasonable, we believe that we do not need to add layers or increase the capacity of our model, which are steps that could be taken to avoid underfitting.

Model Performance with Metrics

The model accuracy was evaluated using a 5-fold stratified K-fold for the same model architecture. The resulting accuracies are given in the box and whisker plot in Figure 11 below. The best model accuracy was 91.03%, which is the model that we used to calculate the following metrics.

Figure 11. Box and Whisker Plot of Model Accuracy
Figure 12. ROC Curve for best performing model

The ROC curve is provided in Figure 12 above. The AUC is 0.96. As this is an aggregate metric for model performance, we can conclude that our model performed reasonably well.

Figure 13. Confusion Matrix for best performing model. 0 represents positive pneumonia cases, while 1 represents negative pneumonia cases.

Lastly, we generated the Confusion Matrix that is shown in Figure 13. The confusion matrix can be used to calculate the sensitivity and specificity of our model. We calculated the sensitivity of the model to be 91.1% and the specificity of the model to be 78.8%. This further adds to our confidence that the model performed well. We also compared this data against the sensitivity and specificity achieved by radiologists detecting pneumonia [6][7]. Notably, radiologists were able to detect pneumonia from chest x-rays with a sensitivity of 93.1%, and doctors were able to detect pneumonia from clinical symptoms with a sensitivity of 47–69%. This also gives us reasonable confidence that our model is comparable to radiologists, and can be useful to doctors who may not have access to a radiologist due to demand.

Model Performance with iNNvestigate

To better understand what the model is doing, we used the iNNvestigate library

which provides numerous analysis methods for complicated neural networks [8]. The iNNvestigate library provides valuable insights into how the model interprets the dataset which is useful for medical experts.

Some reference patient images can be seen in Figure 14. Each image is attached with its true class image, either normal or pneumonia, and the prediction calculated by the model. A prediction of 0 indicates pneumonia and a prediction of 1 indicates normal.

Figure 14. Subset of Original Images

We first created an iNNvestigate guided backpropagation analyzer on the model which was then applied to the dataset. The idea behind using guided backpropagation is to zero out the negative gradients in the model. This is useful because we are interested in knowing what each neuron is detecting, not what it is suppressing. Figure 15 shows the results of applying the analyzer on the original images. Noticeably, it appears the rib cage and lungs were particularly important for detection of pneumonia. We are no medical experts, however it is interesting that the normal images provide a clear lung outline while the images of the pneumonia patients have no noticeable lung outline.

Figure 15. Guided Backpropagation Images

We then created an iNNvestigate deep Taylor analyzer which is based on Taylor expansions. For each neuron a rootpoint is used to estimate the attribution of each neuron recursively. Figure 16 shows the results of applying the analyzer on the original images. Again, we are no medical experts, but it appears that any image differing from the appearance of the top row gets labeled as pneumonia. Since the bottom left image is closest in appearance to the healthy images, the model is less certain that pneumonia is present.

Figure 16. Deep Taylor Images

CONVOLUTIONAL NEURAL NETWORK WITH TRANSFER LEARNING

The next model we will look at is a CNN that uses a concept known as transfer learning. The idea behind transfer learning is to transfer the weights of a pretrained model to a new problem. Early layers in CNNs are found to detect lines and shapes, which are basic building blocks of any image, so why waste time and compute power training new CNNs [9]?

The following transfer learning approach was implemented using PyTorch and ResNet50 following a Kaggle tutorial by Lau Teyang [9]. ResNet50 is a deep residual network, with 50 layers, that is pre-trained on the ImageNet dataset [10].

Similar to the previous CNN model, data preprocessing involves a data augmentation step where images are resized, cropped, flipped, and rotated differently every epoch. From here, setting up the model architecture is quite simple. As seen in Figure 17, following Teyang’s tutorial, we load in ResNet50, freeze all layers except for the final layer, and replace the final layer to output two classes instead of the one thousand classes output by ImageNet [9].

Figure 17: Designing the model architecture [9]

After setting up the architecture, the model is fit to the training data and evaluated on the test set. The test_accuracy after twenty epochs was 89.06% and a confusion matrix is provided in Figure 18 below to compare this model to the basic CNN model from before. Both the accuracy score and the confusion matrix are comparable to the previous CNN model; however, we only had to train one layer to achieve that level of performance, a testament to the power of transfer learning!

Figure 18: Confusion matrix for ResNet50 model

CAPSNET

Another image classification method the team attempted is the capsule neural network or capsnet. A capsnet consists of capsules, which are groups of neurons responsible for detecting if an entity is present in an image. For example, when classifying whether a tree is present some capsules will look for the trunk, others will look for branches, etc as shown in Figure 19. Capsule neural networks are nice because they give instantiation parameters along with a probability. This means that the model can give the location, size, hue, and more information about where pneumonia is detected in the image.

Figure 19: CapsNet Visualization

There is currently no preprocessing for the CapsNet model. The CapsNet implementation was built upon an existing VGG16 keras model, incorporating a capsule class from Github [11][12]. The model architecture was taken from Kaggle [13]. VGG models are growing in popularity because they use smaller receptive fields, provide more non linearity in the decision function with 1x1 convolutional layers, and can handle a large number of weight layers which improve performance [14]. Simply by itself, the VGG model obtained an accuracy of 78.85%.

The accuracy our model generated was 90.06%. The CapsNet model appears to be promising, however further research and efforts must be made in order to classify the pneumonia and its location on the image.

CONCLUSION

A model that can accurately detect the presence of pneumonia would be extremely helpful to radiologists and medical specialists who currently are way understaffed to process the large amount of images that go unprocessed. Using known image classification techniques, we were able to build models capable of greater than 90% accuracy. Our main approach was the convolutional neural network, a deep learning technique which uses convolutional layers over image pixels. An alternative approach briefly explored was the capsule neural network, which identifies entities in an image along with parameters such as position, size, and color. While the models created by the team were insufficiently accurate to replace medical experts, they are capable of providing key aid to experts in identifying pneumonia patients for treatment, which will save lives.

REFERENCES

[1] https://www.news-medical.net/news/20201218/Transfer-learning-exploits-chest-Xray-to-diagnose-COVID-19-pneumonia.aspx

[2] https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia/code?datasetId=17810&sortBy=voteCount

[3] https://www.hindawi.com/journals/scn/2019/8431074/

[4] https://www.ibm.com/cloud/learn/convolutional-neural-networks

[5] https://towardsdatascience.com/beginners-guide-to-understanding-convolutional-neural-networks-ae9ed58bb17d

[6] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608340

[7] http://ezproxy.lib.utexas.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=pbh&AN=9058596&site=ehost-live

[8] Alber, M., Lapuschkin, S., Seegerer, P., Hägele, M., Schütt, K. T., Montavon, G., Samek, W., Müller, K. R., Dähne, S., & Kindermans, P. J. (2019). iNNvestigate neural networks! Journal of Machine Learning Research, 20.

[9] https://www.kaggle.com/teyang/pneumonia-detection-resnets-pytorch

[10] https://pytorch.org/hub/pytorch_vision_resnet/

[11] https://github.com/bojone/Capsule/

[12] https://keras.io/api/applications/vgg/

[13] https://www.kaggle.com/rodcardoso92/using-vgg-capsnet-to-diagnose-pneumonia/comments

[14] https://towardsdatascience.com/vgg-neural-networks-the-next-step-after-alexnet-3f91fa9ffe2c

--

--