Experimental Setup & Results

The team used NVIDIA GeForce RTX 2080 (8 GPUs) for training models. It took an inhouse curated dataset comprising of 2,70,000 chest x-ray images and passed them through the CheXpert labeler to generate the ground truth labels. It was found that the ~1,90,000, images correspond to normal chest X-Ray and the rest ~80,000 contain one or more abnormality.

3.1 AI-based Labeler

A summary result of the AI based report labeler.

Table 1. Report Labeler result of Top 5 classes

3.2 Multi-Label Classification

Since the classification model also serves as the base model for report generation, it is extremely important that the best performing model is used here. Therefore, the team conducted extensive experimentation for this model. The tables 2. (one for each category - report different metrics for each class with variation of parameters) summarize our observations in this regard. Also, note that while training, we try to optimize the AUC metric which is considered as a standard metric for optimization for imbalance datasets [16].

As evident from the table, Mi model performs the best for Ci class but not for Cj class. This is because each class corresponds to different pathology which can correspond to different regions in the image (Cardiomegaly in the heart region, Pleural Effusion in the bottom) or can be diffused radiological finding (Edema) meaning that it covers the entire lung region (image) and can’t be localized to a given region.

Table 2 shows the comparison of the performance of our model with the other existing models for the different chest X-ray datasets.

Table 2. Multilabel classification on CheXpert datasets

Table 3. Multilabel classification comparison with SOTA

3.3 Report Generation

The team initially considered training the proposed decoder model using the Indiana University Dataset as it is a publicly available free dataset. However, it was observed that for each category we got less than 7470 images (Consolidation had a maximum of 496 images). The team conducted a small experiment and observed that the model was clearly underfitting (Max BLEU score obtained was Consolidation for 0.2196). Because of this reason, we decided to go for the in-house dataset which had many more images per label.

The team took the best performing classification model for each class as the encoder. ¡Write the parameters used for Attention base LSTM. While training the decoder, it froze the weights of the encoder, to ensure that there is no drop in the classification accuracy. While training the decoder, it only used the sentence corresponding to the label and not the entire impression.

This was deemed correct because the encoded features only contain the information about the given label and not the other labels present in the image. Other models that correspond to the different labels will generate the corresponding sentence for those labels. The final impression will be the concatenation of all such sentences.

Table 4 Report Generation Model Comparison with SOTA

We also compare our results to the other existing approaches however this is not an exact comparison because the datasets are different (since we did not have access to their datasets and vice versa). Still, the team undertook a comparison so that shows how our approach also gives similar results in terms of traditional metrics.

Table 5 Model Evaluation on inhouse dataset for classification, localisation and report generation.

(BLEU- Bilingual Evaluation Understudy, AUC: Area Under Curve, CB-POI: Center-based Pole of Inaccessibility, Dataset: 3000)

3.4 RFQI

Finally, we calculated the RFQI values for each class. The team used the weight of 0.75 for the radiological finding and 0.25 for the localization parameter. This is because identifying the correct radiological finding is more important compared to localizing it correctly. Table 6 summarizes the results obtained for each class.

Table 6 Report Generation Model using RFQI

Figure 4. RFQI: Novel Scoring Mechanism

Figure 5. RFQI Detailed Schematic