Vital Research Paper Banner

Vision Text Informed Autocomplete Language (ViTAL) Model: A Smart Sentence Completion Tool for X-ray Report Writing

Abstract


According to a study conducted by Andreas Otto Josef Zabel et al., the turnover time for medical radiologist reporting at major hospitals takes on average 245 to 288 minutes.[1] This process is lengthy and costly. Our research intends to provide a supplementary solution to the problem by providing automatic X-ray annotation and reporting using a vision-language model. This research focuses on the annotation of chest X-rays from MIMIC-IV dataset with a vision-language model we referred to as Vision Text Informed Autocomplete Language (ViTAL), by leveraging pretrained SWIN Transformer as the vision model and pretrained GPT-2 as the language model. The model takes X-ray images as inputs and outputs a short descriptive narrative of findings to assist radiologists in making diagnosis and reporting more efficient, effective and less burdensome. As an extension of the application of the model, we also develop an auto-completion tool as an assistive tool for radiologists to write X-ray reports effectively and efficiently.

This research focuses on the annotation of chest X-rays from MIMIC-IV dataset with a vision-language model we referred to as Vision Text Informed Autocomplete Language (ViTAL), by leveraging pretrained SWIN Transformer as the vision model and pretrained GPT-2 as the language model.

Introduction


Radiologists in the U.S. perform on average over 70 million chest X-rays each year.[2] Furthermore, the cost for imaging and professional reading/interpretation is between \$150 and \$1,200.[3] While being expensive, according to a study conducted by Andreas Otto Josef Zabel et al., the turnover time for medical radiologist reporting at major hospitals takes on average 245 to 288 minutes.[4] However, increasing the speed radiologist reading is shown to have a significant impact on the accuracy of the report without tooling improvement. Evgeniya Sokolovskaya et al. has found that the error rate of major misses increased was 26% among the radiologists reporting at a faster speed, compared with 10% at normal speed in their study on reporting speed.[5] In the following section, we outline the research of a Vision and Text Informed Autocomplete Language (ViTAL) model that is shown to have a state-of-the-art performance in classification of diseases at par to a radiologist. The ViTAL model takes a chest X-ray image as input and outputs a summary/narrative of medical findings that incorporates the past and present medical history of the patient using Microsoft Swin Transformer as a vision transformer for object recognition and GPT-2 as the language model for medical finding generation. We intend this model to be an assistive tool that can help radiologists annotate X-rays with increased reporting speed while either maintaining or improving reporting accuracy.

The ViTAL model takes a chest x-ray image as input and outputs a summary/narrative of medical findings that incorporates the past and present medical history of the patient using Microsoft Swin Transformer as a vision transformer for object recognition and GPT-2 as the language model for medical finding generation.

We intend this model to be an assistive tool that could help radiologists accurately annotate x-rays with increased reporting speed while either maintaining or improving reporting accuracy.

Data

Microsoft COCO

To train the ViTAL model, we use Microsoft Common Objects in Context (MS COCO).[6] MS COCO is a “large-scale object detection, segmentation, and captioning” dataset that contains more than 330k images with over 200k labels. It contains 1.5m object instances with over 5 captions per image. It suits our purpose in training the ViTAL model to have a baseline recognition of common objects and generation of simple captions. Figure 1 below provides an example image along with its associated caption contained in MS COCO.

Figure 1: MS COCO example images [7]
Figure 1: MS COCO example images [7]

MIMIC-CXR

To fine-tune the ViTAL, we use the MIMIC-CXR Database. It contains 377,110 chest X-ray images corresponding to radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. Besides X-ray images, the dataset also contains radiology reports from clinicians trained in interpreting imaging studies, summarizing their findings in a free-text format (Figure 2). Moreover, the dataset provides labeling of the radiology reports by using NegBio, an open-source rule based tool for medical finding detection in radiology reports. NegBio also categorizes the mention of medical findings to be one of 1) negated cues, 2) uncertain mentions, 3) or positive mentions and mark them as negative one, zero, or one respectively.

As previously mentioned, the MIMIC-CXRdata is collected from a single hospital in Boston, MA. Therefore, the diversity of the data is restricted to the demographics of the area surrounding the hospital. Furthermore, the data is deidentified and does not allow for a statistical analysis of the demography. We hence restrict the claims made in this project with these constraints in consideration.

Figure 2: MIMIC Data Schema
Figure 2: MIMIC Data Schema

Although radiologist medical reports follow a general template, we have observed that radiologists are not restricted to narrate only on the specifics that are inquired by the patients or the physicians. Radiologists sometimes note all observations they see in the X-ray images. Each report finding contains on average 25 words while some have more than 100 words (Figure 3).

Each patient in this dataset has multiple X-ray images taken and contained in the dataset. On average, each patient has 3.69 images in this dataset (Figure 4). We use up to three chest X-ray images to provide a holistic view of the patient and to better mimic the resources a radiologist possesses when evaluating the X-ray images.


Figure 3: Distribution of Report Finding Word Counts Per Patient
Figure 3: Distribution of Report Finding Word Counts Per Patient
Figure 4: Distribution of Images Count Per Patient
Figure 4: Distribution of Images Count Per Patient

Models and Methods

ViTAL Model

The ViTAL model has two parts as shown in Figure 5, a vision encoder model and a language decoder model. ViTAL relies on Swin Transformer for converting X-ray images to embeddings. Any text provided as a context is also included with the image and is concatenated along with the output hidden states of the image. The text tokens are then converted into embeddings using the input embedding layer of the GPT-2 model. The concatenated hidden states of the Vision and the text form the input to the GPT-2 model. Finally, GPT-2 then outputs the finding embeddings and gets decoded to generate the radiologist medical report findings.

Figure 5: ViTAL Model Architecture
Figure 5: ViTAL Model Architecture

We choose the Microsoft Swin Transformer as our Vision encoder model. Images have large variations in the scale of visual entities and a high resolution of pixels, which make it challenging to be modeled by typical transformers architectures. Swin Transformer introduced by Ze Liu, et al.[8] (Figure 6) constructs hierarchical feature maps and has linear computational complexity to image size. The Swin Transformer achieves these by constructing a hierarchical representation with small-sized patches that will be merged in deeper layers and by computing self-attention within non-overlapping windows, or partitions, of the image.

Figure 6: Swin Transformer Architecture
Figure 6: Swin Transformer Architecture
Note. (a) Architecture of a Swin Transformer is shown (b) Two successive Swin Transformer BlocksCourtesy. W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively [9]

With this more efficient architecture, compared with typical transformers, we are able to apply this architecture to perform image recognition and captioning tasks in our ViTAL model. We used the Swin Transformer base model pretrained on ImageNet dataset optimized for the ImageNet 22K classification task. Specifically, we use the Huggingface checkpoint microsoft/swin-base-patch4-window7-224-in22k. As the input image is of the shape $224 \times 224 \times 3$ ($h \times w \times channels$), the number of output embeddings would be $\frac{h}{32} \times \frac{W}{32} = 7 \times 7 = 49$ with embedding dimension $8 \times C = 1024$.

We use the Open AI GPT-2 model as the second part of the ViTAL model. Trained as a language model on a large English corpus, GPT-2 allows for coherent text generation. We used the Huggingface gpt2-medium checkpoint for this model. GPT-2 was first introduced by Alec Radford, et al. in 2019. In their research “Language Models are Unsupervised Multitask Learners”, they demonstrated that language models can perform natural language processing tasks, such as question answering, translation, and summarization without explicit supervision.[7] The essence of the approach is an unsupervised distribution estimation from a set of examples. This capability unblocks the ability of GPT-2 to perform zero-shot natural language processing tasks with transfer learning.

Model Training

The two models used in the ViTAL architecture were trained with independent objectives, hence a combined end to end training is required for the models to communicate with each other. The MS COCO dataset was used for this end to end pretraining. MS COCO has 330K image and text caption pairs as compared to the limited filtered MIMC-CXR dataset of ~81K (Posterior-Anterior Chest X-ray) image and (X-ray report findings) text pairs.

Once the model has been end-to-end pre-trained with the MS COCO dataset, we then fine-tuned it on the MIMIC-CXR dataset. It is important to note that the Swin Transformer has a 3-channel image input embedding setup, which is compatible with the color images from MS COCO, but not for our MIMIC-CXR single channel (black & white) X-ray images. To address this issue, we employed the following strategy to enable our chest X-ray images to be compatible with our ViTAL model setup, while potentially enhancing the capability by providing historical information of a given patient:

  1. If the patient has only one PA (Posterior-Anterior) X-ray image, i.e. the X-ray image of the current study, then we copy the same image for all three channels
  2. If the patient has a X-ray image from a previous visit then we use the most current X-ray as channel 0 and then previous X-ray as channel 1 and 2
  3. If the patient has two or more X-ray images from previous visits then we use the most current X-ray as channel 0 and then most recent past X-ray as channel 1 and 2nd most recent past as channel 2. Any X-ray images older than the 2nd most recent past visit are ignored.

Each X-ray image, independently, is also treated to an affine transformation with a 50% chance of transformation. The applied transformations are ±10° of rotation and ±5% translation (shift) along both the x- and y-axis.

We also employed similar treatment with doctor's indications to the X-ray channels, where indications from the current study are concatenated to the chest X-ray image embeddings from Swin Transformers. It is important to note that we do not include any images or indications from the future relative to the X-ray being analyzed when generating reports.

Results and Discussion

Saliency Maps

As the report is meant to be a textual summarization of an X-ray, it is important that the words in the report correspond to the appropriate location in the X-ray and hence the model's attention weights must appropriately reflect that. For extracting the saliency map, we inspect the cross attention weights of the first layer of the decoder (Figure 7). Specifically, we inspect the relative attention weights of the model on the 49 image embeddings, as discussed earlier in the model section. As several tokens are generated in parallel, the maximum attention value across multiple tokens is plotted in the saliency map (Figure 8). Note that the saliency map is an inter-area interpolation of these 49 points.

Figure 7: Cross Attention Weights Inspection
Figure 7: Cross Attention Weights Inspection
Note. Cross attention weights of the first decoder(GPT2) layer are inspected.
Figure 8: Saliency Maps Examples
Figure 8 (Left): Saliency Maps Examples
Note. Titles are the text generated from the model. On the left is the original image with the expected location of the issue marked in blue. On the right is the saliency map superimposed on the image. It can be seen that the model is indeed paying attention to the appropriate locations of the image when generating the text.

NegBio Evaluation

As mentioned above, MIMIC-CXR provides labeling of the radiologist triaged reports by using NegBio, an open-source rule based tool for medical finding detection in radiology reports. NegBio categorizes the mention of medical findings to be one of 1) negated cues, 2) uncertain mentions, 3) or positive mentions and mark them as zero, negative one, or one respectively. Additionally, no mentions lead to an empty string. We ignore no mentions when generating the metrics shown below.

To evaluate whether the ViTAL model identifies medical findings accurately and narrates in a human-like and comprehensible manner, we run NegBio against the model generated medical findings.[8] [9] This evaluation measures three aspects of the ViTAL model performance: 1) has the model identified the correct medical finding, 2) has the model correctly performed a diagnosis on such medical finding, and 3) has the model generated natural language that is consumable.

Table 2 highlights the results by comparing NegBio labeling on the ViTAL generated medical findings to those from the ground truth (radiologist triaged) reports provided by MIMIC-CXR. Generally, the model identifies “Lung Opacity” the best with high recall and precision, while performing not as well in correctly identifying Edema. Edema are swelling caused by fluid trapped in the lungs. One possible explanation is that Edema doesn't have a large sample size. Recalls are high in classes where the test data has a decent sample size of as shown below. Another possibility is that Edema is thought to be a disease relatively difficult to be detected from X-ray imaging, where commonly patients are diagnosed by an observation of minor vascular engorgement.

Table 2: ViTAL Model Results
Class F1-score Precision Recall Support
Lung Opacity 0.83 0.76 0.91 43
Pleural Effusion 0.67 0.70 0.65 162
Atelectasis 0.60 0.43 1.00 49
Support Devices 0.57 0.40 0.98 125
Cardiomegaly 0.51 0.47 0.55 119
Edema 0.30 0.33 0.28 36

Future Work


The vision and language models used in the project are relatively small. A bigger model, especially for vision can provide several advantages. A higher resolution for detection of issues, a higher resolution for saliency map. Another improvement to the modeling could be to treat historical X-rays as independent images and 'unwarp' (Figure 9) them into a sequence of embeddings at the input to the decoder. The effect of this would be to make the sequences longer, on the other hand provide a comparison closer to the pre-trained objective of the Vision model.

Figure 9: Sequential Embedding Illustration
Figure 9: Sequential Embedding Illustration
Note. Unwrapping multiple X-rays to make a sequence of embeddings for better comparison with historical data

Conclusion


In this work we have demonstrated that vision and language models that have been trained on independent objectives can be concatenated and, with some minimal changes, be adapted for X-ray report generation. We have shown through saliency maps that the model is indeed paying attention to the appropriate locations on the image when generating text. We have extended this to a practical application where the model and the saliency maps are used to guide report generation by a radiologist.

Reference


[1] Zabel, A.O.J., Leschka, S., Wildermuth, S. et al. Subspecialized radiological reporting reduces radiology report turnaround time. Insights Imaging 11, 114 (2020). https://doi.org/10.1186/s13244-020-00917-z

[2] Iyeke L, Moss R, Hall R, et al. (October 01, 2022) Reducing Unnecessary ‘Admission’ Chest X-rays: An Initiative to Minimize Low-Value Care. Cureus 14(10): e29817. doi:10.7759/cureus.29817

[3] Zabel, A.O.J., Leschka, S., Wildermuth, S. et al. Subspecialized radiological reporting reduces radiology report turnaround time. Insights Imaging 11, 114 (2020). https://doi.org/10.1186/s13244-020-00917-z

[4] Sokolovskaya E, Shinde T, Ruchman RB, Kwak AJ, Lu S, Shariff YK, Wiggins EF, Talangbayan L. The Effect of Faster Reporting Speed for Imaging Studies on the Number of Misses and Interpretation Errors: A Pilot Study. J Am Coll Radiol. 2015 Jul;12(7):683-8. doi: 10.1016/j.jacr.2015.03.040. Epub 2015 May 21. PMID: 26003588.

[5] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.

[6] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[7] Radford, Alec, et al. "Language models are unsupervised multitask learners. " OpenAI blog 1.8 (2019): 9.

[8] Peng Y, Wang X, Lu L, Bagheri M, Summers RM, Lu Z. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA 2018 Informatics Summit. 2018, 188-196.

[9] Wang X, Peng Y, Lu L, Bagheri M, Lu Z, Summers R. ChestX-ray8: Hospital-scale Chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 2097-2106.