Abstract

Chest X-ray Differential Medical Visual Question Answering (Diff-MedVQA) is a novel multi-modal task designed to answer questions about diseases, especially their differences, based on a main image and a reference image. Compared to the widely explored visual question answering in the general domain, Diff-MedVQA presents two unique issues: (1) variations in medical images are often subtle, and (2) it is impossible for two chest X-rays taken at different times to be at exactly the same view. These issues significantly hinder the ability to answer questions about medical image differences accurately. To address this, we introduce a two-stage framework featuring Coarse-to-Fine Granularity Contrastive Learning. Specifically, our method initially employs an anatomical encoder and a disease classifier to obtain fine-grained visual features of main and reference images. It then integrates the anatomical knowledge graph to strengthen the relationship between anatomical and disease regions, while Multi-Change Captioning transformers identify the subtle differences between main and reference features. During pre-training, Coarse-to-Fine Granularity Contrastive Learning is used to align knowledge enhanced visual differences with keyword features like anatomical parts, symptoms, and diseases. During the Diff-MedVQA Fine-tuning, the model treats the differential features as context-grounded queries, with Language Modeling guiding answer generation. Extensive experiments on the MIMIC-CXR-Diff dataset validate the effectiveness of our proposed method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1957_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/big-white-rabbit/Coarse-to-Fine-Grained-Contrastive-Learning

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lia_Leveraging_MICCAI2024,
        author = { Liang, Xiao and Wang, Yin and Wang, Di and Jiao, Zhicheng and Zhong, Haodi and Yang, Mengyu and Wang, Quan},
        title = { { Leveraging Coarse-to-Fine Grained Representations in Contrastive Learning for Differential Medical Visual Question Answering } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors propose a novel two-stage training framework for Diff-MedVQA with coarse-to-fine granularity contrastive learning to primarily detect and answer questions about the differences in paired medical images. The method involves extracting visual features of image-pairs using an anatomical encoder and a disease classifier. These features are further fed to a graph CNN to extract node features. Subsequently, the three features are fed to an MCCFormer to identify visual differences between image pairs. Additionally, to ensure alignment between the visual differences and text, a Coarse-to-Fine Granularity Contrastive Learning is applied in the pretraining stage using the visual difference features and keyword textual features. The authors utilize the MIMIC-CXR-Diff to compare the performance of their method to other approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Relevance: In the medical field, the difference between scans taken at two timepoints is crucial to quantify the change in disease state per patient. With the added challenge of variations in scans (due to scanner variations, differences in contrast, brightness, resolution, etc.) taken at different timepoints, it is very challenging to automate this process. Tailored to ROIs: The authors first identify relevant ROIs to identify ROI-specific changes, which may help the model learn better, since the changes may be subtle and locally present (for e.g., changes exist only in the left lung).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There’s a lack of detail regarding the data, methods, and results: o What are the different anatomical labels and disease labels in the dataset? (Additionally, did the authors note any differences in performance in severe disease cases?) o How was the data split into training/validation/testing (8:1:1)? Was there any stratification done based on disease severity or demographics? o The authors have provided virtually no details (method or result) about the FasterRCNN (other than that it was pretrained on chest imagenome data). It’s unclear how the ‘anatomical features’ are extracted from the RCNN i.e., is it from an intermediate layer of the FasterRCNN? o How is the performance of this model tested on the MIMIC-CXR-Diff dataset? o How does the size of the predicted bounding box affect the subsequent framework? o How does the performance of the MCCFormer and the FasterRCNN models change with respect to disease severity? Use of older architectures: o Why are older architectures like FasterRCNN (2015) and ResNet (2015) used? Are there any changes made to this original architecture? Why not use more recent architectures like YOLO v9 or vision transformers?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    o It would helpful to have a table describing the data used to train and test the model, i.e., the number of images per disease stage. Additional details on how the data stratification was done would be helpful. o Some comments on how the performance of the FasterRCNN and the errors in predictions may potentially propagate throughout the framework, affecting the final text prediction might be helpful to highlight any potential limitations. (For example, if a collapsed lung is accidentally identified as ‘heart’ by the FasterRCNN, how does this affect the performance of the overall framework?) o Comment on the smallest detectable change: Authors mention in the beginning that the challenge in medical datasets in often that the differences in the image pairs may be subtle. Can the authors comment on how small of a change would be detectable by this approach? For example, if the only difference between the main and reference image is that in the left lung there’s a small change in the volume of tumor, would the model pick up this difference or would it potentially identify them as ‘no change’?
    For future work, I would recommend: o Could potentially upgrade the older architectures like FasterRCNN/ResNet to yolov9/vision transformers or some other recent architectures. o Explore performance of overall pipeline on different disease severities. o Explore if the model can pick up subtle changes in image pairs. o Explore performance on data where one image is from a different scanner with differences in resolution/contrast/brightness, etc., and the other image is from an another scanner.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The task that the authors are trying to accomplish is a very relevant challenge often encountered in the medical domain. The approach detailed in the paper is novel and specifically tailored to the task. However, the lack of details regarding data, methods and results needs to be addressed, at least regarding the data distribution and training/validation stratification. Additionally a justification for the use of older architectures might be helpful.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision
    • Although the authors have provided a few additional details regarding the anatomical/disease labels, they still have not clarified if there was any kind of stratification in the 8:1:1 train/val/test split.
    • Also, although the authors have provided some details on Faster RCNN and how the intermediate layers were used for feature extraction, they still have not provided justification for the use of older architectures from 2015 (ResNet/fasterRCNN) other than ‘these are the commonly used architectures for VQA’.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a novel framework for Differential Medical Visual Question Answering (Diff-MedVQA) that employs Coarse-to-Fine Granularity Contrastive Learning to integrate fine-grained visual features from chest X-rays with textual features for improved question-answering capabilities. The method effectively leverages an anatomical knowledge graph to enhance the relationship between visual features and corresponding diseases, enabling more accurate identification of subtle changes in medical images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novelty in Framework Design: The paper’s proposal of a two-stage training framework with Coarse-to-Fine Granularity Contrastive Learning is novel. It addresses the challenge of aligning fine-grained image features with textual descriptions in medical VQA, which is crucial for precise medical diagnosis and has not been extensively explored in previous research.
    2. Anatomical Knowledge Graph Integration: The use of an anatomical knowledge graph to enhance visual feature representation is a strong aspect of the work. It demonstrates a significant improvement in model performance metrics, such as BLEU and ROUGE scores, indicating a robust approach to understanding medical images.
    3. Clinical Feasibility and Performance: The application of this framework in a clinical setting appears feasible, with state-of-the-art performance demonstrated on the MIMIC-CXR-Diff dataset. This indicates both practical and academic contributions to the field of medical imaging and diagnostics.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of Comparative Analysis: While the paper claims improvements over existing methods, it lacks a detailed comparative analysis with other state-of-the-art approaches. Including more comparative insights could strengthen the case for the proposed method’s superiority.
    2. Limited Validation: The paper primarily focuses on one dataset. Expanding the validation to include diverse datasets could help in understanding the generalizability of the proposed method across different types of medical imaging data.
    3. Algorithm Specificity and Complexity: The complex nature of the proposed algorithms may limit reproducibility and practical deployment. A clearer explanation or simplification of the algorithms could make the paper more accessible to a broader audience.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Enhance Comparative Analysis: Consider adding a more comprehensive comparison with existing methods, highlighting specific scenarios where the proposed method outperforms others.
    2. Simplify Methodology Description: To reach a broader audience, simplify the technical descriptions and provide more intuitive explanations or visual aids for complex concepts.
    3. Expand Dataset Validation: To strengthen the claims of generalizability, validate the proposed method across a wider range of medical imaging datasets.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The high rating is justified by the novel approach and significant improvements shown in performance metrics. The main factors influencing this score are the innovative use of an anatomical knowledge graph and the strong empirical results demonstrated on the challenging MIMIC-CXR-Diff dataset. Enhancements in comparative analysis could further solidify the paper’s impact and applicability.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors’ detailed response. The paper deserves a high rating due to its innovative approach and the substantial improvements observed in performance metrics. Notably, the use of an anatomical knowledge graph stands out as a groundbreaking aspect, and the empirical results on the challenging MIMIC-CXR-Diff dataset are impressive. To further enhance the paper’s impact and applicability, a more in-depth comparative analysis would be beneficial.




Author Feedback

Thank to both two reviewers for their professional comments and for acknowledging the innovation and practicality of our work. Below are clarifications regarding aspects of the paper that were not clearly articulated, accompanied by some summaries:

#Reviewer 1:

  1. Insufficient detail in data, methods, and results: 1) Anatomical labels are assigned based on anatomical knowledge on chest x-rays, such as “right lower lung”. Disease labels are based on observable abnormalities and inferred diseases in chest x-rays, such as “pneumothorax”. 2) To facilitate performance comparisons with existing methods, our train, validation, and test sets are split in the same ratio (8:1:1) as existing methods such as EKAID. The dataset MIMIC-CXR-Diff is now available on Physionet. 3) FasterRCNN is pre-trained on Chest ImaGenome data (anatomical labels and bounding boxes). We use features from intermediate layers (before the ROI Head) as anatomical features. Since MIMIC-CXR-Diff does not provide ground truth bounding boxes, FasterRCNN was not evaluated on MIMIC-CXR-Diff for performance. 4) Due to the limitations of the ResNet input, all images cropped by the FasterRCNN bounding box are resized to 224x224 to obtain disease features.

  2. Why not use the latest YOLOv9 or ViT for feature extraction: FasterRCNN+ResNet for feature extraction is the most widely used approach in general VQA methods, and we referenced their pipeline. In the future, we will consider switching to more advanced feature extraction, especially introducing cross-modal pre-training like BioMedCLIP.

  3. Lack of analysis on disease severity: We agree that disease severity can impact model performance. However, there is currently no good method for quantitatively analyzing “severity.” This requires further organization and analysis of the dataset and is a direction for future research.

#Reviewer 3:

  1. Insufficient analysis of existing methods: Among the three baseline methods we compared, MCCFormer is a simple encoder-decoder structure designed for change captioning tasks and did not consider the factor of perspective changes in chest x-rays. As a result, it performed very poorly across all metrics, almost failing to capture differences in chest x-rays. IDCPCL introduced contrastive learning pre-training on top of the encoder-decoder structure, resulting in a slight performance improvement. EKAID, although considering perspective changes, lacks pre-training in medical visual language, and its text generation performance is not as good as our proposed method, especially in terms of the CIDEr and METEOR metrics. Further analysis will be included in the revised version.

  2. Simplify method description: Thank you for your suggestion. We will simplify the technical descriptions and provide more intuitive explanations or visual aids for complex concepts in the final version.

  3. Validation on a broader dataset: Collecting image data from the same patient at different times and having it annotated by experts is a very challenging task. We are actively seeking collaborations to apply our proposed method to other medical imaging contexts.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewer that rated it accept later rejected based on the unsatisfactory rebuttal. The authors don’t seem to be aware of ROI generation methods that form scene graphs as used in the ChestImagenome dataset.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewer that rated it accept later rejected based on the unsatisfactory rebuttal. The authors don’t seem to be aware of ROI generation methods that form scene graphs as used in the ChestImagenome dataset.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a novel approach and strong empirical validation on the MIMIC-CXR-Diff dataset. The rebuttal addressed most concerns effectively, though some aspects, such as stratification, the choice of architectures and detailed performance breakdowns could be further elaborated.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents a novel approach and strong empirical validation on the MIMIC-CXR-Diff dataset. The rebuttal addressed most concerns effectively, though some aspects, such as stratification, the choice of architectures and detailed performance breakdowns could be further elaborated.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although there is a controversry between the meta reviewers and the reviewers too, I would like to support this paper as novelty of the adding knowledge graph into learning is worth discussing.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although there is a controversry between the meta reviewers and the reviewers too, I would like to support this paper as novelty of the adding knowledge graph into learning is worth discussing.



back to top