Abstract

Visual Question Answering (VQA) has advanced in recent years, inspiring adaptations to radiology for medical diagnosis. Longitudinal VQA, which requires an understanding of changes in images over time, can further support patient monitoring and treatment decision making. This work introduces RegioMix, a retrieval augmented paradigm for longitudinal VQA, formulating a novel approach that generates retrieval objects through a mix-and-match technique, utilizing different regions from various retrieved images. Furthermore, this process generates a pseudo-difference description based on the retrieved pair, by leveraging available reports form each retrieved region. To align such statements to both the posted question and input image pair, we introduce a Dual Alignment module. Experiments on the MIMIC-Diff-VQA X-ray dataset demonstrate our method’s superiority, outperforming the state-of-the-art by 77.7 in CIDEr score and 8.3% in BLEU-4, while relying solely on the training dataset for retrieval, showcasing the effectiveness of our approach. Code is available at https://github.com/KawaiYung/RegioMix

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2219_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2219_supp.zip

Link to the Code Repository

https://github.com/KawaiYung/RegioMix

Link to the Dataset(s)

https://physionet.org/content/medical-diff-vqa/1.0.0/

BibTex

@InProceedings{Yun_RegionSpecific_MICCAI2024,
        author = { Yung, Ka-Wai and Sivaraj, Jayaram and Stoyanov, Danail and Loukogeorgakis, Stavros and Mazomenos, Evangelos B.},
        title = { { Region-Specific Retrieval Augmentation for Longitudinal Visual Question Answering: A Mix-and-Match Paradigm } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a novel visual question answering model for assessing longitudinal differences in chest X-rays, trained and evaluated on the MIMIC-Diff-VQA dataset. They introduce a region-level retrieval augmented generation approach that retrieves corresponding pairs from the training set, extracts pseudo-descriptions of their differences, and further refines them in an alignment module to generate the answers. Experiments demonstrate impressive improvements over state-of-the-art methods across all metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and has a clear structure. It includes an extensive evaluation and comparison with state-of-the-art methods, outperforming them in all metrics. The authors propose a novel anatomy region-based retrieval augmentation of pairs, which is a strength of the work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The evaluation focuses primarily on NLP metrics, which are suboptimal for evaluating clinical correctness. A stronger focus on accuracy would be preferable. The qualitative results appear to be very cherry-picked, with the generated outputs perfectly matching the ground truth. Assuming a generative model, this seems very unlikely. Except for a single space, all samples provided in the supplementary material perfectly match the ground truth, not just in content but also in syntax. The authors should provide a justification for this and evidence that no test leakage has occurred during training.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The text in Figures 1 and 2 is hard to read due to its small size.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concern with this paper is that the model has only been evaluated on a single dataset, and on that dataset, all qualitative results perfectly match the ground truth (including the video in the supplementary material). This indicates that either the samples have been cherry-picked, there is test set leakage, or the task at hand is too easy to model a realistic clinical setting. I suggest evaluating the model on another VQA dataset for chest x-rays to address these concerns.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My main concerns have been adressed by the authors and the addition of more critical examples will strengthen the quality of the paper.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a method for longitudinal VQA for chest X-Ray images. The main part of the method is RS-RAG, a region-based RAG method that allows to employ RAG without the need for an external knowledge database. The authors also propose a dual alignment module to produce more accurate answers, as well as the application of a PairNCE loss.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well organized and structured
    • The proposed formulation has some novelty
    • The results show benefits of the different components proposed by the authors
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper relies heavily on the architecture (EKAID) and dataset (Medical-DiffVQA) introduced in [8]
    • Figure 1 is very hard to read. It contains too much information, and the location of the blocks is confusing.
    • The structure of the paper is very compressed, making it hard to read. Paragraphs are very long and there is almost no separation between paragraphs in some cases (e.g., Sec. 3)
    • The qualitative examples in Fig. 3 show answers that perfectly resemble the ground truth. This suggests the selected samples (at least the answers) are common to the training and testing set. This, however, is not a good evidence for the generalization power of the method. How does the model perform on other samples. Realated to the previous point, the text could be shortened, so that not samples that make the paper look good are presented. I would appreciate an explanation from the authors as to why RegionMix produces exactly the same answers as the GT.
    • Some minor comments: Page 2, paragraph before Sec. 2, “posted question” -> “posed question” In caption of Fig. 2 -> pathology should be plural
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I would recommend:

    • Improve Figure 1 so that it is easier to understand for readers.
    • Make the long paragraphs more reader-friendly. Use sub-sections wherever possible. I understand there is a lot to explain, but the paper turns dull with so much compressed text.
    • Include at least some error case in Fig. 3.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper offers an interesting method that sets a new state of the art for longitudinal VQA.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I thank the authors for the rebuttal. Although some minor concerns remain, I believe the paper has enough merits for acceptance. I therefore keep my recommendation. I really recommend the authors improving the readability of the paper, as suggested in my comments.



Review #3

  • Please describe the contribution of the paper

    The paper introduces an approach that expands upon EKAID for performing longitudinal VQA on two reduced quality (8-bit JPG-based) CXR images. The approach adds region-specific RAG, an alignment module, and a paired noise contrastive estimation. The final quantitative results are demonstrated to be higher than the previous EKAID on artificial metrics but statistical significance is not evaluated.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key novel elements of the paper are the orchestration of RS-RAG, DA, and PairNCE into a system that demonstrates improved results on synthetic metrics. The ablation study of the individual architectural components are a great evaluation of the proposed system and demonstrate the individual benefits of each component.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The clarity of the paper’s terminology could be improved and architectural and data details could be more explicitly mentioned. In addition, more thorough analysis would make the paper more interesting and impactful.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The use of MIMIC-Diff-VQA as done in EKAID makes use of inferior quality JPG images as opposed to the original DICOM images. This limitation should be mentioned in the paper if it is the case.

    • When applying Faster-RCNN and down stream components, no mention of image resampling is mentioned. Is the read to assume that a 2K X 3K image is passed through the network?

    • Figure 3 shows a successful case from the approach. A subsequent less successful case should also be given.

    • The Analysis gives overall metrics on the dataset but does not delve into individual sub-conditions. This would be of great interest as well as allow for a finer comparison of ablations and approaches. How does it do on fractures specifically?

    • Please compute confidence intervals for the scores provided.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper improves upon an essisting approach with novel aspects while providing more conclusive comparisons and ablation analysis. The topic itself is rather novel as well.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I feel the authors have sufficiently addressed points brought up by my review and other reviewers. Addressing these comments as stated in the rebuttal will improve the final manuscript.




Author Feedback

We thank the AC and reviewers for their time. We are encouraged that reviewers recognize the novelty(R4,6,7), rigorousness in validation(R4,6), structure(R4,7) and improvements(R4,6,7) of our work.

1) R4,R7–Is there a test set leakage? Why do results match GT? Are results cherry-picked?

We categorically confirm no test leakage. From Section 2, RS-RAG uses only the training split, excluding validation/testing images. The splits faithfully follow from EKAID[8] with no contamination across. RAG contains no extra information beyond training data, making test leakage impossible. Performance improvement is purely from RS-RAG.

The matching outcome is expected. From Section 3, the Language Model(LM) performs well at learning and reproducing general response syntax, also seen in EKAID but with incorrect pathology changes. The key value of RegioMix lies in accurately identifying pathology and severity changes. Thus it is expected for generated responses to match GT, when predicted pathology changes are also correct. We also observe occasional ordering differences, further supporting this matching is positively not from overfit or leakage.

Results are definitively not cherry-picked. Out of 70070 testing samples(same as EKAID), RegioMix generated 36719 exact matches. For difference questions, RegioMix produced 1512 exact matches vs EKAID’s 874(+73%). Samples are not a special case and were randomly chosen from such to highlight method’s validity. We have added this into paper.

2) R4–Task too easy to model realistic clinical setting?

We acknowledge R4’s concern but disagree the task is easy. Although LM does well on response syntax, identifying and describing pathological differences is very challenging, as seen in EKAID’s errors. The dataset has 31 pathologies; resulting in 3^31(approx 10^14) response combinations in difference questions. While multi-label classification with text template might seem sufficient, it lacks flexibility for other question types. Adapting multiple prediction heads for various questions is not scalable. In contrast, a VQA model with generative output can adapt to any question.

3) R4–More focus on accuracy metrics

Use of NLP metrics is standard for assessing generative VQA models and follows from EKAID. We agree accuracy is crucial for evaluating clinical correctness and is the reason for Table 3, examining accuracy(RegioMix+7.7%) of non-difference questions including abnormality(pathology). CIDEr(RegioMix+77.7) also evaluates clinical correctness by giving higher weights to pathology words.

4) R4,R6,R7–Less successful examples, structuring of figures and paragraphs

We have included two less successful examples in Fig 3, improved Fig 1,2 clarity, addressed use of JPG format. Long paragraphs were broken down and rephased.

5) R4–Evaluation on another dataset.

MIMIC-Diff-VQA is the only available dataset for Medical longitudinal VQA to our knowledge. Also see 6). Additional experiments are not allowed by rule but as future work we aim to test it on neonatal X-ray and is collecting a dataset.

6) R6–Confidence Intervals(CI), fractures comparison.

We conducted experiments with same seed and hyperparameters as EKAID, with sole addition of our modules, ensuring improvements are from such, not random variations. From experiments with 3 seeds, the 95% CI in CIDEr [234.0-237.0], BLEU-4 [48.6-51.1] is significantly higher than EKAID’s 189.3, 42.2. RegioMix shows slight advantage with 45.6 vs EKAID’s 44.9 in BLEU-4 for fracture questions. However fractures account for only 1.5% of test set, so result should be interpreted with caution. We have added this.

7) R7–Reliance to EKAID and dataset

We acknowledge RegioMix builds on EKAID and uses the same dataset. Our novelty is in integrating RS-RAG, DA, and PairNCE, showing significant improvements over EKAID alone, as noted by R4 and 6.

8) R6–Faster-RCNN Image size

Original size imposes high computation; we follow EKAID and downsample to 1K keeping aspect ratio.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces significant advancements in medical VQA by focusing on longitudinal changes in chest X-rays. The rebuttal has address most major concerns and all reviewers are positive about this work’s contributions.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces significant advancements in medical VQA by focusing on longitudinal changes in chest X-rays. The rebuttal has address most major concerns and all reviewers are positive about this work’s contributions.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is an interesting demonstration of RAG in a longitudinal context. All reviewers are in agreement to accept this paper as are the ACs.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is an interesting demonstration of RAG in a longitudinal context. All reviewers are in agreement to accept this paper as are the ACs.



back to top