Abstract

Difference Visual Question Answering (DiffVQA) introduces a new task aimed at understanding and responding to questions regarding the disparities observed between two images. Unlike traditional medical VQA tasks, DiffVQA closely mirrors the diagnostic procedures of radiologists, who frequently conduct longitudinal comparisons of images taken at different time points for a given patient. This task accentuates the discrepancies between images captured at distinct temporal intervals.To better address the variations, this paper proposes a novel Residual Alignment model (ReAl) tailored for DiffVQA. ReAl is designed to produce flexible and accurate answers by analyzing the discrepancies in chest X-ray images of the same patient across different time points. Compared to the previous method, ReAl additionally aid a residual input branch, where the residual of two images is fed into this branch. Additionally, a Residual Feature Alignment (RFA) module is introduced to ensure that ReAl effectively captures and learns the disparities between corresponding images. Experimental evaluations conducted on the MIMIC-Diff-VQA dataset demonstrate the superiority of ReAl over previous state-of-the-art methods, consistently achieving better performance. Ablation experiments further validate the effectiveness of the RFA module in enhancing the model’s attention to differences. The code implementation of the proposed approach will be made available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2957_paper.pdf

SharedIt Link: https://rdcu.be/dV19i

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_61

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lu_Spot_MICCAI2024,
        author = { Lu, Zilin and Xie, Yutong and Zeng, Qingjie and Lu, Mengkang and Wu, Qi and Xia, Yong},
        title = { { Spot the Difference: Difference Visual Question Answering with Residual Alignment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {649 -- 658}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    A novel DiffVQA is proposed for analyzing the discrepancies in longitudinal chest X-ray images, which incorporates a residual image encoder and a residual feature alignment module to better capture image discrepancies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Explicit residual image encoder is adopted to better capture image discrepancies.
    2. Residual feature alignment is proposed to better understand the image differences.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    More implementation details are needed, e.g., details of projection module.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. As there may be certain mis-alignment (e.g., in position), is there any preprocessing for the main and ref images (maybe registration) before fed into the Residual Encoder?
    2. Other than the proposed Residual Encoder and residual consistency loss, what about explicitly using the differences (f^ref - f^main) as input to the multi-modal decoder?
    3. How will the trade-off between cls loss and con loss affect the performance?
    4. Some example results obtained by other methods will be helpful in Fig. 3.
    5. Is it possible to visualize where the DiffVQA model actually looked at to generate the results?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel medical DiffVQA is proposed for analyzing the discrepancies in longitudinal chest X-ray images, which incorporates a residual image encoder and a residual feature alignment module to better capture image discrepancies.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper tries to address an interesting problem, i.e., Difference Visual Question Answering (DiffVQA), which contributes to medical analysis and treatment planning. The proposed Residual Align framework (ReAl) is simple yet effective and achieves good performance on a publicly available dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, The framework design is simple yet effective, in which visual information from reference and main images along with textual prompt information are taken as input. The integration of temporal information is a good idea to guide the model’s focus on differences. 2, Taking pre-trained Large Language Models (LLMs), i.e., GPT-2 as the decoder model is helpful for text generation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1, I wonder if the main images and reference images are strictly registered. If not, directly applying the subtracting results at the pixel level as the input of the residual encoder will cause some misalignment problems [1, 2]. 2, More details about the textual encoder should be provided. Is it a pre-trained language model like GPT or BERT? 3, More experiments about the decoder should be included. Would it be helpful to replace GPT-2 with a medical vision-language model or medical VQA model, like Clinical-bert and Mmbert [4].

    [1] Mok T C W, Chung A. Affine medical image registration with coarse-to-fine vision transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20835-20844. [2] Kong L, Lian C, Huang D, et al. Breaking the dilemma of medical image-to-image translation[J]. Advances in Neural Information Processing Systems, 2021, 34: 1964-1978. [3] Yan B, Pei M. Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(3): 2982-2990. [4] Khare Y, Bagal V, Mathew M, et al. Mmbert: Multimodal bert pretraining for improved medical vqa[C]//2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, 2021: 1033-1036.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The core deigns of textual encoder are unclear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1, The visual encoders take classical ResNet-50 as backbone, and I wonder why not apply the emerging ViTs since transformers are more suitable for vision-language modeling. 2, Textual encoder design is unclear. Is it a transformer-based model? Will there be gaps between image features and textual embeddings since they take different architectures as the backbone? 3, Why not employ a medical large language model or a powerful vision-language large model as the decoder?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall framework is really simple yet effective, which attracts me a lot. But I think the performance can be further improved by several refinements, e.g., replacing the image encoders and multimodal decoders with a stronger model.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a novel Residual Alignment model for the difference VQA (diffVQA) task. The proposed method outperformed all the previous approaches on the MIMIC-Diff-VQA dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key strength of the paper is that it proposes a novel ReAl model, along with a Residual Feature Alignment (RFA) module, to solve the diffVQA problem, and achieved SOTA results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    NA

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    There are multiple typos, e.g., the last para of Introduction should have “, we” (not “Finally, We”) and “SOTA” (not “SOAT”).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • DiffVQA is an important real-world problem.
    • The paper proposes a novel approach, and was able to outperform the SOTA approaches.
  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thank you for your insightful feedback and constructive comments on our paper. Here are our clarifications that address the major concerns raised.

Q1-Preprocessing for Image Alignment (R3, R4) We appreciate your concern regarding the potential misalignment between the main and reference images. In our current implementation, we assume that the images are approximately aligned, as they are derived from the same patient and typically captured under standardized conditions. However, we acknowledge that minor misalignments can occur. To address this, we will include an image registration step in our preprocessing pipeline to ensure better alignment of the images before feeding them into the Residual Encoder. This enhancement will be described in the future version.

Q2-Text Encoder (R4) We appreciate your request for more details regarding the textual encoder. In our implementation, we utilize the “get_input_embeddings()” function of GPT-2.

Q3-Decoder (R4) We will replace GPT-2 with more powerful models in the future.

Q4-Gaps between multi-modal embedding (R4) This projection module is designed to address these gaps.

Q5-Other comments(R1, R3, R4) We appreciate your additional comments and will carefully consider them to make the corresponding improvements.




Meta-Review

Meta-review not available, early accepted paper.



back to top