Abstract

Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs’ multi-image understanding capabilities in the medical domain. The Med-MIM instruction dataset, benchmark, and fine-tuned models are available at \href{https://github.com/Xikai97/Med-MIM}{Med-MIM}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0784_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Xikai97/Med-MIM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanXik_Medical_MICCAI2025,
        author = { Yang, Xikai and Miao, Juzheng and Yuan, Yuchen and Wang, Jiaze and Dou, Qi and Li, Jinpeng and Heng, Pheng-Ann},
        title = { { Medical Large Vision Language Models with Multi-Image Visual Ability } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {403 -- 413}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of this paper is the introduction of the Med-MIM instruction dataset and benchmark, aimed at enhancing the ability of medical large vision-language models to understand multi-image scenarios. Using this dataset, the authors fine-tuned the Mantis and LLaVA-Med, resulting in the MIM-LLaVA-Med and Med-Mantis, which perform excellently in multi-image analysis tasks. Additionally, the paper designed the Med-MIM benchmark to comprehensively evaluate LVLMs’ capabilities in medical multi-image understanding. Experimental results show that the fine-tuned models significantly improve multi-image visual abilities.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1 One of the starting points of this article is what is required by existing medical tasks, that is, time-based medical image tasks.

    2 The proposed benchmark takes into account both in-domain and out-of-domain situations, making it closer to real medical scenarios.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1 On page 3, the author claims that “ Subsequently, GPT-4o, combined with iterative refinement, is employed to generate the inherent Med-MIM instruction dataset and the held-in benchmark.” Why can the data generated by GPT-4o, which is used to train LLaVA-Med and Mantis in the Held-in part of Table 1, significantly outperform the performance of GPT-4o itself?

    2 On page 3, the author claims that “we construct a composed multi-image dataset by manually grouping multiple images from the LLaVA-Med VQA.” How exactly was this grouping operation carried out?

    3 In Figure 2c, why isn’t the proportion of the number of samples from the LLaVA-Med VQA dataset shown?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author has addressed most of my concerns, and I agree to accept.



Review #2

  • Please describe the contribution of the paper

    The paper presents a specialized dataset of QA pairs for multi-images to enable temporal and cross modality reasoning.

    The data is then used to fine-tune to specialized VLMs for four capabilities.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and contributes towards an important application of VLMs for multi-image understanding in medical imaging domain. The comparisons and ablations studies are well designed to establish the effectiveness of the proposed method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    It would be helpful to discuss why the model performance for reasoning task drops as more data is used.

    Also how the fine-tuned models perform for data such as RAD-VQA?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well presented paper, and a good contribution towards clinical utility of VLM models.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author feedback has helped addressing some of the concerns.



Review #3

  • Please describe the contribution of the paper

    The authors collected and constructed a new dataset which enables large VLMs for multi-image applications such as temporal, multi-view, comparison, co-reference applications.

    The authors also conducted thorough experiments that demonstrate the usefulness of this dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This is a much needed area for large medical VLMs, frankly I’m quite surprised that this work is the first work doing this. I did a brief literature search, and only found MAIRA-2 (https://arxiv.org/abs/2406.04449) in additional to what author has mentioned in related works doing this for the CXR domain. So very happy that this work addresses this.
    • The comparison and data ratio experiments are thorough and demonstrates the usefulness of such dataset.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Not major weaknesses per say, but for future work (not for this paper) I would recommend also looking into at least two areas

    • Multi-view and multi-timepoint for CXR
    • Looking into more RL style instead of SFT
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Address a much needed domain for large medical VLMs, and with thorough experiments demonstrating the usefulness of the dataset.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Address a much needed domain for large medical VLMs, and with thorough experiments demonstrating the usefulness of the dataset.




Author Feedback

To R1: We appreciate R1 for your positive overall feedback and thoughtful suggestions. Below, we reply the two specific points raised:

  • Discuss performance on reasoning set While the overall performance improves with the increasing size of the whole instruction dataset, the drop in reasoning performance may result from the limited diversity of the reasoning subset, which is entirely derived from the EMBED dataset (Fig. 2(c)). In contrast, the other subsets are constructed from more diverse sources. As the overall dataset grows, the model may prioritize learning sub-tasks with more complex data compositions, potentially weakening its focus on reasoning task and further limiting its reasoning ability. In the future, we plan to collect reasoning-related data from a wider range of sources to address this limitation and validate our hypothesis.
  • Performance on RAD-VQA While we regret that MICCAI guidelines do not allow the inclusion of additional experimental results at this stage, we would like to emphasize that in this work, we primarily focus on the multi-image setting and evaluate our models on the multi-image version of the RAD-VQA benchmark. The corresponding results are reported in Table 1 (MIM-RAD column). Importantly, our finetuned models demonstrate improved performance compared to their original versions, which aligns with our expectations and highlights the effectiveness of our approach.

To R2: We sincerely thank R2 for your high praise of our work. As highlighted in our paper, the multi-image visual capabilities of VLMs are indeed critical in clinical applications. We also deeply appreciate R2’s thoughtful suggestions for future work. Exploring additional multi-image scenarios, such as multi-view and multi-timepoint analysis for CXR, as well as integrating advanced training techniques, holds great potential for further improving the model’s performance in our future studies.

To R3: We thank R3 for the valuable and insightful comments. Below are our detailed responses:

  • GPT-4o performance In our framework, GPT-4o is only utilized for its text organization capabilities during instruction dataset generation, rather than being distilled or directly used to generate QA pairs from images. For instance, GPT-4o summarizes descriptions like “pneumothorax is stable” (first visit) and “pneumothorax is unchanged” (second visit) into “no significant changes observed between two visits.” This step primarily involves text processing. In contrast, our models are directly finetuned on both visual and textual information, requiring advanced visual understanding across multiple medical images, which is significantly more challenging than text organization.
  • Group operation for composed dataset For the composed subset, the primary goal of this part is to equip model with fundamental medical data analysis capabilities while enhancing its co-reference visual ability (e.g., identifying the target image based on a given index and describing its content). For the specific group operation, we first select one image from the LLaVA-Med image dataset as the starting point of the input multi-image sequence. The remaining images in the sequence are then selected from the same dataset, ensuring their corresponding text information differs significantly from that of the first image. Finally, we shuffle the multi-image sequence to construct the final sample. This design increases the complexity of describing the target image, providing a greater challenge for the model’s co-reference visual ability.
  • Display in Fig.2(c) In Fig. 2(c), we focused on illustrating the distribution of the inherent multi-image dataset, as stated in the figure caption. The inherent instruction dataset is specifically designed for a variety of practical tasks, and our intention was to emphasize this aspect. In the final version, we will also include the proportions of samples from the composed dataset to provide a more comprehensive representation.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Med-MIM introduces an 83.2K‐sample instruction dataset and benchmark that enable medical vision-language models to perform multi-image reasoning, comparison, temporal tracking, and co-reference tasks. All three reviewers ultimately recommended acceptance: R1 and R2 praised the clear presentation, novel dataset, and strong ablations, and R3’s initial concerns—about GPT-4o’s role, dataset grouping, and figure details—were satisfactorily addressed in the rebuttal. Given the paper’s important contribution, thorough validation, and successful rebuttal, I recommend acceptance.



back to top