Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP – Medical Imaging Relative Positioning – benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images. Dataset and code are available on https://wolfda95.github.io/your_other_left/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0530_paper.pdf

SharedIt Link: https://rdcu.be/eHwU6

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_65

Supplementary Material: Not Submitted

Link to the Code Repository

https://wolfda95.github.io/your_other_left/

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WolDan_Your_MICCAI2025,
        author = { Wolf, Daniel AND Hillenhagen, Heiko AND Taskin, Billurvan AND Bäuerle, Alex AND Beer, Meinrad AND Götz, Michael AND Ropinski, Timo},
        title = { { Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {691 -- 701}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of this paper is a systematic and novel evaluation of the spatial reasoning capabilities of VLMs in medical imaging. The paper (1) Introduces MIRP, a benchmark dataset targeting relative anatomical positioning (2) Demonstrates that state-of-the-art VLMs fail to reliably answer spatial questions based on image content (3) Shows that performance improves when anatomical labels are removed, suggesting models rely more on priors than visual evidence (4) Explores the use of visual markers as a method to improve interpretability and performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel Task Focus: Understanding relative positions specifically for anatomical structures is an underexplored capability in clinical AI. This seems like an important prerequisite for real-world deployment.
2. Benchmark dataset: The MIRP dataset is carefully constructed, including controls such as image rotations and random flips to isolate visual reasoning from memorized anatomical knowledge.
3. Experiments: The experiments are structured around clear research questions and ablation studies
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Unsubstantiated Assumptions on Clinical Prerequisites: The claim that spatial reasoning is a “prerequisite” for clinical performance is intuitive but unvalidated in the paper. No downstream task assesses whether spatial failures translate to poor diagnostic performance.
2. Missing Evaluation of Anatomical Knowledge: The models’ reliance on anatomical priors is hypothesized but not fully disentangled from their lack of visual reasoning. Furthermore, the paper state that “State-of-the-art VLMs already possess strong prior anatomical knowledge” referencing [35]. However, that paper explicitly states that VLMs were not evaluated “All questions containing clinical images were excluded”. A control experiment that explicitly tests whether models know basic anatomy would strengthen this argument. E.g.”Is there a liver in this scan?”
3. [Minor] Clarity and Motivation of Marker Experiment: While the use of visual markers is inspired by natural image work, its clinical motivation is less clear. Would such markers be feasible or desirable in real-world workflows?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While this paper offers a valuable medical benchmark for assessing localization strength, its clinical significance remains uncertain for two main reasons: 1) It is quantitatively ambiguous how crucial the localization task is for subsequent decision-making processes. Being able to observe a correlation in the experiments would strengthen the claim that this is an important task. 2) The assertion that VLMs have robust anatomical priors lacks evidence (the cited reference does not directly evaluate VLMs), making it essential for the MIRP benchmark to examine anatomical priors prior to addressing localization inquiries, as inadequate localization outcomes might stem from insufficient anatomical priors.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Although interesting, the paper in its current state lacks any meaningful analysis on the underlying hypotheses that “VLMs already possess strong prior anatomical knowledge” (the cited paper does not even make this assertion) and does not provide any downstream tasks to demonstrate the truth of the claims. There are no experiments that show whether the VLMS that perform poorly do so because they lack anatomical prior knowledge or have poor spatial reasoning. Although I thank the authors for communicating that they performed these evaluations, this manuscript does not contain them.

Review #2

Please describe the contribution of the paper

The authors proposed MIRP benchmark, with the aim of evaluating the performance of current MLLMs in recognizing relative anatomical positions within the medical images. Based on the constructed benchmark, the authors further conduct a comprehensive analysis with visual marker strategies to detemine factors that affect MLLMs’ awareness of positions. The findings provides inspiring insights for developing more spatially-aware medical AI systems.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Evaluating MLLMs’ limitation in understanding anatomical position is significant, which lays the foundation for improving performance of downstream tasks and interpretability.
2. The procedure of gradually introducing visual markers is reasonable, demonstrating current MLLMs’ inability in associating prior medical knowledge with the given medical scan.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Lack of comparison with MLLMs fine-tuned on medical domain knowledge, which could further demonstrate whether medical vision-language data improve the understanding of relative anatomical regions.
2. Limited discussion of how to integrate visual marker strategies into existing MLLM systems to assist clinical practice.
3. The authors claim that the constructed data include flipped/rotated images. This might raise the concern that could the position prediction errors result from the imagesbeing incorrectly oriented? If the images are in proper orientation while only organ positions varied, how would the models perform?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. As a fundamental task, the ultimate goal of correctly recognizing anatomical position is to enhance downstream tasks, such as disease diagnosis. It is worth exploring the model’s performance and the effectiveness of visual markers on clinical decision-making or diagnosis accuracy related to anatomical position.
2. What do the authors think is the better strategy to improve the model’s understanding on anatomical position, automatically providing extra visual markers or improving the MLLMs’ awareness of positions within the visual inputs?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduced a novel benchmark dataset and conducted a systematic evaluation of MLLMs’ relative anatomical position understanding capabilities. However, the conclusions are currently limited by the lack of comparison with medical fine-tuned models and questionable validity of artificially manipulated images.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have resolved most of my concerns. I believe this paper could be accepted.

Review #3

Please describe the contribution of the paper

The authors investigate how well SOTA VLMs can answer spatial reasoning questions using medical images. They first establish a baseline with unedited pictures, expand to including landmarks in the images, and finally test reasoning over landmarks only. They find leading models perform poorly, rely on imagined previous medical knowledge instead of actually analysing the image, but perform well when tasked to only use landmarks. They make their benchmark public to aid further research.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Well motivated research questions
- Good experimental design where the authors control and modify various factors in the questions and images. This allows the authors to investigate why models fail and not just that they fail
- Repeat experiments with SDs
- Measured and well reasoned discussion
- Open code and benchmarks
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- When investigating the influence of human anatomy priors, the authors should run an experiment where the anatomy prior and the orientation in the image are aligned and evaluate performance. This should then be compared to a cohort where all priors and orientations are misaligned. This is similar to what the authors have done but framing the question as aligned vs misaligned instead of simply evaluating for prior vs image would further illuminate RQ3.
- Typo Fig. 1 - Plane should be plain
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(6) Strong Accept — must be accepted due to excellence
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well motivated, well executed, and well measured in its discussion. I can think of no glaring weaknesses.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The initial paper was already of high enough quality to be accepted and the author’s rebuttal to R1 and R2 I believe adequately address their concerns.

Author Feedback

We thank all reviewers for their constructive feedback. We appreciate the recognition that our submission addresses an “underexplored capability” (R1), is “significant” (R2), has a “good experimental design” (R1), and that the “findings provide inspiring insights” (R2). R3 summarizes: “The paper is well motivated, well executed, and well measured in its discussion.”

We appreciate R1’s request for a clearer justification of localization tasks as a clinical prerequisite. Our research was motivated by the vision that VLMs could support complex tasks like radiological report generation or surgical planning. These are currently done by radiologists, and localization mistakes can have severe consequences: wrong-level spine surgeries [A], wrong-side surgeries [B], or missing that a tumor lies near vessels [C]. Just as radiologists require spatial understanding, this is equally essential for VLMs to be clinically applicable. Without this ability, they cannot reliably describe localizations in reports, which could lead to similarly severe outcomes. In our work, we do not specifically address these concrete downstream cases, as it is intended as foundational work investigating general relative positioning tasks. As a next step, models should be evaluated on such concrete cases.

R1 is correct that [35] does not evaluate image-based tasks. Our intention was to show that the language components of VLMs have strong prior medical knowledge, demonstrated in [35] via text-based evaluations. Therefore, we hypothesized that VLMs rely more on this prior knowledge than on image content. We tested this in RQ3: GPT and Pixtral answered the majority of questions correctly with respect to human anatomy, but incorrectly regarding the actual flipped/rotated image. A correct anatomical answer likely stems from prior knowledge within the language part. This indicates that both models weigh language priors more than visual input. Our wording in this section was unclear and will be revised. R1 suggested testing visual prior anatomical knowledge. We conducted the suggested experiment early on. As the outcome was consistent with our main findings, we focused the manuscript on the core experiments, which were the introduction of markers. With markers, even a model with poor visual prior knowledge could answer correctly by comparing marker positions.

R1/R2: Markers in clinical practice could be auto-placed via segmentation models. Their use may be clinically valuable if they help VLMs in complex tasks. We will detail this in the Discussion.

R2: We agree that evaluating medical fine-tuned VLMs is an important next step, building on our foundational work with base models. We already have such models prepared for future evaluation on our benchmark dataset.

R2 points out that position errors might stem from non-standard image view. Radiological images in standard view mirror the anatomical definition (anatomical right appears on the image’s left). Thus, VLMs relying on prior anatomical knowledge instead of visual input produce wrong answers. Since the left-right swap is a well-defined convention, VLMs might learn to compensate for it. To remove this potential bias, we applied flips/rotations. We also analyzed the existing results for each flip/rot variant separately, but excluded this from the paper, as it aligned with our findings. Even with no flip/rot (=standard view), performance remained poor. We now see the importance of clarifying that position errors are not due to non-standard views. Thank you for highlighting this. R3: In the two cases where the view aligns with the anatomy definition (Flip + NoRot and NoFlip + 180° Rot), GPT-4o and Pixtral performed well, supporting our RQ3 finding.

We thank all reviewers for helping improve the manuscript and hope our findings and open benchmark will foster safe, clinically useful VLMs.

[A] 10.1007/978-3-031-61601-3_1 [B] Joint Commission: Sentinel Event Data 2023 Annual Review [C] 10.1007/s00330-020-07307-5

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

the authors proposed an important research question to investigate whether MLLMs are able to examine the image and reason the location of an anatomy. Authors constructed a dataset for this purpose and tested on sota MLLMs. While this addresses an important clinical research question, i would like to see how this affects downstream ai tasks which unfortunately isn’t discussed. in its current state, it’s a well written paper that answers the proposed question

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers agree that the paper addresses a novel and interesting aspect of clinical AI, with well-designed experiments and a good benchmark. While concerns were raised regarding the clinical relevance of the task and the need for clearer validation of anatomical priors, I believe the rebuttal adequately clarified these points.

back to top

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Author(s):