Abstract

Medical Vision-Language Models (Med-VLMs) have demonstrated strong capabilities in clinical tasks. However, they often struggle with understanding anatomical structures and spatial positioning, which are crucial for medical reasoning. To address this, we propose a localization-aware enhancement to the Med-VLM pipeline, introducing improvements at three levels: data, architecture, and alignment. First, we introduce localization lens, a set of expert-validated representations that provide richer anatomical and positional context. However, as these representations increase input complexity, we integrate pixel shuffle within the model architecture to filter and refine representations, enhancing spatial information processing while preserving anatomical continuity. Lastly, to effectively align the localization lens representations with textual features, we incorporate decoupled contrastive loss (DCL) alongside the standard loss function. This ensures better feature discrimination and robustness, particularly in data-limited medical settings. Through extensive evaluations on medical visual question answering (Med-VQA) datasets, we show that our methodology improves localization-driven performance across different Med-VLM architectures. Our analysis of localization-based questions further reveals that improvements in anatomy and spatial reasoning directly enhance overall Med-VQA accuracy upto 6.2%. The proposed approach is model-agnostic and can be seamlessly integrated into existing Med-VLM pipelines. The dataset, code and trained models will be made publicly available at URL

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2723_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/CVLABLUMS/localizationlens

Link to the Dataset(s)

https://github.com/CVLABLUMS/localizationlens

BibTex

@InProceedings{FarHas_Localization_MICCAI2025,
        author = { Farooq, Hasan and Taj, Murtaza and Nasim, Mehwish and Mahmood, Arif},
        title = { { Localization Lens for Improving Medical Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {342 -- 351}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • The manscript explores the localization aspect in Med-VLMs by introducing a localization-aware enhancement that improves anatomical and spatial reasoning.
    • The approach consists of clinically meaningful representations, architectural modifications, and alignment refinements, making it model-agnostic and easily integrable into existing Med-VLM architecture.
    • To enhance the performance of MedVLM (and assessment on medical VQA), this manuscript proposed the Pixel-shuffle mechanism within the architecture for filtering relevant context, Disease-aware Contrastive Learning, Vision text alignment with decoupled contrastive loss
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript proposed methodology improves localization-driven performance across different Med-VLM architectures (testing on Med-VQA).

    • The manuscript proposed clinically meaningful representations which act as localization lens to enhance anatomical and positional understanding in medical images. (augmenting each clinical image with several complementary representations. These represen tations include (a) original image, (b) single-color segmented representation of original image, (c) multi-color coded segmented representation of original image, and (d) masked representation of the original image)
    • The manuscript integrate a pixel-shuffle mechanism within the model architecture to effectively handle the increased input complexity from the localization lens, improving the capture and refinement of spatial and anatomical details.
    • The manuscript proposed a vision-language alignment pipeline that first aligns the localization lens representations using decoupled contrastive learning, followed by their integration into Med-VQA tasks.
    • The manuscript conducts a localization analysis to evaluate how the proposed localization aware enhancements impact Med-VQA performance, demonstrating significant improvements in spatial reasoning and anatomical understanding across different Med-VLM architectures.
    • Tthe results show that the small VLMs are capable of performance improvement with effective training pipelines and architectural design.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The manuscript does not address computational complexity..
    • The manuscript should present the main algorithms to facilitate easy reproduction of results.
    • It is necessary to clarify where these contributions lie within the four main modules of typical VLM architecture (image encoder, text encoder, fusion module, pre-training objectives), rather than merely listing events or phenomena without revealing the essence of the issue.
    • Clearly identify the set of mathematical expressions that control the system’s operation and specify what improvements have been made in them.
    • Is there a zero-shot prediction, few-shot learning assessment?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The manuscript proposed methodology improves localization-driven performance across different Med-VLM architectures (testing on Med-VQA).
    • The manuscript does not address computational complexity..
    • The manuscript should present the main algorithms to facilitate easy reproduction of results.
    • It is necessary to clarify where these contributions lie within the four main modules of typical VLM architecture (image encoder, text encoder, fusion module, pre-training objectives), rather than merely listing events or phenomena without revealing the essence of the issue.
    • Is there a zero-shot prediction, few-shot learning assessment?
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    New Contributions is Acceptable, Responses to Reviewers is Acceptable.



Review #2

  • Please describe the contribution of the paper

    The paper proposed: (1) Localization lens to enhance model’s localization awareness in medical context; (2) Pixel-shuffle mechanism to reduce computation complexity while preserving spatial relationship; (3) Novel two-stage training strategy to integrate localization lens into vision-text alignment.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper focuses on a crucial limitation of medical VLMs by enhancing their ability to perceive localization within medical scans through the proposed localization lens.
    2. The paper navigate the trade-off between extra complexity and improved spatial information by introducing novel training schemes, including pixel-shuffling and ecoupled contrastive loss.
    3. The proposed method is model-agnostic, demonstrating improved localization capabilities across multiple base models.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of comprehensive ablation study to validate the effectiveness of the proposed pixel-shuffle mechanism. While the authors claim that this architecture change could reduce the complexity and preserve critical features, the quantitative evidence to demonstrate this argument is insufficient.
    2. Inadequate validation in the advantages of decoupled contrastive loss (DCL) over traditional InfoNCE in the medical imaging context. It would be interesting to see how much the model performance could improve when augmented views are introduced in the contrastive loss.
    3. It would be better to compare the localization lens approach with existing spatial enhancement methods. Some works have already attempted to enhance regional understanding with bounding-box-based region features, such as RGRG [1], CHEX [2], etc., in report generation task. Have the author compared with these spatial enhancements applied to visual question-answering tasks?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors claim that the mask representation has been validated by experts. It is unclear that (1) how were medical experts engaged in the validation process of the mask representations? (2) what is the percentage of cases requiring expert modifications during post-processing?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces localization lens for improving positional awareness in the medical context, along with a pixel-shuffle mechanism to reduce computational complexity, and a two-stage training strategy for better alignment. The paper lacks comprehensive ablation studies to validate the effectiveness of the proposed pixel-shuffle mechanism and decoupled contrastive loss (DCL). Additionally, the paper would benefit from comparing the localization lens approach with existing spatial enhancement methods with bounding boxes.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The initial concern about the effectiveness of each proposed module has been addressed with the reported performace.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a localization-aware enhancement to medical vision-language models (Med-VLMs) through a combination of data augmentation, architectural modifications, and contrastive learning objectives. Specifically, they introduce the (1) localization lens, which is a set of representations (segmentations and masked images) to inject anatomical and spatial context into model training, (2) pixel shuffle architecture modification to handle the increased input complexity, (3) first phase includes the decoupled contrastive loss to avoid repulsion from negative samples, and (4) demonstrated improvements over multiple VLMs.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Public release of code and dataset. Meaningful data with clinically-feasible representations allows for improved reproducibility. 2) Model agnostic and scalable approach. Their approach can be used with both large and small VLMs. 3) Pixel shuffle is an architectural novelty and a creative adaption in the Med-VLM space. 4) Strong empirical results. Every technique the authors use has a corresponding empirical increase, which bolsters support for their approach. Their ablation study (Table 3) is a big plus.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In no particular order: 1) There isn’t evaluation on improved segmentation for MedVLMs, which I would assume this pertaining strategy would improve other than just Med QA content. 2) It would be interesting to compare the patch shuffle strategy with more conventional strategies to get quantitative insights into patch partition methods. Also, the visual in fig 1 for pixel shuffle is slightly confusing. It would be nice to describe the strategy visually a bit more. 3) What is the training costs to proceed with this approach versus standard vision-language pertaining.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper systematically tackles the challenges in current Med-VLMs through a well-motivated framework by introducing anatomy and additional spatial awareness during training. The integration of the localization lens, architectural filtering via pixel shuffle, and alignment improvements using decoupled contrastive loss make for a compelling pipeline that is model-agnostic and yields measurable performance gains. While there are some limitations around figure clarity and compute trade-offs, the novelty and empirical results justify acceptance. This work is likely to be of interest to both the Med-VLM and broader MLLM communities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I had an accept status before and the authors comments to me in the rebuttal did not help / hurt them. I will leave it as accept.




Author Feedback

We thank the reviewers for appreciating the work and providing valuable feedback. In particular, we are grateful for accepting the work (R3), acknowledging contributions in method (R1,R2,R3), improvements across multiple baselines (R1,R2), novel training strategy (R2), novelty in integration of pixel shuffle strategy (R1,R2,R3), strong empirical results and ablation (R3) and overall clarity (R1,R2,R3).

Reviewer 1:

Computational Complexity: In our method, there is an added computational cost of preparing augmentations which makes the overall complexity=single time cost of augmentations (localization lens) + VLM training cost depending on no. of epochs. When trained on NVIDIA A-100(40GB) GPU for 10 epochs, SmolVLM[23], with 1.7B params, took ~3.0 hours while SmolVLM (with Lens) took ~3.5 hours on the combined dataset VQA-RAD and SLAKE as used in Table 1. During testing, our method did not incur any additional cost as the augmentations were used to improve the training of VLM.

Reproducibility: We have elaborated the method in Fig.1. To facilitate easy reproduction of results, our code, trained models and dataset will be made publicly available. A pseudo code will also be added to the same github page.

Contributions in VLM Architecture: In the VLM architecture, we employ a novel fusion module with pixel shuffle algorithm that improves both the image and text encodings. In pre-training, organs are first aligned with lens and then, organs and lens are aligned with text tokens during training (Sec 2.2).

Mathematical Expressions: In the paper, we have given equations for each part of the proposed system separately (see Eq. 1-5, 7). These equations improve the existing VLM based system.

Few Shot: We have not performed a few-shot assessment, instead, we have done an extensive assessment on how the methodology improves the localization in the downstream tasks such as VQA. This includes an analysis on anatomy and positioning for all the models presented in the paper (Table 2).

Reviewer 2:

Pixel Shuffle and Loss Ablation: In our methodology, pixel shuffle and DCE loss are used to tackle the input representation of augmentations (localization lens) during training. With our configurations of localization lens, ablation results are shown in Table 3 on VQA-RAD dataset. Following results with (best lens configuration) for pixel-shuffle and loss configurations with asterisk(*) can be added in Table 3 for SmolVLM[23] 1.7B Model: InfoNCE Loss = 55.4% (Standard, Table 3) *InfoNCE Loss + Lens = 57.3% *Pixel-shuffle + InfoNCE Loss + Lens = 58.8% *DCE Loss + Lens = 57.7% Pixel-shuffle + DCE Loss + Lens = 61.6% (Proposed, Table 3) Thus, our approach improved the accuracy from 55.4% to 61.6% out of which 3.9% is contributed by pixel shuffle.

Bounding Box and Segmentation Comparison: RGRG (CVPR 23), CHEX (ECCV 24) will be included in the related work as these works improve the localization via bounding boxes. We have reproduced the results using the bounding boxes instead of segmentations. Our ablation results show that segmentation gives better results on the proposed methodology. We think the reason behind this is that bounding boxes provide coarse localization whereas the proposed segmentation based localization provides a fine-grained spatial enhancement. The results on VQA-RAD dataset are as follows: SmolVLM 1.7B [23] (segmentation): 61.6% (Table 3) SmolVLM 1.7B [23] (bounding box): 57.8%

Reviewer 3

Improved Segmentation: Yes, due to the augmentations (localization lens) being added, the segmentation should also improve. We thank the reviewer for this excellent suggestion. We will explore this direction on tasks other than Med QA in our further research.

Pre-training Costs: Please see the discussion on computational complexity in response to reviewer 1.

Clarity in Figure: We will improve the visualization of pixel shuffle in Fig. 1 by clearly mentioning the position of text and image tokens before and after the pixel shuffle step.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers have acknowledged the potential and merit of the work, and based on this, I vote for acceptance.



back to top