Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate segmentation of lung infection regions is critical for early diagnosis and quantitative assessment of disease severity. However, existing segmentation methods largely depend on high-quality, manually annotated data. Although some approaches have attempted to alleviate the reliance on detailed annotations by leveraging radiology reports, their complex model architectures often hinder practical training and widespread clinical deployment. With the advent of large-scale pretrained foundation models, efficient and lightweight segmentation frameworks have become feasible. In this work, we propose a novel segmentation framework that utilizes CLIP to generate multimodal high-quality prompts, including coarse mask, point, and text prompts, which are subsequently fed into the Segment Anything Model 2 (SAM2) to produce the final segmentation results. To fully exploit the informative content of medical reports, we introduce a localization loss that extracts positional cues from the text to guide the model in localizing potential lesion regions. Experiments on the CT dataset MosMedData+ and the X-ray dataset QaTa-COV19 demonstrate that our method achieves state-of-the-art performance while requiring only minimal parameter fine-tuning. These results highlight the effectiveness and clinical potential for pulmonary infection segmentation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2608_paper.pdf

SharedIt Link: https://rdcu.be/eHwLd

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_29

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Dukerious/MedSeg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GaoSic_LocationAware_MICCAI2025,
        author = { Gao, Sicong AND Pagnucco, Maurice AND Song, Yang},
        title = { { Location-Aware Parameter Fine-Tuning for Multimodal Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {301 -- 311}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a lightweight, location-aware framework for multimodal medical image segmentation, leveraging the pre-trained CLIP and SAM2 models. The authors introduce a bridge module called QGAF to connect CLIP-generated prompts (including coarse mask, point, and text) with SAM2. Moreover, they propose a Global Positional Feature Classification (GPFC) module to extract and utilize location information from medical reports.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper creatively bridges CLIP and SAM2 through a dedicated QGAF module, enabling the use of multimodal prompts in a segmentation task.
2. By freezing the backbone and only tuning the bridge module, the method achieves strong performance with significantly reduced trainable parameters, which is very practical for medical applications.
3. Extracting spatial cues from text and integrating them via a novel GPFC loss is a smart and underexplored direction, highlighting the potential of language supervision in medical imaging.
4. The method is evaluated on two datasets and shows consistent improvements.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Although two datasets are used, both are focused on the same domain (COVID-19 lung CT). Evaluating the method on more diverse anatomical sites or modalities (e.g., brain MRI, cardiac MR) would strengthen the generalizability claims.
2. The claim of “lightweight fine-tuning” is well-supported, but comparisons to other parameter-efficient tuning methods (e.g., adapters, LoRA) would be valuable.
3. The paper does not mention whether the code or trained models will be released. Providing these would facilitate adoption by the community.
4. Figure 1 does not intuitively show how the loss is evaluated.
5. There are several grammatical and formatting inconsistencies throughout the paper. For example, in Figure 1, the notation Fcv is used, while in Section 2.2, Fvc appears with inconsistent use of subscripts and superscripts. It is recommended to revise the notations for clarity and consistency.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. Figure 2 includes both the Zoom-In and MHSA modules; however, there does not appear to be a corresponding explanation or description of them in the Method section
2. Figure 1 needs to be revised to more intuitively illustrate how the loss is computed and evaluated within the proposed framework.
3. The authors integrate the CLIP image and text encoders to directly generate a pseudo mask and point prompts. However, the accuracy of the generated pseudo mask remains unclear. While Table 5 presents an ablation study involving point prompts, pseudo masks, and text input, the performance of using text alone is not reported. Moreover, based on the manuscript, the generation of both the pseudo mask and the point prompts appears to be highly dependent on the text input. Providing clarification on this dependency, as well as reporting the standalone effectiveness of the text prompt, would significantly strengthen the analysis.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method introduction, experimental comparison and novelty.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s rebuttle met my question.

Review #2

Please describe the contribution of the paper
This study aims to enhance medical image segmentation performance by generating high-quality prompts using CLIP and applying them to SAM2 models, with guidance from localization cues extracted from medical reports. The key contributions of the paper are summarized as follows:
1. The introduction of a Global Positional Feature Classification (GPFC) component for accurately extracting lesion localization cues from medical reports.
2. The proposal of a Query-Guided Attention Fusion (QGAF) module for effective fusion of visual and textual features.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A clear and detailed specification of the datasets used in the experiments.
2. Comparative analysis between text-prompt-based vision-language models and vision-only segmentation models.
3. Ablation studies that demonstrate the effectiveness of the proposed modules.
4. Visualization results that highlight the performance gap between the proposed method and state-of-the-art (SOTA) models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper rightly emphasizes that CNN- and Transformer-based segmentation approaches are highly dependent on manual annotations, which presents a challenge for scalability. Given this context, reducing annotation dependency should be considered one of the core objectives of the proposed method. However, it remains unclear whether the proposed approach effectively mitigates this limitation. As a result, readers may question whether the method still relies on manual annotations to a similar extent as the models it seeks to improve upon.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. It would be helpful to clarify in the abstract the distinction between medical reports and natural image descriptions. Rather than stating vaguely that positional cues are extracted from text, the authors should explicitly mention that these cues are derived from medical reports. This clarification is important, as the central claim of the paper hinges on the notion that, in contrast to general textual descriptions, medical reports inherently provide localization cues.
2. In the contribution section of the introduction, the role of QGAF should be described more clearly. The current phrasing—“for fine-tuning”—is too vague. The authors are encouraged to emphasize its necessity by, for example, stating that it “enables lightweight fine-tuning through learnable queries.”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper demonstrates clear novelty, presents a well-explained training procedure, and provides strong empirical validation through ablation studies, the logical flow from the problem statement in the introduction to the final conclusion lacks coherence.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

Authors propose a novel segmentation framework that utilizes CLIP to generate multimodal high-quality prompts, including coarse mask, point, and text prompts, which are subsequently fed into the Segment Anything Model 2 (SAM2) to produce the final segmentation results. Besides, authors introduce a localization loss that extracts positional cues from the text to guide the model in localizing potential lesion regions. Compared to other segmentation methods, the proposed method is more efficient and lightweight. The proposed method is evaluated on CT dataset MosMedData+ and the X-ray dataset QaTa-COV19, and experimental results demonstrate that our method achieves state-of-the-art performance while requiring only minimal parameter fine-tuning.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novelty: Using CLIP to generate multimodal high-quality prompts and then fed into the SAM2 to produce the final segmentation results.
- Efficiency: The proposed method is lightweight with small parameters while achieving the best performance on multiple datasets, compared to other segmentation methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- No statistical evaluation of results: paired tests would give statistical weight to the argument of “superiority” of the proposed method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Please release the code to facilitate reproduce.
- Please do a paired t-test to demonstrate the significant improvement brought by the proposed method.
- Please provide attention map to demonstrate the effectiveness of the GPFC localization loss.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is novel, and the experiment is complete. The writing is logic and easily read.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My question was answered accordingly.

Author Feedback

We sincerely thank all reviewers and ACs for their time and constructive feedback.

R1-7, Reducing annotation dependency: Unlike previous methods that rely heavily on costly pixel-level annotations, our approach leverages radiology reports, which are typically provided alongside medical images in clinical practice. Our model is capable of directly extracting lesion location cues from radiology reports without requiring any additional annotations, thereby substantially reducing the need for costly pixel-level supervision. We will add more explanation in the introduction.

R1-10.1&10.2 Clearly written: Thanks, and we will follow your suggestion to revise.

R2-7&10.2 Paired t-tests: We have now conducted paired t-tests to assess the statistical significance of our results. Specifically, we performed paired t-tests on the per-sample Dice and mIoU across the test set between ours and recent SOTA method MMI-UNet [2]. The results show that the p-values for both metrics < 0.05, confirming that the performance improvements achieved by our method are statistically significant.

R2-10.1&R3-7.3 Reproducibility: We will release the code after the paper is accepted.

R2-10.3 Attention map: Thanks, that’s a good point. We will add this in supplementary.

R3-7.1 More datasets: Our research primarily focuses on COVID-19 lung imaging, where radiology reports are readily available and clinically relevant. Although both datasets are related to lung, they differ in imaging modality (CT vs. X-ray), data sources, and annotation protocols, which pose non-trivial generalization challenges. We agree that evaluating our method on other anatomical regions would further demonstrate its generalizability. For example, the polyp segmentation dataset used in MAdapter and the nuclei segmentation dataset adopted in LViT [14] both provide corresponding textual prompts, making them suitable candidates for future evaluation. However, due to insufficient time to run these experiments, we plan to explore this in future work.

R3-7.2 Comparison: MAdapter utilizes Adapter, and MedUniSeg adopts LoRA for fine-tuning. According to the results in their paper, both underperform compared to our method. Notably, on QaTa-COV19, MedUniSeg’s Dice score is 12.16% lower than our approach, further highlighting the superior effectiveness of our proposed method. MAdapter: A Better Interaction Between Image and Language for Medical Image Segmentation (2024). MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model (2024).

R3-7.4&7.5&10.2 Fig.1 and clearly written: Thanks, and we will follow your suggestion to revise.

R3-10.1 Fig.2: This detail is mentioned in the Query-guided Attention Fusion Module section: the zoom-in and zoom-out are implemented via linear transformations that reduce the feature dimension to 64 in order to save memory and improve efficiency. The MHSA means self-attention to extract meaningful information from the features. We will revise it to be clearer.

R3-10.3 Table5 part1: As shown in the second row of Table 5, incorporating the coarse mask substantially improves performance—by 23.2% in Dice and 23.45% in mIoU on the MosMedData+, and by 18.65% in Dice and 23.03% in mIoU on the QaTa-COV19. This demonstrates the high quality of the coarse mask and its effective transfer to MedSAM2. Due to space constraints, we did not elaborate on the performance of using text prompts alone. However, when using textual prompt alone, our method achieves around 10% improvement over zero-prompt in Dice and mIoU on both datasets, verifying the effectiveness of textual cues. R3-10.3 Table5 part2: The text plays a critical role in guiding the vision-language model (CLIP) to generate high-quality prompts for MedSAM2. Specifically, we utilize the localization information embedded in the text to generate accurate and relevant prompts. This is one of the key motivations behind our use of text-guided segmentation. We will add more explanations about this.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All three reviewers agree to accept this paper after rebuttal.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Location-Aware Parameter Fine-Tuning for Multimodal Image Segmentation

Author(s):