Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM’s semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek. Code is available at: https://github.com/FengheTan9/LLM4Seg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0627_paper.pdf

SharedIt Link: https://rdcu.be/eHw3N

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_39

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/FengheTan9/LLM4Seg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanFen_PreTrained_MICCAI2025,
        author = { Tang, Fenghe AND Ma, Wenxin AND He, Zhiyang AND Tao, Xiaodong AND Jiang, Zihang AND Zhou, S. Kevin},
        title = { { Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {402 -- 412}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). And the model improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) This paper proposes a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). (2) The proposed method achieves SOTA results across multiple medical imaging domains.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In general, LLMs have billions of parameters and require a significant amount of GPU memory resources. Although the proposed method freezes LLM parameters and reduces the number of training parameters, large memory cost still exists. However, the application scenarios of medical images are generally limited to edge computing, which leads to the limitations of the application of the proposed method. The author should compare the memory usage analysis of the model during the inference phase, as such parameter comparisons are more convincing.
2. The methods compared in the experiments are not the latest methods, so the proposed method in the paper cannot achieve convincing “SOTA results”.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method does not show significant performance improvement compared to TransUnet, and the improved performance comes at the cost of increasing the overall parameter count of the model.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s rebuttle was able to give a partially reasonable explanation for my problem. Hopefully the author will make changes based on rebuttle’s response.

Review #2

Please describe the contribution of the paper

This paper presents a novel and empirically validated approach that successfully demonstrates the feasibility of integrating large language models (LLMs) into the bottleneck layer of U-Net architectures for medical image segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper is the first work to effectively leverage LLMs for processing visual tokens in U-Net’s bottleneck, enabling global contextual understanding while improving segmentation performance.
2. Achieves SOTA performance across diverse medical imaging modalities (2D and 3D).
3. This paper establishes a promising direction for combining vision-language models with classical segmentation architectures in medical image analysis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The authors should more explicitly demonstrate the methodological novelty beyond simply inserting a frozen LLM into U-Net’s bottleneck, particularly clarifying whether the frozen LLM exhibits any medical-domain reasoning patterns during visual token processing.
2. The manuscript should explicitly state the input dimensions (e.g., 224×224 or 512×512) for all datasets. This is critical for reproducibility and FLOPs calculation.
3. While the LLM remains frozen during training, the computational graph still includes its full operations, so I recommend adding FLOPs analysis for the LLM-integrated bottleneck in Tables 1-2.
4. The FLOPs in Table 1 and 2 should by GFLOPs
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper presents a logically coherent methodology with clear experimental design and interesting conclusions, the authors fail to adequately explain why the model generates superior activation maps. In particular, the paper would benefit from a more thorough analysis of whether the LLM develops higher-level semantic understanding (i.e., a ‘reasoning process’) during feature encoding. This key mechanism requires deeper investigation and discussion. Therefore, I currently rate this submission as ‘weak accept’.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

All concerns have been adequately addressed - I recommend acceptance.

Review #3

Please describe the contribution of the paper

It introduces a novel hybrid framework that integrates a frozen pre-trained Large Language Model (LLM) layer into a CNN-based encoder-decoder segmentation architecture. The LLM layer, without fine-tuning, enhances medical image segmentation by semantically enriching visual tokens through its linguistic priors. The method yields consistent performance gains across multiple imaging modalities—including ultrasound, dermoscopy, polyp, and CT—achieving new state-of-the-art results with only a minor increase in trainable parameters. Detailed activation, statistical, and structural analyses reveal that the LLM contributes to semantic refinement, improving foreground-background separation and enhancing both global context and local detail modeling within the CNN. The framework shows strong generalizability, maintaining effectiveness across different LLMs (e.g., LLaMA, DeepSeek) and layers within them. Moreover, the added computational cost is minimal, limited to lightweight projection layers, making the method resource-efficient.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. LLM4Seg leverages the semantic priors of LLMs—originally trained on large-scale text data—to enhance visual understanding in medical imaging, challenging the traditional view that LLMs are limited to NLP tasks. By transferring semantic knowledge from the textual to the visual domain, the approach reduces reliance on large labeled medical datasets, offering a data-efficient alternative.
2. The method demonstrates strong clinical relevance, achieving state-of-the-art performance across diverse modalities including ultrasound, dermoscopy, polyp detection, and CT scans. It represents a novel application of LLMs in a space historically dominated by CNNs and Vision Transformers, highlighting their versatility in multimodal contexts. 3.Backed by thorough evaluations across multiple datasets and baselines, the results provide solid evidence of both effectiveness and generalizability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper evaluates the method on several benchmark datasets, but some of these datasets are relatively small in size, especially compared to the large-scale datasets used for pre-training LLMs.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-written and clearly structured, with in-depth experimental analysis supported by effective visualization and interpretability.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors’ rebuttal has partially resolved my concerns. I recommend acceptance after revisions.

Author Feedback

We appreciate the reviewers’ comments and insightful suggestions. We thank the reviewers for acknowledging the interest(R2,R3), novelty(R2,R3), organization(R1,R2,R3), and impressive experimental results(R2,R3) in our paper. Our responses are as follows:

PERFORMANCE & DATASET R1-Q2: In the original paper, we have compared with recent methods, including TinyU-Net (MICCAI 2024 Oral) and UniRepLK (CVPR 2024). Comparing with TransUnet, CMUNeXt+LLaMA outperforms it by ≥1.7 IoU (p<0.01) on BUSI and ≥1.1 (p<0.01) on Kvasir, while having fewer than 3/5 of the parameters and GFLOPs. On the larger TNSCUI dataset, U-Net+LLaMA improves IoU by 1.8 over the baseline and 0.17 (p<0.01) over TransUnet with fewer parameters. On ISIC, U-Net+LLaMA(T) also achieves the best performance than TransUnet. Moreover, our goal is to explore the potential of LLMs in medical vision tasks. LLM4Seg can be integrated into various encoder-decoder architectures and enhance their performance (become new SOTA by SOTA + LLM4Seg). We firmly believe that improvements can be observed by incorporating LLM4Seg into TransUnet, nnUNet, and other strong models. R3-Q1: Thanks for the reviewer’s thoughtful feedback. While it is common for medical domain to have small-scale annotation datasets (e.g. 0.5K in BUSI, 1K in Kvasir), our method aims to leverage LLMs’ pre-trained knowledge to boost such small-scale but challenging tasks, and the strong performance on BUSI and Kvasir confirms its effectiveness. In addition, consistent gains on larger datasets (3.4K in TNSCUI and 2.5K in ISIC 2018) demonstrate the scalability of our approach.

COMPUTATION R1-Q1: Thank you for pointing out practicality. Unlike previous LLM-based vision methods such as LLaVA, our LLM4Seg uses only a single intermediate layer of the LLM, significantly reducing both parameter and computation than full LLMs. For instance, in LLaMA3.2-1B and DeepSeek-R1-Distill (Qwen-1.5B), the effective parameters involved are just 60M (1/16 of the full LLaMA) and 46M (1/28 of the full DeepSeek), respectively, making our method suitable for edge-computing scenarios typical in clinical settings. Inference runtime increases only slightly compared to without inserting this layer (from 5.0ms to 5.9ms), demonstrating the method’s practicality under real-world constraints. R2-Q2,3,4: Thanks for pointing out the unclarity of resolution and FLOPs calculation. The input resolution of all datasets is 256x256. We clarify that the +LLAMA(T) row in Tables 1 and 2 reports forward GFLOPs including the frozen LLM during inference. Since the LLM layer is frozen, its grad are not updated during training, which could reduce the overall backward computation cost. To avoid confusion, we will revise the table to report forward GFLOPs explicitly and provide a more detailed explanation in the manuscript.

PERSPECTIVE R2-Q1: Thanks for the insightful comment. Our approach goes beyond simply inserting a frozen LLM into the U-Net bottleneck. The key novelty lies in leveraging the LLM’s pretrained representations to enhance semantic alignment in a new domain (visual) without any fine-tuning. Activation analyses show that models with the LLM layer produce activation maps with clearer background separation and consistently higher effective rank, indicating more structured and discriminative representations. Because the encoder is trainable, it learns to project features into the input space of the frozen LLM layer, effectively tapping into the semantic priors learned from large-scale language pretraining. While this transfer is implicit, the observed better foreground-background separation suggests a form of semantic abstraction or reasoning during visual token processing. We hope our findings will inspire more work in our community to unlock the full potential of frozen LLMs in vision tasks and pave new directions for multi-modal representation learning.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers agreed to accept the manuscript, and the rebuttal addressed most reviewers’ doubts.

back to top

Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster

Author(s):