Abstract

In addressing the unique challenges of medical image segmentation, foundation models like the Segment Anything Model (SAM), originally developed for natural image, often falter due to the distinct nature of medical images. This study introduces the Language Guided Adapter (LGA), a paremeter efficient fine-tuning approach that extends SAM’s utility to medical segmentation tasks. Through the integration of textual data from medical reports via a pretrained Bert model into embeddings, LGA combines these embeddings with the image features in SAM’s image encoder using Feature Fusion Modules (FFM). Our method significantly enhances model performance and reduces computational overhead by freezing most parameters during the fine-tuning process. Evaluated on the CT-based MosMedData+ and the X-ray dataset QaTa-COV19, LGA demonstrates its effectiveness and adaptability, achieving competitive results with a significant reduction in the number of parameters required for fine-tuning compared to SOTA medical segmentation models. This enhancement underscores the potential of foundation models, leveraging the integration of multimodal knowledge as a pivotal approach for application in specialized medical tasks, thus charting a course towards more precise and adaptable diagnostic methodologies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3350_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3350_supp.pdf

Link to the Code Repository

https://github.com/JiHooooo/LGA/

Link to the Dataset(s)

https://github.com/HUANGLIZI/LViT

BibTex

@InProceedings{Hu_LGA_MICCAI2024,
        author = { Hu, Jihong and Li, Yinhao and Sun, Hao and Song, Yu and Zhang, Chujie and Lin, Lanfen and Chen, Yen-Wei},
        title = { { LGA: A Language Guide Adapter for Advancing the SAM Model’s Capabilities in Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors propose a Language Guided Adapter (LGA) that is built of a stack of feature fusion modules based on cross attention. The author adopted the LGA to achieve parameter-efficient fine-tuning of medical image segmentation, using medical text as additional input.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1)The proposed method is compared with the existing medical image segmentation models using and without using text inputs. The proposed method outperformed nnUNet by 4% Dice and the state-of-the-art model using text by 1%, in the cost of training fewer parameters. The proposed method is effective and competitive. 2)The authors conducted ablation study to show the effectiveness of using LGA and text input, respectively. It is a meaningful experiment.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1)My major concern is the limited application scene of the proposed method. If the proposed method requires medical report as its input, it means that the diagnosis has been finished by the radiologists before running the proposed segmentation model. If so, the proposed method cannot be used in computer aided diagnosis. Therefore, the authors should discuss the application scene of the proposed method in the Introduction and Conclusion section as well as the rebuttal.

    2)To well show the strength, the authors should consider to report the results of using generated medical text instead of real medical text in the inference stage. For example, the authors could train an image caption model to produce medical text for a testing image during the inference. How are the results of using generated medical text during the inference? That experiment is meaningful.

    3)In my opinion, the use of ‘SOTA’ is unprofessional and unformal.

    4)Some important details are missing. Which datasets and how to pretrain the parameters of the SAM used?

    5)Will the authors release the trained model weights, training and inference code? It is expected to see the response in the rebuttal.

    6) In Table 3, the design of ‘Dual Cross’ can be further analyzed. For example, what if removing ‘L-to-V-attn’ as ‘L-self-attn’?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    1)Which datasets and how to pretrain the parameters of the SAM are not clear 2)It is suggested to release the code and trained model weights.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please check the weakness of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the proposed LGA is acceptable. The LGA is simple yet effective. The overall proposed method achieved very competitive results in comparison to previous works, by tuning very few parameters.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces the Language Guide Adapter (LGA), an innovative approach that extends the Segment Anything Model (SAM) to medical image segmentation by integrating textual data from medical reports with image features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The multi-modal approach, which integrates the encoder of a foundation model (SAM), is innovative and may address the performance degradation commonly seen in medical segmentation tasks when using such models. This concept is worth further exploration.
    • The proposed method surpasses many competing approaches in terms of performance.
    • The authors offer sufficient technical details, making their methods straightforward and easy to understand.
    • They conduct thorough and intriguing ablation studies to assess the effectiveness of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The report text seems highly structured, resembling a concatenation of different structured labels. This raises two concerns: (i) At the testing stage, using such structured text as input might lead to label leakage, allowing the model to infer information it shouldn’t, thereby compromising its accuracy and fairness. (ii) During the training stage, this level of structured text could lead to degenerate representations in text embeddings. This could risk collapsing the representation learning and reduce the overall performance of the segmentation tasks.

    2. The gap between image and text features seems significant. What is the outcome of aligning these two types of features? The authors should discuss this aspect further to clarify the rationale behind their approach and underscore the motivation for their work.

    3. The authors mention MedSAM [1] several times throughout the paper, considering it a competitive competitor. From my experience and perspective, the original SAM is designed to handle general segmentation tasks elegantly. However, MedSAM just fine-tunes the decoder for each specific domain or medical dataset, which can unfortunately transform a general-purpose model back into a domain-specific one. Although MedSAM’s fine-tuned performance surpasses that of the original SAM (which, of course, was designed for natural images), it lacks the technical innovation and clinical application value compared to other well-established domain-specific methods with better performance than MedSAM.

    A similar concern arises with this work. I noticed that the authors incorporated the Language Guide Adapter into the framework, suggesting a more flexible approach compared to MedSAM. I recommend that the authors explain the rationale behind their approach and highlight the motivation for their work in the paper. Without this clarification, the innovative aspects of their method might be limited, much like what happened with MedSAM.

    [1] Segment anything in medical images. arXiv e-prints pp. arXiv–2304 (2023)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors provide enough technical detail to make their methods easy to understand and follow.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please refer to the comments regarding the weaknesses.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty of the multi-modal method and convincing results. Considering the previous comments, the paper’s score could improve if these issues are addressed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper uses languaged guide adapter with parameter efficient fine-tuning to integrate textual information into the segmentation process, supported by a SAM and BERT backbone. To combine the embeddings from both models, they use a feature fusion module. They evaluate across two multimodal datasets covering both CT and X-ray modalities and demonstrate SoTA performance in both.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and easy to follow.

    The architecture design is intuitive, well-motivated and quite parameter efficient. The method is also not tightly coupled to the text encoder, so this can easily be replaced as more powerful embedding models become available.

    The proposed method outperforms all baselines across two good sized datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It isn’t clear how much of the superior performance comes from the FFM component versus from using the SAM architecture. So far as I can tell, the other text+image models have a different segmentation backbone, which may put them at a disadvantage, given SAM’s extensive pre-training. In the case of LViT-T, the underlying architecture is still ViT-based like SAM, so in principal the LViT-T method could also be used with SAM so far as I am aware.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    A well-written paper with a clear contribution. Although SAM was originally trained with a text prompt encoder, so far as I am aware this has not been released, therefore this contribution should be broadly of interest to anyone looking to use text with the SAM beyond in a segmentation task. To support this, the team also appear to plan to release the code.

    I am somewhat surprised at how well bert embeddings perform without fine-tuning, especially as medical reports may be out of distribution of the wikipedia pre-training data. Some previous work finds bert embeddings perform quite poorly out-of-the-box [1]. I understand the aim is to keep the number fine-tuning parameters low, however BERT can be integrated with out-of-the-box parameter efficient fine-tuning methods [2]. Did you investigate at all whether fine-tuning bert improved performance?

    I have some concerns about the comparability of the other image-text methods, particularly what sort of segmentation backbone they used, which the team could resolve in the rebutall.

    [1] Reimers, Nils and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” Conference on Empirical Methods in Natural Language Processing (2019). [2] Hu, J. Edward, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR (2021).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Weak accept as paper overall makes a clear contribution and I just have some previously stated concerns regarding baselines that could be clarified in rebutall.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thank you for your valuable suggestions, which are greatly helpful for our future research. Below, we address the main concerns raised:

  1. Limited Application Scene and Information Leakage:We acknowledge the concern regarding the practical application of our method and the potential for information leakage. In future work, we will explore strategies to reduce dependency on structured text during training, aiming to prevent label leakage and improve model fairness and accuracy.
  2. Performance Comparison: To ensure fair comparisons in our future research, we will: Standardize Backbones: Use consistent segmentation backbones across methods. Use Generated Text: Report results using generated medical text during inference, as suggested, to evaluate model robustness.
  3. Language Model Improvements: We will compare the performance of models pre-trained on medical data, such as BioClinicalBERT, with our current approach. Additionally, we will assess the impact of parameter-efficient fine-tuning methods like LoRA on our task.
  4. Releasing Resources: We will release our code to support open science and reproducibility.




Meta-Review

Meta-review not available, early accepted paper.



back to top