Abstract

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2311_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2311_supp.pdf

Link to the Code Repository

https://github.com/HealthX-Lab/MedCLIP-SAM

Link to the Dataset(s)

https://github.com/razorx89/roco-dataset https://drive.google.com/file/d/1qY_LLYRM7akV50_wOn-ItNKU5rGpfjya/view?usp=drive_link https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset https://drive.google.com/file/d/1txsA6eNFZciIrbqzwS3uOcnnkiEh3Pt4/view?usp=drive_link https://www.kaggle.com/datasets/anasmohammedtahir/covidqu https://www.kaggle.com/datasets/ashkhagan/figshare-brain-tumor-dataset

BibTex

@InProceedings{Kol_MedCLIPSAM_MICCAI2024,
        author = { Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
        title = { { MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes MedCLIP-SAM, which tuned a BiomedCLIP model with DHN-NCE loss, and use gscore-CSM to generate saliency maps. Then, the saliency maps are used as prompts for SAM to segment the medical images. An option is to use the SAM output to further serve as weak supervision to train a segmentation network.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well organized and easy to follow. The experiments are conducted on 3 datasets to validate the effectiveness.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The entire complex BiomedCLIP tuning + gscoreCAM process are aiming to provide prompts for SAM. However, why don’t the authors directly tune SAM, or directly use MedSAM? As the authors have already used BiomedCLIP that is a specialized medical CLIP model, I think using MedSAM is an intuitive solution.
    • Another question is that, if the goal is to use text prompt with SAM, there are also many works like Grounded-SAM that support text prompt. What is the advantage of your framework?
    • Regarding the effectiveness, it seems the DHN-NCE only surpass InfoNCE by a negligible margin (Tab.1, e.g. 85.73->85.99 in text to img retrieval). For the segmentation performance, the model only works fine in Breast datasets, and shows a very poor performance on Lung X-Ray. A strange thing is that, with weak supervision training, the performance becomes worse (-16% iou on breast-US), why is this happening? The results could not support the effectiveness of the model.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    lack of finetuning biomedclip details. Is it full tuning or PEFT? Besides, what are the details of gscoremap to generate saliencies, and how to use them as box prompts for SAM? more details should be given.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    refer to weaknesses

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    refer to weaknesses

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks for the author’s feedback. Despite the authors have made the workflow clearer, I still think the method is too complex to use text for SAM segmentation. The text is fed into a Biomedclip (tuned by the authors), and a gscorecam is used just to generate box prompts for SAM.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a MedCLIP-SAM framework by generating bounding box prompts for SAM automatically using a fine tuned BiomedCLIP model to generate pixel-wise scores. The experimental results show effectiveness of the model on two of the three datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Finetuning a BiomedCLIP model using the proposed new decoupled hard negative noise contrastive estimation loss seems effective to improve downstream tasks. The experimental results show that the fine-tuned BiomedCLIP leads to improvements for gScoreCAM. The method should be useful for other applications.

    Combining bounding box generations with SAM seems to provide a framework that can segment tumor regions well.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The description of the model can be improved. For example, as the model uses the BiomedCLIP model fine-tuned on the MedPix dataset, it becomes unclear whether the model is a zero-shot universal medical image segmentation solution. It seems that the proposed model works well on breast tumors and brain tumors but does not work as well on lung chest x-rays compared to the state of the art models suggests the dataset used for finetuning may play an important role.

    it would be very valuable if the authors could conduct detailed error analyses between the proposed model and the one proposed in [34] for segmenting lung x-ray images.

    The paper trains a weekly supervised model for segmentation as shown in Figure 1 and described in 2.2. However, it is not how the segmentation network is used. Tables 2 and 3 use a weekly supervised model, which is from [34]. Including the results in Tables 2 and 3 and explaining them clearly would improve the readability of the paper. It seems it is used to generate the results in Figure 2; the examples are helpful and statistics on the entire datasets would be more informative.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The provided descriptions are helpful; but I think they are not sufficient to reproduce the reported results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Note that the entire framework is not zero-shot for medical image segmentation even using SAM as BiomedCLIP is fine tuned using the images in MedPix. It should be clarified what images were included in MedPix and how they affect the model performance.

    The paper should state directly that the proposed model does not work well on Lung X-ray compared to the methods developed in [34].

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed framework consists of novel components and the resulting model is effective on bread ultrasound and brain MRI datasets.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    1. Introducing a novel CLIP training/fine-tuning method called Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE).
    2. Proposing a zero-shot medical segmentation approach by combining CLIP and SAM for radiological tasks.
    3. Exploring a weakly-supervised strategy to further refine zero-shot segmentation results.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to follow. Visualizations properly illustrates the concept of the proposed architecture.
    2. The proposed novel Hard Negative Noise Contrastive Estimation loss is quite general and can be used in different medical imaging applications while the batch size is small.
    3. The segmentation performance is promising even when comparing with fully supervised learning method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The method parts need to be highly expanded: Some parts of the method details explanation remain unclear. Specifically, in Section 2.2: 1) After post-processing the gScoreCAM map, it is unclear what method was used to obtain the bounding box. 2) Details regarding the resulting pseudo-masks used to train the segmentation network in a weakly supervised setting are lacking.
    2. There is a lack of experiments comparing the model with other models in zero-shot or weakly supervised settings.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Explain more details of each stage and all the settings. Especially the week supervised segmentation stage.
    2. Please include the explanation for all the components shown in figure 1. For instance, how the DICE loss was used?
    3. For zero shot - week supervision stage, what is the data flow?
    4. Compare the model to state-of-the-art zero shot segmentation models such as ZegFormer [1]. [1]Ding, J., Xue, N., Xia, G.S. and Dai, D., 2022. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11583-11592).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed loss is novel and potential to use in more applications

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all reviewers for their valuable input. Our code will be made public upon acceptance. We address the main concerns and misunderstandings below:

  1. Clarify rationale of our method & zero-shot setting (R1,4): We proposed a novel method for text-based interactive and universal medical image segmentation. Compared with MedSAM, our method doesn’t need the user to precisely locate the target structure with bounding boxes/points. Different from Grounded-SAM, our CLIP-based target localization module doesn’t need bounding boxes (not available for medical text-image data) during training for localization loss. Similarly, ZegFromer needs ground truth segmentation (mask-loss) during training, which isn’t feasible in our data. Thus, we finetuned BioMedCLIP with a novel DHN-NCE loss in combination with SAM to obtain zero-shot segmentation.
    [R4] MedSAM training data “includes our test databases; adopting it for our framework will invalidate the ‘zero-shot’ setting”, so we used SAM in our case. [R1] The text-image data in the original training and finetuning of BiomedCLIP, including MedPix doesn’t include the databases used for segmentation tasks, which have no matching clinical reports. Also, like our case, CLIP-based pre-training is commonly used in zero-shot segmentation (Lüddecke et al. arxiv.org/abs/2112.10003, Zhang et al. arxiv.org/abs/2212.03588, Cao et al. arxiv.org/abs/2401.12665) and SAM was trained on natural images, so our method is indeed zero-shot.
  2. Clarify method workflow details (R1,3,4): Our method’s “zero-shot” workflow has 4 steps: 1) after inputting the text prompt and image, a saliency map is outputted from the activations of the text and image encoder layers of BiomedCLIP using gScoreCAM; 2) The CRF filter produces an initial discrete mask from the gScoreCAM saliency map; 3) A bounding box is obtained by fitting the biggest rectangle that contains the initial discrete mask and we represent it using coordinates of its 4 corners (more details in the revision); 4) The bounding box is used as the prompt to obtain refined segmentation. For “weakly-supervised segmentation”: For each dataset (Section 2.3), “zero-shot” masks were curated for training data as weak ground truths, which were used to train task-specific ResUNet [34] using the DiceCE loss. [R3]: gScoreCAM [6] combines top CAMs of gradient-ranked channels for saliency maps. We used it without modification. [R4]: We fine-tuned all the layers of BioMedCLIP, which provides better results than PEFT. The difference between DHN-NCE and InfoNCE (and other SOTA) is statistically significant (p<0.05).
  3. Explain the segmentation results (R1,3,4): Table 3 compares segmentation accuracy of zero-shot, weakly-supervised (see Q2), and supervised results. The trends vary across 3 tasks. Although the “zero-shot” results are lower than the fully and weakly supervised results for lung X-rays, this doesn’t mean our method fails in this case. This could be explained by 1) lung X-ray has a much bigger training set (16,280) than breast ultrasound (600) and brain MRI (400); 2) The segmentation task is easier (larger structure). For smaller datasets and harder tasks, the quality of pseudo-label becomes important in model training, so the weak supervision offered worse results than zero-shot. Note that we’re training from scratch.
  4. SOTA segmentation method comparison (R1,3,4): [R3,4] In this first study, our method primarily focuses on universal zero-shot medical segmentation. As mentioned in Q1, many zero-shot methods for natural images require refined/coarse object segmentation or multi-class presence in one image for training, which isn’t feasible in public text-image medical datasets for our method comparison. As weakly-supervised segmentation is task-specific, we’ll perform SOTA comparison in future studies. [R1] As mentioned in the caption, Table 2 compares “zero-shot segmentation” based on the pre-trained and fine-tuned BiomedCLIP models using gScoreCAM vs. GradCAM.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is good in novelty.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is good in novelty.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes an approach to medical image segmentation by leveraging the BiomedCLIP model and the Segment Anything Model (SAM). Reviewers pose several concerns, including the comparison with other state-of-the-art models in zero-shot or weakly supervised settings is insufficient, the effectiveness of the proposed DHN-NCE loss function is questionable, as it only marginally improves upon the InfoNCE loss, and the methodology lacks clarity and requires further elaboration.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes an approach to medical image segmentation by leveraging the BiomedCLIP model and the Segment Anything Model (SAM). Reviewers pose several concerns, including the comparison with other state-of-the-art models in zero-shot or weakly supervised settings is insufficient, the effectiveness of the proposed DHN-NCE loss function is questionable, as it only marginally improves upon the InfoNCE loss, and the methodology lacks clarity and requires further elaboration.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes an interesting method for medical image segmentation by combining CLIP and SAM models. The reviewers are mostly positive, with one reviewer improving their score post-rebuttal. . The method is novel and experiments with several datasets are presented. The topic is of interest and can lead to fruitful discussions.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes an interesting method for medical image segmentation by combining CLIP and SAM models. The reviewers are mostly positive, with one reviewer improving their score post-rebuttal. . The method is novel and experiments with several datasets are presented. The topic is of interest and can lead to fruitful discussions.



back to top