Abstract

Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance. Source code is available at: https://github.com/xyx1024/SimTxtSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0802_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/xyx1024/SimTxtSeg

Link to the Dataset(s)

https://www.synapse.org/Synapse:syn26376615/wiki/613312 https://github.com/DengPingFan/PraNet https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation

BibTex

@InProceedings{Xie_SimTxtSeg_MICCAI2024,
        author = { Xie, Yuxin and Zhou, Tao and Zhou, Yi and Chen, Geng},
        title = { { SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors studied weakly supervised medical image segmentation by using text prompts. Specifically, the authors trained a converter that takes text prompts and input image to generate bounding box visual prompts that would be used for SAM. SAM-generated pseudo masks are used as segmentation labels. The segmentation decoder also integrated text embeddings from converter.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed methods achieved stronger or similar results as fully supervised methods, using only bounding box information.

    The ablation studies on text prompts categories, SAM versions are complete.

    The writing is smooth as relatively easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is not clear how the text prompts in the dataset were generated, by human experts, provided in the data, etc.

    The performance between with and without TVHA is large, it would be clearer if the authors reported the model size of different versions. This would be helpful to understand if the advantage is from the proposed architecture, since UNet is often the most robust network architecture.

    Since the performance of the SAM generated pseudo labels are good, it is interesting why the visual prompts (bounding boxes) are not integrated into the decoder.

    Also, the used dataset is relatively not large and it is not clear why these two datasets were chosen instead of others.

    There is also a lack of discussion on other works using text prompts for segmentation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    As the author proposed new architectures and training format, the reproduction would be very challenging if the authors do not release data and code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It would make the paper stronger if the author explains more about the text data generation, network comparison.

    The description of TVHA is relatively long, which can be easily shortened and releasing code can help explain most of the details.

    As the pseudo label is relatively high quality, it would be interesting to analyse the quality of the bounding boxes. This can be helpful for understanding the SAM ablation and also for planning future research, if the bounding boxes are nearly perfect, it means that the pseudo label quality is restrained by SAM, if the bounding boxes are of low quality, it means a better converter is needed.

    A more thorough literature review is recommended.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is overall decent with good results, however the reproducibility is very limited. The unclarity of the data generation is one of the weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Authors feedback answers many small details, from data generation and architecture comparisons.

    One potential limitation remains as the small dataset size, but this is potentially beyond the scope of this paper.



Review #2

  • Please describe the contribution of the paper

    The paper puts forth a method for segmentation of medical images using weak supervision from simple text cues. It demonstrates the adaptability of grounded sam for medical imaging. It also proposes a novel ‘Text-Vision Hybrid Decoder’ that involves self-attention of the image embeddings followed by cross attention of text with image and image with text, which enables text aware segmentation

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Validates the benefits of Grounding Dino + SAM for medical images over other forms of weak supervision.
    2. Combining text and image embeddings isn’t always straightforward, but the proposed TVHA module (seemingly influenced by the feature enhancer in Dinos) offers a logical approach to tackle this challenge.
    3. The performance comparisons were made with other weakly supervised and fully supervised techniques, providing a comprehensive view of the landscape.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed TVCC (Grounding Dino) + SAM is not entirely novel. Refer to https://arxiv.org/abs/2401.14159.
    2. The improvements resulting from the TVHA module seem to be minimal when compared to TVCC + SAM. Moreover, it’s hard to assess the significance as no measure of spread is provided.
    3. Not enough information about the diversity of the prompts used for fine-tuning the Grounding Dino.
    4. Ambiguous information about the experiments (Table 1, row 9)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Great work fine-tuning the Grounding Dino with medical domain data. Please add more information about the diversity of prompts used. For example, how many different types of sentences were used to describe a polyp? how were these phrases created? etc. Also consider adding details about data used in the fine-tuning process and if it is the same set as the one on which the segmentation network is later trained
    2. The benefits from text supervision in TVHA would be more convincing when compared with an attention based decoder as opposed to a regular UNet.
    3. It is vital to know if the performance gains are statistically significant. Consider showing at least the standard deviations for mDice and mIOU.
    4. SimTxtSeg-w/o-TVHA is very confusing. If its just a UNet, please edit row 9 of Table 1 to show that it does not use any text supervision.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the formulation isn’t novel, the validation of Grounding Dino + SAM module for medical imaging is an important result. More information is need to assess the significance of the results from the TVHA module. Also, some minor corrections are needed in the results table (Table 1).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed the concerns raised.



Review #3

  • Please describe the contribution of the paper

    This paper studies weakly-supervised medical image segmentation by harnessing both vision foundation model SAM and the text prompt. Its key idea is to use the text representation to aid the generation of the pseduo label. The generated pseduo label in turn supervises the learning from SAM. Experiments on two datasets show its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper is well-written and easy-to-follow.

    • The overall idea is moderately novel, and deserves further study.

    • The joint use of vision foundation model and text cue for weakly-supervised medical image segmentation is rarely studied, and should be encouraged in the community.

    • The proposed method shows the state-of-the-art performance when compared with prior weakly-supervised methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The experimental baseline is not very rational, and need to be clarified. For example: (1) The fully-supervised baselines listed in Table 1 are not rational, and are made artificially low. As the proposed method harnesses VFM which is ViT-B based, the fully-supervised baseline should report the results of ViT-B based encoder. Currently, the reported UNet results are not artifically low, as the UNet representation is inherently inferior to ViT-B. (2) For the four compared weakly-supervised segmentation methods, what is their backbone? Is it the same type of the proposed method for fair evaluation?

    • The proposed two components, namely, textual-to-visual cue converter and text-vision hybrid attention, are somewhat complicated. They are a pile-up of multiple cross-attention and channel attention, and the skip connections. How each component and each step impacts the overall performance need more in-depth study.

    • Besides, inside both components, the key idea is the cross-attention, channel-attention and the skip connection. All of them are common in computer vision and medical image. The module design is somewhat incremental and ordinary.

    Minor comments:

    • The mathematical definitions and notations are not clear. Please seperate tensors, scalars, operations and loss by different font sizes.

    • Fig.1 and Fig. 2 can be better polished with higher resolution.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the specific weakness points to clarify during the rebttual stage.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Image-text based weakly-supervised medical imae segmentation deserves further study, and the paper is easy-to-follow.

    However, the module design is somewhat incremental design. The method comparison and baseline need to be better clarified. In addition, inside the incremental module design, more in-depth analysis may needed. Finally, some presentation issues remain.

    Therefore, the reviewer feels that the weakness weights similar to the strength, and awaits the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addressed my concerns. I lean to Reviewer#1 to accept this paper.




Author Feedback

Thanks for the support and insightful comments of all reviewers. To R1: Q1: Text prompts and fine-tuning data A1:

  1. We adopt class name and short sentences as text prompts for their simplicity and effectiveness. We tried longer texts but did not bring extra increase. To avoid handcrafted prompting costs, we use GPT-4 to generate a concise sentence within 20 words.
  2. We used the same data in training TVCC and segmentation model with TVHA.

Q2: TVHA’s benefits A2: We tried a U-shape decoder only with self-attn. during initial study, which performed similarly to regular UNet (+<1%) for comparison. Moreover, we emphasize that our aim is to train a light weakly text-driven segmentation model, instead of directly deploying the large model (TVCC+SAM) which is to generate high-quality pseudo masks. It is unnecessary to prove the weakly-supervised model with TVHA gains marginally better results than TVCC+SAM.

Q3: Standard deviations A3: Standard deviations for mDice and mIOU will be added in the revised paper.

Q4: Ambiguous row 9 of Tab.1 Q4: The U-shape decoder doesn’t need text supervision. We’ll edit row 9 in revision.

To R2: Q1: Generation of text prompts A1: To avoid prompt design costs and echo the simple yet effective text supervision aim, we use GPT-4 to generate a concise sentence within 20 words. We tried longer texts but did not bring extra increase.

Q2: Model size of different versions A2: The params of the model with TVHA is 43M, while that without TVHA is 34.91M. We added params (i.e. much deeper) to the baseline with UNet decoder to 40M, but did not bring such increase as TVHA, showing our design is effective.

Q3: Code releasing A3: Due to the anonymous policy, we are unable to include the code link here, but will add it in revision. We’ll shorten the TVHA description and add more literature review in our paper.

Q4: Visual prompts integration A4: Our target segmentation network is a light CNN-based framework with ConvNeXt as its backbone, instead of a SAM-based foundation model. We hope to both maintain its effectiveness and simplicity. So, the bounding box info. can only be used as supervision rather than visual prompt inputs for our target decoder.

Q5: Datasets chosen A5: We chose these two datasets because endoscope and MRI are two very common and representative medical image modalities, and the findings can be easily described in simple texts. We also did experiments on skin image data gaining positive increase.

Q6: Other works with text prompt A6: Ariadne’s Thread mentioned in the Sec3.2 and Tab.1 in our paper is text-guided fully supervised segmentation network. Our model’s performance exceeds it.

Q7: The quality of the bounding boxes A7: The TVCC generates good enough pseudo boxes, achieving around 0.80 mAP for both polyp and Brain MRI data. We’ll add corresponding analysis in the revised paper.

To R3: Q1: Clarification about baseline A1:

  1. First, our target segmentation backbone is ConvNext-Tiny (also CNN-based). Moreover, the fully-supervised methods are only put to show our weakly-supervised manner is even competitive to fully-supervised ways (seen as an upper bound). Our primary comparison is mainly with weakly-supervised methods.
  2. As for the weakly-supervised baselines, we keep their backbones in original papers (Res2Net50 for BoxPolyp and WeakPolyp, Vgg16 for Boxshrink, and the S2ME has its own framework), because they tuned the best results. We tried to use the same backbone, while the results of some baselines declined, e.g. Boxshrink, -2.54% in mDice for polyp dataset. Moreover, the model size of Res2Net50 and ConvNext Tiny (ours) are 25.70M and 28.13M, respectively, which are almost the same scale.

Q2: Study about component design A2: We did in-depth studies during initial design. Compared with UNet decoder, with dual-way cross-model atten., the mDice +3.42% in average. With channel atten. the mDice +3.89% in average. The details will be updated in the supp. file due to rebuttal policy.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces a method for segmenting medical images using weak supervision from simple text cues, demonstrating the adaptability of grounded SAM for medical imaging. It also presents a novel ‘Text-Vision Hybrid Decoder,’ which employs self-attention on image embeddings followed by cross-attention between text and image, enabling text-aware segmentation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces a method for segmenting medical images using weak supervision from simple text cues, demonstrating the adaptability of grounded SAM for medical imaging. It also presents a novel ‘Text-Vision Hybrid Decoder,’ which employs self-attention on image embeddings followed by cross-attention between text and image, enabling text-aware segmentation.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces a method for weakly-supervised medical image segmentation using simple text cues, demonstrating the adaptability of Grounding Dino and SAM for medical imaging. It proposes a novel Text-Vision Hybrid Decoder (TVHA), integrating self-attention on image embeddings and cross-attention between text and image embeddings, enabling text-aware segmentation. Strengths include validating the benefits of combining text and image embeddings, achieving competitive results with minimal supervision, and comprehensive performance comparisons. Weaknesses include the limited novelty of the proposed modules, insufficient details on text prompt generation, and the lack of information for reproducibility.

    After rebuttal, the paper received a consensus in accept. However, some aspects, such as experiment details and the significance of improvements, need further clarification in the next version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces a method for weakly-supervised medical image segmentation using simple text cues, demonstrating the adaptability of Grounding Dino and SAM for medical imaging. It proposes a novel Text-Vision Hybrid Decoder (TVHA), integrating self-attention on image embeddings and cross-attention between text and image embeddings, enabling text-aware segmentation. Strengths include validating the benefits of combining text and image embeddings, achieving competitive results with minimal supervision, and comprehensive performance comparisons. Weaknesses include the limited novelty of the proposed modules, insufficient details on text prompt generation, and the lack of information for reproducibility.

    After rebuttal, the paper received a consensus in accept. However, some aspects, such as experiment details and the significance of improvements, need further clarification in the next version.



back to top