Abstract

Semi-supervised learning, a paradigm involving training models with limited labeled data alongside abundant unlabeled images, has significantly ad-vanced medical image segmentation. However, the absence of label supervi-sion introduces noise during training, posing a challenge in achieving a well-clustered feature space essential for acquiring discriminative representations in segmentation tasks. In this context, the emergence of vision-language (VL) models in natural image processing has showcased promising capabili-ties in aiding object localization through the utilization of text prompts, demonstrating potential as an effective solution for addressing annotation scarcity. Building upon this insight, we present Textmatch, a novel frame-work that leverages text prompts to enhance segmentation performance in semi-supervised medical image segmentation. Specifically, our approach in-troduces a Bilateral Prompt Decoder (BPD) to address modal discrepancies between visual and linguistic features, facilitating the extraction of comple-mentary information from multi-modal data. Then, we propose the Multi-views Consistency Regularization (MCR) strategy to ensure consistency among multiple views derived from perturbations in both image and text domains, reducing the impact of noise and generating more reliable pseudo-labels. Furthermore, we leverage these pseudo-labels and conduct Pseudo-Label Guided Contrastive Learning (PGCL) in the feature space to encourage intra-class aggregation and inter-class separation between features and proto-types, thus enhancing the generation of more discriminative representations for segmentation. Extensive experiments on two publicly available datasets demonstrate that our framework outperforms previous methods employing image-only and multi-modal approaches, establishing a new state-of-the-art performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0960_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_Textmatch_MICCAI2024,
        author = { Li, Aibing and Zeng, Xinyi and Zeng, Pinxian and Ding, Sixian and Wang, Peng and Wang, Chengdi and Wang, Yan},
        title = { { Textmatch: Using Text Prompts to Improve Semi-supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper applies multiple strategies to address the semi-supervised learning in medical image segmentation. Main contributions are: 1) an alignment module of Bilateral Prompt Decoder (BPD) to fuse the visual and text features for joint modality learning, 2) augmentation in un-labeled data for both image and text to achieve more robust pseudo-labeling, 3) pseudo-label guided contrastive learning between foreground and background feature spaces. Experiments on two public datasets for lung infections are used to validate the effectiveness of the learning strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation is well addressed, and the paper is well written with clear delineation of the methodologies. Qualitative and quantitative results are consistent with the conclusions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are a few areas are vague to the reviewer: 1) when the proposed strategies will plateau as the ratio of the labeled data increases? 2) How sensitive the relative weights are combined in the learning? Compared to the supervised learning, the weights of semi-supervised loss are relatively small, and varies with datasets, 3) Experiments are limited to two small datasets related to lung infection only. The generality of the algorithms remains unknown.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The methodology and insights are clear to understand. There might be minor difficulties to replicate the work exactly due to lack of details and code in particular.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The term “Multi-View” is easily to be confused that the images are captured from different perspectives. As a matter of fact, these “views” are merely augmentations of the original image-text pairs.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Methodologies for semi-supervised learning for fine-grained tasks such as image segmentation is well motivated. Overall, this is a well written paper with sufficient innovations, although limited scale of experiments. The decision is a balance between the strength and weakness of the paper described above.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper advances the text prompts technique into the semi-supervised medical image segmentation task. And, the authors propose a framework, called Textmatch, to leverage the promising capabilities in aiding object localization by using text prompts. The main idea is easy to understand and reasonable.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-written and makes it easy to understand the main techniques.
    2. The idea of leveraging the promising capabilities of text prompts in semi-supervised medical image segmentation tasks is interesting.
    3. The experiment results show the superiority of the proposed methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The main components, i.e., Bilateral Prompt Decoder, Multi-views Consistency Regularization, and Pseudo-label Guided Contrastive Learning, are derived from the existing works with few improvements.
    2. The overall framework may be hard to follow, as it relies on the generative model for text generation, without considering the poor case of text generation.
    3. The submission does not mention open access to source code.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I believe the poor case of text generation should be considered for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weaknesses .

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is well-written and the idea of leveraging the promising capabilities of text prompts in semi-supervised medical image segmentation tasks is interesting. Although the proposed methods are adapted from the existing work, I believe it is an interesting work for medical tasks.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors have proposed a semi-supervised segmentation framework to leverage text prompts to improve segmentation accuracy. The proposed method is novel, and the way the authors use visual and linguistic features is very interesting (Bilateral prompt decoder and multi-view consistency regularization). They have used two publicly available datasets for validation and compared the proposed method’s accuracy with some state-of-the-art techniques in the literature. The results show a significant improvement in segmentation accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Technical novelty is strong.
    • The authors used two publicly available datasets for the validation.
    • Comparison with multiple state-of-the-art methods.
    • substantial improvement in the segmentation quality.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The fact that the authors have used different hyper-parameters for different datasets indicates that some sort of hyper-parameter optimization was involved. However, the authors do not describe this process.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The hyper-parameter optimization should be explained.
    • I am not sure I agree with the authors’ conclusion regarding figure 3.
    • The authors should add appropriate statistical analysis for all comparisons.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The proposed method is novel and interesting
    • The improvement as compared to some of the state-of-the-art methods is promising.
    • Extensive ablation studies that show the impact of different parts of the proposed algorithm.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their constructive comments, which have been carefully addressed as follows:

Q1: Hyper-parameter optimization (R1, R3) A1: In our experiments, the hyper-parameters 𝜆1 and 𝜆3, which control the consistency loss and contrastive learning loss respectively, are set to small values (0.1) based on existing semi-supervised works to align their gradient scales with the supervised segmentation loss. For 𝜆2, which controls the pseudo-label supervision loss, we varied the value from 0 to 1.0 in steps of 0.05. We found that a smaller value of 0.1 for the first dataset achieved the best performance, while for the second dataset, a moderate value of 0.5 yielded better results due to its higher segmentation difficulty, which required a larger weight to learn more from the unlabeled data.

Q2: Statistical analysis for all comparisons (R1) A2: We conducted paired t-tests to verify the significance of our improvements. Results on two datasets indicate that p-values on both Dice and MIoU are less than 0.05, demonstrating that the improvements achieved by our model are statistically significant.

Q3: Conclusion regarding figure 3 (R1) A3: Sorry for any confusion. Figure 3 analyzes the T-SNE decomposition of the representation space with and without PGCL. Red points represent foreground classes, and blue points represent background classes. With PGCL, training shows better intra-class compactness and inter-class separability, forming two clusters despite some boundary confusion. Without PGCL, red and blue points remain scattered with no clear boundaries. We have updated the manuscript with a more detailed explanation to support our conclusion.

Q4: Ratio of the labeled data (R3) A4: In our experiments, we examined the ratio of labeled data from 0.05 to 0.95 in increments of 0.1 under a semi-supervised setting, as well as a fully-supervised setting at 1.0. We noted significant improvements in our evaluation metrics from 0.05 to 0.75, beyond which the performance plateaued up to 1.0.

Q5: Generality of the algorithms (R3) A5: While we conducted experiments on two datasets related to lung infections, these datasets encompass both X-ray and CT modalities, which partially demonstrates the generality of our proposed method. In the future, we plan to extend experiments to more datasets to further validate the effectiveness of Textmatch.

Q6: Confusion of “Multi-View” (R3) A6: Sorry for any confusion. We referenced the naming conventions used in previous works for different augmented forms of images. We believe it is reasonable to transfer this concept to the context of generating different augmented forms of image-text pairs. In our revised manuscript, we have added further explanations regarding the term “multi-view” to clarify its specific meaning and avoid misunderstandings.

Q7: Poor case of text generation (R4) A7: We utilize the large language model (LLM), specifically GPT, for text generation. The specific task involves generating text that is structurally different but semantically similar to a given original text. We achieve this by providing strict and precise prompts, and by setting contextually appropriate guidelines and filters to maintain high-quality output.

Q8: Novelty of main components (R4) A8: While some components of our method are inspired by existing works, our key contribution lies in the innovative integration of text prompts into these components and the corresponding improvements. We designed a Bilateral Prompt Decoder to harmonize visual and linguistic features, enabling comprehensive multi-modal representations. Our Multi-views Consistency Regularization uses both image and text perturbations to reduce noise and produce high-quality pseudo labels. Additionally, our Pseudo-label Guided Contrastive Learning strategy refines the feature space, enhancing class-discriminative feature learning.

Q9: Open access to source code (R1, R3, R4) A9: Source code will be released upon formal acceptance.




Meta-Review

Meta-review not available, early accepted paper.



back to top