Abstract

Semi-supervised learning has emerged as a critical approach for addressing medical image segmentation with limited annotation, and pseudo labeling-based methods made significant progress for this task. However, the varying quality of pseudo labels poses a challenge to model generalization. In this paper, we propose a Voxel-wise CLIP-enhanced model for semi-supervised medical image Segmentation (VCLIPSeg). Our model incorporates three modules: Voxel-Wise Prompts Module (VWPM), Vision-Text Consistency Module (VTCM), and Dynamic Labeling Branch (DLB). The VWPM integrates CLIP embeddings in a voxel-wise manner, learning the semantic relationships among pixels. The VTCM constrains the image prototype features, reducing the impact of noisy data. The DLB adaptively generates pseudo-labels, effectively leveraging the unlabeled data. Experimental results on the Left Atrial (LA) dataset and Pancreas-CT dataset demonstrate the superiority of our method over state-of-the-art approaches in terms of the Dice score. For instance, it achieves a Dice score of 88.51% using only 5% labeled data from the LA dataset.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1949_paper.pdf

SharedIt Link: https://rdcu.be/dV52c

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_66

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

https://www.cardiacatlas.org/atriaseg2018-challenge/ https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT

BibTex

@InProceedings{Li_VCLIPSeg_MICCAI2024,
        author = { Li, Lei and Lian, Sheng and Luo, Zhiming and Wang, Beizhan and Li, Shaozi},
        title = { { VCLIPSeg: Voxel-wise CLIP-Enhanced model for Semi-Supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {692 -- 701}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors present a CLIP-enhanced segmentation model for 3D CT image data. The core design of their approach, VCLIPSeg, draws inspiration from the original CLIP architecture. They propose the incorporation of voxel-wise prompt modules specifically tailored for processing 3D data. Essentially, three new modules as well as their interconnection are proposed. The first module tries to create joint vision-text prompts (VWPM), the second module serves as a regularization to avoid overfitting on noise (VTCM) and a third model creates pseudo-labels on the fly based on a two-decoder setup. The quantifications are extensively compared to other state-of-the-art methods and obtain superior results. Ablation studies are performed that indicate the proposed improvements are beneficial.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-written, has nice illustrations and nicely demonstrates that the obtained results with their semi-supervised model are almost on par with a fully-supervised one, which is quite useful based on the usually very limited amount of manually annotated training data in the medical domain.
- Using voxel-wise prompts modules, by converting the 3D voxel data to 2D is very interesting, and exploring the CLIP training scheme with medical image data is also very interesting
- Implementing Dynamic Labeling Branch (DLB) for the unlabeled data is an innovative approach.
- The authors did thorough quantitative analysis compared with SOTA methods as well.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
While the paper is generally very well-written and the presented methods are convincing, there are a few weaknesses that would need clarification:
- It is not fully clear to me at which level the prompting is performed. The text prompt itself looks like it would be performed on a per-pixel basis but I doubt that this is the case. Is it based on 3D patches instead? I think the way how “this module aims to explore the semantic relationships between different pixels” (page 3) should be elaborated a bit more.
- I think the ablation study to show that VWPM is superior does not really demonstrate that the CLIP features are providing a valuable addition. Couldn’t it be that just the additional learnable parameters that are used to transform f by a few MLPs in the VWPM are the reason for slightly improved performance? As you essentially only use two different text embeddings “A pixel of background.” and “A pixel of left atrial.” I’m not fully convinced that this additional information is indeed responsible for the performance gain. An additional ablation study without the text encoder would be interesting to confirm this.
- Although the text encoder is completely frozen, using only a simple prompt might not be optimal. Ablation studies on different text-prompts or even medical data related text prompts are missing, for instance, the text-prompt used in CXR-CLIP https://arxiv.org/abs/2310.13292 can potential further boost the performance.
- In Voxel-Wise Prompts Module (VWPM), Equation 1, v_s, the vision semantic map, is not very clear and also not consistent with Equation 2. In Equation 2, it is also not clear if the authors are using voxel-wise multiplication or other operations.
- The implementation of Dynamic Labeling Branches (DLB) is not very clear, especially the Equations 7, 8. DLB mainly is designed for generating the pseudo-labels for unlabeled data, however, in formula 7, 8, w_1 ,w_2 contains GT. Please comment, why it contains GT despite arguing that it uses unlabeled data. Furthermore, the coefficients of \lambda_u and \lambda_r are not explained in Equation 11 and only mentioned in the experiments part.
- The use of two decoders makes sense once the Dynamic Labeling Branch is introduced. However, in the beginning, a bit of motivation for two decoders would be good and some comments on how they are initialized / trained differently, to make sure they produce different / complementary results such that the pseudo-labels are mutually beneficial. In section 2.4 you also mention the two decoders, however up to this part, the reader doesn’t even know of them (unless he/she noticed it in Figure 1).
- If the pseudo labels are actually created by one of the decoder networks, does this imply that this decoder network is then trained using its own prediction? (i.e., it would actually not be trained in that iteration as it would have a zero loss, right?).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

I would highly suggest to also make the code available upon acceptance. Otherwise, it will be hard for follow-up studies to compare to your method and to perform further ablations etc.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
See comments on the weaknesses of the paper and try to clarify the raised concerns in the rebuttal and/or a slightly revised version of the manuscript.

Other comments:
- Is the V-Net encoder pretrained in some way? If yes, which dataset was used for the pretraining?
- In Table 3, it is not clear which metric is used (probably Dice?). The beneficial effects of your method seem to vanish as the data set size increases (e.g., at 20% it’s only 0.16 better than the baseline without any additional modules and even worse than the baseline solely with VWPM). Any explanation/hypothesis for this behavior?
Minor typo:
- Page 2: y^l should also have a subscript _i as each image obviously has its own corresponding ground truth.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-written, works on a very important topic and provides a semi-supervised method that is almost on par with a supervised method with only 20% of the training data. I think the proposed methods are reasonable for the most part and nicely validated. However, as noted in the section on weaknesses, there are several concerns and unclarities that should be addressed in the rebuttal letter and definitely need to be resolved before publication.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper introduces a new semi-supervised segmentation method using CLIP (text encoder) at the voxel scale. The proposed method is applied to Left and Pancreas-CT datasets and demonstrates state-of-the-art results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is overall well written and the authors’ motivations are well explained. The choice of state-of-the-art compared methods is adequate and the gain of performance seem significant.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper could significantly improve clarity by providing a more detailed description of the method and the ablation study. Furthermore, a final discussion on the crucial parameters of the method and the bottleneck to increase accuracy would be highly valuable.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- How many decoders are used? Are there only 2, or could we potentially utilize more?
- Have the results been subjected to statistical analysis, such as a t-test, to determine if they are significantly better than state-of-the-art methods?
- Before the ablation study, it’s unclear that the baseline, which forms the core of the method, is actually MC-Net [20] with additional modules added (VWPM, VTCM, and DLB). This should be explicitly stated in the implementation details. Additionally, the ablation study could be enhanced by including the DLB ablation results in Table 3.
- A significant performance gap is observed in the ablation study when applying DLB to other approaches. The notable improvements seen with MC-Net prompt the question: could DLB potentially be applied to other state-of-the-art methods such as CAML or MC-Net+ and even surpass the proposed method? Additionally, CPS is not included in the comparison in Table 1.
- It would be beneficial to include a qualitative example of pancreas segmentation for better understanding.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The clarity and novelty of the paper and the validation experiments (different datasets, sota methods) seem sufficient to recommend Acceptance.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors propose a novel semi-supervised learning method for medical image segmentation that combines one text and one vision encoder with two vision decoders. The text encoder is based on a frozen CLIP model while the vision encoder uses a V-Net encoder. Moreover, a module is proposed (Voxel-Wise Prompts Module) to generate text-vision prompts, and another module (Vision-Text Consistency Module) is used to regularize those prompts. Finally, the model includes a dynamic labeling branch to generate pseudo labels on the fly based on the loss values.

Experiments are performed in two public datasets (LA and Pancreas-CT) and evaluated using common segmentation metrics. The proposed approach shows competitive results at different annotation levels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method is novel and has been evaluated on two public datasets against state-of-the-art alternatives showing equal or superior performance.

An ablation study is performed to better understand the contributions of each component of the proposed model.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The only serious weakness I observe is the lack of significance in the ablation study of the VWPM and VTCM modules. This suggests to me that the text prompts are not really helpful for the task compared to the DLB module, but further experiments should be needed to elucidate this aspect.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

The authors do not mention the range of hyper-parameters considered nor the method to select the best hyper-parameter configuration.

There is no analysis of situations in which the method failed.

There is no description of the memory footprint nor an average runtime for each result, or estimated energy cost.

There is no analysis of statistical significance of reported differences in performance between methods.

The results are not described with central tendency (e.g. mean) & variation (e.g. error bars).

The specific evaluation metrics and/or statistics used to report results are correctly referenced.

There are details of train / validation / test splits but no details on how the methods were implemented and tuned.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
An effort should be made with regards to reproducibility and evaluation. More specifically, the authors should provide a better description of the range of hyper-parameters (if any) considered for every method (including those of the state of the art), the number of training and evaluation runs, validation results, etc. In that sense, I recommend to follow the code of good practices proposed by Dodge et al. (“Show your work: Improved reporting of experimental results”, 2019).

As I mentioned before, the need of the VWPM and VTCM modules is not as clear as that of the DLB module. I recommend more experiments in that direction to check the real need of those modules and if there is any significant gain with their use.

Minor comments:
- On page 2, the embeddings are mentioned at “pixel level”, shouldn’t they be “at voxel” level?
- In Table 2, there is an error where it reads “62 (20%)”, it should read “62 (100%)”.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is novel and has been properly evaluated against the state of the art on public datasets.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

N/A

Meta-Review

Meta-review not available, early accepted paper.

back to top

VCLIPSeg: Voxel-wise CLIP-Enhanced model for Semi-Supervised Medical Image Segmentation

Author(s):