Abstract

Semi-supervised medical image segmentation plays a critical method in mitigating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model’s robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at https://github.com/taozh2017/Text-SemiSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0335_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaKai_Textdriven_MICCAI2025,
        author = { Huang, Kaiwen and Zhou, Yi and Fu, Huazhu and Zhang, Yizhe and Gong, Chen and Zhou, Tao},
        title = { { Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {606 -- 616}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a deep learning framework called Text-SemiSeg for semi-supervised segmentation of medical. Text-SemiSeg utilizes related textual information to enhance the understanding of visual features in 3D scans. The core of the method involves a Text-enhanced Multiplanar Representation (TMR) to integrate text and visual data across different image planes, a Category-aware Semantic Alignment (CSA) module to align textual and visual embeddings, and a Dynamic Cognitive Augmentation (DCA) strategy to reduce the differences between labeled and unlabeled data, ultimately leading to more robust segmentation results across various medical datasets.

    Using pre-trained VLMs, such as CLIP, to facilitate medical image segmentation has been proposed before. However, CLIP was trained on 2D data. The proposed framework focuses on the adaptation of semi-supervised text-visual model to 3D medical image segmentation. The proposed method was favorably demonstrated for three different datasets, namely, Pancreas-CT , Brats- 2019, and MSD-Lung tumor datasets and favorably compared with ten different methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. utilizing text to enhance semi-supervised segmentation is an interesting direction.
    2. The proposed framework has novel methodological aspects including the adaptation to 3D and the introduction of the TMR, CSA and DCA modules.
    3. The proposed framework outperforms ten existing methods for three databases in all but one case.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed method relies on the pre-trained CLIP model, which is primarily trained on 2D natural images and corresponding text. The extension to 3D medical images might be limiting.
    2. The framework consists of three main modules (TMR, CSA, and DCA) and utilizes two decoders for consistency learning. This modular design, while contributing to its performance, might also increase the complexity of the model in terms of implementation and the number of hyperparameters to tune.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In your rebuttal please refer to the weaknesses section.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    segmentation of medical images is the topic of the paper.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • good evaluation - see table 1
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • just know data sets have been used
    • methodological description is not very deep, just some obvious equations
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    an interesting application is considered. the numerical analysis is fine

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper describes an approach to use a vision model coupled with a language model and force it to leverage textual information in the best possible manner. The paper presents the results on three datasets, exceeding previous performance by a sustainable margin in 2 of the three datasets - namely Pancreas and MSD-Lung.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is easy to read and well-motivated.
    2. The methodology is easy to understand and solve. The prompt learning outlook is most certainly a better approach to solve the problem.
    3. The results and ablation studies cover everything that could be required within reason & under the page limit.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    This reviewer does not consider the following two questions major weaknesses, but more of looking for answers towards understanding the thoughts behind the paper better.

    1. when distributing the views along the axes - aka, coronal, axial and sagittal, is there a chance that you may encounter scans wherein the views and the in-scan orientation is not really aligned and hence, you may face issues where the network thinks its looking at coronal and instead, looks at sagittal? This reviewer is not saying its the case for the datasets used in this study, but there is a fair chance of situations arising (AbdomenAtlas/FLARE23) wherein the metadata orientation (lets assume, RAS) and the actual scan orientation when read in as a numpy array are not aligned. Does this confuse the workflow? and if so, how can it be resolved?
    2. Has a single decoder circumstance been tried given the obvious advantage posed by integrating VLM ? This may reduce the workload on the training requirements, and increase the quality of auto-segmentations as more scans can be accumulated within a mini-batch.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written, easy to read, motivated and solves a problem at hand. The authors have also thought of the experiment design and the ablation studies.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper presents a text-driven semi-supervised medical image segmentation method. It is rooted in the CLIP family of methods, and builds upon VCLIPSeg a semi-supervised 3D medical segmentation method. The method relies on 3 modules enhancing the CLIP architecture. The first module is the Text-enhanced Multiplanar Representation (TMR), the second is the Category-aware Semantic Alignment (CSA), and the third is the Dynamic Cognitive Augmentation (DCA). The first two modules aim to better align and fuse visual with textual information. The third module aims to address the distributional difference of labeled vs unlabeled data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall this is a strong contribution.

    First of all, the evaluation is thorough and compares the method proposed to many other techniques, showing consistently higher performance than others especially at low labeled data regime. The ablation studies show all three components contribute to the performance although somewhat surprisingly the DCA module is most impactful. In terms of methodology, the TMR incorporates a smart design of 2.5D for a richer representation. I am less clear about the contribution of the learnable variables into the text. The CSA masks the features from the visual encoder with the decoder output before aligning with text, thereby adding semantic context. The DCA mixes label and unlabeled data, which reduce distributional differences between labeled and unlabeled data.

    Overall, any one of these 3 contributions could be improvement enough to warrant publication and this paper represents a significant amount of work. The experiments are also very thorough.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I did not see any major weaknesses in this paper, but here are some general comments and minor points. One point that could have been better illustrated in the experiments and results was the learned variables from the text. If the text encoder is fixed which text was learn? Do you have examples? If the image/segmentation pair is a brain tumor, does the learned text says “image of a tumor in the left hemisphere” or something completely different?

    Some of the design choices are not always fully justified. For example, why is average pooling the best strategy in the TMR module?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See above. This is a strong paper with three significant contributions, each well justified. The authors show, through extensive experiments, that the modules contribute to performance improvement. The paper is well written and has the potential to be adapted by or to inspire new methods in this field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Thanks for constructive comments.

To R1: Q1: 2D to 3D domain mismatch A1: The main objective of our model is to enhance the association of visual features through text. By interacting with CLIP in a 3D plane, the responses of the foreground classes across these three planes are aggregated, which effectively reflects the intensity of each foreground voxel in the 3D features.

Q2:Module complexity A2: The use of two decoders is a common setup in semi-supervised consistency learning strategies [24]. Moreover, TMR interacts through 2D planes, introducing only a minimal increase in computational cost. CSA is a loss function, and DCA is a data augmentation method, both of which do not increase the model’s parameter count.

To R2: Q1: Textual feature interpretability A1: Although the text encoder is fixed, the input text contains learnable variables. For example, when segmenting a brain tumor, the input text would be “[V_1][V_2]…[V_n] brain tumor,” where [V_1][V_2]…[V_n] represents the learnable variables.

Q2:Average pooling A2: Based on our experiments, average pooling is a simple and effective operation. It allows for a quick reduction in the dimensionality of visual features without introducing additional computational overhead, while also preserving the information from each channel.

To R3: Q1: Datasets and methodological description A1: We have validated the effectiveness of our method on multiple publicly available datasets. A detailed description of the methodology has already been provided in the methods section. Additionally, we will take your feedback into account for the final version.

To R4: Q1: View axis misalignment A1: Our method focuses solely on the three angles of the input 3D data mapping, without requiring alignment to any specific plane. This is because we only use the category information from the text to enhance the response intensity of the relevant category across each view.

Q2: Feasibility of a single decoder A2: Since the focus of this paper is on a consistency learning-based semi-supervised strategy, the design of two decoders is essential, as it is a common paradigm in semi-supervised learning [24].




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top