Abstract

Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP’s rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3127_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/WUTCM-Lab/CausalCLIPSeg

Link to the Dataset(s)

https://github.com/HUANGLIZI/LViT

BibTex

@InProceedings{Che_CausalCLIPSeg_MICCAI2024,
        author = { Chen, Yaxiong and Wei, Minghong and Zheng, Zixuan and Hu, Jingliang and Shi, Yilei and Xiong, Shengwu and Zhu, Xiao Xiang and Mou, Lichao},
        title = { { CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present CausalCLIPSeg, which integrates the CLIP vision-language model into the domain of medical image segmentation, particularly for tasks involving referring expressions. By adapting CLIP’s semantic understanding to medical imagery, the framework addresses the challenge of text-to-pixel alignment without the need for model retraining on medical-specific data. Additionally, it introduces a causal intervention module enhancing the reliability of segmentation results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed CLIP-based medical referring segmentation is quite interesting.

    2. The proposed method achieve promising results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Major Concerns:

    1. Lack of Clarity and Excessive Jargon: The authors have used substantial jargon which makes the text difficult to understand. For example, in the abstract, the sentence proposing the method states: “Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments.” Here: a) What is confounding bias? b) What are spurious correlations? c) What are meaningful causal relationships? d) What does it mean to self-annotate confounders? e) What is a causal intervention module? f) What does it mean to excavate causal features? Overall, it is overwhelming and challenging to understand the proposed method.

    Continuing in the introduction, the last paragraph reiterates the same terms without clearly describing them. After reading through the abstract and introduction, it is understandable that the authors are trying to leverage CLIP for referring segmentation, but it is unclear why and how they are addressing it.

    1. Lack of Ablation Study: a) AM Module: The proposed AM module incorporates causal and confounding features; however, there are no ablation experiments to clarify its impact on the model’s performance.

    b) Adversarial Loss: Similarly, for the adversarial loss, it is unclear how it affects performance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Whether the dataset train and test split will be released including GT annotations?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Firstly, there’s a lack of clarity and an overuse of technical jargon, making it difficult to understand key concepts such as “confounding bias” and “causal features.” Secondly, the paper lacks ablation studies for the AM module and the impact of adversarial loss, leaving unclear how these elements affect the model’s performance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Reject — must be rejected due to major flaws (1)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I am leaning towards reject this word due to unclear explanation and clarity of the proposed method and insufficient experimental validation of key model components.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper uses CLIP to investigate referring segmentation in the QaTa-Covid-19 dataset, and introduce a causal intervention module to address confounding bias.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Proposes a means of incorporating CLIP into a segmentation process from textual descriptions. Develops a masking approach to try and separate useful features from confounding image properties.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The segmentation seems to be binary in all cases, which makes the “referring” part of the segmentation less convincing. The causal part of the segmentation was not presented clearly enough. It is unclear what happens at runtime - are the causal masks still there?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Dataset is public, but source code will not be released (or it is not mentioned). Most aspects are reasonably detailed to a point of reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Fig 1 – the difference in color between causal and confounding features is pretty subtle - stronger color difference might be useful

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The lack of clarity about how the causal part of the network works at inference time is a key factor in this rating. Overall, the paper seems to present two potentially interesting ideas in limited detail as opposed to one idea with a stronger presentation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The release of source code alleviates some reproducibility concerns and raises my rating.



Review #3

  • Please describe the contribution of the paper

    In this paper, authors use a vision-language model to segment Chest X-Ray images using textual prompts. Their method relies on CLIP and they adapted it to medical images on a specific application. They also implemented a causal intervention module to reduce confounders that come from the heterogeneity of medical images in general. Finally, they perform an ablation study to understand the significance of each part of their algorithm.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper aims to mitigate the problem of reliable Ground Truth creation for segmentation studies which is still a challenge especially in the medical imaging domain. The main strength is the novel approach to the problem that relies on textual description of images particulars and finding. Moreover, the causal module is an interesting and apparently working method to mitigate confounding factors of this kind of problem. Finally, the paper is clear and well-written and figures are adequate.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of this paper is the lack of errors and uncertainties in the evaluation of the performance. Moreover, it is not clear if the algorithm has been evaluated on a separate test set. In order to assess significance, it is very important to evaluate the uncertainty, especially to make comparisons with existing methods and also in the presented ablation study.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Despite a clear explanation of their algorithm, the reproducibility cannot be ensured since authors did not share their code. Moreover, even if the used dataset is publicly available, authors did not write the proportion of train/validation and test set.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As written previously, the main problem of this manuscript is the lack of confidence level and/or uncertainties in the performance measures. This is a key improvement that the authors should implement as it allows for a fairer comparison to existing methods. Furthermore, the authors should better specify how many patients were used for the training, validation and test set both for reproducibility issues and to understand whether the number of test samples is sufficient to claim for an improvement in performance. Finally, the authors could extend the cited literature to better position their work (e.g., comparing their algorithm with others based on the Segment Anything Model (SAM) approach).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I suggest to weak accept this paper since the problem of building reliable ground truth in segmentation studies is an important and still challenging problem. Moreover, the work is based on the use of state-of-the-art methods for segmentation, it is well written and it includes also the consideration of causality of predictions. However, the lack of details about the division in training, validation and test set is required as well as the computation of errors/confidence/uncertainties on the test set to fairly compare performances.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank the reviewers for scrutinizing our work and providing detailed and constructive comments.

Q1 Source code and dataset information for reproducibility issues. (R1&R3&R4) We will make our code publicly available. The QaTa-COV19 dataset is a public dataset consisting of 5716 training images, 1429 validation images, and 2113 test images. The dataset split, along with the corresponding texts and ground truth annotations for the images, are publicly available [3].

Q2 The segmentation seems to be binary in all cases, which makes the “referring” part of the segmentation less convincing. (R1) Referring image segmentation aims to segment specific regions based on natural language prompts. It is applicable to both binary and multi-class segmentation problems. For instance, some peer-reviewed works on referring image segmentation have focused on binary segmentation tasks [a,b].

[a] RRSIS, IEEE TGRS, 2024. [b] Rotated multi-scale interaction network for referring remote sensing image segmentation, CVPR’24.

Q3 The causal part of the segmentation wasn’t clearly presented. Are the causal masks still there at runtime? (R1) Yes, causal masks are generated during both the training and inference stages. These masks are learned and input-specific. We will clarify this in Section 2.3 of the final version of the paper.

Q4 Fig 1 - stronger color difference might be useful. (R1) We will change the colors accordingly.

Q5 Lack of clarity and excessive jargon. (R3) Confounding bias, spurious correlation, causal relationship, and causal intervention are fundamental terms in the field of causality in machine learning. For rigor’s sake, we cannot modify these terms arbitrarily. To address the reviewer’s concerns, we intend to add a PRELIMINARIES section providing background knowledge on causality and explaining these terms in the context of our task.

(a) Confounding bias refers to the interference of background factors and other extraneous variables in the model’s segmentation of lesion areas. (b) Spurious correlation is a related concept to confounding bias. (c) A meaningful causal relationship is the ideal scenario where the model relies solely on visual features of lesion areas to generate lesion masks, without being influenced by confounding factors. (d) Self-annotating confounders refers to our model’s ability to adaptively extract confounders. (e) The causal intervention module is a network module we designed to mitigate confounding bias. (f) Excavating causal features refers to our module extracting visual features that satisfy (c).

Q6 Lack of ablation study. (R3) Adversarial masking (AM) is an indispensable component of our causal intervention module, and the adversarial loss is specifically designed to guide this module’s learning. We have conducted an ablation study on the entire module (cf. Table 2). However, it is important to note that we cannot perform an ablation study solely on AM or the adversarial loss, due to their interdependence within the module’s design. We will clarify this in Section 3.4.

Q7 Lack of errors and uncertainties in the evaluation. (R4) In fact, we conducted multiple experimental runs and observed minimal fluctuations in results. For instance, the standard deviation of Dice for our model did not exceed 0.002. We will report these uncertainty measures for all models in the final version of the paper.

Q8 It is not clear if the algorithm has been evaluated on a separate test set. (R4) The dataset we use has predefined training, validation, and test splits. We performed model evaluation on the test set.

Q9 The authors could extend the cited literature to better position their work. (R4) In the final version of the paper, we will cite and discuss the latest medical image segmentation models based on SAM. It is important to note that according to SAM’s code, it only supports point and bounding box prompts, not text prompts. In contrast, we chose CLIP, as it supports text prompts, meeting our needs.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I think this paper deserves acceptance even though it got mixed reviews. The reviewer who gave strong reject did not reply after rebuttal and I feel the points brought by the reviewer was addressed. I feel this paper proposes an interesting end-to-end framework for referring medical image segmentation using CLIP while promoting causal relationships and deserves to be included at MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I think this paper deserves acceptance even though it got mixed reviews. The reviewer who gave strong reject did not reply after rebuttal and I feel the points brought by the reviewer was addressed. I feel this paper proposes an interesting end-to-end framework for referring medical image segmentation using CLIP while promoting causal relationships and deserves to be included at MICCAI.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper uses CLIP to investigate referring segmentation and propose a causal intervention module to address confounding bias. In general, there is lack of clarity for Causal Intervention Module and excessive jargon. The ablation study is not complete and the result is not convincing.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper uses CLIP to investigate referring segmentation and propose a causal intervention module to address confounding bias. In general, there is lack of clarity for Causal Intervention Module and excessive jargon. The ablation study is not complete and the result is not convincing.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a promising research direction, and the method can be potentially valuable for large-scale annotation. The rebuttal provided helpful clarifications. Technical clarity should be carefully revised.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents a promising research direction, and the method can be potentially valuable for large-scale annotation. The rebuttal provided helpful clarifications. Technical clarity should be carefully revised.



back to top