Abstract

Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated data from various modalities to achieve accurate segmentation performance. This dependence often poses a challenge in clinical settings due to limited availability of such data. Moreover, the inherent anatomical misalignment between different imaging modalities further complicates the endeavor to enhance segmentation performance. To address this problem, we propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality, and integrates this information into a unified fusion layer for feature amalgamation. With a channel-wise semantic consistency loss, our framework ensures alignment of modality-independent information from a feature-wise perspective across modalities, thereby fortifying it against misalignments in multimodal scenarios. Furthermore, our framework effectively integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data in semi-supervised segmentation tasks. Our method achieves competitive performance compared to other multimodal methods across three tasks: cardiac, abdominal multi-organ, and thyroid-associated orbitopathy segmentations. It also demonstrates outstanding robustness in scenarios involving scarce labeled data and misaligned modalities.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2001_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2001_supp.pdf

Link to the Code Repository

https://github.com/med-air/CMC

Link to the Dataset(s)

MS-CMRSeg dataset: https://zmiclab.github.io/zxh/0/mscmrseg19/ AMOS dataset: https://amos22.grand-challenge.org/

BibTex

@InProceedings{Zho_Robust_MICCAI2024,
        author = { Zhou, Xiaogen and Sun, Yiyou and Deng, Min and Chu, Winnie Chiu Wing and Dou, Qi},
        title = { { Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed a semi-supervised multimodal segmentation framework to deal with scarce labeled data and misaligned modalities. The method is a combination of cross-modality collaboration and contrastive consistent learning. Results on three datasets showed the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method employed fine-tuned modality-specific encoders from the SAM-Med3D encoder to extract the initial modality-independent features.
    2. Cross-modality collaboration strategy and contrastive consistent learning module were employed for feature alignment and harmonizing anatomical structure.
    3. Extensive experiments on three datasets were conducted to show the effectiveness.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Novelty is limited. The CSC loss and CAC loss seem to be the same as the contrastive loss in ref.[2] and the area-similarity contrastive loss in ref.[13], respectively. The differences between this paper and SOTA methods need to be clarified. Comparison between the proposed method and that in [2] has not been provided in the result section.
    2. In Fig.1, the CSC loss was calculated at feature-level, but it was calculated between two predictions in formula (1). The supervised loss for two modalities is binary cross-entropy and dice loss, respectively. There are no explanations for why they are different.
    3. This work only showed the results on a subset of categories of the AMOS dataset and TAO dataset. 4, The method requires paired multi-modal images for training, why only using one modality for evaluation? For the compared methods, it would be more convincing to show the results of typical SSL methods such as CPS when using all the modalities for training/testing. If the authors assume that only one modality is available for inference, a comparison between using multi-modal and one modality for training is necessary to show the benefit of multi-modal training.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. As mentioned above, it would be good to provide the differences between the CSC/CAC loss with SOTA methods.
    2. The results only show a fraction of categories on the AMOS dataset and TAO dataset. It would be better to provide the results for all classes.
    3. Writing precision needs improvement, such as the supervised loss, and the motivation of the MIA module. Statistical significance analysis is also appreciated to be given.
    4. Comparison with multi-modal methods for testing is not shown, and the benefit of using multi-modal training over single-modal training is not demonstrated.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Multi-modal semi-supervised learning is an interesting topic, but the novelty of the method is limited, and the experiments should be improved.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a semi-supervised multi-modality segmentation framework based on pre-trained SAM-Med3D model to align the information of different modal feature levels and to fuse the features of different modalities through cross-modality collaboration strategies. Meanwhile, contrast consistent learning similar to Cross Pseudo Supervision is used to align the predictions of different modalities to improve segmentation performance. The proposed method shows competitive performance on three datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper constructs a multi-modal semi-supervised medical segmentation framework by using the currently popular SAM-Med3D model, and adds an adapter to the encoder to increase the adaptability to the task in this paper.
    2. This paper proposes Channel-wise Semantic Consistency loss so that it can standardize the underlying anatomical structure at the feature level. A Modality-Independent Awareness (MIA) module is also proposed to obtain the independent knowledge of modal and to optimize feature fusion.
    3. Contrastive Anatomical-similar Consistency (CAC) loss is established for the unlabeled output of different modality decoders in a manner similar to Cross Pseudo Supervision.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. When comparing this method with other methods, the paper does not pay attention to the influence of model structural complexity on the results. For example, the backbone of method CML is U-Net, while this paper is pre-trained SAM-Med3D. Obviously, this paper uses a more heavy-weight model, and there are additional pre-training processes. In addition, the backbone of the compared supervison method uses a lightweight model V-Net. These show that the comparison of methods in this paper is carried out under unfair circumstances. Getting good results under such high computational complexity may not represent the true performance of the model.
    2. The framework in this paper only explains the method of dual-modality collaborative training, while the single-modality data of MS-CMRSeg is used for training in the ablation experiment. The specific training process should be briefly explained.
    3. The title of the paper mentions “Multimodal”, but this framework is only discussed in the case of dual modalities, and the computational complexity is very high, it seems difficult to generalize to more modalities.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The computational complexity of this framework is much higher than that of the other methods, and the performance improvement seems to be inevitable. However, this article does not discuss this issue too much. In this case, the true performance of the model cannot be obtained. In addition, many key information is missing in Fig.1, which affects the understanding of the paper, and it is suggested that the authors make some additions.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In this paper, a novel multi-modality semi-supervised medical image segmentation framework is constructed by using Cross Modality Collaboration (CMC) and Contrastive Consistent Learning, and excellent segmentation performance are achieved. However, the comparison of methods in this paper is carried out under unfair circumstances. Moreover, this framework is only discussed in the case of dual modalities, and the computational complexity is very high, it seems difficult to generalize to more modalities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author’s feedback solved some of my doubts. It is valuable to evaluate the computational complexity of the model. Considering that no major changes can be made to the article in this session, the previous score is maintained.



Review #3

  • Please describe the contribution of the paper

    The paper addresses the challenge that multi-modal medical segmentation typically requires a substantial amount of labeled data across modalities. To address this, the authors propose a cross-modality collaboration strategy to obtain modality-independent knowledge and integrate it into a fusion layer.

    Specifically, the approach uses a channel-wise semantic consistency loss to ensure alignment of the modality-independent features learned across the different modalities.

    The main contributions are a cross-modality collaboration strategy to leverage modality-independent knowledge for multi-modal segmentation, and a channel-wise semantic consistency loss to align the modality-independent features. Experimental results show the proposed method can achieve comparable performance to existing fully-supervised multi-modal segmentation approaches, but with significantly less labeled data required.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The primary strength of this paper is its innovative approach to addressing the data scarcity challenge in multi-modal medical image segmentation. By proposing a cross-modality collaboration strategy to extract modality-independent features, the authors demonstrate how to effectively leverage unlabeled data across different imaging modalities. The use of a channel-wise semantic consistency loss is an interesting technique to ensure these modality-independent features are well-aligned, enabling robust fusion for accurate segmentation. This is a significant advancement over fully-supervised methods that require extensive labeled data for each individual modality. Additionally, the authors provide extensive experimental results using several datasets showing their semi-supervised approach can achieve comparable performance to existing supervised techniques, despite using far less labeled training data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper presents an innovative semi-supervised approach that achieves comparable results to fully-supervised methods, there are a few potential weaknesses to consider. First, the authors claim their framework exhibits “exceptional robustness”, but this is not immediately evident from the experimental results provided. More detailed analysis or comparison to existing methods would be needed to substantiate this assertion. Additionally, although the semi-supervised technique reduces the data annotation burden, the reported performance is still on par with previously published fully-supervised methods. This suggests the proposed approach, while a step forward, may not represent a significant leap in multi-modal segmentation capabilities compared to existing state-of-the-art techniques.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While the authors have presented an innovative semi-supervised approach for multi-modal medical image segmentation, there are a few areas where additional work could strengthen the contribution. First, it would be helpful for the authors to provide more detailed analysis and quantification of the “exceptional robustness” claimed for their framework. The authors claim that Figure 2 shows their method “significantly outperforms” existing techniques, but this is not immediately evident from the provided visual comparison. It would be helpful for the authors to conduct and report the results of statistical significance tests to quantify the degree of performance. Additionally, the authors frequently use vague descriptors like “shows promising results” and “considerably surpasses” when discussing their results, but these terms lack specificity. Finally, the authors could consider expanding the evaluation to include practical deployment considerations, such as inference/training ressources.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a technically sound semi-supervised approach to address the data scarcity challenge in multi-modal medical image segmentation. However, the evaluation and communication of the results need to be strengthened to better demonstrate the merits of this work. The authors claim their method “significantly outperforms” existing techniques, but the evidence provided does not clearly show statistically significant improvements. The authors should conduct and report the results of appropriate significance tests to quantify the performance gains. Additionally, the language used to describe the experimental findings is often vague, lacking concrete, quantitative comparisons to state-of-the-art fully-supervised methods. To better support the assertions of the method’s advantages, the authors should present a more rigorous and transparent evaluation, including analyses of practical deployment considerations like data efficiency and inference speed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the area chair and reviewers for their time and supportive comments highlighting our “novel” method with “well-supported contributions” to tackle the “data scarcity challenge” in multi-modal medical image segmentation. We are also glad to see that reviewers recognize the “well-aligned” and “robust fusion” of our work, which addresses the emerging topic of semi-supervised multi-modal learning with high Dice scores demonstrated in “extensive experiments on three datasets”.

As we include extensive experiments, our paper had to omit some implementation details due to limited space, which might have caused minor confusion among reviewers. Although the issues raised are minor or stem from misunderstandings, we will release all our code in the final version to ensure clarity and reproducibility.

R1

Q: Unfair comparison due to heavy-weight model A: To maintain the integrity of the original structural network, CML[13] is a 2D structural network, while our model is a 3D structural network. For a fair comparison, we note that although we use a heavy-weight model, comparison methods like mmFor, EFCD, and UMML also rely on heavy-weight models.

Q: Explanation for single-modality training A: First, single-modality images are input into two encoders for feature extraction. Then, a cross-modality collaboration ensures channel-wise consistency using the CSC loss. Finally, a contrastive consistency learning module aligns prediction maps from unlabeled data with the CAC loss.

Q: Generalization to more modalities A: Our method, evaluated on three datasets encompassing six modalities, exhibits good generalizability, as evidenced by both qualitative and quantitative results, as shown in Fig. 1 & 3, and Table 1. Moreover, it can also be extended to other modalities.

R3

Q: Limited novelty
A: Our method’s novelty lies in addressing the data scarcity challenge in multi-modal segmentation by leveraging the fine-tuned SAM-Med3D model, which is effective in scenarios with limited labeled data and misaligned modalities. We also introduce a CSC loss to align channel-wise features and a CAC loss to regularize predictions on unlabeled data.

Q: Difference between the CSC loss and the supervised loss  A: Our CSC loss uses cosine similarity to align channel-wise features, ensuring channel consistency without relying on ground truth (GT). In contrast, our supervised loss is applied only to labeled data and requires GT to minimize prediction errors. Thus, they have fundamental differences.

Q: Results for a subset of categories in the AMOS and TAO datasets A: Our study focuses on paired modality data. For the AMOS dataset, we use 40 CT/MRI pairs for training and 20 pairs for testing, despite it containing 200 CT and 40 MRI images for training, and 100 CT and 20 MRI images for testing. For the TAO dataset, we use all 100 T1/T1c pairs for both training and testing.

Q: Why only using one modality for evaluation A: Following the study of ref.[13], our study aims to develop a more clinically applicable model that excels under the practical constraints of real-world medical imaging, where multi-modal data may not always be available.

R4

Q: Experimental support for “exceptional robustness” A: Extensive experiments on three datasets demonstrate our model’s robustness. Visual comparisons with CML show that our model produces more robust predictions with fewer semantic errors and better alignment with ground truth, as shown in Fig. 3. Quantitative results with 10% labeled data reveal performance comparable to the CML method (The SOTA), as presented in Table 1, further highlighting our model’s robustness.

Q: Reported result is still on par with fully-supervised methods A: Our method significantly outperforms the fully-supervised method[8] on three datasets with 10% labeled data, as shown in Fig. 2 and Table 1. Our model achieves Dice score improvements of 15.0% and 8.1% for CT/MRI (#Liver), and 19.7% and 18.8% for T1/T1c (#SOM), respectively.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This cross modality combination topic is not something new, and I do not think the authors contribute much in terms of methodology novelty, as R3 said. Also, all reviewers had concerns about the experiments, which however did not solved very well during the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This cross modality combination topic is not something new, and I do not think the authors contribute much in terms of methodology novelty, as R3 said. Also, all reviewers had concerns about the experiments, which however did not solved very well during the rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a semi-supervised multimodal segmentation framework to deal with scarce labeled data and misaligned modalities. The reviewers are generally in favor of the paper. The authors shall carefully address the remaining concerns in their final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes a semi-supervised multimodal segmentation framework to deal with scarce labeled data and misaligned modalities. The reviewers are generally in favor of the paper. The authors shall carefully address the remaining concerns in their final version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two area chairs have different opinions. I lean to accept the paper since (1) most of the reviewers have positive decision, (2) even not perfect, the idea of Cross Modality Collaboration might raise interesting discussions during MICCAI conference.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two area chairs have different opinions. I lean to accept the paper since (1) most of the reviewers have positive decision, (2) even not perfect, the idea of Cross Modality Collaboration might raise interesting discussions during MICCAI conference.



back to top