Abstract

Semi-supervised medical image segmentation is a practical but challenging problem, in which only limited pixel-wise annotations are available for training. While most existing methods train a segmentation model by using the labeled and unlabeled data separately, the learning paradigm solely based on unlabeled data is less reliable due to the possible incorrectness of pseudo labels. In this paper, we propose a novel method namely pair shuffle consistency (PSC) learning for semi-supervised medical image segmentation. The pair shuffle operation splits an image pair into patches, and then randomly shuffle them to obtain mixed images. With the shuffled images for training, local information is better interpreted for pixel-wise predictions. The consistency learning of labeled-unlabeled image pairs becomes more reliable, since predictions of the unlabeled data can be learned from those of the labeled data with ground truth. To enhance the model robustness, the consistency constraint on unlabeled-unlabeled image pairs serves as a regularization term, thereby further improving the segmentation performance. Experiments on three benchmarks demonstrate that our method outperforms the state of the art for semi-supervised medical image segmentation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0942_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0942_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{He_Pair_MICCAI2024,
        author = { He, Jianjun and Cai, Chenyu and Li, Qiong and Ma, Andy J},
        title = { { Pair Shuffle Consistency for Semi-supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The article presents a novel semi-supervised approach for medical image segmentation. The approach consists of a teacher-student framework coupled with a patch-based shuffling strategy to mix labeled and unlabeled images. The student model is trained in a supervised way (for labeled examples) and in a semi-supervised way with pseudo-labels (for unlabeled examples). The teacher model is updated through EMA updates of the student model. The method is evaluated on three image segmentation datasets (CT and MRI, in a 2D setting) and compared with several semi-supervised segmentation methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The article is well-written. It is clear and easy to follow, with complete methodological explanations and precise mathematical notations.

    • The proposed approach is simple and relatively sound. The idea of making the model focus on local information by making it rely less on external context can be interesting, in particular when combined with a semi-supervised framework using pseudo labels. However, it raises key additional questions about the spatial consistency of segmentations (see weaknesses).

    • Experimental validations seem to improve upon other semi-supervised approaches on three segmentation datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • While one can understand the appeal for a focus on local information, the shuffling operation could introduce artefacts or segmentation issues at the boundaries of patches. In particular, the approach has been validated on datasets where segmentation is limited to few objects with not so complex borders (one or two small objects in Figs. 1, 2 and in the supplementary). In more complex cases with many objects (i.e. whole brain segmentation or similar tasks) and/or more complex shapes (i.e. with regions spanning several patches), it is not so clear how the approach will handle patch shufflings.

    • The framing of the article in the introduction can be a bit misleading. In the 1st and 2nd paragraphs, it sounds as if semi-supervised methods either use supervised or unsupervised learning, while, by definition, they use both (otherwise they would not be called semi-supervised). This might be just a writing issue, but it makes the contributions of the paper stand out more than it should, as the proposed approach is definitely not the first to use both ground truth and pseudo labels (see indeed some comparison methods cited in the introduction). In the same spirit, the proposed shuffling strategy seem closely related to data augmentation strategies such as mixup [1] and variants. While here mixed images are not used as data augmentation but as a way to enforce consistency between labeled and unlabeled examples, this field of literature should be addressed for completeness.

    [1] “mixup: Beyond Empirical Risk Minimization”, ICLR 2018

    • Experimental validations should be more detailed and/or extended to better understand the benefits of the approach. In particular, very few details are given about state-of-the-art comparison methods of Tables 1, 2 and 3. For the ablation study, it seems to have been conducted only on the ACDC dataset, which is limited to 100 examples (70 for training). This is a bit unconvincing, especially because one of the main appeal of SSL is to leverage the abundance of unlabeled data.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The authors should reformulate the introduction to better frame the paper according to semi-supervised learning literature and mixup related approaches.

    • More details about the impact of shuffling on segmentation results should be included, especially in the case of objects spanning several patches. At least some qualitative results and comments about this potential issue should be included.

    • More details about some of the comparison methods should be included, in particular the closest ones to the approach, to better understand their differences. Similarly, the ablation study should be extended to a larger dataset to assess the robustness of the approach.

    • It would be interesting to apply the approach in a fully patch-based setting, i.e. by using stacked patches in the batch dimension as input to the model, then reconstructing, instead of using mixed images as input. This comparison would be useful to assess the impact of shuffling on the segmentation results. Another interesting experiment would be to use standard mixup between labeled/unlabeled examples (with GT/pseudo-labels) to compare with patch shuffling.

    Minor comment: performance improvements should be expressed in percentage point unit, not %.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The article has potential, as the proposed approach is interesting and the results seem to improve over other methods. However, experiments are a bit limited and some key elements about the behavior of the approach should be clarified.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed most of the concerns raised by the reviewers. However, I am still unconvinced by the answer about boundaries of patches. The authors claim that “it does not degrade the overall performance” but the approach has only been validated on datasets with few and small objects, where this issue will clearly be less important. It remains unclear how the approach will perform on denser segmentation tasks, with larger adjacent regions. If the article is accepted, it would be honest to at least comment on this issue as a potential limitation in the final version.



Review #2

  • Please describe the contribution of the paper

    The authors proposed an image augmentation method similar to CutMix augmentation. They designed a PSC method to align the outputs of the teacher and student models based on image patches. They evaluated the performance on three different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is simple and straightforward, yet it yields impressive results. The comparison involves novel methods. It performs exceptionally well on three different datasets across various tasks, including tumor and organ segmentation, in both binary and multi-class scenarios.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The novelty is limited. Please compare to other mixup-based methods such as classmix, MixMatch and so on. 2) There are many loss terms in PSC module, where needs more ablation studies for each terms.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) Please compare to other semi-supervised methods with mixup-based methods. 2) There are many loss terms in PSC module, where needs more ablation studies for each terms. 3) More details about the novelty part in the augmentaion module. Why it is different from existed CutMix methods?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the detailed and constructive comments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel semi-supervised learning method for medical image segmentation. The proposed approach involves mixing patches from both labeled and unlabeled data. The authors suggest that this strategy, which includes spatial mixing of patches, improves accuracy. Three validation datasets are considered.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and well-structured, making it easy to read. The results appear to significantly outperform those of existing state-of-the-art methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The significance of spatial patch mixing remains to be demonstrated in my opinion.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The method proposes to mix patches from both unlabeled and labeled data during training. While this approach, akin to patch-based mixup, is already intriguing, the authors extend it by also integrating spatial information, which appears counterintuitive. While one might anticipate this approach working primarily for lesion segmentation, the authors conduct experiments with multiple labels. In the current setup, even with N»4, there is no constraint on the final position of a patch in intermediate images X. Could the ablation study explore testing the mixture of unlabeled and labeled data without random position shifting? This seems like involving two different shuffling operations.

    • Authors only utilize 2 unlabeled images. Could more be considered, especially in 2D?

    • It is unclear if the MLT dataset was originally considered in 3D and then used with 2D slices for convenience. Is the Promise12 dataset in 3D? The authors mention volumes for Promise12 but later claim that all images are resized to 256x256. Are the results of state-of-the-art methods from the original paper, or did the authors reproduce the results for 2D slices? This should be clearly stated.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach seems novel and the results on three datasets are significantly higher than the ones of sota methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have answered to my questions in this rebuttal.




Author Feedback

Thanks for the comments from all the reviewers. Sorry that additional experiments cannot be included in the response following the rebuttal guidelines.

To Reviewer1:

  1. Framing the introduction by comparing with Mixup and CutMix Sorry for the confusion in introduction. This work follows the semi-supervised learning literature to make use of both labeled and unlabeled data. The last few lines in paragraph 1 highlight that consistency regularization based on only unlabeled images in most existing methods may be suboptimal due to the possibly noisy pseudo labels. We do not claim that this is the first work to mix labeled and unlabeled images. Mixup or CutMix can be simply extended to improve consistency regularization. Nevertheless, Mixup overlays two images resulting in ambiguous pixel semantics. While CutMix cuts a random patch and paste it into another image at the same position, our method shuffles patches randomly to change relative positions such that more meaningful cues can be learned by optimizing the harder consistency regularization objective.
  2. Impact of shuffling for complex objects spanning several patches It is true that our method may split objects into different patches and introduce artefacts at the boundaries of patches in certain cases, but it does not degrade the overall performance. This is because the complete image is input to the teacher. The consistency constrains the student network learn to segment the entire objects from a partial patch view, thereby improving the ability to extract local information.
  3. Details about comparing methods and ablation study Due to page limit, details about comparing methods and datasets are omitted in the manuscript. ICT and BCP are two of the most related recent works, developed based on Mixup and CutMix respectively for consistency learning. Ablation studies are conducted on ACDC, one of the most widely used benchmarks for evaluation. Although it consists of only 100 3D volume samples, thousands of 2D slices are extracted for experiments as in existing works such as BCP.
  4. Stacked patches as input and labeled/unlabeled Mixup With the distinct sizes between patches and images, it is unsuitable to directly input patches to student network, because the teacher taking images shares the same architecture as the student. Resizing patches to match the size of images significantly increases the computational complexity as there are n*n patches per image. Thus, we do not attempt to use patches as input. To compare with Mixup, our method surpasses the ICT based on Mixup of only unlabeled images. For labeled/unlabeled Mixup, we will study in our future work.

To Reviewer3:

  1. Patch-based mixup without position shifting One similar implementation to patch-based mixup without position shifting is CutMix. Our method outperforms the CutMix-based BCP. Due to page limit, ablation studies along this line are not available in the manuscript but will be explored in the future. More details on the rationale of our method can be referred to responses 1-2 to Reviewer1.
  2. Patch shuffle with more images Thanks for your inspiring suggestion. Our future work will investigate patch shuffle with n>2 images.
  3. Dataset details All datasets are 3D and we convert them into 2D slices as in existing works based on 2D images. Except BCP and CL on the ACDC, others are our reproduced results. We maintain the same protocol in the partition of train/val/test set and labeled/unlabeled for fair comparison.

To Reviewer4:

  1. Novelty comparing to mixup-based methods The ClassMix and MixMatch are similar to Mixup and CutMix. Please refer to responses 1 and 3 to Reviewer1 and response 1 to Reviewer3 for details.
  2. Loss ablation The PSC loss in Eq. 1 consists of only two terms L_{l-u} (Eq. 4) and L_{u-u} (Eq. 5) with ablation results in Table 1. The four terms in Eq. 4 or Eq. 5 are the widely used Cross Entropy and Dice loss for a pair of images. We simply set equal weight to them without ablation.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The article introduces a novel semi-supervised approach for medical image segmentation using a teacher-student framework combined with a patch-based shuffling strategy to mix labeled and unlabeled images. The student model undergoes supervised training on labeled examples and semi-supervised training with pseudo-labels on unlabeled examples. The teacher model is updated through Exponential Moving Average (EMA) updates of the student model. This method is evaluated on three image segmentation datasets (CT and MRI, in a 2D setting) and compared with several existing semi-supervised segmentation methods.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The article introduces a novel semi-supervised approach for medical image segmentation using a teacher-student framework combined with a patch-based shuffling strategy to mix labeled and unlabeled images. The student model undergoes supervised training on labeled examples and semi-supervised training with pseudo-labels on unlabeled examples. The teacher model is updated through Exponential Moving Average (EMA) updates of the student model. This method is evaluated on three image segmentation datasets (CT and MRI, in a 2D setting) and compared with several existing semi-supervised segmentation methods.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The article introduces a novel semi-supervised method for medical image segmentation using a teacher-student framework and patch-based shuffling of labeled and unlabeled images. This approach aims to enhance segmentation accuracy by leveraging both ground truth and pseudo-labels. Evaluated on three datasets, it outperforms existing methods. While well-written and methodologically sound, there are concerns about potential artifacts at patch boundaries and the limited scope of the validation datasets. The method’s novelty is noted, but it needs more comprehensive comparisons and a broader ablation study to confirm its robustness across complex tasks. Overall, it shows great promise.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The article introduces a novel semi-supervised method for medical image segmentation using a teacher-student framework and patch-based shuffling of labeled and unlabeled images. This approach aims to enhance segmentation accuracy by leveraging both ground truth and pseudo-labels. Evaluated on three datasets, it outperforms existing methods. While well-written and methodologically sound, there are concerns about potential artifacts at patch boundaries and the limited scope of the validation datasets. The method’s novelty is noted, but it needs more comprehensive comparisons and a broader ablation study to confirm its robustness across complex tasks. Overall, it shows great promise.



back to top