Abstract

Semi-supervised medical image segmentation aims to leverage limited annotated data and rich unlabeled data to perform accurate segmentation. However, existing semi-supervised methods are highly dependent on the quality of self-generated pseudo labels, which are prone to incorrect supervision and confirmation bias. Meanwhile, they are insufficient in capturing the label distributions in latent space and suffer from limited generalization to unlabeled data. To address these issues, we propose a Latent Diffusion Label Rectification Model (DiffRect) for semi-supervised medical image segmentation. DiffRect first utilizes a Label Context Calibration Module (LCC) to calibrate the biased relationship between classes by learning the category-wise correlation in pseudo labels, then apply Latent Feature Rectification Module (LFR) on the latent space to formulate and align the pseudo label distributions of different levels via latent diffusion. It utilizes a denoising network to learn the coarse to fine and fine to precise consecutive distribution transportations. We evaluate DiffRect on three public datasets: ACDC, MS-CMRSEG 2019, and Decathlon Prostate. Experimental results demonstrate the effectiveness of DiffRect, e.g. it achieves 82.40\% Dice score on ACDC with only 1\% labeled scan available, outperforms the previous state-of-the-art by 4.60\% in Dice, and even rivals fully supervised performance. Code will be made publicly available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3391_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3391_supp.pdf

Link to the Code Repository

https://github.com/CUHK-AIM-Group/DiffRect

Link to the Dataset(s)

https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html https://zmiclab.github.io/zxh/0/mscmrseg19/ https://medicaldecathlon.com/

BibTex

@InProceedings{Liu_DiffRect_MICCAI2024,
        author = { Liu, Xinyu and Li, Wuyang and Yuan, Yixuan},
        title = { { DiffRect: Latent Diffusion Label Rectification for Semi-supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a novel approach to address the challenges of incorrect supervision and limited generalization in semi-supervised medical image segmentation. Its approach to rectifying pseudo labels, leveraging latent diffusion for label refinement, and improving segmentation accuracy through category-wise correlation and latent feature rectification may advance the field of semi-supervised medical image segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. DiffRect tackles the issue of incorrect supervision by introducing a novel approach to rectify biased pseudo-labels. By leveraging a latent diffusion process, the model can refine pseudo labels towards more accurate ground truth labels.
    2. The proposed model includes a Latent Feature Rectification Module (LFR) that enables the refinement of latent features to align with ground truth labels. This process enhances the model’s ability to capture semantic information in the latent space.
    3. DiffRect utilizes category-wise correlation in pseudo labels to enhance the quality of supervision during training. By considering the relationships between different classes in the segmentation task, the model can better generalize to unseen data and improve overall segmentation accuracy.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Insufficient explanation of why consistent mask predictions do not fully grasp the semantic content.
    2. Lack of clarity on the calculation of the noise schedule variable.
    3. Inadequate justification for choosing a diffusion model over a Flow-based model.
    4. Absence of the calculation method for the Dice score.
    5. Lack of a concise overview or diagram of the framework.
    6. Missing formal problem definition and clarity on the model’s input and output.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Since the input and output of the model are not obvious, please provide a problem definition of this paper for the targeted setting (i.e., give a formal definition of the problem and the expected goal).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The manuscript suggests that merely achieving consistent mask predictions fails to capture the full semantic content within the latent space and neglects details about label distribution. A detailed explanation is necessary on page 2 to elucidate why this approach falls short.

    2. The method for calculating the noise schedule variable in Equation (1) lacks clarity. Providing explicit details on how this variable is determined would enhance understanding. Additionally, linking this calculation to the implementation of LRR with the “category-wise correlation” would improve the manuscript’s readability.

    3. The choice to utilize a diffusion model within the LFR module requires justification, particularly given that a Flow-based model, noted for its reversible nature, might also be suitable for refining label distributions. An explanation of the preference for a diffusion model over a Flow-based model would be valuable.

    4. Including the calculation of the Dice score within the manuscript would provide clarity on how this critical metric is computed.

    5. Given the manuscript features multiple modules and predefined models, a concise overview or diagram illustrating the framework would aid in understanding the model’s structure and function.

    6. The input and output of the model are not readily apparent. It would be beneficial to include a formal problem definition for the targeted setting, outlining the expected goals of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I appreciate the innovative methods in this paper, like the DiffRect approach for addressing biased pseudo-labels and the Latent Feature Rectification Module (LFR) for improving accuracy. The paper’s advancements in refining pseudo labels and using category-wise correlations for better training supervision are notable contributions to the field. However, it is tempered by the need for more detailed justifications and explanations, particularly regarding the choice of a diffusion model over a Flow-based one and a clearer depiction of the methodology and framework.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I remain unconvinced by the author’s justification for preferring a diffusion model over a flow-based model. The rationale provided seems to be derived from empirical results rather than being grounded in a scientific motivation. Therefore, my opinion is reject.



Review #2

  • Please describe the contribution of the paper

    The paper introduces the DiffRect model, a novel approach for semi-supervised medical image segmentation that addresses the inherent issues in pseudo-label quality through a Latent Diffusion Label Rectification process. The model combines a Label Context Calibration Module (LCC) to adjust the biased class relationships in pseudo labels and a Latent Feature Rectification Module (LFR) to align label distributions in latent space. This approach claims to significantly outperform existing semi-supervised methods and even rivals fully supervised methods on specific metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The introduction of the LCC and LFR within the DiffRect framework represents a significant innovation in handling the quality of pseudo labels and their distribution in latent space, a known challenge in semi-supervised learning.
    • The model achieves impressive results on three public datasets, with a notable increase in Dice score compared to both semi-supervised and fully supervised baselines, demonstrating the effectiveness of the proposed method.
    • The thorough evaluation, including ablation studies, provides clear evidence of the efficacy of individual components of the DiffRect model, enhancing the credibility of the results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper does not discuss the potential limitations or future directions of the proposed model. Such discussions are critical for understanding the scope of the model’s applicability and for guiding future research efforts.
    • Details on the hardware used for training and the computational complexity of the model are missing, which are important for assessing the practical applicability of the method in real-world settings.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Elaboration on the choice of U-Net as the segmentation network would be beneficial. Discussing why U-Net was chosen over other potential architectures could provide deeper insight into the model design decisions.
    • Providing specifics about the hardware and computational requirements would help in evaluating the feasibility of implementing the DiffRect model in clinical environments where computational resources might be limited.
    • Consider underlining or otherwise highlighting the second-best results in your tables to facilitate easier comparison for the reader, enhancing the readability and interpretative value of the comparative analysis.
    • It is recommended to include or reference qualitative results within the main paper, not just in supplementary material. These results are crucial for readers to assess the practical performance implications of the proposed model.
    • A section discussing the limitations of your approach would provide a balanced view and help in setting realistic expectations for the application of your model. Additionally, outlining potential future research directions based on these limitations could be very insightful.
    • Ensure that any claims of statistical significance are supported by appropriate statistical tests and evidence. This will strengthen the credibility of the results presented.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s innovative approach to addressing the challenges of semi-supervised medical image segmentation through latent space manipulation and its performance improvements are compelling. However, the lack of a limitations discussion, detailed computational insights, and qualitative results within the main text are notable gaps. My recommendation is for acceptance, contingent upon the revision of the paper to include these missing elements. Addressing these issues would not only strengthen the paper’s contribution but also enhance its practical and academic relevance, providing a more complete and informative resource for the community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed all my questions and concerns. Therefore I vote for accepting this work.



Review #3

  • Please describe the contribution of the paper

    This article proposes a new diffusion model-based semi-supervised medical image segmentation method that models the conversion process between different quality predictions in the latent space. Experiments show that competitive results are obtained.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • It is interesting to model the transition process between different perturbation predictions.
    • The experiments in the article show that the proposed method achieves good results, especially when using 1% of labeled data on the ACDC data.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some details and motivations of some modules are not clearly described, such as the RGB mapping process of SCS. Are colors assigned arbitrarily?
    • Although it is interesting to use diffusion models to model the transition process of sample predictions under different perturbations, it is not clear to me whether the transition can actually be modeled as the authors claim, especially between very similar predictions.
    • The author claims that both LCC and LFR have calibration or correction functions, but the article lacks evidence that the modules have a “correction” function. Comparisons between metrics do not prove that a module can correct errors.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) It is recommended to show the visualization of features before and after correction by LCC and LFR modules to prove the effectiveness of the module work. 2) The article does not mention how many iterations are needed for the diffusion process in the LFR module, and the article lacks analysis of the training time, computing consumption, and memory usage of the proposed DiffRect method. 3) What is the platform and configuration of DiffRect training? 4) Should the cell to the right of S2W in Table 4 be LFR? Because S2W exists in LFR, right? 5) Considering that the article involves the problem of confirmation bias rectification in semi-supervised medical segmentation, the introduction section lacks description of relevant literatures, such as: [a] MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation, CVPR 2023 [b] Error-Correcting Mean-Teacher: Corrections instead of consistency-targets applied to semi-supervised medical image segmentation, CIBM 2023 [c] Self-supervised correction learning for semi-supervised biomedical image segmentation, MICCAI 2021

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The details of some modules are insufficient, but the article is generally clearly written and the experiments presented demonstrate that the method achieves competitive results. Although the idea overall is interesting, I also have some confusions and concerns about it. My main concern is whether the LCC and LFR modules really work as the authors claim. And it’s not clear to me whether using diffusion models to model distribution transformations between similar predictions (with different perturbations) is really effective. In addition, considering that the article does not analyze calculation consumption, training time, etc., I cannot give a higher score.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my concerns in rebuttal. I would like to retain my positive rating.




Author Feedback

We thank all reviewers for their invaluable comments and approval that our method is novel and effective. Our code will be released to ensure reproducibility. Common questions are first answered, followed by responses to individual comments.

R3&R4: Hardware and computational cost A: DiffRect is trained on a single 24G NVIDIA 4090 GPU and Intel Xeon Gold 5418Y CPU. The training time and GPU memory consumption of DiffRect, MCNetV2, and INCL are: DiffRect(0.24s/iter, 5612M, 71.85%Dice); MCNetV2(0.25s/iter, 3582M, 49.92%Dice); INCL(0.16s/iter, 19682M, 67.01%Dice). DiffRect shows superior performance with negligible training overhead. The low computational cost of DiffRect is attributed to (1) LFR is conducted in low dimensional latent-space with a lightweight U-Net, (2) the denoising process only requires 10 steps to achieve favorable performance, resulting in shorter training times. During inference, only a U-Net is needed thus DiffRect incurs no extra inference cost.

R1&R3&R4: Figure/table clarity, qualitative results, more relevant works A: We will address the format and clarity issues in revision, and add discussion of relevant literature.

R1: Why consistency mask predictions do not fully grasp semantic content A: Latent-space representations contain complex semantic content and class-wise relations, which may not be reflected in the output masks. Moreover, consistency on masks may cause overfitting to a single mode of the mask distribution, neglecting modes or variations expressed in the latent space.

R1: Noise schedule and relation to category-wise correlation A: The noise schedule is a’t=∏_i=1^t(1-bt), where bt is computed from a cosine scheduler bt=cos((t+0.008)/1.008*π/2)^2. It enables the model to learn underlying patterns and category-wise correlations in a progressive and robust manner.

R1: Use diffusion model over flow-based model A: Although flow-based models can estimate data distributions like diffusion models, they are not suitable for medical image segmentation as they (1) use deterministic trajectories, which have limited expressiveness; (2) require invertible architectures with topological constraints that cannot adequately capture complex medical data distributions.

R1: Dice score A: DS=2*|P∩G|/(|P|+|G|).

R3: Limitations and future directions A: Although DiffRect is efficient, it may be slow for extremely large inputs. It is also designed only for modeling data within the same domain. Future directions include exploring faster ODE solvers to improve sampling speed and investigating its robustness to out-of-distribution data.

R3: Choice of UNet A: (1) For fair comparison, we follow [26][34] to use U-Net as the segmentation model. Due to the structured layout of medical images, both high-level semantics and low-level features are important. The skip connections in U-Net can integrate multi-level features, making it suitable for medical image segmentation. (2) Our method can feasibly integrate with other models, e.g., on UNeXt, our model also shows a higher Dice score on ACDC with 10% labels: Ours: 86.87%, Fixmatch: 79.34%.

R3: Statistical significance A: The p-values of the improvement over INCL on ACDC and MS-CMRSEG19 are 0.0086<0.01 and 0.033<0.05, demonstrating the statistical significance.

R4: RGB mapping A: We maximize the color difference between each encoded category to avoid semantic confusion, e.g., on ACDC with 3 fg classes, we use red/green/blue to encode.

R4: Module effect and feature visualization A: We visualize features before LCC, after LCC, and after LFR with tSNE to validate the distribution transition of each module. The averaged class-wise variance decreases from 2.61 to 1.08 and to 0.29, indicating that LCC and LFR effectively learn the underlying class distributions.

R4: Diffusion steps A: We ablate the diffusion steps in Fig.1 in Supp and set it to 10 for the best speed-accuracy trade-off. DiffRect is trained for 30k iterations and can be completed within 2 GPU hours.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top