Abstract

Retinal image segmentation plays a critical role in rapid disease detection and early detection, such as assisting in the observation of abnormal structures and structural quantification. However, acquiring semantic segmentation labels is both expensive and time-consuming. To improve label utilization efficiency in semantic segmentation models, we propose Diffusion-Enhanced Transformation Consistency Learning (termed as DiffTCL), a semi-supervised segmentation approach. Initially, the model undergoes self-supervised diffusion pre-training, establishing a reasonable initial model to improve the accuracy of early pseudo-labels in the subsequent consistency training, thereby preventing error accumulation. Furthermore, we developed a Transformation Consistency Learning (TCL) method for retinal images, effectively utilizing unlabeled data. In TCL, the prediction of image affine transformations acts as supervision for both image elastic transformations and pixel-level transformations. We carry out evaluations on the REFUGE2 and MS datasets, involving the segmentation of two modalities: optic disc/cup segmentation in color fundus photography, and layer segmentation in optical coherence tomography. The results for both tasks demonstrate that DiffTCL achieves relative improvements of 5.0% and 2.3%, respectively, over other state-of-the-art semi-supervised methods. The code is available at: https://github.com/lixiang007666/DiffTCL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1467_paper.pdf

SharedIt Link: https://rdcu.be/dV58j

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_21

Supplementary Material: N/A

Link to the Code Repository

https://github.com/lixiang007666/DiffTCL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_DiffusionEnhanced_MICCAI2024,
        author = { Li, Xiang and Fang, Huihui and Liu, Mingsi and Xu, Yanwu and Duan, Lixin},
        title = { { Diffusion-Enhanced Transformation Consistency Learning for Retinal Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {221 -- 231}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a novel semi-supervised segmentation method named Diffusion-Enhanced Transformation Consistency Learning (DiffTCL). A transformation-based consistency loss is proposed to utilizing unlabeled data. A self-supervised diffusion pre-training is further proposed to enhance model learning. DiffTCL has been evaluated on datasets of two modalities.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) A new transformation-based consistency loss is proposed to utilizing unlabeled data. (2) A new denoising pretraining strategy is proposed. (3) The description of the method is very clear. (4) Comparison and ablation experiments are sufficient.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The technical contributions need to be further clarified. The main contributions of authors are the denoising pretraining and transformation-based consistency learning. However, some previous studies in literature have already involved some self-training or self-supervised strategies [1][2]. There are also some studies employing transformation consistency [3][4]. These are not discussed in the manuscript.

[1] R. Zhang, et al., “Self-supervised correction learning for semi-supervised biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 2021, pp. 134–144. [2] L. Yang, et al., “St++: Make self-training work better for semi-supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4268–4277. [3] X. Li, et al., “Transformation-consistent self-ensembling model for semisupervised medical image segmentation,” IEEE transactions on neural networks and learning systems, vol. 32, no. 2, pp. 523–534, 2020. [4] Z. Zhao, et al., “Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 11350–11359.

(2) The motivation of involving self-supervised diffusion pre-training and Transformation Consistency Learning is unclear. It seems that these techniques are general and not specifically designed for retinal images. Since many pretraining strategies have been proposed, the authors did not explain why it was necessary to use denoising pretraining. Besides, the motivation of using transformation consistency learning also needs to be clarified.

(3) The theory of Diffusion-Enhanced pre-training may be incorrect. Authors claimed that “DDPMs indicate that predicting ε is more effective than predicting x, leading us to adopt this strategy as well” The DDPM includes multi-step noising processes. So it’s hard to predict all added noises at once, and predicting noise of each step is eaiser. However, the proposed Diffusion-Enhanced pre-training has only one-step noising. There may be no theoretical difference between predicting added noise or original images. I’m not sure the proposed strategy is different from simply noising. Besides, authors also claimed that “traditional denoising autoencoders are generally designed to eliminate Gaussian noise of constant variance” and DDPM uses “noise originating from a Gaussian distribution of varying variances” However, it seems that the designed pre-training use a Gaussian noise of constant variance like the traditional denoising autoencoders.

(4) The effectiveness of numerous transformation operations involved in TCL is not verified.

(5) The standard deviations are not reported.

(6) Incorrect use of ophthalmic terminology. RNFL, GCIP, INL, OPL, ONL, IS, OS, and RPE refer to the layer surfaces, not region names used in Table 2 and Fig. 3.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The main issues are inaccurate contribution statements and unclear technical motivations. The authors can carefully modify the relevant description based on the above suggestions
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the authors introduce or design some new methods and proves their effectiveness through experiments, unclear motivations and technical errors should be addressed.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents a semi-supervised approach for retinal image segmentation called DiffTCL, which comprises two stages: diffusion-enhanced pre-training and transformation consistency learning. Utilizing DeepLabV3+ based on ResNet50 as the baseline, DiffTCL, which combines the pre-train and image transformation methods, is fundamentally a feature-enhanced method. The authors evaluate the approach on two retinal datasets, demonstrating 2.3%~5.0% superior performance compared to existing baseline methods in the literature.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1)This paper designs a two-stage image segmentation method. In the first stage, self-supervised diffusion pre-training is used to improve the accuracy of early pseudo-labels in TCL and avoid error accumulation. In the second stage, three image transformation methods are used to enhance the model’s robustness. (2)Even though this paper only focuses on retinal image segmentation, the diffusion-enhanced pre-training and three image-processing transformation methods can be promoted as plugins in other image segmentation fields.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1)While the model in the paper demonstrates superior performance compared to baseline methods by 2.5% to 5.0%, as reported in the literature, the innovation in methodology appears limited. Additionally, the baseline method (DeepLabV3+) used in this paper may not be universally applicable in medical image segmentation. Would using basic baselines like U-net and FCN be equally effective? (2)The experimental comparison with state-of-the-art methods is inadequate, especially regarding the REFUGE2 dataset, which is crucial for the REFUGE2 challenge and includes numerous comparative methods. However, most of the methods compared in this paper are not specifically retinal image segmentation methods, particularly the probabilistic diffusion model methods that have significantly contributed to this paper, lacking comparison. (3)Figure 1 necessitates a fundamental explanation of the model. As the core and focal point of the paper, it is essential to provide necessary clarifications to enhance readability and comprehensibility. (4)In Figure 1, there is a discrepancy in the correspondence between the image and text. The three image transformation methods are simultaneously applied to the model in the main text, whereas the diagram illustrates a branching structure. (5)The citation format is incorrect. Please arrange citations according to the order in which they appear in the text and adhere strictly to the citation format guidelines provided by MICCAI.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

(1)While the model in the paper demonstrates superior performance compared to baseline methods by 2.5% to 5.0%, as reported in the literature, the innovation in methodology appears limited. Additionally, the baseline method (DeepLabV3+) used in this paper may not be universally applicable in medical image segmentation. Would using basic baselines like U-net and FCN be equally effective? (2)The experimental comparison with state-of-the-art methods is inadequate, especially regarding the REFUGE2 dataset, which is crucial for the REFUGE2 challenge and includes numerous comparative methods. However, most of the methods compared in this paper are not specifically retinal image segmentation methods, particularly the probabilistic diffusion model methods that have significantly contributed to this paper, lacking comparison. (3)Figure 1 necessitates a fundamental explanation of the model. As the core and focal point of the paper, it is essential to provide necessary clarifications to enhance readability and comprehensibility. (4)In Figure 1, there is a discrepancy in the correspondence between the image and text. The three image transformation methods are simultaneously applied to the model in the main text, whereas the diagram illustrates a branching structure. (5)The citation format is incorrect. Please arrange citations according to the order in which they appear in the text and adhere strictly to the citation format guidelines provided by MICCAI.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The model design is reasonable and has generalizability.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The author contributes a new method for retinal segmentation, which can be used for semi-supervised learning, achieving SOTA performance. The method is a novelty based on diffusion techniques and various transformations to learn consistency among model predictions given the same input.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper has the following strengths: a) Ideas using diffusion for pre-training are interesting and effective. This is important for retinal images, where the objects to segment are relatively small and suffer from noises around structures.

b) To leverage unlabeled data, the author suggests using transformation consistency learning, where the different output segmentations of the same images are aligned to find the differences.

c) The paper conducts diverse experiments, both self-supervised and semi-supervised, compared with several baselines, including some of the latest ones like UniMatch[23]. The performance is good and consistent across settings. The author also provides ablation studies for important components and visualization results.

d) The paper is well-written and easy to follow. Codes are provided anonymously.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Generally, the paper is good work and interesting to the MICCAI community. I have no significant concerns.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Some extensions might be important for the paper include (a) pre-training diffusion with multi-modal data that can helps model better adapted under domain adaptation settings; (b) leveraging current foundation models trained on medical image potentially improve further performance with less effort, for e.g. using [1,2].

[1] MH Nguyen, Duy, et al. “Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching.” NeurIPS 2023. [2] Zhou, Yukun, et al. “A foundation model for generalizable disease detection from retinal images.” Nature 2023.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My decisions are based on several strengths mentioned in Sec. 5
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

Thank you for your constructive comments and recognition of our strengths, such as “DiffTCL can be promoted as a plugin” (R1), “sufficient ablation studies” (R3), and “the proposed method is novel” (R4).

To Reviewer1: Our method is not limited to baseline methods and is effective on any backbone. We will compare more methods specifically designed for retinal image segmentation in future work. Additionally, we will enhance the readability of Fig. 1 and correct the citation format.

To Reviewer3: Thank you for your suggestion. We will clarify the technical contributions and motivations in the revised version. Additionally, we will rewrite the theoretical section on diffusion-enhanced pre-training to make it easier to understand. Finally, we will correct the OCT layer region names in Table 2 and Fig. 3.

To Reviewer4: We will explore extensions such as pre-training diffusion with multi-modal data and using foundation models trained on medical images as a base.

Thank you once again to the reviewers for taking the time to review our manuscript, recognizing its value, and providing suggestions that are highly instructive for our future research.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Diffusion-Enhanced Transformation Consistency Learning for Retinal Image Segmentation

Author(s):