Abstract

Acquiring a comprehensive segmentation map of the retinal image serves as the preliminary step in developing an interpretable diagnostic tool for retinopathy. However, the inherent complexity of retinal anatomical structures and lesions, along with data heterogeneity and annotations scarcity, poses challenges to the development of accurate and generalizable models. Denoising diffusion probabilistic models (DDPM) have recently shown promise in various medical image applications. In this paper, driven by the motivation to leverage strong pre-trained DDPM, we introduce a novel framework, named DiffDGSS, to exploit the latent representations from the diffusion models for Domain Generalizable Semantic Segmentation (DGSS). In particular, we demonstrate that the deterministic inversion of diffusion models yields robust representations that allow for strong out-of-domain generalization. Subsequently, we develop an adaptive semantic feature interpreter for projecting these representations into an accurate segmentation map. Extensive experiments across various tasks (retinal lesion and vessel segmentation) and settings (cross-domain and cross-modality) demonstrate the superiority of our DiffDGSS over state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1173_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xie_DiffDGSS_MICCAI2024,
        author = { Xie, Yingpeng and Qu, Junlong and Xie, Hai and Wang, Tianfu and Lei, Baiying},
        title = { { DiffDGSS: Generalizable Retinal Image Segmentation with Deterministic Representation from Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose DiffDGSS, a diffusion-based framework for domain-generalizable semantic segmentation. They exploit the latent representation of a DDIM, and propose a timestep-dependent feature interpreter. They evaluate their method on different datasets of retinal fundus images for vessel segmentation and lesion segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is very well written, and the formulations are easy to follow. The introduction is clear and motivates the work well. The authors implement quite some comparing methods to evaluate their method on different datasets. The small ablation study in Table 1 is nice, as well as the cross-domain and cross-modality experiments in Table 2. Figure 1 gives a nice overview. In general, the idea of using diffusion models to extract good data representation is important for robust and generalizable models for medical downstream tasks. This paper explores this direction for the task of image segmentation on fundus images.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In equation 5 (top line) some bars are missing above the alphas. What is written is the prediction of x_{t-1}, not x_0.

    2. Section 2.3 is unclear. The transition from section 2.2 to section 2.3 is out of the blue. How are the multiscale representations F_t obtained, how are they defined? What is h and how do we extract it from the diffusion model? How is the adaptive feature interpreter trained? What is meant by “majority voting” during inference? Much more detail is needed in this section. A pseudo-algorithm can also help here.

    3. It is stated that deterministic inversion is chosen for this approach. In which part of the method is this described, how does it differ from probabilistic inversion, and how does it affect the output segmentation? More explanation is needed in section 2.

    4. What part of the method deals with domain generalizability? This is mentioned in the abstract, but what mechanism ensures better generalization performance? How does it adapt to new domains or modalities?

    5. In the ablation study in Table 1, the results are presented without deterministic inversion or adaptive interpreter. However, it is not clear what part of the method was omitted. Further explanation is needed. What did you change in the method to compare the deterministic vs. stochastic diffusion process?

    6. The following related work covers a similar topic and should also be mentioned in the discussion: Rousseau, Jérémy, et al. “Pre-training with Diffusion Models for Dental Radiography Segmentation”.

    7. In Figure 6, I think there is some confusion about the reconstruction quality of a DDIM. Even a completely untrained diffusion model is capable of reconstructing input images of any modality using DDIM inversion. As mentioned in Section 2.2, DDIM inversion is the forward and backward solution of an ODE. The only source of inaccuracy is the approximation of the solver. This is independent of how well the diffusion model is trained. In this sense, Figure 3 does not add any information about out-of-domain generalization.

    8. What scores are reported in Table 1 and 2?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please comment on all the points listed under “weaknesses”. In general, I suggest that much more emphasis be placed on a detailed description of the method and a more detailed ablation study of each component to better assess the impact of the proposed approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the paper tackles the interesting challenge of extracting good and robust image representations to improve generalizability of downstream tasks, the method is not described in enough detail to assess its benefit.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposed a new segmentation method using diffusion models. It utilized the latent codes from the learned representations in diffusion models to help the segmentation network. Results showed great performances in various datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The proposed method leveraged the latent codes from the learned representations in diffusion models. (2) It also incorporated Mamba to boost the segmentation performances. (3) Through extensive experiments, it achieves impressive performance metrics that surpass existing methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Lack of novelty: the integration of latent codes from diffusion models has been explored previously in various applications.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Availability should be stated.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should provide a clear differentiation of their method from existing techniques that also employ latent codes from diffusion models.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposed a new approach for segmenting retinal images, but the technique is not new and has been explored in various applications.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have provided detailed technical insights into their novel contributions, particularly the deterministic inversion for representation extraction and the enhanced domain generalizability of learned representations. This also demonstrated their novelty.



Review #3

  • Please describe the contribution of the paper

    The authors present DiffDGSS a diffusion model framework which parameterises predictions of x_{0} produced via DDIMs deterministic denoising scheme, in-order to produce representations that are useful for segmenting eye scans. They demonstrate that their approach out performs existing methods for retinal image segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of this paper is the novel parameterisation of x_{0} prediction through the DDIMs deterministic denoising scheme via another neural network, G_{\theta}(x_{t}, t). This is a strength because it introduces a new form of representation learning within diffusion models, and thus may inspire future work using similar techniques. Another key strength is that the technique reliably outperforms existing methods on a variety of segmentation tasks. Which further demonstrates that this particular form of representation learning in diffusion models is promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Certain aspects of how G_{\theta}(x_{t}, t) is learnt are unclear from the paper. Equation 5 states what it is, but do we minimise an MSE between the RHS of equation 5 and G_{\theta}(x_{t}, t)? Or is G_{\theta}(x_{t}, t) trained directly on the original x_{0}? Its not clear from the paper. Also the paper focuses a lot of DDIM inversion results, but the high fidelity of reconstructions from DDIM inversion is a well known result, so its unclear why so much space is dedicated to it.

    The authors should have also compared the features produced by G_{\theta}(x_{t}, t) to the original denoising network \epsilon_{\theta}(x_{t}, t)

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper is overall solid due to the strong improvements over other methods, and elegant parameterisation of x_{0} predictions via an auxillary nextwork G_{\theta}(x_{t}, t). However, it would be interesting to see how the raw representations in \epsilon_{\theta}(x_{t}, t) compare. Also, there are a few other representation learning approaches for diffusion models which could have been acknowledged. Such as the diffusion autoencoder framework: https://diff-ae.github.io/. And the H-Space representations framework: https://arxiv.org/abs/2210.10960. With that being said, it’s understandable that not everything gets a mention/is benchmarked against due to the length of the paper. However, I think less space could have been devoted to showing the effectiveness of DDIM inversion for reconstruction, and more space to delving into these other representation learning avenues.

    Below are comments on how the clarity of the paper can be improved.

    Paragraph 1 on page 2 should be changed to ‘very accurate labels for real images and not suffer from error-prone GAN inversions.’ instead of GANs inversions.

    Paragraph 2 on page 2: ‘comprehending the latent space of diffusion models is crucial but challenging’. Instead latent spaces.

    End of paragraph 2: ‘Motivated by this insight, in this paper, we delve into the intermediate representations that are derived from this process, with a particular focus on Domain-Generalizable Semantic Segmentation’. Instead of ‘that derived from’

    The first contribution bullet point on page 3 needs toward something along the lines of ‘We present DiffDGSS, an innovative representation-based algorithm’ as opposed to just ‘innovative representation-based.’

    Figure 3 has the cross domain examples labelled as ‘cross momain’ instead of cross domain.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper produces a novel representation learning scheme for diffusion models which shows promise due to its solid results. This may be useful for other researchers working both in the representation learning space and segmentation space.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We are pleased that the reviewers find our DiffDGSS to be novel and overall solid (R4), the paper to be very well-written with formulations that are easy to follow (R1), and the experiments to be comprehensive and demonstrating impressive performance (R1,R3,R4). There are our feedbacks for the major weaknesses:

  1. Implementation Details (R1): Due to the page limit, some implementation are only depicted in Fig.1 roughly. In particualr, F_t are obtained from intermediate outputs of the UNet’s decoder, while h is the intermediate feature map processed in the adaptive segblock. The adaptive feature interpreter predicts true segmentation masks given different F_t with cross-entropy loss. The majority voting mechanism assigns each pixel to the class that is most frequently predicted across all the predicted masks derived from different F_t. Complete code will be released upon acceptance.
  2. DDIM deterministic inversion (R1,R3): As a class of latent-variable models, diffusion models can naturally yield the latent variables $x_{1: T}$ through its Markovian forward diffusion process. Noticing this property, Baranchuk et al.[1] naturally suggested an approach to extract pixel-wise representations of the image for the segmentation task via denoising the latent variables at the specific timestep. However, owing to the stochastic nature of DDPM, the obtained representations could not correspond to the semantic information of the original image (the stochastic reverse diffusion gives a stochastic source sample in Fig.3), with this discrepancy becoming more pronounced at the later steps [1]. Therefore, we propose leveraging the intermediate representations that emerge during the deterministic inversion process.
  3. DDIM reconstruction and domain generalizability (R1,R3,R4): While theoretically possible for an untrained diffusion model to reconstruct arbitrary images through DDIM inversion, practical limitations arise due to ODE approximation inaccuracies and discontinuities (T is finite). In particualr, the inversion relies on (\epsilon_\theta(x_{t-1},t-1)) approximating (\epsilon_\theta(x_t,t)) over small time steps, assuming output continuity. Untrained or poor models lack this continuity, leading to reconstruction errors. Our key observation is that DDPM pre-trained on retinal images holds better reconstruction on retinal images across different domain and modality than the ImageNet-Pretrain model in Fig.3. Therefore, we hypothesize that the diffusion model has inherently learned domain-specific information of the unseen subspace via pretraining, thus holding certain generalizability along the deterministic inversion chain.
  4. Ablation study (R1): Our DiffDGSS builds upon DDPM-Seg [1], which employs an ensemble of MLP classifiers to explore the semantic linear separability of representations derived from the stochastic reverse diffusion process, i.e. Our DiffDGSS = DDPM-Seg + Deterministic Inversion + Adaptive Interpreter. The scores reported in Tables 1 and 2 are introduced in Sec.3 Metrics.
  5. Lack of novelty (R3): Our approach introduces several novel contributions that distinguish it from prior work [1]: 1) Deterministic Inversion for Representation Extraction; 2) Adaptive Feature Interpreter; 3) With a focus on Domain Generalizability of Learned Representation.
  6. Representation of G_{\theta}(x_{t}, t) (R4): G_{\theta}(x_{t}, t) can be viewed as a denoising autoencoder (DAE) with varying denoising scales (Minimise MSE between G_{\theta}(x_{t}, t) and x_0). For the raw representations in \epsilon_{\theta}(x_{t}, t), Baranchuk et al.[1] have conducted similar visualizations like our Fig.4. Comparison between them can be found, our G_{\theta}(x_{t}, t), combined with DDIM inversion, retains more image features at different timesteps and blocks. More exploration and comparisons will be considered in the future work. 7.Content and grammar issues (R1,R4): Thank you to all the reviewers and the final version will be revised accordingly.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors sufficiently addressed all reviewer’s questions.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors sufficiently addressed all reviewer’s questions.



back to top