Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at https://github.com/Hejrati/cDAL/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3622_paper.pdf

SharedIt Link: https://rdcu.be/dV5Kp

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_20

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3622_supp.pdf

Link to the Code Repository

https://github.com/Hejrati/cDAL/

Link to the Dataset(s)

https://monuseg.grand-challenge.org/ https://www.kaggle.com/code/nikhilpandey360/lung-segmentation-from-chest-x-ray-dataset http://medicaldecathlon.com/

BibTex

@InProceedings{Hej_Conditional_MICCAI2024,
        author = { Hejrati, Behzad and Banerjee, Soumyanil and Glide-Hurst, Carri and Dong, Ming},
        title = { { Conditional diffusion model with spatial attention and latent embedding for medical image segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {202 -- 212}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors propose the use of a diffusion model for image segmentation, conditioning the denoising process to the input image. On top of the base denoising-diffusion model, the authors add a discriminator to assist the denoising model in focusing on the challenging areas of the image. Finally, a random latent variable is introduced into the denoising model to enable the modelling of more complex, multi-modal denoising distributions (in contrast to the usual unimodal distributions of standard denoising models). The authors claim that this technique allows the denoising process to reach the data distribution in a very short number of steps.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

S1. The paper addresses the challenging problem of image segmentation using denoising diffusion probabilistic models (DDPM). While previous literature in this area exists, it is scarce due to the lack of real-world problems that require multiple segmentation predictions per input image (thus justifying the problem of learning a complex conditional distribution) and the inherent challenges of applying DDPMs with discrete/categorical distributions.

S2. The quantitative results are competitive, beating previous SOTA by small but consistent margins.

Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

W1. My first concern with this paper is conceptual. DDPMs’ main goal is to model complex, multi-modal distributions. However, the segmentation problems considered in this paper are deterministic, and therefore can be represented by simple unimodal distributions that can be learned by a standard segmentation model. The authors claim that “image segmentation with diffusion models is a challenging task due to the deterministic nature of image segmentation”. Learning a deterministic (or delta) distribution is not a big challenge for DDPMs. In fact, they can easily do so, as evidenced by the experimental results. The crucial question is whether there is any advantage to using powerful DDPMs, which are designed to learn complex distributions, to learn a deterministic process. In a way, using a DDPM to learn a deterministic process is similar to using a U-Net to learn a classification problem: it is not inherently wrong, but there is no clear benefit and a lot of unnecessary complexity. The same model that authors propose could be trained in a deterministic manner, without the DDPM training procedure, and achieve similar or potentially better results. In fact, as I will detail below, it is not clear to me that what the authors are proposing constitutes an actual DDPM. Many of the design decisions in this paper appear to be removing its capacity to learn complex distributions.

Note that I’m not claiming that DDPMs cannot be used for image segmentation. Previous work (see [P1, P2] listed below) successfully applied DDPMs to this problem. However, they focus on particular problems where the input images feature inherent ambiguities and, as a consequence, the segmentation problem becomes stochastic, with multiple potential segmentation maps for the same input image. In this context, DDPMs make sense, at least conceptually.

W2. One of the key challenges when using DDPMs for segmentation is how to deal with the categorical nature of the segmentation variables. [P1] and [P2] address this problem by modelling the conditional distributions of the DDPM as categorical distributions. It is unclear to me how the authors are addressing this problem in this paper. I assume that, following [11], classes are assigned a real value (0 and 1 in the case of binary segmentation) and then a standard DDPM with real variables and Gaussian distributions is used. However, this class-to-real-value assignment is completely arbitrary and introduces an order among labels that does not exist naturally. Why didn’t authors use a DDPM with categorical distributions instead?

Beyond the fact that using Gaussian distributions to model categorical variables is questionable, the specific assignment of labels to real values raises more concerns. First, the proposed model does not seem invariant to the chosen assignment. For example, given that label maps are modulated by the attention maps from the discriminator (as explained in section 2.2, more on this later), it is unclear to me how the denoising model distinguishes between pixels that were assigned the value 0 and pixels that were zeroed out by the attention map. Do the authors explicitly prevent the use of value 0 when mapping classes to real values?

It is also unclear how the model x_\theta predicts the labels. Does it produce a single real value per pixel that is approximated to the nearest integer (i.e., regression)? Or does it produce a pixel-wise probability for each class (i.e., classification)? From the way the loss functions are defined, I assume it is the former, but this should be clarified.

In case the model x_\theta is modeled as a regressor, how does the proposed approach deal with multi-class problems? According to section 3.1, it seems one-hot encoding is applied, and each class is treated as a binary problem. Does this mean that a single pixel can be predicted as belonging to more than one class? What is the final class assigned in that case? Are authors using the regressed values as probabilities to select a final class? This should also be clarified.

W3. Following [18], authors propose introducing a so-called “discriminator” to improve the quality of the denoising model x_\theta. I have a number of concerns regarding this module:

W3.1. The name “discriminator” might be a bit misleading. Indeed, as a discriminator, this module is trained to distinguish synthetic noisy samples from generated denoised samples. However, unlike actual discriminators, the output of this module is not used to guide the training of the generator x_\theta in a GAN-style training, but only one of its feature map is used to modulate its inputs.

W3.2. The attention maps are used at training time to modulate the inputs to x_\theta. However, as authors claim, the attention map “highlights the spatial regions in the labels which are essential to generate labels that are close to the ground-truth”. If we assume this is correct, the attention maps A_D will tend to zero as the training progresses and x_\theta learns to produce realistic samples. Given that authors are modulating x_0 with A_D to produce x_0^att, most of the signal present in x_0 will be progressively zeroed out. Therefore the input x_t^att passed to the model x_\theta will contain mostly noise and almost no signal about the original x_0. How can x_\theta learn to take into account the information received via x_t^att if the actual information is destroyed when it is modulated by the attention maps? My guess here is that x_\theta ends up simply ignoring its input x_t and relies only on the image I to produce its output. That is, the distribution p(x_t-1 x_t, I) that x_\theta is learning to model actually becomes p(x_t-1 I), as most of the information in x_t is destroyed by modulation, and therefore x_\theta becomes simply a segmentation model. This idea is also supported by the following observation.

W3.3. During training, the denoising model x_\theta is fed with samples x_t^att after modulation with the attention maps. At inference time, the discriminator is discarded and x_\theta is fed samples x_t without any modulation. The distribution of x_t^att at training time and the distribution of x_t at inference time are, therefore, markedly disparate. How does x_\theta manage to perform well at inference time when it receives a vastly different kind of inputs from the ones used at training time? Again, this reinforces my idea that x_\theta is in fact ignoring its input x_t and just using the image I to produce its output. In that case, the proposed approach would not be an actual DDPM, but just an iterated segmentation model.

W3.4. The previous issues would have been solved if the discriminator had been used to train the model in the GAN-style, as done in [18]. Why didn’t authors follow this straightforward approach? The unusual proposed technique of input modulation seems to follow a number of arbitrary design choices, raises concerns about its actual behavior, and does not seem to bring any advantage over the straightforward usage of a discriminator. Authors should clarify the rationale behind this decision.

W3.5. The “discriminator” is used to compute the so-called “attention maps”. As in W3.1, this “attention” name is misleading, as no attention mechanism is carried out. The “attention map” is just the average of the channels of a selected feature map of the discriminator model and, therefore, closer to the concept of “average feature map” or “activation map” than to “attention map”. I suggest also updating the name for clarity.

W4.

W4.1. Following [18], authors also introduce a latent variable z in order to enable x_\theta to model more complex, multi-modal distributions p(x_t-1 x_t). As stated above, the paper deals with deterministic segmentation problems and, by definition, deterministic processes are unimodal. Therefore, it is unclear to me why enabling multi-modal distributions is necessary in this work, as unimodal distributions can be modeled using simple unimodal transition distributions p(x_t-1 x_t). Authors should clarify why this is necessary. In any case, I suspect no multi-modal distribution is actually being learned, as suggested by the following observation.

W4.2. The latent variable z is introduced to allow for multimodal distributions. However, the authors then propose an MSE loss function (Eq. (6)) that marginalizes over the latent variable, promoting a unimodal distribution and thus rendering the latent variable useless. Do the authors take additional measurements to enforce a multimodal distribution? It should be noted that simply passing noise as input to a model is not sufficient for the model to produce multimodal distributions (otherwise methods such as VAEs and GANs would not be necessary). Therefore, the training procedure must be adapted accordingly. For example, in [18] a GAN loss is used, which does not prevent a unimodal distribution (mode collapse can still occur), but at least does not encourage it. As noted in W3.4, the authors could have solved most of these problems by simply adopting the GAN-style training method proposed in [18].

Considering the elements discussed in W3 and W4, it seems like the authors took interesting ideas from [18] but arbitrarily modified some components for some unclear reason. I suspect that these modifications limited the representational power of the DDPM in practice. However, since the model is used in the context of deterministic image segmentation, which does not require learning complex multimodal distributions, these limitations are not noticeable in the experimental results. This is also supported by the fact that the method is capable of producing good results in less than T=4 steps.

[P1] Hoogeboom et al., Argmax flows and multinomial diffusion: Learning categorical distributions, Advances in Neu- ral Information Processing Systems, 2021. [P2] Zbinden et al., Stochastic segmentation with conditional categorical diffusion models, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

First of all, I suggest authors to clarify the need to use DDPMs for deterministic processes (W1) or, alternatively, reframe the method within an alternative framework. Perhaps the whole method could be reformulated in terms of auto-context or as an “iterative segmentation model” that refines previously proposed segmentation maps instead of a DDPM. If the authors justify that DDPMs are indeed relevant for this task, I suggest that they clarify whether there is an advantage to using continuous vs. categorical DDPMs for the problem of segmentation, which is inherently categorical (W2). Additionally, it would be helpful to describe the methodology employed for the arbitrary class-to-value assignment in a multiclass context, the manner in which the model x_\theta predicts the classes, and the measures implemented to mitigate potential undesired effects given that the model does not appear to be invariant to this arbitrary assignment. Including categorical DDPMs [P1, P2] as baselines in the experiments would also provide additional support in the comparison of continuous vs. categorical DDPMs.

As secondary suggestions, it would be beneficial for the paper to address the concerns regarding the discriminator and the latent embedding issues (W3, W4). Why did the authors choose to move away from the standard GAN-style training proposed by [18]? How is the modulation of the attention map not forcing the model to simply ignore x_t? Why does the model perform well at inference time, given that it receives very different input distributions during training and inference? Why is a latent variable (or even DDPMs) necessary if the paper is dealing with deterministic processes? How does the MSE loss enforce a multi-modal distribution instead of mode collapse?

Finally, I also suggest that authors consider alternative terminology for the “discriminator” module and, in particular, for the “attention” maps, in order to avoid confusion with GAN-style training and attention models, respectively.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Reject — should be rejected, independent of rebuttal (2)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While attaining SOTA performance is a noteworthy accomplishment, I believe that there are several significant conceptual and methodological concerns that must be addressed before the paper can be published. In its current form, the paper does not adequately explain the ideas that led to this SOTA performance, which I find crucial. Clarifying these points will be the great utility from the theoretical standpoint, but will also serve to increase the confidence on the reported experimental results.

Nevertheless, I strongly encourage the authors to explore their approach further, as there may be a very valuable insight in their results. The achieved performance demonstrated in the experimental section is commendable, and I believe that the proposed method, when properly analyzed and clarified, can be a significant contribution to the MICCAI community.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Reject — should be rejected, independent of rebuttal (2)
[Post rebuttal] Please justify your decision
1. Authors agree that image segmentation is stochastic. However, the proposed method is still applied and evaluated as a deterministic model. Compare to the evaluation protocol of categorical DDPMs, where the metrics compare distributions, not single predictions. Using DDPMs in the context of this work is still unjustified.
2. Sect. 3.1 only describes one-hot encoding for more than 2 classes. For 2 classes, it is unclear. If no one-hot encoding is used, an arbitrary order is introduced between the two classes. But even with one-hot encoding there is an arbitrary assignment of discrete values (present/not present) to real values.
Even accepting the suitability of real DDPMs to model categorical variables, categorical DDPMs are very relevant models designed for segmentation that should be compared as a baseline to demonstrate superiority.
1. I didn’t claim that “discriminator” and “attention” are inappropriate, but misleading. This was a minor suggestion. I concede that GANs are not suitable for discrete variables.
2. I am very perplexed by the assertion that there is no difference between the distributions of the inputs x_t^{att} and x_t passed to the network during training and inference. According to the paper, they should be markedly different.
3. Authors claim again that, by introducing z, “the denoising distribution becomes multimodal”, but fail to explain how this is possible when training is done with the MSE loss. The fact that Tab.1 shows a better performance when z is present is hard to explain theoretically and makes this result suspicious. If authors can show how multimodal distributions can be learned using the MSE loss with a random z, this would constitute a remarkable contribution, as it would render models such as GANs, VAEs and DDPMs unnecessary.
In conclusion, while the paper introduces interesting ideas, there are still many unclear points and missing important baselines that need to be addressed for it to be useful to the MICCAI community.

Review #2

Please describe the contribution of the paper
1. The authors propose a nice conditional diffusion model with spatial attention and latent embedding for medical image segmentation. 2- A CNN-based discriminator is used at each of the diffusion process to distinguish between the generated labels and the real ones. 3- A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well written. 2- The paper is well organized. 3- The message is clear.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The seems to be a powerful approach for image segmentation, but it has a few limitations that the authors need to address:
1. The method can be computationally intensive, especially for large images or high-resolution datasets. The spatial attention mechanism adds an additional computational overhead, which can slow down training and inference. 2- Training the proposed model may require significant memory resources, especially when processing high-dimensional image data. This can pose challenges for training on GPUs with limited memory capacity. 3- Like many deep learning models, the method’s performance can be sensitive to hyperparameters such as learning rate, batch size, and network architecture. Tuning these hyperparameters effectively can require significant trial and error. 4- The method may struggle with generalizing to unseen data or data from different distributions. If the training data is not representative of the entire population of images the model may encounter, it could fail to segment certain types of images accurately. 5- The method may be sensitive to noise and artifacts present in the input images. Preprocessing steps or additional regularization techniques may be necessary to improve the model’s robustness to such issues. The authors may refer the below papers while addressing: “A Lightweight Neural Network with Multiscale Feature Enhancement for Liver CT Segmentation,” Scientific Reports, Nature, vol. 12, no. 14153, pp. 1-12, 2022. “Re- routing drugs to blood brain barrier: A comprehensive analysis of Machine Learning approaches with fingerprint amalgamation and data balancing,” IEEE Access, vol. 11, pp. 9890-9906, 2023. “Dense-PSP-UNet: A Neural Network for Fast Inference Liver Ultrasound Segmentation,” Computers in Biology and Medicine, ScienceDirect, vol. 153, pp. 106478, 2023. 6- What is the risk of recurrence if the segmentation is not proper? The below paper can be referred. “Risk Assessment of Computer-aided Diagnostic Software for Hepatic Resection,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 6, no. 6, pp. 667-677, 2022. 7- Please mention the potential limitations of the paper.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

no
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The authors have presented a nice problem, however, they need to address the comments.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper has proposed a novel method.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

The authors have improved the quality of the paper. Thus, should there be space to accommodate more papers, this can well be considered.

Review #3

Please describe the contribution of the paper

This paper proposes a novel conditional diffusion model for medical image segmentation which uses both a discriminator and a spatial attention map to guide the training of the diffusion model. The authors demonstrate the value of their proposed additions to the image segmentation diffusion model through an ablation study which proves that both the introduction of the latent embedding and spatial attention maps improve performance. More than that, the authors compare their proposed model with a number of SOTA segmentation models, including a diffusion-based model, and demonstrate superior performance on 3 medical image segmentation datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. One of the main contributions of this paper is the proposed overall image segmentation through diffusion architecture, which, to the best of my knowledge, is novel, and improves over the current SOTA, while also reducing the time-steps required during training and sampling.
2. The experimental results show improvements both quantitatively and qualitatively.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. I think the authors could reduce the methodology section (specifically, Section 2.1 which only describes the method’s main contributions at the very end of the section), in favour of a lengthier discussion on the benefits of their proposed method from a clinical perspective.
2. The authors should provide some information on inference times (specifically to SegDiff) to aid their justification that their proposed model is better, as their quantitative results are sometimes only marginally better. That being said, the paper could also benefit from a statistical test on whether their model is significantly better than the others.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. The paper could benefit from a lengthier discussion of the main clinical contributions.
2. Please provide inference times for your proposed model vs SegDiff.
3. Please provide statistical tests on the reported improvements (Dice scores, etc.).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is novel in terms of the proposed method, but I think at the moment it reads more as an independent technical / deep learning paper regardless of application. My only concern is that the improvements brought by this novel architecture may not benefit the MICCAI main audience. However, upon introducing more arguments as to why this method would be clinically more helpful than others, I think this paper should be accepted.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

I appreciate the time spent by the authors of this paper to write a well-organised and thoughtful rebuttal, and I think the authors have answered all of my concerns. For this reason, I have changed my opinion to “5. Accept” from my original “4. Weak Accept”.

Author Feedback

We appreciate the reviewers’ comments & insightful suggestions. R1: Advantage of DDPM for segmentation: R: There is inherent ambiguity in medical image segmentation as the delineation of the same image differs among experts. Generally, the ground-truth (GT) label for an image is obtained by a consensus among experts. Here, we utilized the stochastic nature of DDPM to approximate this process & generate multiple predictions during inference, then take their mean & threshold to obtain MORE accurate masks compared to deterministic models as U-Net (Table 2). We will add this discussion. R1: Categorical nature of segmentation variables: R: For categorical distributions (Table 3), we use one-hot encoding for GT labels (Sec 3.1). Thus, we DID NOT introduce any order among labels. The prediction from our model is a probabilistic map for each channel of GT labels & we used thresholding to convert it to binary labels for each output channel, like standard segmentation models. For a given pixel, we considered the class which has the maximum probability value. We will clarify this. R1: Misleading “discriminator” term: R: As we use the “generated (fake)” & “real” noisy samples at each time step to train the discriminator to classify between real & fake samples, the term “discriminator” IS appropriate. GAN-based models generally are not used for segmentation as adversarial learning is NOT well suited for segmentation tasks. R1: Zero “Attention” map: R: If the attention map had been 0, the discriminator would predict only fake, which is clearly not the case. We have also visualized the NON-ZERO attention map in the architecture figure that shows dependency on both x_t & I. The attention maps A_D indicate the relative importance of pixels [20], hence the term “attention” IS appropriate. R1: Disparate inference distribution of x_t: R: We utilized the attention maps during training to learn a better mapping of the reverse denoising process & identify the importance of different parts of the segmentation labels (x). Inference just involves sampling from noise. Cross-validation results show there is NO distribution mismatch between training & inference. R1: No multi-modal distribution learned with z: R: We introduced latent variable z to reduce the training & inference time-steps. So, when larger steps are used, the reverse denoising distribution becomes multimodal [18]. Ablation studies (Table 1) show that WITHOUT using z leads to a significant performance drop when using extremely low time steps. R3: Computational cost: R: Our method is not computationally expensive as in inference we remove the discriminator and perform sampling with a much smaller number of steps. Additionally, compared to SegDiff, we observe a 95% decrease in trainable parameters. We will clarify this. R3: Hyperparameters & generalization to unseen data: R: We chose the standard hyperparameters used in the literature for diffusion models & performed an ablation study to choose the best setting. Cross-validations are performed to avoid overfitting. R3: Preprocessing & Risk of recurrence: R: We follow the standard preprocessing steps such as normalization, random scaling & rotation. We didn’t consider the risk of recurrence but will add the discussion to the paper and add the suggested references. R3: Limitation of our model: R: One limitation is that our model can only be applied on a 2D slice & we are working on the 3D version of it. We will add this discussion. R4: Clinical contributions: R: By introducing our diffusion model for medical image segmentation, we are modelling an important clinical process as addressed in the first point to R1. We will update Sec 2.1. R4: Inference time: R: The inference time for our model is 1 sec (T=4) vs 60 secs for SegDiff (T=100). R4: Statistical tests: R: We had performed statistical t-test before & found p-value < 0.0001 for all metrics in Table 2 & Table 3, therefore our results are statistically significant. We will update this.

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper has highly contrast scores from three reviewers. After carefully rending the rebuttal, the authors have tried their best to explain the concerns from the reviewer 1. Although he was still not convinced, he might have systematic bias about this method. The current form is of interest to the MICCAI community.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

This paper has highly contrast scores from three reviewers. After carefully rending the rebuttal, the authors have tried their best to explain the concerns from the reviewer 1. Although he was still not convinced, he might have systematic bias about this method. The current form is of interest to the MICCAI community.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Accepts
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

Accepts

back to top

Conditional diffusion model with spatial attention and latent embedding for medical image segmentation

Author(s):