Abstract

Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model’s stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1007_paper.pdf

SharedIt Link: https://rdcu.be/dZxer

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_62

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1007_supp.pdf

Link to the Code Repository

https://github.com/lin-tianyu/Stable-Diffusion-Seg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lin_Stable_MICCAI2024,
        author = { Lin, Tianyu and Chen, Zhiguang and Yan, Zhonghao and Yu, Weijiang and Zheng, Fudan},
        title = { { Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {656 -- 666}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. This paper proposes segmentation framework called SDSeg, which is computationally friendly.
    2. This paper introduces simple latent estimation loss and and a concatenate latent fusion technique
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This article is very full of experiments, accounting for nearly half of the length, including Comparison with State-of-the-Arts, Comparison of computing resource and time efficiency, Stability Evaluation and Ablation Study.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. I have some confusion with the third point of contributions, I’m not sure if this can count as a valid contribution point. The mentioned “Trainable Vision Encoder” looks not special?
    2. Many of the statements in this article are vague (possibly limited by length), such as Concatenate Latent Fusion and Trainable Vision Encoder
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Some adjustments may need to be made to the structure of the article. The contribution points of the article need to be reorganized. The core part of the method should reflect the core contribution of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are plenty of experiments in this paper, and the experimental results also show the superiority of the method. However, this article does not seem to let the reader grasp its main point quickly in the introduction and method section.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    thanks for the author’s feedback, I changed it to Weak Accept



Review #2

  • Please describe the contribution of the paper

    authors proposed a new method for the Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    overall a good paper, novel method, easy to read and follow, valid experiments

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    authors could provide more motivations of their methods

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    overall a good paper, novel method, easy to read and follow, valid experiments

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    overall a good paper, novel method, easy to read and follow, valid experiments

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    SDSeg is built on top of latent diffusion models (LDMs). They proposed a straightforward approach to generate segmentation results on a single-step reverse process using a latent estimation loss. They condition the denoising U-Net based on the input image features taken from a vision encoder. They achieved SOTA results on different benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Their experiments and ablation studies are exhaustive, and the paper is well-motivated and easy to follow. They evaluated their method in three 2D segmentation datasets and two 3D datasets. The proposed approach is computationally efficient and has significantly better inference speed compared to other diffusion samplers like DDIM.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is mentioned in the abstract that they “introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD),” however, there has been already some research on using LDMs for the medical segmentation task [1] similar to the proposed architecture.

    • Lack of qualitative comparison.

    [1] Vu Quoc, Hung, et al. “LSegDiff: A Latent Diffusion Model for Medical Image Segmentation.” Proceedings of the 12th International Symposium on Information and Communication Technology. 2023.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please look at the weaknesses section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method improves computational efficiency and inference speed in segmentation tasks, tested across both 2D and 3D datasets. There are some weaknesses that can be resolved, but overall, the good performance improvements and comprehensive testing justify acceptance.They have also provided the code.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I increase my score. The authors addressed my concerns, but Reviewer #1 mentioned a valid point about the information on the “Trainable Vision Encoder.” I understand the page limit, but the authors could have included the figure of the encoder architecture in the supplementary file to make it clearer.




Author Feedback

Thanks for all the constructive comments.

Reviewer #1 Comment 1 —I’m not sure if the third contribution can count as a valid point. The mentioned “Trainable Vision Encoder” looks not special?

Reply: We want to clarify that “Trainable Vision Encoder” is worth a valid contribution point for the following reasons:

  1. The proposed “vision encoder” is different from the vision encoders provided by LDM. In LDM, there are only “ClassEmbedder” and “SpatialRescaler”, both of which don’t have the appropriate network structure or output type for segmentation task. Therefore, we have to build a novel vision encoder to learn semantic features for segmentation.
  2. The term “trainable” is essential because it differentiates our vision encoder from original LDM design and brings over 10% Dice improvement to SDSeg.
  3. Due to the 8 pages limit, we haven’t explored the choice of this vision encoder. The specific architecture of this vision encoder can be future work for us and other researchers to explore.

Reviewer #1 Comment 2 —Many of the statements in this article are vague (possibly limited by length), such as Concatenate Latent Fusion and Trainable Vision Encoder.

Reply: Concatenate Latent Fusion is to concatenate the noised latent representation with the image feature representation at channel dimension. Trainable Vision Encoder is a feature extractor to learn semantic features from the conditional original image for segmentation. Many statements are indeed limited by length, we open-sourced the code and hoped it can provide technical details.

Reviewer #1 Comment 3 —Some adjustments may need to be made to the structure of the article. The contribution points of the article need to be reorganized.

Reply: For clarification, the 2nd contribution point refers to section 2.1 and 2.2; the 3rd contribution point refers to section 2.3. Thanks for your valuable comment, we will follow this tip in paper writing in future work.

Reviewer #3 Comment 1 —Authors could provide more motivations of their methods.

Reply: We would like to further explain the motivations of our method:

  1. Image-level diffusion segmentation methods is time-consuming and computing resource-consuming. So, we want to develop a latent-level diffusion segmentation method, which conducts the diffusion process on a relatively small latent space to lower the need of time and computing resource.
  2. Also, diffusion models always need dozens of reverse steps to generate a reasonable result. We choose to simplify the abundant reverse process for acceleration.

Reviewer #4 Comment 1 —The authors claimed to “introduce the first latent diffusion segmentation model” however, there has been already some research on using LDMs for the medical segmentation task [1] (LSegDiff) similar to the proposed architecture.

Reply: Possibly because that work [1] was contemporized with our work, and also due to our negligence, we failed to find that [1] had used LDMs for medical segmentation when we were writing the paper. In addition, the main differences between SDSeg and LSegDiff are:

  1. The training stage of LSegDiff has two stages. However, SDSeg is trained end-to-end with frozen mask compression model.
  2. Compared to LSegDiff, SDSeg maintain simple method design and no post-processing is needed.
  3. Compared to LSegDiff, SDSeg was evaluate on more data modalities on both diffusion and conventional segmentation methods.
  4. The inference speed of SDSeg is way faster than LSegDiff.
  5. Compared to LSegDiff, SDSeg only need a single step reverse process to generate the predicted latent representation.

Reviewer #4 Comment 2 —Lack of qualitative comparison.

Reply: Due to page limit, we only showcased the quantitative comparison, which has demonstrated the superiority of our method (Table 2 and 3) and the most important qualitative results such as visualization of latent representation and reverse process (in the supplemental materials) to better explain our work. If allowed, we will add it.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After rebuttal, all reviews are positive.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After rebuttal, all reviews are positive.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a straightforward approach to generate segmentation results on a single-step reverse process using a latent estimation loss. After carefully reading the comments and the authors’ rebuttal, I will vote for acceptance, given the major concerns raised by reviewers were well-addressed.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes a straightforward approach to generate segmentation results on a single-step reverse process using a latent estimation loss. After carefully reading the comments and the authors’ rebuttal, I will vote for acceptance, given the major concerns raised by reviewers were well-addressed.



back to top