Abstract

Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method (“SegGuidedDiff”) to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including cross-modality translation, and the generation of paired or counterfactual data. Our code is available at https://github.com/mazurowski-lab/segmentation-guided-diffusion.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0584_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0584_supp.pdf

Link to the Code Repository

https://github.com/mazurowski-lab/segmentation-guided-diffusion

Link to the Dataset(s)

https://www.cancerimagingarchive.net/collection/duke-breast-cancer-mri/ https://www.cancerimagingarchive.net/collection/ct-org/

BibTex

@InProceedings{Kon_AnatomicallyControllable_MICCAI2024,
        author = { Konz, Nicholas and Chen, Yuwen and Dong, Haoyu and Mazurowski, Maciej A.},
        title = { { Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper describes a mask-conditional approach to medical image generation. A diffusion model is trained to generate images conditional on masks, but the training process (mask-ablated training) considers that certain mask labels may be missing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed mask-ablated training with mask dropout is novel and can be applicable to a variety of medical imaging tasks.

    Authors report results on two public datasets (breast MRI and CT Organ) and compare the proposed approach to several existing techniques. While visual comparisons (and Frechet Inception Distance) don’t elucidate significant differences, results on patient datasets (Table 1) show that when applied within the domain of the dataset, the method can generate images more faithful to original masks (with a large margin). However, the authors should clearly note that these results are obtained by stratifying an existing dataset, and performance may be lower when the model is applied on another dataset (out of domain).

    An analysis of real vs synthetic data for segmentation model training is reported in Table 2, demonstrating that replacing the real data with synthetic data generated using the proposed method does not result in a large performance drop.

    Finally, various ablation studies (image generation with mask classes removed, Figure 3) and image interpolation in Figure 4) are reported.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Obtaining segmentations in medical imaging is very time-consuming and often requires specialized expertise, so there are significantly fewer datasets that contain segmentations vs image-level labels. The authors did not describe how they envision their approach will be used in practice.

    Results in 1 and 2 are reported within the same datasets and may not generalize to other images. In addition, the authors should report uncertainty (e.g., with standard deviation) along with mean result values.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors include details of the algorithm and implementation in the manuscript and supplementary material and promise to release the code upon acceptance. The approach should be easily reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please clarify how the proposed method will be used in practice (see weaknesses).

    Performance should also be reported with and without mask ablated training to demonstrate the effect of this step.

    Please also clarify how results will be affected by imprecise masks delineations.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach is compelling, but practical utility may be limited, and the reported results may not generalize.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The rebuttal addressed my concerns.



Review #2

  • Please describe the contribution of the paper

    This paper proposes Seg-Diff, a diffusion model-based approach, for anatomically-controllable medical image generation. The authors introduce a random mask ablation training strategy, enabling the model to generate images from masks representing selected anatomical constraints while preserving flexibility in other areas. Experimental results on breast MRI and abdominal/neck-to-pelvis CT datasets demonstrate superior fidelity to input anatomical masks compared to existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors introduce an innovative random mask ablation training strategy, allowing the model to generate images even when certain classes are missing from the masks.
    2. The paper is well-written and easy to follow.
    3. The experiments are comprehensive, which strengthens the validity and generalizability of the proposed Seg-Diff.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. This paper lacks an ablation study. The proposed Mask-Ablated-Training essentially performs random dropout on anatomical structure masks. It is worth discussing whether there are better implementation methods for this operation.
    2. Whether Mask-Ablated-Training can be generalized to other methods for generating CT images based on masks is a worthwhile discussion. Addressing this aspect would significantly enhance the paper’s quality.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness section for further details. Additional discussions regarding the implementation and generalization capability of Mask-Ablated-Training are encouraged.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose an intriguing training strategy that allows the generative model to flexibly handle missing anatomical structure masks. However, this strategy appears somewhat naive, and the paper lacks comparisons with multiple strategies as well as validation of the method’s generalizability.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors’ rebuttal addressed my concerns satisfactorily. Therefore, I am inclined to accept the paper and have also increased my score.



Review #3

  • Please describe the contribution of the paper

    They propose a diffusion model-based method, “Seg-Diff,” for anatomically controllable medical image generation, conditioned by a multi-class anatomical segmentation mask. They introduce a random mask ablation training algorithm to enable conditioning on selected anatomical constraints while maintaining flexibility. Evaluation is done on on breast MRI and abdominal CT datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow. With the exception of a few points noted under ‘weaknesses’, the method is described in sufficient detail. Implementation details are provided in the supplementary material. The proposed mask-based training strategy allows more flexibility with respect to more combinations of segmentation labels. The outlook and possible applications of this work are nicely outlined in the conclusion.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The name “Seg-Diff” is not optimal, as another method already has the same name: Amit, Tomer, et al. “Segdiff: Image segmentation with diffusion probabilistic models” arXiv preprint arXiv:2112.00390 (2021).

    2. In Section 1.1, try not to have line breaks in the middle of equations.

    3. In section 1.2, the notation of the addition of an extra input channel is strange. I think you meant the transition from c to c+1 and not the other way around? Why is the segmentation mask not one-ho encoded (i.e. C channels for C segmentation classes)? Furthermore, it needs to be stated that noise is only added on the image channels, not on the segmentation channel.

    4. In Section 2, the authors state that “All generative models are trained on the training sets, and the auxiliary segmentation network, introduced next, is trained on the held-out training sets.” What is meant by hold-out training sets?

    5. Regarding the evaluation metrics, it would still be nice to report the FID scores for all comparing methods on both datasets, to get an idea of image quality. Furthermore, one could report MSE and SSIM for a hold-out test set, where the segmentation masks are provided.

    6. In Tables 1 and 2, the proposed method is called “ours”. Is it the MAT or the STD implementation?

    7. The proposed interpolation in the latent space is not new, it was already introduced in “Dhariwal, Prafulla, and Alexander Nichol. “Diffusion models beat gans on image synthesis.” Advances in neural information processing systems 34 (2021): 8780-8794.”. In this sense, Figure 4 does not add any novelty to this work.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please add the comments listed under “weaknesses”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well motivated and proposes a flexible method for anatomically-controlled image generation. Apart from some issues pointed out, the method is described in enough detail, and evaluated on two distinct datasets.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their helpful feedback. We will address your comments as follows and pledge to update the camera-ready version accordingly if the paper is accepted.

On our method’s practical uses (for R3): one is that it can constrain specific organs/objects to have uncommon characteristics in generated images (e.g., an unusually large liver in an abdominal CT), or similarly generate counterfactuals by modifying one organ while keeping others fixed. This could be used to controllably generate synthetic data to augment a training set for some downstream task, to generate more of certain rarer cases to mitigate any dataset imbalance. Another related usage is cross-modality anatomy translation: e.g., given images+masks from one scanner sequence (e.g., T2 MRI), you can train our model on these images+masks, and then can use masks from another sequence of images (e.g. T1 MRI), to create new T2 images using the T1 masks.

Regarding R3’s concerns about our model’s generalizability to other data, we note that generating images from segmentation masks of new datasets within the same modality should work in principle. Segmentation masks capture intrinsic anatomical content, the characteristics of which should not be significantly affected by any distribution shift due to a new image acquisition sites/new dataset, and so such new data should still be usable by our model. Similarly, for their question about our model’s generalizability to imprecise mask delineations: our mask-ablated training (MAT) algorithm helps the model learn to be able to handle cases of incomplete masks, so this learned flexibility should similarly make the model work with imprecise/imperfect masks. Finally, we compared performance with and without MAT in the paragraph right before section 3.2.

Also, for R3: while we aren’t allowed to add new uncertainty results for the tables for MICCAI rebuttals, we saw little variation in Dice score between evaluation batches when we ran the experiments for the submitted paper, so we expect the uncertainties to be small.

R4 had two questions about our MAT strategy: (1) alternative implementations and (2) its usability for other mask-conditional generative models. For (1), MAT can be easily modified, such as using different Bernoulli probabilities in Algorithm 1 for varied mask class removal. For example, a lower value than 0.5 results in training masks with typically fewer removed mask classes, and probabilities can also be varied by class, such as using a low removal chance for the breast class in breast MRI, but a high one for dense tissue. For (2), MAT can be applied to any mask-conditional generative model (e.g., SPADE [17]) without modification by just modifying/ablating training masks according to Algorithm 1 during training. Overall, we designed MAT to be simple and interpretable to allow for such customizability and generalizability to any generative model.

For R1’s numbered minor weaknesses: for point (1), we’ll change the model name to “SegGuidedDiff”; we will fix (2) and make (4) and (6) explained more clearly at camera-ready. For (3), we note that the network maps from c+1 to c channels because it takes in the noised image + the mask as input, and outputs the denoised image. We tried using one-hot mask encoding, but this makes the network scale poorly for large numbers of mask classes. For (5), we initially found that our model outperformed competing methods on both datasets by FID, but chose not to add this to the paper because FID uses features learned from natural images, and so may be inappropriate for medical images, and we can’t add new MSE or SSIM results as this is not allowed for the MICCAI rebuttal. Finally for (7), the novelty of the latent space interpolation is that it has not yet been explored for medical image models, especially anatomy-constrained ones, and so traversing this latent space has a new meaning: the fine-tuning of the anatomies specified by the input masks.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces a novel approach to medical image synthesis that leverages a mask-ablated training strategy, enabling it to generate images conditioned on anatomical segmentation masks with the flexibility to handle missing labels. Initially, the reviews were mixed with one weak reject, but all reviewers shifted to accept post-rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces a novel approach to medical image synthesis that leverages a mask-ablated training strategy, enabling it to generate images conditioned on anatomical segmentation masks with the flexibility to handle missing labels. Initially, the reviews were mixed with one weak reject, but all reviewers shifted to accept post-rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top