Abstract

Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4\%/4.2\% on KiTS19, while SANet achieves 2.6\%/3.5\% gains on Polyps, highlighting its effectiveness and superiority. Code is available at https://github.com/Qiukunpeng/ADC.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1831_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Qiukunpeng/ADC

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QiuKun_Adaptively_MICCAI2025,
        author = { Qiu, Kunpeng and Zhou, Zhiying and Guo, Yongxin},
        title = { { Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {53 -- 63}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce Adaptively Distilled ControlNet, a task-agnostic diffusion model designed to improve medical image synthesis by enabling precise lesion-mask alignment while maintaining privacy. The core idea is a dual-model distillation strategy where a teacher model (conditioned on mask-image pairs) regularizes a mask-only student model via predicted noise alignment in parameter space. An adaptive weight regularization further enhances lesion representation based on lesion-background ratios. During inference, only the student model is used.

    Evaluations on KiTS19 (CT) and Polyps datasets show that the synthesized images, when used with their corresponding real masks as synthetic training data, lead to state-of-the-art segmentation performance. TransUNet achieves improvements of 2.4% in mDice and 4.2% in mIoU on KiTS19, while SANet reports gains of 2.6% in mDice and 3.5% in mIoU on Polyps. In addition, qualitative comparisons further confirm that the proposed method outperforms existing models in synthetic image quality in terms of FID and CLIP-I scores.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Distillation Framework for Diffusion Models: The paper proposes a teacher-student paradigm, where the teacher (conditioned on mask-image pairs) regularizes the student (conditioned only on masks) through noise prediction alignment. During training, this learning paradigm enables rich supervision, whilst during inference, it enables privacy preservation.
    • Task-Agnostic and Privacy-Preserving Design: The method follows a task-agnostic principle, making it easily adaptable across different datasets and modalities without requiring major architectural changes.
    • Lesion-Mask Alignment: The proposed Adaptive Distillation Loss incorporates lesion-to-background ratios, which are computed from mask statistics. This idea addresses the imbalance between lesion and background regions and improves the learning of lesion-specific features in the student model.
    • Benchmarking and SOTA Results: The paper provides comprehensive quantitative and qualitative evaluations on multiple datasets (KiTS19, Polyps), segmentation baselines (TransUNet, nnUNet, SANet, Polyp-PVT), and metrics (mDice, mIoU, Accuracy, Recall, FID, CLIP-I). Especially, in segmentation tasks using synthetic datasets, the proposed model outperforms competing baselines for data generation, e.g. Copy-Paste, SinGAN, ArSDM, T2I-Adapter, ControlNet.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • FID Improvements: The reported FID scores, particularly on KiTS19, show limited improvement or are slightly worse than ControlNet (e.g., 70.786 vs. 69.240). Additional perceptual or task-specific metrics (e.g., SSIM, PSNR, or human perceptual studies) could further validate the image quality.
    • Comparison to Medical-Specific Diffusion Models: The reported results are based on comparisons against general-purpose generative baselines (e.g., SinGAN, T2I-Adapter, ControlNet). However, it does not compare to recent medical diffusion models (e.g., SegGuidedDiff) designed for lesion-aware synthesis. This analysis can be useful to further understand the benefits of the proposed approach.
    • Visual Artifacts in Generated Images: Although the proposed method shows quantitative improvements, qualitative results in Figures 2 and 3 might show artifacts in synthesized images. For instance, in Figure 2, the second row (Ours) contains “holes” or inconsistencies in tissue structure, especially at early training steps (e.g., 100–500 steps). Similarly, in Figure 3 (column d), the “Ours” samples display subtle structural artifacts or textural inconsistencies (e.g., Cases 088, 108, 163) that could affect clinical plausibility. A potential clinical validation or expert assessment could address this observation.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed Adaptively Distilled ControlNet introduces a distillation framework that adopts a teacher-student paradigm and a spatially adaptive loss based on lesion-to-background ratios. Designed to address privacy concerns in clinical settings, the model enables image synthesis using mask-only inputs at inference time. Experimental evaluations on two diverse datasets (KiTS19 and polyp-based datasets), across multiple segmentation backbones and metrics, demonstrate improved segmentation performance when using the generated synthetic data. However, missing comparisons against domain-specific medical diffusion models (e.g. SegGuidedDiff) or absence of clinical validation, visual artefacts in some generated samples, and similar IFD-results to baselines could limit the overall impact of the proposed method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed the points that were raised.



Review #2

  • Please describe the contribution of the paper
    • The paper proposes a mask-controllable diffusion model based on the teacher-student paradigm, aiming to address the issues of inaccurate lesion-mask alignment and low training efficiency in medical image synthesis.
    • The paper designs a dual-model distillation architecture and combines it with an adaptive distillation loss to adjust the training weights of lesion and background regions dynamically.
    • Moreover, this framework is task-agnostic and can seamlessly adapt to multi-modal datasets. In the sampling stage, only the student model is required, which balances efficiency and privacy protection.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • An innovative distillation framework is adopted. By the noise prediction alignment mechanism between the teacher-student models, the supervision signal of the image-mask pair is implicitly integrated into the student model conditioned only on the mask, effectively solving the problem of blurred lesion region alignment in a traditional diffusion models.
    • The adaptive distillation loss adjusts the weights dynamically according to the ratio of lesion-background pixels, which specifically alleviates the severe class imbalance problem in medical images.
    • Experiments verify the generated data’s image quality and effectiveness for downstream tasks. Meanwhile, it is demonstrated that the model has multi-modal generalization ability, and the effectiveness of the method is verified on multiple datasets of different modalities.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The computational cost and complexity of the training stage are not clear. Although the sampling stage is efficient, training requires optimizing both the teacher and student branches simultaneously, which undoubtedly introduces additional computational costs (for example, using 8 NVIDIA 4090 GPUs for training). However, the paper does not clarify the comparison of training efficiency with single - model methods (such as ControlNet), which may limit the application of this model in resource - limited scenarios.
    • The authors should conduct the downstream experiments using only the images generated from the proposed method instead of combining them with the real data. This makes the training set twice larger compared to the baselines with only real data.
    • The results would be more interesting if some clinician’s evaluation or judgment of the generated images could be included
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper conducts a successful exploration of the efficiency of ControlNet in medical image generation tasks. The distillation method is used to effectively avoid the high computational cost during sampling. Although there are still questions about the computational complexity introduced in the training process, overall, this research is of positive significance to the medical image generation community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This paper presents a novel method for medical image synthesis. The authors have basically addressed the issues I raised during the rebuttal phase. Therefore, I consider accepting this paper.



Review #3

  • Please describe the contribution of the paper

    This paper proposed an Adaptively Distilled ControlNet that employed a teacher model with both image and mask conditional inputs to regularize the student model, enabling mask-only conditioned medical image synthesis. To improve lesion-mask alignment, the authors introduced an Adaptive Distillation Loss to generate high-quality data for downstream segmentation tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper exhibits several strengths:

    1. The proposed framework demonstrates superior performance in high-fidelity mask-to-image synthesis and serve as an efficient augmentation tool for two segmentation tasks.
    2. The distillation paradigm uses mask-image pairs as implicit regularizers to ensure high-quality generation and stable optimization.
    3. The introduced Adaptive Distillation Loss enhances the consistency between lesion masks and corresponding generated regions.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The motivation for utilizing a distillation framework in this synthesis task is unclear. While the image priors embedded in the teacher model likely introduce rich visual concepts for high-quality generation, there is a lack of an ablation experiment to prove the effectiveness of this design (i.e., the loss term L_T).
    2. The conditions (i.e., mask-image pairs) were derived from real training data. It is unclear whether the synthesized images are truly novel. Providing evidence, such as a case study, would help demonstrate that the synthesizer is not merely replicating the real dataset.
    3. How many images are generated using a single real training mask? It would help reproduce this work if described.
    4. The authors should discuss the limitations of their method.
    5. Some experimental tables and figures are placed in different sections of the paper, which makes it somewhat hard to read. I would suggest the authors reorganize the structure for better flow.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would recommend weak acceptance based on the performance and novelty of the proposed framework. But I have a few major concerns (refer to the weakness) that I hope the authors could address properly.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my major concerns, providing clarifications regarding the work’s motivation and implementation details. However, I would encourage the authors to include feature visualizations comparing the generated and real data to better demonstrate feature-level novelty and diversity in future versions. Overall, I believe the quality of this paper is fair enough for the MICCAI community.




Author Feedback

We thank all reviewers for their constructive feedback and for recognizing the strengths of our work, including the novel and superior distillation framework (@R3, R4), the effective adaptive distillation loss (@R2, R3, R4), comprehensive evaluations demonstrating strong task-agnostic generalization across diverse domains (@R2, R3, R4), and the clear, reproducible algorithm description (@R3, R4).

Q1. Quality Evaluation Metrics (@R2): SSIM and PSNR are designed for reconstruction and may have limitations in generative tasks. SegGuidedDiff and LeFusion highlight that FID fails to capture fine medical structures. While CLIP-I offers better generalization, it shares similar limitations. These metrics should be treated as auxiliary rather than primary evaluation criteria. In contrast, image diversity and lesion-mask alignment are more critical for segmentation performance.

Q2. Clinician Evaluation (@R2, R3): We agree that clinician evaluation is the best way to assess image quality. Unfortunately, validating generalizability across two modalities and organs complicates clinical support, as each organ typically requires at least three experts.

Q3. Visual Artifacts (@R2): The “Original” (real) images in Fig. 3 contain holes, meaning these holes represent valid structures and can therefore be synthesized. Fig. 2 visualizes lesion-mask alignment across eight training stages, illustrating the trend that our method achieves faster and more stable alignment than ControlNet, effectively mitigating the sudden convergence phenomenon noted in ControlNet. The final model is trained for 3,000 steps, with SOTA segmentation performance reflecting high image quality.

Q4. Medical Diffusion Models (@R2): We appreciate the reminder. ArSDM (1,450 RGB images, 18,000 steps) and SegGuidedDiff (12,000 CT images, 75,000 steps) are pixel-space models trained from scratch, with ArSDM performing better. Given the limited annotated data in our setting, comparison with SegGuidedDiff might offer limited additional insight relevant to our core findings. In contrast, general-purpose models like ControlNet (jointly trained with the UNet decoder) leverage pretrained weights and achieve better FID with only 3,000 steps. Moreover, strong segmentation performance can be achieved with 12,000 real image-mask pairs alone, which may explain why SegGuidedDiff was not tested on mixed synthetic and real data. SinGAN is task-specific for polyp data.

Q5. Training Cost (@R3, R4): All distillation-based methods, including ours, inherently suffer from increased training costs (approximately 860M additional parameters), but this may be partly mitigated by separately pretraining the mask-image (teacher) model. Compared to ControlNet, we trade training cost for significantly better lesion-mask alignment, which is key to improved segmentation performance. Training is feasible on two 4090 GPUs with ZeRO-2.

Q6. Segmentation Training Set (@R3, R4): SegGuidedDiff shows that synthetic images alone cannot replace real samples. Following ArSDM, we double the training set by generating one image per real mask and mixing them with real data. The Copy-Paste baseline in Tables 2 and 3 duplicates real samples to isolate the effect of data quantity.

Q7. Motivation (@R4): Improving lesion-mask alignment requires stronger constraints on lesion regions. Our adaptive distillation uses the teacher model to enforce this while avoiding the morphological degradation in lesion-free areas seen in ArSDM.

Q8. Feature Novelty (Diversity) (@R4): Mask-image prior is used as regularization only during training; during sampling, only the mask guides generation to prevent mode collapse. Figs. 3 and 4, especially case-61, show strong lesion-mask alignment with diverse features.

Q9. Layout (@R4): Thanks for the suggestion. We will revise the layout upon acceptance.

Thank you for your valuable feedback! We hope our clarifications address your concerns and look forward to your continued support.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top