Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advancements in deep learning for medical image seg- mentation are often limited by the scarcity of high-quality training data.While diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT , a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT ’s synthetic image-mask pairs improve SOTA method’s segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at https://github.com/JianhaoXie1/MedDiff-FT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4183_paper.pdf

SharedIt Link: https://rdcu.be/eHaVL

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_29

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/JianhaoXie1/MedDiff-FT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XieJia_MedDiffFT_MICCAI2025,
        author = { Xie, Jianhao AND Zhang, Ziang AND Weng, Zhenyu AND Zhu, Yuesheng AND Luo, Guibo},
        title = { { MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {306 -- 316}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces MedDiff-FT, a method for generating synthetic medical images and corresponding segmentation masks by fine-tuning a pre-trained diffusion model. The approach addresses challenges in medical image segmentation caused by limited datasets through a data-efficient generation process.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The method is computationally efficient, requiring minimal resources (30 minutes of training, 24GB memory) and only a small number of training examples (around 30 image-mask pairs).
2. The approach generates both images and their corresponding segmentation masks simultaneously, which is particularly valuable for medical image segmentation tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The purpose of only fine-tuning the Diffusion Model with masked lesion images requires clarification. The paper doesn’t explain why they didn’t consider image-to-image generation approaches as in [1], which could eliminate the need for both a mask generator and non-lesion image generation. This would potentially reduce trainable parameters and complexity. Additionally, the paper fails to address scenarios where masks aren’t available, while image-to-image methods could still function effectively. A comparative analysis with these alternatives would strengthen the paper.
2. While the paper mentions a “lightweight stochastic mask generator” for improving diversity, critical details about its architecture, implementation, and training process are missing. Similarly, the parameters and architecture of the non-lesion image generator aren’t discussed, making reproducibility challenging and leaving questions about the overall model complexity unanswered.
3. The experimental validation is limited to dermoscopic images, which restricts the generalizability claims. Including more diverse and challenging clinical image datasets such as Derm7pt [2] and Fitzpatrick17k [3] would provide a more comprehensive evaluation of the method’s effectiveness across different real-world scenarios with varying image qualities, skin tones, and clinical presentations.
4. The quality assessment protocol using DINOv2 similarity scores lacks specific thresholds and detailed implementation information, making it difficult to assess its effectiveness or reproduce this critical filtering component of the pipeline.
[1] Wang, J., et al., “From Majority to Minority: A Diffusion-based Augmentation for Underrepresented Groups in Skin Lesion Analysis,” MICCAI ISIC Workshop, 2024.

[2] Groh, M., et al., “Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset,” CVPR Workshop, 2021.

[3] Kawahara, J., et al., “Seven-Point Checklist and Skin Lesion Classification using Multitask Multimodal Neural Nets,” IEEE Journal of Biomedical and Health Informatics 2019
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While MedDiff‑FT’s promise of generating paired images + masks from only ~30 examples is attractive, the paper leaves too many gaps to recommend acceptance. (1) No baseline comparison: The choice to fine‑tune on masked lesions is not justified against simpler image‑to‑image diffusion augmenters that work without masks; we cannot tell if MedDiff‑FT is truly more data‑ or parameter‑efficient. (2) Missing implementation details: Key components—the “stochastic mask generator,” non‑lesion generator, and DINOv2 quality‑filter thresholds—are undocumented, making reproduction and complexity assessment impossible. (3) Narrow validation: All experiments use dermoscopic images; claims of broader applicability remain untested on diverse sets such as Derm7pt or Fitzpatrick17k. (4) Marginal gains vs. added complexity: A ~3 pp Dice boost may not justify the extra generators, filters, and masking steps.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed all my major concerns.

Review #2

Please describe the contribution of the paper

The authors propose to fine-tune a Stable Diffusion model using limited data, and to construct realistic mask-conditioned images during inference by combining a generated mask and a healthy image in the latent space of the diffusion model. They demonstrate model validity with improved segmentation scores on a variety of datasets with a variety of baseline segmentation models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors use multiple segmentation models and multiple datasets to show that images generated with their model can improve segmentation performance. The main convincing testament is Table 1.
2. The authors use few images (<50) to fine-tune their diffusion model, and it is interesting to see that this works well. This is relevant for many medical modalities where the amount of available (labeled) scans is small.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. abstract/intro “3% improvement”: This is misleading in my opinion. If I consider for instance only the nnU-Net row (which I believe is the most stable baseline included, and also shows best performance on 4/5 datasets) I calculate a mean improvement of 1.25%.
2. p4: Textual inversion trains a new embedding vector for each new ‘concept’ introduced to the model. The authors do not explain here how they approach this during training. Do they train a new embedding for each type of scan?
3. p5: This section is unclear, because there are suddenly extra diffusion models introduced. Are those also Stable Diffusion models? And if not, what is their architecture?
4. It is not completely clear whether the authors have reimplemented the baseline models themselves, or use existing pipelines. For instance for nnU-Net, do the authors use the full framework, or only the model? Added to this, it would be useful if the authors could provide context for the segmentation scores by citing previous work on the datasets used.
5. p6, table 2: differences are very slight, and there are no error margins given. I doubt whether differences between Controlnet/t2i/ours are significant. The are for instance less than a percent for nnU-Net. The authors write in the text that Controlnet and T2i produced suboptimal results, but I am not convinced of that by Table 2. It would have been nice to see error margins on these numbers, for instance by repeating all experiments 5-10 times and showing standard deviation. Or e.g. using 5-fold cross validation.
6. p8 “significantly”: The authors never mention statistical testing between model variants, so using the word ‘significant’ is not appropriate.
7. Why do the authors generate diseased images only from existing healthy ‘background’ images? Would it not be more useful to freely generate diseased images conditioned on segmentation masks?
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. p4, section 2.1: Textual Inversion is cited, but then the authors state that they unfreeze the U-Net during training. In my recollection this is closer to the ‘Dreambooth’ approach. Could the authors check that and perhaps refer to that instead?
2. The header shows ‘title suppressed due to length’. There should be an option in Latex to specify a short title for the header.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The authors introduce method to generate mask-conditioned images from existing healthy images using a fine-tuned stable diffusion model. The stable diffusion model can be fine-tuned with less than 50 scans, which is especially relevant for medical imaging modalities with few scans available.

In general, I am convinced enough by the presented results that the method of the authors has added benefit for training segmentation models. The reasons for not recommending stronger than ‘Weak Accept’ are:
- The paper is not very clearly written, and some details regarding implementation and validation are missing or hard to understand.
- While I am convinced the proposed method works, I am not convinced it works better than existing alternatives, like ControlNet or T2I-adapater. It is good to see though, that the authors test against these themselves.
- There are a lot of results, but no statistics or error margins. This leaves me with some doubt as to whether the reported increases in performance would hold in a more rigorous validation approach.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a diffusion-based framework for medical image-mask pair generation in a data-efficient manner, while introducing a lightweight stochastic mask generator to enhance diversity. Moreover, the authors devised a quality assessment protocol to ensure the fidelity of the generated data, thus better supporting downstream segmentation tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This paper exhibits several strengths:
1. The proposed framework for image-mask pair generation is resource-efficient and serves as an efficient augmentation tool for downstream segmentation tasks.
2. The diversity of the synthetic image-mask pairs can be enhanced through the proposed mask generator and lesion-free background generator.
3. The filtering scheme for the synthetic samples connects the generation to downstream tasks well.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The original line experiments used only 30 image-mask pairs, while the number of generated images reached 1,500 or 2,750. Despite this significant increase, the observed DICE improvements are relatively marginal (mostly around 1%). It remains unclear whether the model genuinely enhances generalization by increasing intrinsic data diversity or merely mitigates overfitting. Moreover, the downstream experiments lack a baseline trained on a duplicated dataset (i.e., a “copy-paste” version of the real dataset matching the synthetic dataset in size).
2. The detailed pipeline of the condition image restoration process for non-lesion background generation needs further clarification. Specifically, how is the inverted mask incorporated into the diffusion model to enable effective inpainting?
3. It is unclear whether the proposed model generates more structurally faithful images compared to ControlNet and T2I-Adapter. A more in-depth discussion or evidence of its superiority over these competitors is warranted.
4. Detailed configurations of the model should be described, such as the down-sampling factor of VAE, sampling steps, etc.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I would lean to weak acceptance based on the novelty of the proposed framework. But I have a few major concerns (refer to the weakness) that I hope the authors could explain in the rebuttal.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed my major concerns, and I believe the quality of this paper is fair enough for the MICCAI community.

Author Feedback

More details about Reviewer 1’s Q3, Reviewer 2’s Q2,Reviewer 3’s Q2 Mask Generator It is designed to produce highly diverse masks for guidance purposes. Implemented as a DDPM based on UNet architecture,trained on mask images. The network architecture consists of four hierarchical layers with progressively increasing channel dimensions [64,128,256,512]. Non-lesion Image Generator Built upon Stable Diffusion 1.5 architecture,.In the training phase, we use the data pairs: lesion Image, and invert the mask (swapping 0, 1 values). Then perform the original fine-tune, the inverted mask allows the model to learn how to generate healthy regions. In the inference phase, the input is the lesion image with original mask, like Fig.1 inference phase, the model can repair the lesion region to a healthy region in the mask region.

Reviewer 1 Q2: whether the method trains new embeddings per scan type. We use fixed, predefined trigger words (e.g., “hta”) as lightweight semantic cues to guide lesion generation. These prompts are not tied to specific scan types; instead, the model learns to associate them with anatomical patterns through U-Net updates, eliminating the need for scan-type-specific embeddings. Q4: If the baseline models use the full framework? For nnU-Net, we used the official nnU-Netv2 full framework,Including preprocessing operations. Q5:5-fold cross validation In BUSI,the average dice of our 5-fold cross validation is 78.76, still exceeding 78.25 for T2I-adapter and 78.11 for ControlNet. Q7: Constraining lesion synthesis to healthy backgrounds vs. full image generation from mask conditioning. Freely generating diseased images conditioned on segmentation masks is difficult when using the SD model for medical image image generation because of the domain offset problem (the pre-training data of the SD model is natural images, and the target we generate is images in the medical field). We focus on generating diseased regions on healthy background images, which reduces the difficulty and improves the generation quality.

Reviewer 2 Q1:true generalization vs. overfitting reduction. In nnU-Net framework, 1% improvement is already considerable, and like reviewer1’s Q5, we also conducted a 5-fold cross-validation to verify the effectiveness of the method. The improvements stem not just from data volume but structured diversity(mask variations + background images), validated in Table 4 (performance grows with background diversity). Q3:Further discussion needed to demonstrate the proposed model’s structural fidelity via comparative analysis with ControlNet and T2I-Adapter. MedDiff-FT ensures structural faithfulness through: Adaptive mask guidance: A dynamic mask enforces spatial constraints during training/inference (Eq. 1-2), ensuring pixel-level alignment between synthetic images and masks. Post-processing: Mask corrosion (Fig. 3) and DINOv2-based filtering refine structural fidelity. Q4:More details about Stable Diffusion. The down sampling factor of VAE is 8, sample steps is 50.

Reviewer3 Code will be public for reproducibility. Q1:About the image-to-image approaches. Currently image-to-image methods ([1]) are difficult to correspond to generate Mask, so they are generally applied to downstream classification tasks.We focus on more difficult segmentation tasks, where mask-conditioned lesion generation is more appropriate. We’ll add the description of the image-to-image methods to clarify this. Q3: Lack of test on clinical datasets. We agree broader clinical validation (e.g.varied skin tones) is critical. Not only dermoscopic images, but also ultrasound in our validation, in Table 1,2. While Derm7pt/Fitzpatrick17k are valuable for classification, they lack mask annotations required for our task. Q4: Thresholds for DINOv2 similarity scores are [0.2, 0.8].

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis

Author(s):