Abstract

Diabetic retinopathy (DR) is a major cause of vision impairment, with early detection playing a crucial role in preventing irreversible blindness. While deep learning-based automated DR grading has improved diagnostic efficiency, class imbalance in public datasets hinders reliable performance evaluation, particularly for underrepresented DR stages. Current state-of-the-art classifiers achieve high overall accuracy but suffer from poor balanced accuracy, limiting their real-world applicability. Inspired by recent advancements in diffusion models, we propose to mitigate class imbalance by generating synthetic fundus images. Unlike prior methods prioritizing visual quality, we introduce a semantic quality metric based on classifier-predicted likelihood to selectively filter synthetic samples that enhance classification performance. Furthermore, we incorporate explicit class constraint during diffusion model finetuning to generate more semantically relevant data. Experimental results demonstrate a significant improvement in balanced classification accuracy from 66.84% to 74.20%, highlighting the effectiveness of our approach in improving DR diagnosis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4449_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/AlanZhang1995/ECC_DM_for_DR.git

Link to the Dataset(s)

DDR: https://github.com/nkicsl/DDR-dataset EyePACS: https://www.kaggle.com/c/diabetic-retinopathy-detection/ APTOS: https://www.kaggle.com/competitions/aptos2019-blindness-detection

BibTex

@InProceedings{ZhaHao_ClassConditioned_MICCAI2025,
        author = { Zhang, Haochen and Heinke, Anna and Nagel, Ines D. and Bartsch, Dirk-Uwe G. and Freeman, William R. and Nguyen, Truong Q. and An, Cheolhong},
        title = { { Class-Conditioned Image Synthesis with Diffusion for Imbalanced Diabetic Retinopathy Grading } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of this paper is the proposal of a method to address class imbalance in diabetic retinopathy (DR) grading datasets using diffusion models. The key contributions are: (1) The paper introduces a semantic quality metric based on classifier-predicted likelihood to selectively filter synthetic samples that enhance classification performance. (2) They proposes incorporating explicit class constraints during diffusion model finetuning to generate more semantically relevant data. (3) The method significantly improves balanced classification accuracy from 66.84% to 74.20%, demonstrating the effectiveness of the approach in improving DR diagnosis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a new evaluation method based on a pre-trained integrated classifier to generate synthetic data for unbalanced datasets by focusing on semantic quality rather than visual quality, mitigating the problem of poor performance of a few sample categories in DR ratings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The methodological innovation is limited. The backbone is essentially a text-to-image diffusion model. The claimed novelty lies in the semantic quality evaluation; however, this evaluation approach is essentially an ensemble scoring model.
    2. The analysis of the generated data is insufficient. For example, there is a lack of discussion on the class distribution of the training samples after generation, and how the threshold for selecting samples affects the results.
    3. The performance advantage of the proposed method is not clearly demonstrated. The evaluation metrics used in the experiments are too limited, including only balanced accuracy, unbalanced accuracy, and quadratic weighted kappa. More evaluation metrics are needed to comprehensively demonstrate the superiority of the method from multiple perspectives.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. The dataset description should be made clearer. Since the method is mainly applied to imbalanced datasets, it is necessary to explicitly provide the number of samples in the training, validation, and test sets.
    2. Among the three evaluation metrics reported in the paper, the proposed method leads to a decrease in two of them to varying degrees. How does this demonstrate the superiority of the method? It is recommended to explain the importance of the improvement in balanced accuracy in the context of the specific task or sample distribution.
    3. In addition to balanced accuracy, kappa, and overall accuracy, various other metrics can also be used to evaluate multi-class classification performance. Comparative studies should include additional indicators such as Macro-F1, Micro-F1, and MCC.
    4. What is the impact of the filtering threshold [0.7,0.9] on performance improvement?
    5. In Table 1, the balanced accuracy of “ECC DM w/ filter” decreases, suggesting that the semantic quality filter did not have the desired effect. Why is that the case?
    6. How is the class distribution of the added synthetic samples determined? What is the resulting class distribution in the training set after augmentation?
    7. Lack of comparison with other state-of-the-art methods that also address class imbalance in medical imaging.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodological novelty is limited, and the experimental section is insufficient. The descriptions of sample distributions (the original data, synthetic data, and the newly formed training set) are unclear. Moreover, the lack of comparative methods makes it difficult to demonstrate the advancement of the proposed method.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The revised manuscript has few improvements.



Review #2

  • Please describe the contribution of the paper

    The paper proposed a conditional diffusion-based solution for improving balanced accuracy in diabetic retinopathy (DR) classification. The author proposed to generate synthetic samples of underrepresented classes of DR using a diffusion model to generate samples whose goodness is evaluated on a semantic-based metric instead of classical visual quality. Indeed, they propose a classifiers ensemble method to filter only semantically relevant generated samples to be used to balance the training set and improve overall balanced classification accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The explicit conditioning of the diffusion model to generate semantically relevant samples from different classes helps in an imbalanced dataset, as in DR grading.
    • The idea of focusing on semantic image quality instead of pure visual quality by using the ensemble of pre-trained classifier to select the most relevant generated samples.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • There are no details on how the generated samples are used for training the proposed solution; there is a missing description of class distribution in the validation and test sets, which limits the results’ interpretation and evaluation.
    • Description of implementation choices is high-level and, in some cases, lacks motivation or discussion.
    • No comparison is made with solutions exploiting vision quality-based generated samples for data balancing.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Addressing data imbalance in medical imaging is relevant to the MICCAI community, considering it is a shared problem among different pathologies. The paper is generally well written, even if various parts of the methodology are not described in enough detail to fully appreciate the proposed solution and support reproducibility. Although the reviewer agrees on the fact that for diffusion model finetuning are required large datasets and this well motivates the use of finetuneing strategies that require few samples it is also true that there are a considerable number of open source fundus datasets with an high number of samples, so an analysis of the semantic quality of the generated images could be explored based on the finetuning dataset dimension. Additionally, the motivation behind choosing DDR as the primary dataset and EyePACS and APTOS as external ones is unclear, and how changing this decision would impact the overall performance is unclear. A better description should be provided. Additionally, no explanation is reported on the impact of the classifier backbones and why they have been chosen, or the impact of their pretraining on bias propagation in the ensemble selection of relevant generated samples. A more in-depth discussion would highly benefit the manuscript. The primary concern concerns the missing details regarding the class distribution in the validation and testing dataset, how it has been treated, how the synthetic samples have been integrated in the training procedure, and how the oversampling has been performed in the LANet setting for baseline comparison. These missing details limit the reported results’ interpretability and generalization. Finally, although the balanced accuracy improves, there is no discussion on the impact of it in clinical practice; indeed, as for Figure 2.b, to reach such results, the accuracy on the middle class “moderate” highly degrades, which should be discussed in terms of clinical implications.

    Minors:

    • In Section 1, FID and IS are used without introducing them.
    • Figure 1 and Figure 2 have a font that is too small.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method description is lacking relevant details to evaluate the goodness of the proposed solution and the motivation and discussion of fundamental choices is at too-high level.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a diffusion model-based synthetic data generation method to address class imbalance in diabetic retinopathy (DR) grading. By introducing semantic quality filtering and explicit class constraints, the authors significantly improve the balanced classification accuracy (from 66.84% to 74.20%) on the DDR dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Targets the critical issue of model bias caused by class imbalance in DR grading, aligning with real-world needs in medical image analysis.

    2) Combines DreamBooth and LoRA strategies to address challenges in finetuning diffusion models on limited medical data, demonstrating effective adaptation of existing techniques.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Performance plateaus with increased synthetic data volume, suggesting limited diversity in generated samples. However, no solutions are proposed.

    2) Semantic filtering relies on pretrained classifiers, which may propagate biases (e.g., overfitting to specific classes) into synthetic data quality. Mitigation strategies are unexplored.

    3) Experiments are limited to the DDR dataset; generalizability to other DR datasets (e.g., Kaggle EyePACS) remains unverified.

    4) The pipeline (diffusion generation + iterative finetuning + filtering) may incur high computational costs, limiting accessibility for resource-constrained settings.

    ​5) No comparison with alternative generative methods (e.g., GANs, VAEs) to demonstrate the superiority of diffusion models in this task.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1) Performance plateaus with increased synthetic data volume, suggesting limited diversity in generated samples. However, no solutions are proposed.

    2) Semantic filtering relies on pretrained classifiers, which may propagate biases (e.g., overfitting to specific classes) into synthetic data quality. Mitigation strategies are unexplored.

    3) Experiments are limited to the DDR dataset; generalizability to other DR datasets (e.g., Kaggle EyePACS) remains unverified.

    4) The pipeline (diffusion generation + iterative finetuning + filtering) may incur high computational costs, limiting accessibility for resource-constrained settings.

    ​5) No comparison with alternative generative methods (e.g., GANs, VAEs) to demonstrate the superiority of diffusion models in this task.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author has addressed my questions and recommends accepting this paper.



Review #4

  • Please describe the contribution of the paper

    The Experimental results show that the semantic-oriented metric effectively filters the synthetic data and our proposed finetuning strategy achieves an improvement in balanced classification accuracy from 66.84% to 74.20%.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    not bad

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Fig 1 and 3 are of very poor resolution. It is not acceptable. Use high resolution picture.
    2. Clearly specify what are the innovative aspects of the proposed method?
    3. What is the specialty of Backbone VGG-16 ,Inception-v3 ,DenseNet-121? why you choose CNN based models for training?
    4. Table title should be modified. It is bit confusing.
    5. English language should be improved.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Recommendation: Accept

    This paper presents a well-structured, insightful, and impactful investigation into the semantic robustness of deep learning models for diabetic retinopathy (DR) grading, focusing on worst-case evaluation using optimized perturbations. The work is timely and significant, especially given the increasing deployment of AI systems in safety-critical domains like healthcare.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their thoughtful feedback. We are encouraged by the recognition of our contributions, e.g. the novel use of semantic quality filtering and class-conditioned finetune with limited data.

[R1Q1&6, R4] – Dataset splits and synthetic data. We followed the official DDR split: training (3133, 315, 2238, 118, 456 per class) and test (1880, 189, 1344, 71, 275). For valid, we randomly selected 47 samples from each class in the official valid split to ensure balance. Synthetic samples were added only to the mild and severe DR classes. At the 4K setting in Fig. 2(a), 2K samples were added to each, resulting in a train set of 3133, 2315, 2238, 2118, 456. Valid and test sets remained untouched. Oversampling in the LANet followed normed inverse class frequency.

[R1, R3Q2] – Novelty. Our work shifts the focus from visual realism to semantic usefulness in medical image synthesis—an aspect often overlooked. Instead of assuming realistic images are helpful, we assess them based on classifier-predicted semantic confidence. We adapted DreamBooth for low-data finetune, introduced a semantic scoring metric, and implemented ECC as self-supervised refinement. [R4] Table 1 demonstrates gains over the visual-only baseline (Basic w/o filter). We hope this shift in perspective encourages further development in medically meaningful generation.

[R2Q3, R3Q3, R4] – Implement choices. As our work focus on Diffusion Model development, the choices of datasets and classifiers follow fair and effecient DM evaluation purpose. We selected DDR as the primary dataset due to its defined public test split and availability of strong open-source baselines. EyePACS lacks a public test label. Even if used as the main dataset, EyePACS would only reduce the gain from DreamBooth, but our core innovation—semantic filtering and ECC—should remain beneficial. For classifiers, we prioritized established methods rather than custom networks to better isolate the impact of our DM. LANet[10] itself offers SOTA Acc. with diverse 5 backbones, making it ideal for integration. Our method improves performance across top 3 backbones, demonstrating its backbone-independence. [R2Q2] To reduce potential bias from any single classifier, we use a diverse ensemble and it works well in our experiment. In more challenging case where classifier quality is limited, we will consider human-in-the-loop filtering a promising direction.

[R1Q2, R4 Clinical]– Metric drop and clinical impact. Given imbalance of test set, it is critical to use Balanced Acc. as primary metric. While overall Acc., kappa may decrease, this reflects a trade-off: our method reduces bias toward dominant classes in favor of more equitable classification. As shown in Fig. 2(b), left model can mask poor performance on mild/severe DR. Clinically, this is crucial—previous models inflated moderate DR accuracy by misassigning mild/severe cases. Our method reduces this harmful shortcut.

Experiment details. [R1Q4] For the threshold [0.7, 0.9], we performed a grid search; the average B. Acc. across 9 options was 73.7 with a std of 0.6. [R1Q5] In Table 1, ECC diffusion models are trained with the proposed filtering, thus learning to generate semantically meaningful data directly. This indicates the proposed filtering strategy remains effective through implicit guidance, rather than being unhelpful (see Page 6, last paragraph). [R2Q4] Our model runs on a single RTX 4090 (24GB), supporting accessibility. [R4] We also tested augmenting DDR with EyePACS real data (w/ vs. w/o filtering: 73.59 vs. 71.38 B. Acc). As DDR and EyePACS differ in country and instrument, our semantic filter helped exclude domain mismatched samples and improve DDR.

[R2Q1]– Diversity. While our current focus is on semantic quality, we plan to incorporate diversity-aware sampling (e.g., clustering-based) in future work. However, note with finite real data, it is not feasible to generate datasets with unlimited diversity—regardless of method.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The topic is relevant and it has been recognized by reviewers. The main concerns share by all reviewers regard missing details on data distribution into subsets and how generated images are introduced into training, and motivation/discussion on implementation choices.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a diffusion model-based synthetic data generation method to address class imbalance in diabetic retinopathy (DR) grading. The rebuttal appropriately address the most concerns from reviewers. As the reviewers that hold a ‘weak-reject’ stance do not provide specific reasons for rejection, I would like to recommned acceptance.



back to top