Abstract

Deep learning models have made significant advances in histological prediction tasks in recent years. However, for adaptation in clinical practice, their lack of robustness to varying conditions such as staining, scanner, hospital, and demographics is still a limiting factor: if trained on overrepresented subpopulations, models regularly struggle with less frequent patterns, leading to shortcut learning and biased predictions. Large-scale foundation models have not fully eliminated this issue. Therefore, we propose a novel approach explicitly modeling such metadata into a Metadata-guided generative Diffusion model framework (MeDi). MeDi allows for a targeted augmentation of underrepresented subpopulations with synthetic data, which balances limited training data and mitigates biases in downstream models. We experimentally show that MeDi generates high-quality histopathology images for unseen subpopulations in TCGA, boosts the overall fidelity of the generated images, and enables improvements in performance for downstream classifiers on datasets with subpopulation shifts. Our work is a proof-of-concept towards better mitigating data biases with generative models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1935_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/David-Drexlin/MeDi

Link to the Dataset(s)

TCGA-UT: https://zenodo.org/records/5889558(not introduced in this paper, just usage)

BibTex

@InProceedings{DreDav_MeDi_MICCAI2025,
        author = { Drexlin, David Jacob and Dippel, Jonas and Hense, Julius and Prenißl, Niklas and Montavon, Grégoire and Klauschen, Frederick and Müller, Klaus-Robert},
        title = { { MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {388 -- 398}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper explores the challenge of distribution and population shifts in biomedical image analysis. It proposes a synthetic data generation approach that conditions diffusion models on metadata associated with the images, thereby guiding the synthesis process to produce data that can enhance classifier generalization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem addressed in the paper is highly relevant and important for biomedical image analysis. The characteristics of the dataset, as well as the domain shifts, are clearly described and well-motivated. The paper is well-written, easy to follow, and the proposed method is clearly presented.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Novelty While the problem addressed is important, the novelty of the proposed approach is somewhat unclear to me. Synthetic image generation for data augmentation has been previously explored, as the authors themselves acknowledge. Is the main contribution the application of this approach to datasets with richer metadata and more pronounced distribution shifts?

    Evaluation The downstream evaluation is based solely on linear probing. Why not also include full fine-tuning, which is a more standard and informative way to assess adaptation performance? It would be more compelling to demonstrate whether the gains persist after full model adaptation.
    Additionally, while FID scores serve as useful sanity checks to verify the quality of generated data, the most tangible measure of effectiveness is the final classification performance. It would strengthen the paper to include more comprehensive classification evaluations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My main concern lies in the novelty and scope of the proposed application. Please refer to the Weaknesses section for more details.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the introduction of a generative framework that conditions image synthesis not only on class labels (e.g., cancer type) but also on metadata attributes such as the medical center (Tissue Source Site). This allows for targeted generation of synthetic histopathology images representing underrepresented or unseen subpopulations, thereby addressing dataset imbalances and mitigating biases in downstream classification tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The study has a valid motivation with clinical relevance. The study tackles a known problem in digital pathology which is bias and poor generalization of AI models due to demographic and institutional imbalance. It moves beyond class-only conditioning by embedding multiple metadata fields into the diffusion process. It highlights the ability to interpolate and extrapolate metadata combinations that is something difficult with conventional augmentation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Although MeDi is proposed as a general framework, in this study only Tissue Source Site (TSS) is used for conditioning. Broader metadata like race, scanner type, or age are mentioned but not tested. Despite claiming higher visual fidelity, the evaluation relies solely on FID, which may not fully reflect clinical realism or interpretability of synthetic images. Human evaluation of diagnostic image quality might enhance the proposed methodology. Furthermore, the performance improvements shown in Table 1 do not appear to be highly significant and may require further validation, as results could vary across different datasets.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is technically sound, the experiments are carefully designed to simulate subpopulation shift, and the improvements are quantitatively compelling.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose an approach to improve out-of-domain generalization of histopathology classifiers through the use of synthetic data augmentation with a class- and tissue source-conditional diffusion model. The primary contribution of this paper, compared to existing work, is the extension of the method to a dataset with many more classes and sources.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Major strengths include: The paper is well-written, includes a good amount of detail on the method used (repositories for model architecture, data, etc), and well-evaluated (convincing demonstration of the approach’s efficacy for out-of-domain generalization).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The advance of prior work is pretty marginal/incremental, but the authors include a good review of related work and situate their work well within that corpus of existing work, so this is not a critical/disqualifying weakness.
    2. It would be nice to have some more detail about why/how the particular sub-typing tasks shown in Table 1 were chosen (i.e. NSCLC/RCC/Uterine). Like, why not Glioblastoma vs lower-grade glioma, or differentiating the GI adenocarcinomas, or Sarcoma vs poorly differentiated adenocarcinoma?
    3. Should include source code for running experiments (see below), but again, not a critical weakness as the description of experiments in the paper itself seems adequate.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Should include source code for implementing and running the models/experiments

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the contribution seems a bit incremental, every part of the paper is well-written, well-justified, and is a nice/novel contribution to a growing body of literature demonstrating the benefits of synthetic data for OOD generalization in path AI models. Acceptable as is, nice paper!

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all three Reviewers for their thoughtful feedback and their constructive criticism.

R1: We used TSS as our sole conditioning variable as a proof of concept and are currently extending MeDi to additional metadata, with promising initial results. We agree that FID alone does not fully capture clinical realism. Although this submission did not include a formal reader study, Niehues et al. [1] demonstrated that diffusion models with lower FID scores receive higher realism and diagnostic-utility ratings from expert pathologists across nine colorectal tissue classes, suggesting that FID may serve as a useful proxy—however this definitely requires further validation. A systematic evaluation of the correlation between FID and expert ratings—especially under metadata shifts—remains an important avenue for future work; we will add this as such to the conclusion.

R2: Many thanks for pointing us to the missing justification of the choice of datasets in our experiments. We selected the NSCLC, RCC, and Uterine subtyping tasks to reflect three different imbalance regimes and metadata–class relationships within TCGA-UT. NSCLC pairs two common subtypes (16,460 LUAD vs. 16,560 LUSC patches) that are similarly distributed across dozens of centers, testing MeDi’s ability to rebalance two well‐populated classes under moderate site bias. RCC introduces an imbalanced multi-class setting (11,650 clear cell vs. 6,790 papillary vs. 2,460 chromophobe) where less common subtypes are at risk of being overshadowed. The Uterine task contrasts an infrequent with a frequent class (2,120 carcinosarcoma vs. 12,480 endometrial carcinoma), examining whether metadata conditioning can conjure realistic examples for scarce subpopulations. We will add this rationale to Section 4.1 of the revised manuscript. As discussed in the Conclusion, expanding to larger-scale training regimes and additional benchmark datasets is a key direction, and we plan to include more extensive experiments in future work. Finally, to ensure full reproducibility, we will release a GitHub repository and include a link in our manuscript.

R3: While synthetic image generation for data augmentation has indeed been explored before, our work is the first to systematically enable embedding of arbitrary metadata at scale—in TCGA-UT’s 32 cancer types and 184 medical centers—demonstrating that MeDi can faithfully interpolate underrepresented and highly imbalanced subgroups and even extrapolate to unseen metadata–class combinations. This scalability, combined with a clear, modular conditioning framework, constitutes our novel contribution. We also appreciate your point about downstream evaluation. Using linear probing is a common evaluation strategy for pathology foundation models as the community strives towards general models that can be adapted to new downstream tasks as cheaply as possible. At the same time, we agree that full model fine-tuning would offer additional insight and will note this as a relevant research direction for future work in the conclusion of our manuscript. Finally, we agree that FID should be regarded only as a global sanity check, as it primarily quantifies the closeness between synthetic and real data. We emphasize that our primary measure of success remains downstream classification performance. While we think that our initial results on three datasets, as presented in Table 1, successfully demonstrate the potential of our method, we plan to run more extensive experiments in future work.

[1] Niehues, J.M., Müller-Franzes, G., Schirris, Y., Wagner, S.J., Jendrusch, M., Kloor, M., Pearson, A.T., Muti, H.S., Hewitt, K.J., Veldhuizen, G.P., et al.: Using histopathology latent diffusion models as privacy-preserving dataset aug- menters improves downstream classification performance. Computers in Biology and Medicine 175, 108410 (2024)




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top