Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Over the past decades, computer-aided diagnosis tools for breast cancer have been developed to enhance screening procedures, yet their clinical adoption remains challenged by data variability and inherent biases. Although foundation models (FMs) have recently demonstrated impressive generalizability and transfer learning capabilities by leveraging vast and diverse datasets, their performance can be undermined by spurious correlations that arise from variations in image quality, labeling uncertainty, and sensitive patient attributes. In this work, we explore the fairness and bias of FMs for breast mammography classification by leveraging a large pool of datasets from diverse sources—including data from underrepresented regions and an in-house dataset. Our extensive experiments show that while modality-specific pre-training of FMs enhances performance, classifiers trained on features from individual datasets fail to generalize across domains. Aggregating datasets improves overall performance, yet does not fully mitigate biases, leading to significant disparities across under-represented subgroups such as extreme breast densities and age groups. Furthermore, while domain-adaptation strategies can reduce these disparities, they often incur a performance trade-off. In contrast, fairness-aware techniques yield more stable and equitable performance across subgroups. These findings underscore the necessity of incorporating rigorous fairness evaluations and mitigation strategies into FM-based models to foster inclusive and generalizable AI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3881_paper.pdf

SharedIt Link: https://rdcu.be/eHxbR

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_3

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GerElo_Bias_MICCAI2025,
        author = { Germani, Elodie AND Selin-Türk, Ilayda AND Zeineddine, Fatima AND Mourad, Charbel AND Albarqouni, Shadi},
        title = { { Bias and Generalizability of Foundation Models across Datasets in Breast Mammography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {24 -- 34}
}

Reviews

Review #1

Please describe the contribution of the paper
- A broad study of generalization of foundation models on breast mammography classification which provides the useful insight that pre-training on large diverse datasets do not guarantee generalizable and equitable performance. The demonstration of the utility of domain-specific feature and the impact of underlying shortcuts and dataset fingerprinting is useful for model development in clinical contexts.
- Demonstrates the need for fairness / shortcut aware optimization on top of aggregation along with results from 10 datasets including an inhouse dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Comprehensive analysis : The paper demonstrates a strong evaluation on domain shift and debiasing / shortcut removal techniques across 10 breast mammography datasets from various underrepresented regions and provides a good broad set of results which are useful for clinical deployment.
- Real-world clinical relevance : The individual vs unified settings provide a real-life scenario and issue faced in training clinical models for real-world use, The fact that additional bias mitigation on top of aggregation is necessary is a useful tool for the community which considers scale/aggregation as a catch-all solution.
- Inhouse data benchmarks : The addition of the inhouse LBMD dataset provided a real world setting for comparison and showed the generalization across subgroups vs in-distribution performance. Sidebar out of curiosity : I noticed that there are almost no malignant tumours - is the closer to random screening clinic setting as oppsed to other datasets? It would be worthwhile to mention it if that is the case.
- Fig 1 does a good job of communicating the fact that while strong in-domain foundation models provide good embeddings, dataset fingerprints still exist within the embeddings.
- Goes orthogonal to existing benchmarks such as SubPopBench (https://arxiv.org/pdf/2302.12254) which look at single datasets and more methods which is useful from a clinical angle.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited evaluation: While the paper does offer broad range of results using existing methods, it does not mention if any of these models are clinically useful (This would be useful as a one liner in the discussion). Additional metrics not dependent on thresholds such as AUROC and AUPRC would be useful in making the determination.
- Subgroup Definition Clarity: The paper could benefit from a few edits for clarity and readability. For instance : (1) It would be good to to mention clearly what constitutes subgroups (eg: density x age extremes) initially along with the definitions in the methodology section (2) The small font sizes and organization of Fig 2 make it a bit difficult to follow the results.
- Density and age affects the accuracy and sensitivity to mammography (https://www.cancercareontario.ca/en/guidelines-advice/cancer-continuum/screening/breast-density-provider-information, https://ajronline.org/doi/10.2214/AJR.10.5442). This means that there might be images in which even if the biopsy was positive or negative, it might not be possible to make a diagnosis from the imaging. Since this looks at clinical application, it would be useful to add a few lines in the discussion section about how to disentangle the impact of this from shortcut learning / dataset shift issues.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, a good paper with a broad set of results providing guidance for AI model development in a niche clinical space. However, a few items (mentioned under weaknesses) require additional clarity and would benefit the community with a broader discussion.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper investigates the bias and generalizability of five foundation models (MammoCLIP, MedCLIP, GLORIA, CLIP and DINOv2) for breast cancer detection and breast density classification across a wide range of mammography datasets, with a focus on under-represented regions. To this end, fairness-aware techniques (DANN, FairDisCO, FADES, GroupDRO and MoE) are benchmarked. The results indicate that modality-specific FMs like MammoCLIP outperform others in performance. Significant disparities remain across demographic subgroups and datasets; fairness-aware methods, particularly GroupDRO, are shown to reduce these disparities effectively, though partially with a trade-off in accuracy.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addresses an important and timely problem – bias and generalizability of foundation models in the medical domain – by providing a rigorous evaluation across ten diverse datasets. It contributes a valuable empirical analysis using publicly available and under-represented regional datasets. The paper is well-organized and the methodology is clearly described, including implementation details and statistical testing.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper’s primary limitation lies in the relatively shallow analysis of why certain FMs and bias mitigation methods underperform – particularly in the diagnosis task. While GroupDRO stands out as effective for breast density classification, the discussion on why domain-adaptation strategies fail to yield benefits could be deepened. Moreover, MedCLIP’s poor performance is noted but not investigated in detail. Generally, the paper lists results rather superficially and reads more as a status report than a detailed analysis of the underlying causes behind model performance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper offers a empirical contribution to the field of fairness in medical AI, more specifically for mammography, and is of interest to the MICCAI community. Although the novelty lies only in the systematic benchmarking rather than in a new methodology, the inclusion of underrepresented datasets and the comparison of bias mitigation strategies provide important insights. With increased detail in the discussion around the causes of bias or limits in performance, this work could be elevated further. Nonetheless, it represents a well-motivated and practically important study that is worthy of publication.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper discusses the recent and interesting topic of bias and generalizability of foundation models (FMs), specifically in the context of mammograms. It comprehensively investigates the biases inherent in FMs across diverse datasets sourced from ten different datasets. Furthermore, the study explores different bias-mitigation strategies, such as domain adaptation and fairness-aware techniques, and evaluates their effectiveness in reducing disparities across subgroups (e.g., extreme breast densities and age groups).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper leverages a wide array of datasets, including data from underrepresented regions, ensuring a more globally representative analysis.

It offers a detailed evaluation of how FMs perform across datasets and how domain adaptation works.

The exploration of domain adaptation and fairness-aware strategies is particularly valuable. The paper highlights the trade-offs between improved fairness and model performance.

The paper is well-written and easy to follow, with insightful discussions that help the reader understand the key findings and implications of the study.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

To my understanding, the unified method in the study is trained on all datasets, meaning that all datasets belong to the internal (seen) set. As a result, the study does not evaluate how these bias and domain-adaptation methods perform on external (unseen) datasets, which could be important for assessing the generalizability of the methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper provides valuable insights into the biases and generalizability of foundation models in mammograms, especially when applied to diverse datasets. The authors’ exploration of fairness and domain-adaptation strategies is crucial for making AI more equitable and usable in clinical practice.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We would like to thank the reviewers for their positive feedback and thoughtful comments. To strengthen our discussion, we will enrich the manuscript with discussions and insights on the limitation of domain-adaptation techniques and the clinical impact of these biased classifiers.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Bias and Generalizability of Foundation Models across Datasets in Breast Mammography

Author(s):