Abstract

Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0408_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/utkuozbulak/spurbreast

Link to the Dataset(s)

https://github.com/utkuozbulak/spurbreast

BibTex

@InProceedings{WonJon_SpurBreast_MICCAI2025,
        author = { Won, Jong Bum and De Neve, Wesley and Vankerschaver, Joris and Ozbulak, Utku},
        title = { { SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {562 -- 571}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors study the phenomenon of the erroneous correlations that could be extracted by a deep learning algorithm when training for a target task. More specifically, the focus on breast cancer detection in MRI, and study the influence of such features as ethnicity, surgery type, and some others. The authors evaluate state-of-the-art classification networks (ViT and ResNet50) and report the classification metrics. The authors propose to disclose the dataset to the community allowing the community for a reproducible research.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors focus on a relevant issue of underlying correlations in case of a deep learning algorithm training. Such work has a potential to contribute to the explainability research providing more relevants hints to the clinicians and engineers when working with deep learning models.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the authors addresss a clinically relevant phenomenon, its study lacks clarity and the insights appear to have limited impact. That is, the authors performs saveral experiments, training deep learning algorithms with and without introducing false correlations. However, the number of samples are different in each of the experiments, leading to difficulty to compare the numbers. Moreover, amongst the suggested features, only few (MRI field, and image orientation), give the hints of erroneous correlation, which appears to be rather obvious. This may leave the reader puzzled about the outcome of the study.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I would suggest the authors to revise the experiements plan to allow for a more comprehensive experimental setup. In 3.1. the authors state selecting tumor-positive images from Caucasian patients, and tumor-negative images from Asian patients. Hence, the features appear to be studied in a unique direction. For a better understanding of the correlation it might be more relevant to perform the experiences in several directions (e.g, negative from Caucasian and positive from Asian or African). In such scenario, the average metric with the standard deviation could be reported. Moreover, the different combinations may be pointed out more explicitely, showing which pairs are more likely to be erroneously correlated.

    On a minor note, I would suggest revising the writing. There are some verbosity and repetitiveness, e.g, the “spurious” term appears too many times making the reading more difficult.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the addressed topic is relevant, the rolled out experiments prevent from good understanding of the phenomenon being looked at. That prevents from accepting the paper. However if the authors revise the experiments and results, the rating may be revised.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This article propose a curated dataset containing 3D MRI scans from 900 patients. These scans are either from 1.5T or 3T scanners. They also propose a study spurious correlations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The dataset proposed is a dataset composed of a lot of patients from different ages, ethnies and with different acquisition systems, making it diverse. The idea to analyse the influence of spurious correlations is also very interesting and informative.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There is a mention of the licence of the dataset but nothing on how to obtain it. There is no statistical analysis in the model performance analysis. It is a weakness for two reasons: first today a conclusion on quantitative results has to be supported by a statistical analysis. And secondly, it would be interesting to see how the different features will change the robustness of a model by also studying the std/variance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Some illustrations are hard to understand The buffer zone is not used for the analysis but it may be the more interesting slices. The authors should perform a special analysis on them.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This king of article presenting a new annotated dataset is of interest for the community. However there are some limitation manly on the analysis part.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper describes a new public dataset specifically aimed at facilitating studies of spurious correlations (a.k.a. shortcut learning) in AI models for medical imaging. Whilst there is increasing evidence that AI models are susceptible to shortcut learning there is currently a lack of realistic data to enable well-grounded studies into this phenomenon. This paper describes such a dataset based on breast MR images as well as results for training AI models using it that demonstrate shortcut learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and clear The problem of shortcut learning/spurious correlations is one of the key challenges facing the MIC community at the moment The dataset will be of great value to other researchers in the field

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There are no major weaknesses IMO, but below I provide a few suggestions for widening the scope of the literature review

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It is great that the data will be made available. Will the trained models also be part of the public release?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think that this paper addresses one of the key challenges facing the MIC community at the moment – that of shortcuts/spurious correlations preventing generalisable and fair performance when models are used in the clinic - so I think that the paper makes a very useful contribution which deserves to be presented at MICCAI to help raise awareness of this issue and promote further research into it.

    The paper is well written, and I have no strong criticisms to make.

    The only relatively minor point I would raise is that I think the paper would be strengthened by a broader discussion of shortcut learning/spurious correlations in the Introduction to make a stronger case that this is something the community should be seeking to address. For example, some further references to act as evidence for the importance of shortcuts/spurious correlations could include:

    • Correlation between surgical skin markings and lesion malignancy in dermatology images: https://doi.org/10.1001/jamadermatol.2019.1735
    • Correlation between chest drains/ECG wires and diagnostic labels in chest X-rays: https://doi.org/10.1007/978-3-031-72787-0_1
    • General correlations between (sex/race) demographics and target labels: https://doi.org/10.1073/pnas.1919012117, https://doi.org/10.1038/s41591-024-02885-z, https://doi.org/10.1007/978-3-031-45249-9_22 (The last one is interesting as it is for MRI/breast cancer and so has similarities with the ethnicity experiment the authors perform.)

    Also, when mentioning the availability of synthetic datasets for investigating spurious correlations this might be a good reference: https://doi.org/10.1016/j.ebiom.2024.105501

    Finally, one minor suggestion with regard to wording. In Section 3.1, Training and Validation Datasets: I think the phrase “per patient” is a bit vague and potentially misleading. Do the authors mean to say that the splits were done at the patient level and not the image level? If so, then I suggest replacing this text with “at the patient level”.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers and the meta-reviewer for their constructive and thoughtful feedback. We are pleased that our work was recognized as timely and relevant to the MICCAI community. Below, we respond to key concerns raised across the reviews and clarify the scope and intent of our contribution.

– Response to meta reviewer

We agree with the meta-reviewer’s suggestion and will revise the Introduction and Conclusion to clearly highlight that the main contribution of this work is the release of a new, realistic, and diverse breast MRI dataset, curated specifically to support the study of shortcut learning (spurious correlations) in clinical AI. The experiments included in the paper are intended as illustrative examples of how such biases can manifest in practice. We will explicitly state that the dataset is designed to support broader and more detailed future investigations by the community.

– Response to reviewer feedback

We will take steps to improve the manuscript in light of the reviewers’ valuable comments.

In response to Reviewer 1, we will include additional citations to expand the discussion of shortcut learning and strengthen the motivation in the Introduction. We will also revise the wording in Section 3.1 to avoid ambiguity, replacing ‘per patient’ with ‘at the patient level’. The reviewer’s suggestion regarding the public release of model weights is appreciated, and we confirm that we will share both the dataset and the trained model checkpoints along with supporting documentation.

In response to Reviewer 2, we will clarify that the use of buffer zones was a deliberate design choice to align with prior literature that uses the same type of breast MRI data, where tumors often appear in central slices. This ensures consistency with established practices while placing spurious features in spatially distinct regions. We will also present a straightforward method to obtain the dataset future experiments.

In response to Reviewer 3, although we experimented with most candidate features that could plausibly lead to spurious correlations, we were only able to highlight a few illustrative examples (e.g., device type, orientation, ethnicity) to maintain focus. We will revise the text to better explain this trade-off and more explicitly describe which combinations led to the strongest spurious effects. We will also streamline the writing by reducing repetitive phrasing and improving clarity, as suggested.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    I advise the authors to address the comments from Reviewer #3, in particular revising the writing and emphasizing the problem this dataset tries to tackle. In addition, the authors should clearly outline that the main contribution of their paper is the dataset and that experiments provide an overview of potential investigation that could be run on this dataset. Larger, more detailed studies should be performed by other researchers.



back to top