List of Papers Browse by Subject Areas Author List
Abstract
Machine learning (ML) models may suffer from significant performance disparities between patient groups.
Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient.
Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g. patient sex) sufficiently reflects the main reasons for performance variability, which is not common.
Subgroup discovery techniques that identify cohesive subgroups based on learned feature representations appear as a potential solution: They could expose hidden stratifications and provide more granular subgroup performance reports.
However, subgroup discovery is challenging to evaluate even as a standalone task, as ground truth stratification labels do not exist in real data. Subgroup discovery has thus neither been applied nor evaluated for the application of subgroup performance monitoring. Here, we apply subgroup discovery for performance monitoring in chest x-ray and skin lesion classification. We propose novel evaluation strategies and show that a simplified subgroup discovery method without access to classification labels or metadata can expose larger performance disparities than traditional metadata-based subgroup analysis. We provide the first compelling evidence that subgroup discovery can serve as an important tool for comprehensive performance validation and monitoring of trustworthy AI in medicine.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0459_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/alceubissoto/hidden-subgroup-perf
Link to the Dataset(s)
N/A
BibTex
@InProceedings{BisAlc_Subgroup_MICCAI2025,
author = { Bissoto, Alceu and Hoang, Trung-Dung and Flühmann, Tim and Sun, Susu and Baumgartner, Christian F. and Koch, Lisa M.},
title = { { Subgroup Performance Analysis in Hidden Stratifications } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {606 -- 615}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper evaluates subgroup discovery methods in medical applications. Its contributions include experimental and evaluation metric designs to validate the utility of DIMINO. The authors present interesting results, showing that DIMINO effectively discovers subgroups aligned with underlying valid subgroups.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-written and well-organized.
- I enjoyed reading it, as it addresses an important and overlooked topic.
- The empirical results are solid and clearly presented.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The abstract is somewhat vague and does not fully capture the manuscript’s contributions. Specifically, the phrase “novel evaluation strategies” could be clarified-presumably, it refers to experimental design.
- There is limited technical novelty, as the work is primarily an empirical study.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I think this paper is already very well organized and easy to follow. However, if the authors want to further improve it, they could consider the following:
- The paper’s exact contributions are not entirely clear. While it states, “propose novel evaluation metrics and provide the first… results…”, this phrasing feels somewhat vague. After reading, I got the impression that the empirical results and observations are the main contributions, but I’m not entirely sure what unique insights this paper provides. To make this crystal clear, I suggest summarizing the contributions in bullet points and referencing the relevant sections.
- I may have missed this, but I struggled to find details on the exact subgroups discovered by different methods. Specifically, the authors divide samples into 15 groups—how many samples are in each group? If some subgroups have very few samples, their significance might be limited. The current conclusions are reasonable, but a deeper analysis would be helpful.
- I’m slightly puzzled by Fig. 3. If I understand correctly, it shows that the discovered subgroups align with hidden subgroups. However, why are the discovered subgroups represented as dots rather than lines? Do the horizontal positions of the dots carry any meaning?
- Most of the analysis relies on CLIP. I wonder how tightly the conclusions are tied to this method—would they hold with other image feature extractors?
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This work applies a subgroup discovery method to image data for chest X-ray and skin lesion classification to identify performance disparities that are greater than those associated with conventional subgroups based on patient demographics such as age or sex.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is eloquently written with a sound explanation of why traditional subgroup analysis based on patient metadata can fail to identify patient groups with the largest performance disparities.
- The strategy to inject synthetic artifacts correlated with disease labels into the images in order to test the capability of the subgroup discovery method is novel and intelligent.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The real-world application of this work is not laid out convincingly. The paper stops short of making suggestions on how to address performance disparities in subgroups once they have been discovered. Because the subgroups are based on potentially complex image features rather than patient metadata, it seems particularly challenging to address the disparities e.g. by increasing the number of training samples for an underperforming subgroup.
- The paper is missing a solid explanation of why BiomedCLIP and PCA were chosen for image encoding and dimensionality reduction respectively. The choice of image encoder and also the dimensionality reduction technique will have a large impact on the type of features which are extracted from the images and therefore on the subgroups which are identified. This is partly addressed in Section 3.6, but subgroup purity and performance disparity do not directly measure the similarity of subgroups identified by CLIP and BiomedCLIP. It would be nice to see a more extensive comparison between different encoders and dimensionality reduction methods in the results.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The application is interesting and the quality of writing, method and analysis are good. However, the utility of the method is left in question given that there are no suggestions made on how to address the discovered performance disparities.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The paper addresses the issue of subgroup discovery to highlight performance disparities. They adapt a Gaussian Mixture Model to cluster subgroups which is used on both a synthetic dataset and real-world dataset. They find that artificial subgroups in the synthetic dataset can be discovered well, whereas discovered subgroups do not align with labelled metadata for the real-world data, particularly for CheXpert-Plus
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Well-formulated problem, motivation and conclusions. The work proposes a novel metric for measuring subgroup performance (purity) which provide a clear measurement of subgroup discovery. The comparison and evaluation of CLIP and BiomedCLIP resulted in the interesting finding that models trained on natural images can discover subgroups as well as models trained on biomedical images. The methods also highlighted the fact that metadata may not align with actual subgroups with disparate performances
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Figure 3 with the greyscale dots is not very clear. The dots would be easier to interpret with a colour scale. The authors do not include any statistical tests to determine whether subgroup discovery results in significantly different performance from known subgroups. [Minor] Although the performance gap may not have been used in this context before, it is not a novel metric and has been used in previous work in the field of fairness
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper provides a good method for identifying subgroups that do not align with known dataset subgroups. This may be very useful in evaluating the fairness of models once deployed and can be used in fairness monitoring.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank all reviewers for their thorough evaluation of our paper, which provided valuable insights to enhance the manuscript’s clarity and rigor. We appreciate the reviewers praise of our work’s “clear, eloquent writing with strong motivation” (R1, R2, R3), “novel methodological contributions: purity metric and innovative use of synthetic artifacts (R2, R3)”, and our “solid empirical results and insightful comparative evaluations between CLIP and BiomedCLIP” (R1, R2). Below, we summarize our response to the reviewers’ main comments. A. Contributions (R1): To clarify our contributions, we will summarize them in the introduction and will revise the abstract to more explicitly highlight our novel evaluation strategies. B. Detailed Subgroup Analysis (R1/R2): We will include subgroup sizes to clarify subgroup significance and sample distributions. As suggested, further statistical tests can strengthen our claims and will be conducted in a follow-up work. C. Choice of feature extractor (CLIP / BiomedCLIP) (R1, R3): Our reasons for using CLIP models were twofold: 1) we stay within a setting where DOMINO is known to perform well, and 2) it enables cleaner comparisons with domain-specific models, as we fix the architecture and focus on the effects of domain-specific training data. The interesting fact that we reached similar conclusions using both generalist and medical CLIP models motivated our current investigations into two key aspects, aligned with the reviewers’ suggestions: the impact of the feature extractor on the embeddings used in our analysis, and a more detailed comparison of the models themselves. These investigations are ongoing and will be presented in a follow-up work. D. Real-world Application and Mitigation Strategies (R3): We will expand the discussion section, addressing the practical utility of our findings.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
Novel method/metric for subgroup discovery to identify performance disparities for subgroups not defined by available metadata. Although, the technical novelty is a bit limited (mostly an application of a slightly modified version of DOMINO) the topic is important, the paper is easy to read, and the evaluation is carried out nicely. The authors are encouraged to better clarify their exact contributions as suggested by the reviewers.