Abstract

The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM’s performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1054 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1289_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1289_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_An_MICCAI2024,
        author = { Li, Qing and Zhang, Yizhe and Li, Yan and Lyu, Jun and Liu, Meng and Sun, Longyu and Sun, Mengting and Li, Qirong and Mao, Wenyue and Wu, Xinran and Zhang, Yajing and Chu, Yinghua and Wang, Shuo and Wang, Chengyan},
        title = { { An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study investigates the fairness of three segmentation foundation models—SAM, medical SAM, and SAT—in terms of age, gender, and BMI variables. The authors evaluate publicly available trained models on two large in-house datasets, including abdominal MRI and thorax CT scans. Additionally, they train an nnU-Net model on a subset of their data to establish an upper bound. Segmentation performance is assessed using Dice scores, followed by subgroup comparisons via t-tests and Pearson correlation. The findings reveal significant fairness issues across most organs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Addresses an important topic with potential value to the community - fairness of segmentation foundation models with respect to the variables age, gender and BMI.
    • Utilizes a large and diverse clinical dataset, this makes the study potentially clinically relevant
    • Examines fairness within sub-regions of organs
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Sole reliance on Dice score as the primary segmentation metric may limit the comprehensive evaluation of fairness issues. The Dice score is known to be biased towards larger structures, potentially leading to skewed results. This limitation could obscure the true extent of fairness concerns within the models evaluated. - see more in detailed comments
    • The absence of information regarding the distribution of subjects across age and BMI groups within the dataset hinders the interpretation of segmentation performance results
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • the study evaluates pre-trained foundation models on their own in-house dataset, therefore the exact results can’t be reproduced, but the evaluation algorithms are clearly described.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • the authors should introduce the SAT abbreviation in the abstract for clarity
    • in the introduction, page 2 the authors could incorporate citations to studies documenting biases in task-specific segmentation models, such as U-Net, to strengthen their claim

    • the authors mention a lot of detail about their test dataset acquired from a local hospital. Additionally it would be interesting to know how many subjects fall into each of the different age and BMI groups that are later evaluated. E.g. we can see better Dice scores for females for most of the organs which might be due to 60% females in the training data.

    • If the authors have any knowledge about the demographics of training data for Med SAM and SAT this information would also be interesting

    • In the BMI experiment the authors note that nnU-Net shows decreased fairness for kidney segmentation of overweight patients. Interestingly the foundaditon models show better performance for the overweight group. Can the authors elaborate on this? Is the overweight group underrepresented in the nnU-Net training data?

    • The authors should specify the BMI ranges for underweight, healthy, and overweight groups for clarity.

    • My main concern with this study, as highlighted in the weaknesses, is its sole reliance on the Dice score for segmentation performance evaluation. This score’s bias towards larger structures implies potential advantages for certain demographics, such as males and those with higher BMIs, while disadvantaging others like females and underweight individuals due to organ size variations. For instance, liver volume typically decreases with age, increases with BMI, and is larger in males. In section 3.2 on page 8, the authors note poorer segmentation performance for underweight females using nnUNet. Understanding if it reflects underweight females’ smaller liver sizes or their underrepresentation in the training data is crucial for interpreting segmentation disparities accurately. Additional information on the BMI distribution in the training data would help address this uncertainty. To mitigate this bias and offer a more thorough evaluation, the authors should consider incorporating additional segmentation metrics. For example, boundary metrics like average symmetric surface distance or Hausdorff distance, along with assessments of connected components and segmentation holes. The “Metrics Reloaded” paper offers a comprehensive overview of such metrics. [1]

    [1] Maier-Hein, Lena, et al. “Metrics reloaded: recommendations for image analysis validation.” Nature methods (2024): 1-18.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I’m not sure how much of the fairness problems can be attributed to the choice of Dice score as the only metric. Therefore, I’m leaning towards rejection.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors’ engagement with my feedback and their efforts to clarify their perspective on the Dice score. However, I still maintain my concern regarding its bias towards larger objects in segmentation tasks. While resizing both the segmentation and ground truth objects to half their original size may not alter the Dice score, the impact of misclassified pixels differs for smaller and larger organs. A single misclassified pixel can disproportionately affect the Dice score for smaller organs compared to larger ones. Therefore, employing multiple metrics would provide a more comprehensive evaluation in segmentation studies.

    Nevertheless, I appreciate the additional information provided on the dataset demographics in the final version, as well as the decision to release the code. This paper addresses an important topic regarding the fairness of foundation models, and despite my concerns, I believe it offers valuable insights to the community.



Review #2

  • Please describe the contribution of the paper

    The authors evaluated 3 state-of-the-art foundation models, SAM, MedSAM and SAT for medical image segmentation as well as a in-house trained nnunet on fairness regarding sex, BMI and age. The models were evaluated on a dataset of >1000 subjects for multi-organ segmentation in MRI and CT. They found significant performance differences between male and female, different age and BMI groups, mostly for SAM and medical-SAM, raising serious concerns about the fairness of these models. They further created a heat map for each patient group (e.g. female) to visualize sub-regions in organs, for which the models typically give unsatisfactory results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Relevant topic: Foundation models are a very active field of research. Fairness is a major concern for deep learning models, especially in the medical field, and must be tackled to bring these models into clinical practice. 2) Comprehensive evaluation: The experiments are convincing and support the point of the study well. The authors used 3 popular foundation models and trained a broadly used task-specific model, the nnUNet. They evaluated on a large dataset of >1000 subjects. They evaluated for multi-organ segmentation in MRI and CT. They evaluated fairness on sex, age and BMI 3) Important results: The authors found significant differences in the performance of the models between the patient groups (e.g. male vs. female), which is important for the community to know, as these models are widely used.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    0) While the paper is overall interesting, the main dataset for evaluation is private, therefore only allows a very narrow, non reproducible snapshot of the current fairness of evaluated foundation models. 1) Limited clarification of novelty of the study: Please state clearly whether this the first study to evaluate fairness regarding the demographic features sex, age and BMI in these segmentation models and discuss your study within the literature. Few related studies were mentioned, but not discussed. 2) Limited information on the data set: Please give the number of subjects in each group, as it is only given for male and female, but not for different age and BMI groups. This is important, as sample sizes affect significance and the dataset is not public. Same holds for the dataset nnUNet was trained on. 3) Limited discussion of the results: Why might these significant performance differences exist in the foundation models, and why are they less present in the nnUNet? How do your results fit within the current literature? What should readers take from the paper? 4) It’s surprising to see that Medical SAM performs worse then SAM. I would have expected some discussion about the performance gap.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Training details of nnUNet missing.
    • Models are underspecified. Which specifications are used?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting insight, but non reproducible due to utilising a private dataset for evaluation.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors response. More details regarding the dataset will be provided. However, I still see the private dataset as a concern for studies like those. My score remains.



Review #3

  • Please describe the contribution of the paper

    The authors compare fairness in segmentation performance on multiple organs from an in-house dataset across gender, age, and BMI of various models: nn-UNet, SAM, Medical SAM, and SAT. To assess fairness, they use statistical tests to examine significant differences between Dice scores of demographic groups. They also generate distance maps between ground truth and segmentation to qualitatively assess quality between groups.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Writing quality – clear and easy to follow. • Detailed description of data collection. • Figure 1 is a great representation of experimental pipeline. • Distance map visualization is a clever visual interpretability method for segmentation that I haven’t seen before. • Comparison between BMI subgroups is also something I haven’t seen before in the context of fairness in segmentation, but I like that the authors include it since it’s intuitive that this attribute would likely have impacts on segmentation performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Missing some important details regarding methods – 1) training data and hyperparameters for nn-UNet, 2) “settings” for foundation models, 3) population statistics for dataset. • Potentially unsound statistical test for assessing significant differences between age and BMI groups.

    See comments below for more details.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    • Training set and training hyperparameters for the nn-UNET baseline are not described. • Settings for SAM/medical SAM/SAT are not explicitly stated. • Imaging protocols are described in detail, but data is not publicly available, limiting direct reproducibility of the study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Major concerns: • The training data for the nn-UNet baseline is extremely vague. From Sec. 2.2: “The nnU-Net, on the other hand, is trained on a dataset collected following the same pipeline but during a distinct temporal interval. It is reasonable to assume that the data utilized in training nnU-Net is similar to the test samples, but not precisely from the same data distribution.” What exactly is the size and composition of this training set? What does “temporally distinct” mean? Furthermore, there is no information on how this model was trained (hyperparameters, optimization procedure). Seems important since this is their baseline model that they compare the SAM models to. • The authors write that each of the SAM-variant models were tested using the “recommended settings”. Ideally, I think these settings should be specified somewhere in the paper (even supplementary material is fine). At minimum, the authors should provide footnote links to where these models were downloaded from and where to find these “recommended settings”. • I would like to see a population statistics table somewhere, especially since the joint attribute analysis (sec. 3.2) only examines two attributes at a time. For example, “females with underweight BMI levels receive significantly worse segmentation results than their male counterparts”, but the reader has no way of knowing if age if confounding this analysis (e.g., are all underweight females in a specific age range?) since it is not controlled for and there is no table of population statistics. The authors also don’t mention this as a limitation when interpreting this result. • Can the authors justify the reason for using Pearson correlation to assess significance between age/BMI and Dice score, instead of using ANOVA? And provide a reference for justifying this if possible. My interpretation is that Pearson correlation captures if there is a significant linear relationship between segmentation performance and age/BMI group, but NOT necessarily significant differences between groups that do not follow a linear relationship (e.g., comparing aorta segmentation from nn-UNet between 20-30 and 40-50 age groups seems like it could be significant – if true, ANOVA would pick this up, but since the 40-50 is more of an outlier it seems unlikely that Pearson correlation would). Therefore, I don’t think this is the best test to use for assessing “fairness” as defined by significant differences in performance between groups.

    Suggestions: • I think that the authors should highlight the joint attribute analysis more (in the intro and/or discussion) because to me, it’s the most interesting result and one that I think others in the fairness community would also be very interested in. I would also encourage the authors to refer to it as “intersectional analysis”, since that is the term that researchers in the community typically use when studying how combinations of demographic attributes impacts fairness. • I also think that the authors should highlight their distance measure visualization more as a novel interpretability method for segmentation. I haven’t seen this used as an explainability method for segmentation before, and I think it is very intuitive and clever!

    Minor comments: • SAM and medical SAM are introduced, but not SAT – should briefly explain this model in the introduction as well/how it relates to SAM/medical SAM. • Add a legend for colour in box plots (Fig. 2 and supplementary material) to make it easier to interpret. • Authors use the term “significantly worse” in sec. 3.2 but don’t do any statistical significance tests for the joint attribute analysis. • Authors don’t clarify the threshold for significance or the reasoning for bolding/underlining numbers in the tables. • Since the authors discuss segmentation performance in the results, they should specify in sec 2.3 that higher Dice = better. • For future reference, I think box plots for all the results would be easier for interpretation compared to tables.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I like this paper as it is easy to read, comprehensive, and important for consideration in the community especially due to the ride of foundation model studies. I especially like the joint attribute and visualization analysis. It is an empirical comparison study, so not methodologically novel, but I think it is of value to the community. However, I think it is very important for the authors to clarify the training procedure for the baseline UNet, and to justify or modify their choice of using Pearson correlation for assessing significant differences between age/BMI groups.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my concerns in the rebuttal and have said they they will make the required revisions in the final manuscript.




Author Feedback

We thank all the reviewers for their consensus on the value of this study and their valuable feedback. Below, we provide our responses to their comments.

More data information (R1, R2, R3): The data used in this study was acquired from volunteers excluding major diseases over a span of 3 years (2019 - 2022). It was originally for studying the baseline statistics of multi-organ phenotypes. Later, we found this data useful for studying segmentation model fairness, which led to this work. In this dataset, a total of 1,056 volunteers participated, comprising: Gender: 635 females and 421 males (a 6:4 ratio is considered normal in population study as females are generally more willing to volunteer). We will include the full population information in the revision.

Data being private (R2, R3):
A major part of the data in training foundation models comes from the public. Using private data for model testing can ensure that the testing samples were never seen by the foundation model, thus ensuring an accurate assessment of their generalization capability and fairness. In the near future, we plan to set up a server system where researchers can upload their models and test them on our private data for fairness evaluation.

Object size & Dice coefficient (R1): The Dice coefficient (DSC), by its design, does not inherently favor larger objects. For example, if both the ground truth object and the segmentation object are resized by the same factor (e.g., 0.5, half of the original size), the DSC score will remain unchanged. When comparing with objects of different shapes and topologies, the DSC might show a preference for one over another. However, this does not apply to our case, as the objects in our study are of the same type, shape, and topology but belong to different demographic groups. For instance, although females generally have smaller livers than males, the livers of both genders are consistent in shape and topology. Consequently, the DSC should introduce minimal to no bias when evaluating liver segmentation across genders. The main finding of this paper is that foundation models exhibit more fairness issues than specialized models. For this purpose, DSC is more than adequate for the organ types we studied.

Comparing to existing studies (R2): To our best knowledge, studying different BMI groups on segmentation performance is novel. Evaluating the fairness of emerging segmentation foundation models is unprecedented. Additionally, our study stands out by studying multiple organs and vessels, which is rarely addressed in existing research. Our findings corroborate previous studies indicating that UNet has fairness issues. Notably, our research reveals that foundation models exhibit more severe fairness problems compared to a specialized model.

On Pearson correlation (R3): Indeed, Pearson correlation measures linear correlation, which can be more stringent than a gentler analysis. Following the suggestion of using ANOVA, we utilized ANOVA to re-evaluate the fairness of BMI and age in Table 2 and Table 3. We found that, most of the time, the significance (presented by p-values) obtained from Pearson correlation and ANOVA closely matched. We will include this information in the revision. As a side note, previous work utilizes Pearson correlation in fairness studies, e.g., Xu Z. et al., 2022.

BMI ranges (R1): Under Weight (<18.5), Healthy (18.5~24), Over Weight (>24).

SAM and SAT were trained using a large collection of public datasets, most of which did not originally include demographic information. The original SAM generalizes better than Medical SAM on unseen data (i.e., the private data in this study). We will include references to papers that studied bias in UNet. We will place greater emphasis on the joint attribute analysis and distance measure visualization. Setups of foundation models and nnUNet training information will be further detailed in the revision. Code will be released to the public upon publication.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper investigates fairness in medical segmentation when using (mainly) foundation models. Interesting, understudied topic. All reviewers are recommending acceptance after the rebuttal phase. Authors should clarify as many things brought up by the reviewers as possible in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper investigates fairness in medical segmentation when using (mainly) foundation models. Interesting, understudied topic. All reviewers are recommending acceptance after the rebuttal phase. Authors should clarify as many things brought up by the reviewers as possible in the final version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper investigates the fairness of foundation models for segmentation. This is an interesting topic. The evaluation is quite comprehensive and reveals some interesting results which are useful to present to the community. However, the authors need to address the concerns regarding clarity and data description in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper investigates the fairness of foundation models for segmentation. This is an interesting topic. The evaluation is quite comprehensive and reveals some interesting results which are useful to present to the community. However, the authors need to address the concerns regarding clarity and data description in the final version.



back to top