Abstract

In the era of Foundation Models’ (FMs) rising prominence in AI, our study addresses the challenge of biases in medical images while the model operates in black-box (e.g., using FM API), particularly spurious correlations between pixels and sensitive attributes. Traditional methods for bias mitigation face limitations due to the restricted access to web-hosted FMs and difficulties in addressing the underlying bias encoded within the FM API. We propose a D(ebiased) N(oise) E(diting) strategy, termed DNE, which generates DNE to mask such spurious correlation. DNE is capable of mitigating bias both within the FM API embedding and the images themselves. Furthermore, DNE is suitable for both white-box and black-box FM APIs, where we introduced G(reedy) (Z)eroth-order) Optimization (GeZO) for it when the gradient is inaccessible in black-box APIs. Our whole pipeline enables fairness-aware image editing that can be applied across various medical contexts without requiring direct model manipulation or significant computational resources. Our empirical results demonstrate the method’s effectiveness in maintaining fairness and utility across different patient groups and diseases. In the era of AI-driven medicine, this work contributes to making healthcare diagnostics more equitable, showcasing a practical solution for bias mitigation in pre-trained image FMs.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2733_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2733_supp.pdf

Link to the Code Repository

https://github.com/ubc-tea/DNE-foundation-model-fairness

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Jin_Debiased_MICCAI2024,
        author = { Jin, Ruinan and Deng, Wenlong and Chen, Minghui and Li, Xiaoxiao},
        title = { { Debiased Noise Editing on Foundation Models for Fair Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper studies the fairness of predictions for foundation models. Arguing that some foundational models may contain biases against certain groups, the paper proposes a perturbation-based debiasing method to alleviate these concerns.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper studies a relevant and understudied field of research. The direction of the research is clear and the reasoning is justified appropriately.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In introduction, the authors provide a number of examples for medical foundation model APIs (such as Google MedLM, Voyage.ai, and ChatGPT) and justify the research performed based on those frameworks. Having read this, I was expecting the authors to conduct experiments on these frameworks for a well-designed paper. Unfortunately, the authors use standard pretrained ViTs. Similar to every other research that use pretrained models, they discard the final layer and retrain it on the selected dataset for their experiments. This means that the justification for the research and the research methodology/output have large conflicts and the claims made by the authors are weak at best and incorrect at worst, since they do not use any APIs from services that are available.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should use one of the APIs they cite in the introduction for their experiments and conduct results based on that.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a method for bias mitigation when using foundation models to classify medical images. Specifically, the authors proposed a universal debiased editing strategy to add noise to the training data to deceive the sensitive attribute classifier and to train a fair disease classifier. The author also proposed an optimization strategy to optimize black-box foundation models. The proposed method is evaluated on Chexpert dataset for classifying three diseases.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Debiasing foundation models is an interesting topic. Adding noise to training data to debias classifier is a novel idea.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The use of the term “universal” in describing the method may lead to misconceptions about its scope. Although the authors clarify its meaning (in terms of applicability across various classification tasks) in a footnote, it is recommended that they reconsider the terminology to prevent potential confusion.

    The evaluation of the proposed method is limited to the Chexpert dataset, which may restrict its generalizability across different datasets.

    The performance of the proposed methods, UDE and UDEZeGO, is not consistently superior across all conditions. While UDE performs best for Pleural Effusion, UDEZeGo excels in Pneumonia classification. Published methods achieved the best performance in classifying Edema. The paper should offer guidance on choosing the appropriate method based on specific circumstances or disease characteristics.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper specifically addresses spurious correlation bias. Can the method mitigate other types of biases such as underrepresentation bias?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Experiments were conducted on only one dataset and three types of diseases, raising doubts on model’s performance on the rest types of diseases in the same dataset. Results showed mixed results compared to state-of-the-art methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    As a foundation model, it is expected to show generalization across multiple tasks but the author only experimented on a few classification tasks. The “universal” in the author’s definition is about certain noise that works for many patients. Without validating on other datasets, it is hard to believe the method is universal. So I keep my original score.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a universal debiased editing method, which aims to mitigate unfairness issues in chest X-ray classification foundation models. Additional optimization methods are extended for FM with inaccessible gradients.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper focuses on a novel research area, i.e. unfairness mitigation in classification FMs, as most of the previous studies were conducted on CNN / Transformer. The use of a consistent noise vector on different images is also interesting, as shown in Appendix Fig.4, which helps us to find potential spurious relations between input and output. Besides, the consideration of FMs where the gradient is inaccessible is more in line with reality, and the improved version of UDE seems useful for handling this situation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    First, some of the descriptions in the manuscript are different from the common usage, please see below for details.

    Second, the authors use “Universal” to refer to their experiment on three illnesses in only one dataset, i.e. CheXpert. AFAIK, the “universal” in medical image analysis is usually used to describe one model on multiple datasets (such as [1-3]). Thus I think the word is not that suitable (although the footnote on page 2 explains it).

    Moreover, as FMs are usually large-scale models trained on a large amount of datasets, I am not sure whether a Medical MAE trained on CheXpert only can be regarded as an FM? (current FMs in chest X-ray usually use a combination of datasets including CheXpert, ChestXray14, MIMIC-CXR, etc.)

    Lastly, I am a little doubt about the experiment result in Table 1. As EO is a metric that measures disparity in TPR and FPR for the two groups, a Delta TPR larger than 20% is infrequent for DL models. Does the result in Table 1 use the correct unit?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    the dataset used in this paper is publicly avaliable.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In the introduction part, page 1. Google MedLM, Voyage.ai, ChatGPT are FMs that mostly focus on language, so it is suggested to add another online image FM.
    2. In the introduction part, page 2. A space should be added after “1)”, and the word “pretrain” should have a consistent form (“pre-train” somewhere).
    3. In the introduction part, page 2. The authors compare edits on image space and on latent space, could you give a rough result about the computation cost between the two choices?
    4. Figure 1. Although mentioned in the manuscripts, I think the meaning of Y_1, \hat{Y_1}, X_a, “X_t”, the meaning of the dashed line, etc. should be explained in the caption.
    5. Sec 2.1, page 4. “i.e. A =2 for gender with male and female” is a bit confusing. And the bold “fairness issue:” seems not proper.
    6. Sec 2.2, page 4. Why do the authors use gradient ascent instead of gradient descent? And should \lambda be a negative number or not?
    7. Sec 3.1, page 6. “sampling subsets from the original training dataset and increase each subgroup’s bias gap (more positive sample in a subgroup).” I do not understand the meaning of doing so. The numbers in Appendix Table 2 should be 5,000 instead of 5000.
    8. Sec 3.1, Evaluation metrics, page 6. Should the author use a different index “j” for a^j and y^j? Besides, the formula of EO should add an absolute operator, otherwise, the authors should define that Y=y^1 is the privileged group with higher performance.
    9. Sec 3.2, universal debiased editing, page 6. Why use Pleural Effusion to finetune the SA classifier? Are there any ablation studies on using different illnesses to train SA? (perhaps training the SA classifier with “No finding” will be more precise?)
    10. Table 1. I think replacing DI with 1 - DI is better.
    11. Sec 3.4, page 8. Do the authors examine \lambda=0.2, 0.5? Will the method have better EO and ACC with a different \lambda?
    12. In my opinion, Fig.4 in the Appendix is important and interesting, so I suggest moving it to the manuscript if there is space.

    [1] Liu, Xinwen, et al. “Universal undersampled MRI reconstruction.” Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24. Springer International Publishing, 2021. [2] Zhu, Heqin, et al. “You only learn once: Universal anatomical landmark detection.” Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer International Publishing, 2021 [3] Liu, Jie, et al. “Clip-driven universal model for organ segmentation and tumor detection.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although there are some improper statements in the manuscript, I think the method focused on mitigating the unfairness of FMs is important for MICCAI society and gives insights to further research. So I am willing to accept this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors explained most of my questions. Besides, although the work does have limitations (especially the gaps between the traditional definition of “FM” and their “Med-MAE used on one X-ray dataset”), I appreciate their idea of using GeZO for black-box model mitigation and believe its utility in real situations.




Author Feedback

We sincerely thank the reviewers for their valuable comments. We are grateful that reviewers acknowledged that our method is interesting (R4), novel (R4 and R6), helps find potential spurious relations (R6), and in line with reality (R6). We address the concerns of the reviewers below.

FM API (R3)-As stated in the first sentence of our intro, our focus was on pretrained models that convert complex input data into vectors for downstream prediction tasks. The references to commercial APIs were intended as background context. At NO point did we state that our research would be based on these commercial APIs. An API is simply a way for components to communicate [3], and our approach of using vectorized outputs from a pretrained medical model as input for another ML model aligns with this definition. We modeled key API properties mentioned in the second last paragraph of intro (white and black-box), and proposed corresponding debiasing methods. We intentionally simulated the API with a publicly available FM for reproducibility, as direct experimentation with commercial APIs was impractical due to access limitations and costs. This practice is common in academic research.

The term of “Universal” (R4, R6)-We appreciate the feedback. We used the term “universal” because a single UDE noise can be applied universally across all patients to de-bias images for multiple diseases simultaneously. As demonstrated in Table 1, the same UDE noise was applied to different patient images on three different diseases, consistently improving fairness. Also, the term universal aligns with its use in other work [4] where it denotes fair representation learning for various tasks.

Evaluation on ChexPert (R4)-Our study focused on ChexPert for several reasons. First, CheXpert is a large-scale, representative dataset widely used in both medical fairness studies and medical FMs training. Second, it contains multiple disease labels, which is ideal for validating the “universal” property of UDE. Therefore, ChexPert enabled us to perform extensive and systematic studies. Last, given MICCAI’s 8-page limit, we prioritized depth over breadth to thoroughly explore our method on a large-scale, representative dataset. This aligns with similar works in the field, which typically focus on a single dataset for in-depth analysis [1][2].

Underrepresentation bias (R4)-The UDE training objective (Eq.1) masks Sensitive Attribute-related information and maintains the utility. Thus, it’s able to mitigate multiple biases.

Performance (R4)-We want to humbly point out that an effective de-biasing strategy balances utility (accuracy) and fairness (EO, DI). Sketch on Edema achieves 0.5 lower EOp and 1.7 lower DI, very small compared to its 8.1% accuracy drop. UDE and UDEZeGO provide the best overall trade-off, excelling in both fairness and utility, making them ideal when both are prioritized. Additionally, UDE offers good interpretability by visualizing the UDE noise vector (see Appendix).

MedMAE Training Data (R6)-The Medical MAE was trained on NIH ChestX-ray14, CheXpert, and MIMIC-CXR, totaling over 500,000 X-rays.

EO Magnitude (R6)-We use E(qual) O(pportunity) to quantify the disparity in true positive rates (TPR) between sensitive groups. Our setting of data distribution is strongly biased, we confirm that it’s resulting in a large disparity in TPR.

Writing Improvement (R6)-We appreciate the careful review and will incorporate your suggestions into our final paper improvements.

Reproducibility-Our algorithm is detailed in the appendix, and hyper-parms are detailed in the experiments (Sec 3.2). Code will be open-sourced upon acceptance.

[1] On Fairness of Medical Image Classification with Sensitive Attributes via Learning Orthogonal Representations [2] Fairness in Cardiac MR Image Analysis: An Investigation of Bias Due to Data Imbalance in DL Based Segmentation [3]https://en.wikipedia.org/wiki/API [4] Generating Fair Universal Representations Using Adversarial Models




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents an interesting methodology for bias mitigation for medical imaging applications. However, the claims of debiasing foundation models and being a universal debiasing editor cannot be supported by the current manuscript. The models used are pre-trained ViTs and the results are showcased on one dataset, CheXpert. Moreover, the paper is only focused on 3 diseases and 1 sensitive attribute (gender). In future iterations of the manuscript, I would recommend either adjusting the claims to match the experiments or perform more extensive evaluations on more datasets, tasks and sensitive attributes.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents an interesting methodology for bias mitigation for medical imaging applications. However, the claims of debiasing foundation models and being a universal debiasing editor cannot be supported by the current manuscript. The models used are pre-trained ViTs and the results are showcased on one dataset, CheXpert. Moreover, the paper is only focused on 3 diseases and 1 sensitive attribute (gender). In future iterations of the manuscript, I would recommend either adjusting the claims to match the experiments or perform more extensive evaluations on more datasets, tasks and sensitive attributes.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper has diverging reviewers recommendations. The main weaknesses of Revs 3 and 4 seem to focus on the use of the term “universal”, on the lack of experimentation with very large models (such those provided by Google’s API) and in the use of a single dataset. While I acknowledge these weaknesses, given the otherwise unanimous opinion in that the methodology presented is innovative and interesting, I am recommending acceptance. I deem the noted strengths of the paper superior to the weaknesses. As a note to the authors, I recommend they make enough clarifications in their revised version regarding the used terminology (e.g. on what “universal” means, as I agree that it is not standard). Furthermore, I strongly recommend revising their notation for the fairness violation metrics: these are (absolute value?) differences of probabilities, so they should be in [0,1], or otherwise carefully noted.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper has diverging reviewers recommendations. The main weaknesses of Revs 3 and 4 seem to focus on the use of the term “universal”, on the lack of experimentation with very large models (such those provided by Google’s API) and in the use of a single dataset. While I acknowledge these weaknesses, given the otherwise unanimous opinion in that the methodology presented is innovative and interesting, I am recommending acceptance. I deem the noted strengths of the paper superior to the weaknesses. As a note to the authors, I recommend they make enough clarifications in their revised version regarding the used terminology (e.g. on what “universal” means, as I agree that it is not standard). Furthermore, I strongly recommend revising their notation for the fairness violation metrics: these are (absolute value?) differences of probabilities, so they should be in [0,1], or otherwise carefully noted.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work claims this work being a universal debiasing editor however the limited experiments does not provide enough evidence to support this claim.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This work claims this work being a universal debiasing editor however the limited experiments does not provide enough evidence to support this claim.



Meta-review #4

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers agree that this is an interesting work with sufficient novelty. Concerns have been raised regarding the validation of the method on one dataset only. Also, the use of the term “universal” should be reconsider in the context of this paper to avoid confusion (e.g use of very large models). However, I agree with meta-reviewer #3 that the strengths of the paper overshadow its weaknesses.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers agree that this is an interesting work with sufficient novelty. Concerns have been raised regarding the validation of the method on one dataset only. Also, the use of the term “universal” should be reconsider in the context of this paper to avoid confusion (e.g use of very large models). However, I agree with meta-reviewer #3 that the strengths of the paper overshadow its weaknesses.



back to top