Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2088_paper.pdf

SharedIt Link: https://rdcu.be/eHdSk

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_2

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YiChe_AdFairCLIP_MICCAI2025,
        author = { Yi, Chenlang AND Xiong, Zizhan AND Qi, Qi AND Wei, Xiyuan AND Bathla, Girish AND Lin, Ching-Long AND Mortazavi, Bobak J. AND Yang, Tianbao},
        title = { { AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {13 -- 23}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes an adversarial fairness framework, AdFair-CLIP, which employs adversarial learning to simultaneously suppress the impact of sensitive attributes on features while the model learns to classify. This approach addresses the fairness issue of the CLIP model in CXR images. Through evaluations on the CheXpert Plus, MIMIC-CXR, and Fair Test datasets, this method demonstrates improvements in both fairness and accuracy compared to existing baselines, including two fairness-aware CLIP methods: FairCLIP and DebiasCLIP.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper explores the fairness issues associated with the application of the CLIP model in the CXR modality, which is an important problem.
2. The framework proposed by the authors is well-defined and logical.
3. Experimental results on two CXR datasets confirm that the proposed method enhances diagnostic fairness in most cases.
4. An ablation study is presented to show whether bias is propagated during image-text pairing process, further shows their joint optimization strategy over multimodal features is essential to suppress the biases raised from image-text pairing.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The min-max adversarial training proposed in this paper is not particularly novel, as similar concepts have already been introduced in other studies. It is necessary to justify the use of adversarial techniques in CLIP by comparing them to other types of methods, such as regularization-based approaches. Why should adversarial techniques be employed in CLIP algorithms instead of simply adding a loss constraint?
2. In Eq. (1), the terms w_u and w_v are missing, and the structure of Eq. (3) is unclear. Based on the paper’s description, could minimizing L_GCL - L_Fair achieve a similar effect? Min-max techniques employed for fairness typically focus on maximizing the minimum utility across all sensitive groups; however, the equations presented in the paper do not appear to follow this way.
3. The experimental section lacks sufficient details regarding the data splitting and preprocessing methods. The explanation in Sec. 3.1 is not sufficiently detailed, which undermines the credibility of the experimental results.
4. The experiments do not consider k-fold or other methods for re-partitioning the datasets, which is essential for ensuring the robustness of the results. Consequently, the reliability of the method cannot be confidently asserted.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of this method is somewhat limited. While the topic is important, applying adversarial learning to issues of fairness is a relatively common approach. Furthermore, the explanation of the method lacks sufficient detail, and the experimental setup is not clearly presented. Therefore, I recommend a weak reject.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The concerns I previously raised have been thoroughly addressed by the authors. While leveraging adversarial learning for bias mitigation is not a novel approach in itself, applying it in the context of CLIP models remains a valuable and meaningful contribution. Given the soundness of the methodology and its relevance to current research directions, I recommend acceptance of this paper.

Review #2

Please describe the contribution of the paper

This paper proposes an unfairness mitigation method for CLIP-based foundation models, by using adversarial training strategies to remove sensitive-attribute-related features encoded in the extracted text and image features. Experiments on the FCXP 5x90 dataset show that the proposed AdaFair-CLIP outperforms CLIP and FairCLIP on both overall utility and fairness metrics.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The topic of this paper is interesting and important, which has not been widely explored in MICCAI society. The proposed method is simple but effect. The description of method is clear and easy to follow.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The related works about unfairness mitigation for foundation models are not enough.
2. The descriptions in Fig. 2 and Fig. 3 are too small to be recognized. Besides, the three supfigures in Fig. 2 seems to have different scale for each metrics, which makes it difficult to compare among different settings. The titles of subfigures should be added in the figure.
3. In Fig. 2, for MedCLIP, why the GAUC_gender is higher in (b) than (a)? Similarly, the DPD_race for CLIP in (a) is also lower than (b). Could the authors give some explainations for this phenomenon?
4. In Fig. 3, it is recommended to change the y-axis range to 54-62 for (a) and 83-87 for (b) to present the differences among different methods better.
5. standard deviations for each experiment should be presented in Table 1-3, instead of just stating “with standard deviations ranging from 0.22% to 5.18%”
6. The experiments did not compare against enough state-of-the-art algorithms, for example, the unfairness mitigation methods presented in [1].
7. The authors set alpha=0.3 for fair constraining, but no ablation experiments on this parameter is conducted. And I am curious about whether the fairness metrics increase if a higher alpha is used.
8. Table 4, I am curious about the performances when using text-only features for adversarial learning. And the coverage speed when using vision-only / multimodal is also an interesting topic.
9. Why the authors use the concatenated feature vector for attribute prediction using only one discriminator, instead of using two discriminators for each type of feature vector?
10. The CheXpert dataset also contains age attribute, why the authors only choose gender and race for evaluation?
[1] FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models (NeurIPS 2024)
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the authors explore an interesting area in MICCAI, the presentation of the experiment result is not satisfactory. Besides, some key experiments are missing to evaluate the value of the proposed method.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author has answered my question well. I hope the authors could improve the visualization of the figures in the camera-ready version and I support the acceptance of this article.

Review #3

Please describe the contribution of the paper

The authors propose AdFair-CLIP, an adversarial framework to reduce the reliance on sensitive attributes in contrastive language-image pre-training models when predicting diagnostic outcomes. The paper also evaluates the fairness of existing models (CLIP, GLoRIA, MedCLIP). The idea is demonstrated on CXR datasets. The model is trained on CheXpert Plus / MIMIC CXR and tested on a small stratified dataset, FCXP 5x90, well-balanced w.r.t. sensitive attributes. The method is compared to FairCLIP and DebiasCLIP.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Fairness is an important topic and the paper is overall interesting.
- The methodology is simple and clearly explained.
- The results in terms of debiasing seem compelling
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The method is very much inspired by [8] and a number of adversarial methods for fairness, which limits its novelty. However I did not find a paper implementing the exact same idea, specifically in the context of fairness and/or CLIP models.
- Some fairness evaluation metrics such as GAUC, Inter-AUC, Intra-AUC are only defined at a high-level in the paper and I could not find their exact definition in given references. From the results they do not follow what one would expect from AUC-like scores (such as higher is better, etc.). It makes somewhat more difficult to judge whether the improvements are meaningful.
- If there is a relevant statistical association between sensitive attributes and diagnostic outcome, the method may encourage the removal of task-relevant features. Does this partially explain the (small) drop in performance of fair models in Fig. 3?
- The improvement over the baselines in terms of AUC and accuracy (Tables 1,2,3) is likely not statistically significant. Given that the test dataset is balanced with respect to sensitive attributes, and that the fairness metrics are better for the proposed method, how do the authors explain this?
- The fairness soft constraint is enforced during contrastive pre-training but not during fine-tuning, potentially reintroducing the bias at fine-tuning time.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Can you clarify whether for considered applications, disease prevalence is affected by ethnicity? It is not clear from this sentence: “In the CheXpert Plus [6] dataset, Cardiomegaly is more prevalent in Asian patients than in Black and White patients” whether this is a dataset-specific bias or a relevant association in the general population.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The strengths outweigh the weaknesses at this stage in the review process.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I am satisfied with the answers provided by the authors in their rebuttal to concerns and questions raised in the reviews.

The methodological novelty is limited but not absent as this specific approach to fairness in a vision-language pretraining context is novel.

The paper is interesting to the MICCAI community.

Author Feedback

Reviewer #1: Thank you! Q1. Definitions of GAUC, Inter-AUC, and Intra-AUC. A: Their definitions can be found in Definition 2 and Appendix A of [25]. The lower values indicate better fairness. Q2: Will the method reduce task-relevant features, causing a slight performance drop? A: Indeed, our method has better performance than the baseline CLIP in both accuracy, AUC and fairness metrics. While our method may remove some task-relevant features, it also reduces spurious correlations, encouraging the model to focus on more meaningful clinical features. This is a trade-off. While our method might be worse in accuracy and AUC than some SOTA methods such as Gloria (Fig. 3), they are not fairness-aware methods and have much worse fairness metrics (Fig. 2). Q3: AUC and accuracy improvement not significant? A: We have much better fairness metrics while remaining competitive in AUC and accuracy. Q4: Bias reintroduced at fine-tuning? A: No, our pretraining method learns fair feature representations. During fine-tuning, the backbone is frozen, and only a linear classifier is trained, preserving the fixed and fair embeddings. Q5: Dataset-specific bias or relevant association in the general population? A: Dataset-specific bias.

Reviewer #2. Thank you! Q1: Why use adversarial techniques instead of loss constraints in CLIP? A: Traditional fairness losses depend on downstream tasks and labeled data, while our adversarial approach learns fair representations, making it more flexible across tasks and metrics. Q2: Could minimizing L_GCL - L_Fair achieve a similar effect? A: No. Simply minimizing L_GCL−L_Fair lacks the adversarial feedback, as it does not explicitly optimize the classifier, reducing its effectiveness in bias mitigation. We acknowledge the omission of w_u and w_v in Eq. (1), representing the text and vision encoder parameters, respectively. This will be corrected in the final version. Q3: Why not use maximizing the minimum utility? A: Maximizing the minimum utility function is ineffective here because contrastive loss compares each data point with all others. In cases where one sensitive group is much smaller, minimizing this loss (by maximizing positive pairs within the group and minimizing negative pairs across groups) can unintentionally capture sensitive attributes, leading to unfair representations. Q4: Details of experiments.
A: We will provide a supplementary material (online) to include more details.

Reviewer #3. Thank you! We will improve our writing and presentation in the revision. Q1: Difference in (a) and (b). A: It is normal that the metrics in (a) and (b) are different, because they are evaluated differently. (a) uses the zero-shot evaluation, and (b) uses a few-shot evaluation that uses a separate linear head learned using the frozen feature representation. It is generally difficult to argue which one is better in terms of fairness. Q2: Comparison with SOTA fairness mitigation methods in [1]. A: Thank you for pointing out [1]. We will cite it. Our comparison focuses on baselines for improving fairness of CLIP models, including FairCLIP and DebiasCLIP, which, to our knowledge, are the current SOTA methods for fairness in CLIP models. Q3: Do fairness metrics improve with a higher alpha? A: Not necessarily. Increasing alpha can affect the contrastive loss optimization, potentially leading to worse representations. It does not necessarily mean the representation will be fairer. In our experiments, α = 0.3 was chosen to balance utility and fairness. Preliminary results showed that increasing α generally improved both up to α = 0.3, but beyond this point, performance degraded and fairness gains became negligible or even reversed. Q4: Why use one discriminator instead of two? A: This is also possible, but our method is both simpler and effective. Q5: Why not use age? A: Age is highly clinically relevant and less sensitive than race and gender. The improvements on race and gender already validate our approach.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Overall an interesting application-oriented paper in the area of fairness. While the methodological novelty is limited (limited to adversarial training in CLIP models), the work is well-executed and all reviewers are in favor of the paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Author(s):