List of Papers Browse by Subject Areas Author List
Abstract
Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Code will be made available at ***.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1032_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/Roypic/Aligner
Link to the Dataset(s)
Vindr CXR: https://physionet.org/content/vindr-cxr/1.0.0/
ChestXray-14: https://www.kaggle.com/datasets/nih-chest-xrays/data
MIMIC-CXR: https://physionet.org/content/mimic-cxr-jpg/2.1.0/
BibTex
@InProceedings{LuoHao_On_MICCAI2025,
author = { Luo, Haozhe and Zhou, Ziyu and Shu, Shelley Zixin and Mortanges, Aurélie Pahud de and Berke, Robert and Reyes, Mauricio},
title = { { On the Interplay of Human-AI Alignment, Fairness, and Performance Trade-offs in Medical Imaging } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {431 -- 441}
}
Reviews
Review #1
- Please describe the contribution of the paper
The main contribution is related to the investigation of human intervention to improve/address bias related to gender and age in a medical image (X-ray) classification task.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Experimental design considering different levels of expert intervention were evaluated.
- Discussion on the tradeoff between efficiency and fairness
- more than one dataset were included in the experiments; supporting reliability
- reproducible methodology with comprehensive information on loss function and hyperparameters used
- figures and captions are great, key contribution to the paper.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It is not clear what fairness metrics were used, just one? or multiple ones? How the fairness gap was computed?: “systematically analyze fairness with respect to two subgroups (gender and age), using multiple group fairness metrics.”; “Following [15], we considered AUC performance disparity as most relevant given that the positive and negative ratio of samples across all conditions is imbalanced. “
- Another key aspect is the level of hit rate to evaluate the Human-AI alignment. Please briefly describe what is it, as it is key to the paper contribution. Please avoid generic sentences like :Proposed in the XAI literature “Finally, to assess the degree of Human-AI alignment, we assessed the level of hit rate, as proposed in the XAI literature”
- what are the age group? Is it a binary feature based on threshold (more or less than a specific age)?
- Great analysis of the results, evaluating fairness, and alignment between expert and AI model. In practice, a key component is also evaluated “cost” or “time” required for the human intervention. Evaluating efficiency, fairness and time (specially from expert)
- Not clear if sex or gender is being evaluated
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Paper has a strong motivation, further investigating human intervention in medical imaging classification, evaluating not only efficiency but fairness. Experimental design consider different levels of intervention as well. However, key components on the methodology is missing, what fairness metric is being used, how fairness gap is computed.
- Further description on the data is also required: how age was evaluated? As a binary feature? Also, not clear if sex OR gender is being evaluated
- Discussion is limited; based on the proposed experiments, conclusion highlights fairness improvement with human intervention. “We systematically analyze fairness with respect to two subgroups (gender and age)” What are these subgroups? sex or gender? What age ranges are being evaluated? What level of human intervention should be used (best tradeoff between efficiency and fairness)?
- How can this framework being used to other medical imaging problems? (As authors aim to make the code publicly available)
- What are the study limitations?
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The paper proposes a method of incorporating human knowledge in chest X-ray classification to investigate human-AI alignment. They employ a vision-language model with adapted loss functions. They find that using human-AI alignment can reduce a variety of fairness metrics but that too much alignment can lead to decreases in performance. Interestingly, randomised alignment decreases fairness gaps, indicating that there is a correlation between labels and demographic data.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper demonstrates well that human-AI alignment can improve both fairness and performance. They use a variety of datasets during training and out-of-distribution datasets during testing to demonstrate that the method works for unseen data. They use five different metrics for performance which helps to compare fairness in different contexts.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
They do not explain how they generate the random shapes/locations of the attention maps in their ablation study. Although the figures display error bars, there is no discussion of whether the differences found are significant. In addition, the figures are not presented in the same order as they are discussed which makes the paper slightly unclear They do not discuss any limitations of the study. [Minor] The caption of Figure 5 could be made clearer to highlight that it is considering fairness gaps, instead of overall accuracy, F1, TPR and AUC
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The proposed method is novel. The results of the experiments are strong and highlight the possibility of developing more fair models. However, some sections of the paper could be made clearer e.g. the description of the ablation study and discussion of Figure 5.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper uses a proposed method for aligning model attention with human expert-based annotations to show that human-AI alignment can decrease fairness gaps between subgroups and improve OOD performance. This was evaluated on three classification tasks using a large CXR database.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
To my knowledge, the connection between human-AI alignment methods and fairness in medical imaging AI tasks novel, and I think an important step towards interdisciplinary development of responsible AI tools for healthcare. The experiments are comprehensive (it is impressive to fit this many results into a MICCAI paper) and (although the error bars are fairly large) the results show some promise for improvement of both performance and fairness alongside each other. Overall, the paper is clear and generally well-written.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
More details should be given on the expert annotations that were used. Currently, there is only one sentence describing them (“These training datasets come equipped with expert-based annotations reflecting human-based attention areas a radiologist uses for diagnosis.”). My understanding is that the expert annotations are image-based, but are they segmentation masks or bounding boxes? How consistent are these annotations across all of the datasets used? I assume there would at least be some inconsistencies in the annotation process across the different training datasets - how would this affect results?
-
Random alignment experiments: Why change the attention areas at each epoch? Why not just have a random attention area and train with that for the whole time – wouldn’t that be more “comparable” to the standard aligned training? Also, I am not sure if I agree with the interpretation of the last section of results (“This trade-off aligns with fairness-aware modeling literature, where reducing bias can sometimes come at the cost of lower performance”). While some literature has shown that some bias mitigation methods can come at the expense of performance of certain subgroups, I don’t think this is a valid comparison to make in this randomization experiment. Here, it seems that the reduced bias is a side effect of intentionally lowering performance (making the model worse is just making it worse for all groups) – i.e., this lower performance is not really a “trade off” that is being made in order to have a fairer model. I am also a bit surprised that performance is still quite high even with randomized alignment – which experimental setup with this performed with (% of train data and % of human alignment during training)? This is not specified in the paper.
-
Other minor constructive comments below.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
-
I would recommend adding a sentence to introduction (and also maybe the abstract) explicitly stating what is meant by alignment in this paper (i.e., attention-based alignment, in a technical sense), because it could, for instance, be initially confused with a broader definition of AI alignment which generally involves incorporating human values, ethics, etc.
-
Flow/clarity would be improved if 2.2 (description of alignment methods) comes before 2.1 (experimental setup).
-
Error bars in all figures are not specified – do they represent standard deviation?
-
Would recommend changing the order of the figures so that the numbers correspond with when they are mentioned in the text.
-
Ref [21] cited twice for the same sentence in the first paragraph.
-
In section 2.2: I think the word “derive” should be changed to “direct”.
-
Fig 2 has LP_Dice labelled in the attention alignment head. Is this supposed to be L_AL?
-
Page 7: “exacerbating the alignment” this phrasing does not really make sense to me, should it be “increasing” or “strengthening the alignment”?
-
Ref [10] is published in a journal, arxiv citation should be updated.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper has comprehensive, seemingly sound experiments, with novel framing connecting important research areas of human-AI alignment and fairness. However, more details should be provided regarding the annotations that were used. Some limitations should probably be mentioned with respect to any inconsistencies across annotations belonging to different datasets, as well as if the results presented may be specific to the VLM model used here (as currently, the language used implies that human-AI alignment could benefit any model).
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
Dear Reviewers and Meta-Reviewer, Thank you very much for your positive and constructive feedback on our paper. We sincerely appreciate the time and effort you have invested in reviewing our work. We will carefully address each of the unclear points you raised and revise the manuscript accordingly to strengthen our presentation. Thank you again for your invaluable comments. Sincerely, The Authors
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
All reviewers have provided positive evaluations of this submission and have offered many constructive recommendations. Authors are kindly requested to incorporate these suggestions into their final manuscript to the fullest extent possible.