List of Papers Browse by Subject Areas Author List
Abstract
AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but this often degrades performance due to the lost of clinically relevant diagnostic cues. In this work, we propose an alternative approach that incorporates sensitive attributes to achieve fairness. We introduce FairMoE, a framework that employs layer-wise mixture-of-experts modules to serve as group-specific learners. Unlike traditional methods that rigidly assign data based on group labels, FairMoE dynamically routes data to the most suitable expert, making it particularly effective for handling cases near group boundaries. Experimental results show that, unlike previous fairness approaches that reduce performance, FairMoE achieves substantial accuracy improvements while preserving comparable fairness metrics.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1069_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/Gracellgg/FairMoE
Link to the Dataset(s)
Fitzpatrick17k dataset: https://github.com/mattgroh/fitzpatrick17k
ISIC 2019 dataset: https://challenge.isic-archive.com/landing/2019/
BibTex
@InProceedings{XuGel_Incorporating_MICCAI2025,
author = { Xu, Gelei and Duan, Yuying and Liu, Zheyuan and Li, Xueyang and Jiang, Meng and Lemmon, Michael and Jin, Wei and Shi, Yiyu},
title = { { Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Experts } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {291 -- 301}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors propose a mixture-of-experts (MoE) model structure utilizing group-specific ‘experts’ per layer to achieve improved demographic fairness without incurring accuracy losses. The method is evaluated on two dermatological datasets, Fitzpatrick-17k and ISIC 2019, achieving favorable results compared to several competitor methods.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The method is novel and appears intuitively appealing, if one ascribes (for a particular application) to the idea that different models should be used for different demographic groups.
The proposed approach does not require demographic labels during inference time and can handle instances “falling between groups” (think nonbinary sex or mixed ethnicities), both of which are appealing properties that many previously proposed methods do not provide.
Several relevant baseline methods are included in the analysis, and the results are generally promising.
The paper is generally well-written and arguments are presented clearly and coherently.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
First, I am uncertain whether the comparison to the baseline methods is “fair”: this is not presently specified in the paper (it should be), but I would expect the MoE model to have much more parameters as well as to be slower to train and heavier during inference. Numbers on this should be provided in the paper, and possibly a baseline method with a similar parameter count (a standard “non-fair” MoE? a larger resnet?) should be included for comparison. (Any observed improvements in performance or fairness might just be due to increased model size.) In addition, the resnet-18 baseline seems rather weak to me - most studies I know of in this area use at least a resnet-50 or a densenet-201.
Second, the authors consider the difference in F1 between sensitive groups as one key outcome. The meaning of the F1 score depends on the base rate of the population in question, meaning that a comparison of F1 between groups with different malignancy rates is meaningless. Various appropriately normalized versions of F1 have been proposed to address this issue [1-4] and such a metric should be used instead of “raw” F1 for comparing model performance between groups. (I would expect that different demographic groups have different malignancy rates in the test set, no? Otherwise, this would be a non-issue since groups would be comparable using F1.)
Third, the ISIC datasets including ISIC 2019 are known to contain excessive duplicates and suffer from a high risk of data leakage and invalid performance assessments [5]. Has this appropriately been taken care of in the present analysis?
Fourth, the authors make it sound a bit as if there were no or little prior methods that utilize demographic information instead of suppressing it. This is not at all the case; see e.g. [6-9] for a few selected examples. The selection of cited methods and implemented baselines seems rather skewed towards the prior work of one particular research group.
[1] Precision-Recall-Gain Curves: PR Analysis Done Right, https://papers.nips.cc/paper_files/paper/2015/file/33e8075e9970de0cfea955afd4644bb2-Paper.pdf [2] Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation, https://icml.cc/2012/papers/349.pdf [3] Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance, https://link.springer.com/article/10.1007/s00330-024-10834-0 [4] On (assessing) the fairness of risk score models, https://dl.acm.org/doi/10.1145/3593013.3594045 [5] Analysis of the ISIC image datasets: Usage, benchmarks and recommendations, https://doi.org/10.1016/j.media.2021.102305 [6] Equality of Opportunity in Supervised Learning, https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html [7] Fairness without Harm: Decoupled Classifiers with Preference Guarantees, https://proceedings.mlr.press/v97/ustun19a?ref=https://githubhelp.com [8] An Algorithmic Framework for Bias Bounties, https://dl.acm.org/doi/abs/10.1145/3531146.3533172 [9] Blind Pareto Fairness and Subgroup Robustness, https://proceedings.mlr.press/v139/martinez21a.html
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Will the code be made publicly available?
“Title Suppressed Due to Excessive Length”.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The method is novel and interesting and the paper well-written, but some of my concerns regarding the evaluation methodology and baseline methods should be addressed before acceptance can be recommended.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The paper proposes FairMoE, a fairness‑aware skin‑disease classifier that embeds layer‑wise mixture‑of‑experts (MoE) modules inside a CNN backbone. Each expert is intended to specialise in a demographic group (e.g., light vs dark skin), enforced by a mutual‑information loss that couples routing decisions to group labels. A soft, probability‑based router then allows boundary cases to draw on multiple experts, mitigating data scarcity and hard assignment problems. Experiments on Fitzpatrick‑17k (skin‑tone bias) and ISIC‑2019 (age bias) show that FairMoE raises average F1 while matching or improving equalised‑odds / equalised‑opportunity gaps relative to recent pruning‑, quantisation‑ and BN‑based debiasing methods.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
This incorporate‑rather‑than‑remove paradigm is well-motivated. The work convincingly argues that sensitive attributes (skin tone, age) carry diagnostic signal, so learning group‑specific modules is more appropriate than adversarial removal in dermatology.
-
Novel MoE formulation — mutual‑information regulariser + size‑rebalanced soft routing yields clear expert–group specialisation without starving minority groups of data.
-
Empirical gains for all groups — unlike many fairness baselines that equalise by hurting the advantaged cohort, FairMoE lifts F1 for both privileged and unprivileged groups while reducing disparity.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
Binary grouping only. All experiments split skin tone (light/dark) and age (≤55/>55) into two experts. It is unclear whether performance or fairness would improve with finer buckets (e.g., Fitzpatrick 1‑2/3‑4/5‑6 or 10‑year age bands). An ablation varying the number of experts/groups is needed to show scalability and robustness of the routing scheme.
-
Limited attribute diversity. The method is demonstrated on a single binary attribute at a time. Real deployments must juggle intersectional biases (skin tone × sex × age). How does FairMoE extend to multiple sensitive variables or to continuous attributes where ground‑truth labels are noisy?
-
Routing overhead & latency. Adding MoE at every layer increases parameters and FLOPs; no wall‑clock or GPU‑memory analysis is given. A fairness technique that is too heavy for portable or point‑of‑care devices may be impractical.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Borderline accept. FairMoE offers a principled and effective shift from eliminating sensitive cues to leveraging them through group‑specific experts, and the empirical results are the first to show simultaneous accuracy and fairness gains across two dermatology benchmarks. The idea is simple enough to attract adoption, yet novel among medical‑AI fairness works. The main reservation is the coarse binary grouping: without experiments on finer or multiple attributes, it is hard to judge whether the approach generalises beyond the two‑expert toy setting. Addressing this (even in supplementary) and adding computational‑cost numbers would move the paper from promising to strong.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The main motivation of the paper is to incorporate skin tone information in the classification of skin disease data instead of removing it as it is important for diagnosis. The paper employs a MoE model to route data to the most appropriate expert. They find that their method surpasses results from other fairness models and improves the fairness of the models.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper proposes a novel method of improving fairness in skin disease classification. They do this while retaining important information on skin type which can be considered a sensitive attribute. Their method also achieves more fair results using Eodd, Eopp and FATE scores.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
They do not discuss the limitations of their work e.g. whether it will be applicable to other imaging modalities and tasks where sensitive attribute information is also important. There are some grammatical errors in the paper Given that the model performs better on dark-skin images than light-skin images despite there being more light-skin images, it would be useful to see a breakdown of the number of images in each class They do not mention making their code publicly available
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper proposes a novel method of classifying skin disease data while considering sensitive attributes and maintaining fairness between groups. This may be a useful model in other applications where sensitive attribute information is important
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank all reviewers for their valuable feedback. Following the meta-reviewer’s suggestions, we focus here on addressing the key concerns raised by Reviewers #1 and #3 regarding evaluation metrics, attribute diversity, and the training and inference cost of FairMoE.
Evaluation Metrics (R1 Q2): We appreciate the reviewer’s observation regarding the limitations of using raw F1 scores to compare groups with differing base rates. In our current analysis, we reported per-group precision, recall, and F1 scores to provide direct insight into the model’s predictive performance across demographic groups. In addition, we included standard group fairness metrics such as Eopp0, Eopp1, and Equalized Odds to more formally assess fairness. Nevertheless, we acknowledge that raw F1 scores may be misleading in the presence of base rate differences, and we will consider incorporating normalized F1 metrics in future work to better account for such distributional disparities.
Attribute Diversity (R3): We thank the reviewer for pointing out the limitations of FairMoE in addressing attribute diversity. Real-world deployments must contend with intersectional biases involving multiple sensitive attributes. While FairMoE does not explicitly model this complexity, one promising extension is to apply clustering in the sensitive attribute space. Although the number of attribute intersections can grow exponentially, we expect that only a subset of dimensions meaningfully impacts diagnostic outcomes. Less influential intersections may be grouped together, allowing the use of fewer experts through subgroup clustering.
Training and Inference Cost: Multiple reviewers raised concerns about the training and inference costs of FairMoE. While additional experiments cannot be added due to submission constraints, we are happy to provide further clarification. FairMoE incurs higher training cost relative to baseline models, primarily dependent on the number of experts (which is two in this study). However, the overhead is comparable during inference, as only one expert is activated per input. To further reduce computational cost, future work will explore incorporating parameter-efficient modules as alternatives to full convolutional layers.
We will make every effort to incorporate the reviewer’s suggestions and will include the implementation code in the camera-ready version.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
All reviewers are in favour of this paper and like the presented idea of using a mixture of experts approach to achieve fairer classifiers. The methods appears novel and the results are promising (see next comment regarding the evaluation, though). The authors are encouraged to especially incorporate the suggestions made by R#1 and R#3 regarding their evaluation and (additional) existing prior work when preparing their final version.