Abstract

Fairness is an important principle in medical ethics. Vision Language Models (VLMs) have shown significant potential in the medical field due to their ability to leverage both visual and linguistic contexts, reducing the need for large datasets and enabling the performance of complex tasks. However, the exploration of fairness within VLM applications remains limited. Applying VLMs without a comprehensive analysis of fairness could lead to concerns about equal treatment opportunities and diminish public trust in medical deep learning models. To build trust in medical VLMs, we propose Fair-MoE, a model specifically designed to ensure both fairness and effectiveness. Fair-MoE comprises two key components: the Fairness-Oriented Mixture of Experts (FO-MoE) and the Fairness-Oriented Loss (FOL). FO-MoE is designed to leverage the expertise of various specialists to filter out biased patch embeddings and use an ensemble approach to extract more equitable information relevant to specific tasks. FOL is a novel fairness-oriented loss function that not only minimizes the distances between different attributes but also optimizes the differences in the dispersion of various attributes’ distributions. Extended experiments show that Fair-MoE improves both fairness and accuracy across all four attributes. Code is made publicly available at https://github.com/LinjieT/Fair-MoE-Medical-Fairness-Oriented-Mixture-of-Experts-in-Vision-Language-Models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1048_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/LinjieT/Fair-MoE-Medical-Fairness-Oriented-Mixture-of-Experts-in-Vision-Language-Models

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanPei_FairMoE_MICCAI2025,
        author = { Wang, Peiran and Tong, Linjie and Wu, Jian and Liu, Jiaxiang and Liu, Zuozhu},
        title = { { Fair-MoE: Medical Fairness-Oriented Mixture of Experts in Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {186 -- 196}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • The authors pioneer MoE-based approach tailored for fairness in the medical VLM domain.
    • Development of the Fairness-Oriented Mixture of Experts (FO-MoE) architecture, which leverages expert specialization to filter out biased patch embeddings and extract more equitable, task-relevant information. This component deals with a fundamental challenge in medical image analysis where biases can be embedded in visual features.
    • Fairness-Oriented Loss (FOL) function that considers the distance between protected attribute distributions and their dispersion differences. Unlike previous fairness losses that focus solely on distributional distances, this approach helps ensure consistent model behavior across demographic groups.
    • Comprehensive empirical validation demonstrating that Fair-MoE improves both fairness metrics (DPD, EOD) and diagnostic performance (AUC) across multiple protected attributes (race, gender, ethnicity, language).
    • Detailed ablation studies that rigorously validate each component’s contribution, providing insights into the mechanisms through which the model achieves fairness improvements, which could inform future research in this important area.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Formulation of fairness. The paper makes a conceptual breakthrough by reconceptualizing fairness in terms of both distributional distance and dispersion consistency. This dual consideration recognizes that true fairness requires not only similar average performance across demographic groups but also similar consistency in predictions - a nuanced perspective that advances our theoretical understanding of algorithmic fairness.

    • Combination of MoE and fairness objectives. The authors cleverly repurpose the variance optimization mechanism traditionally used for load balancing in MoE architectures to serve fairness goals. This represents an elegant technical solution that bridges the previously separate research directions.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Inadequate justification of fairness objectives in medical context. The authors did not clearly articulate what constitutes bias versus legitimate clinical factors. Some medical conditions naturally have different prevalence rates across demographic groups (e.g., higher lung cancer rates in men, higher breast cancer rates in women). The authors do not sufficiently address when demographic attributes should inform diagnosis versus when they represent harmful bias, raising questions about whether the model might be suppressing clinically relevant signals in pursuit of statistical fairness.
    • Limited theoretical foundation for the dispersion-based fairness approach. While the authors said that the optimization of dispersion differences between protected attributes enhances fairness, they provide insufficient theoretical analysis justifying why this approach is superior to alternatives. The article lacks rigorous mathematical proof demonstrating why variance consistency should be a fundamental fairness criterion.
    • Inconsistent experimental results. In Table 1, the authors claim superior performance for Fair-MoE, but careful examination reveals inconsistencies. For instance, in the Ethnicity attribute, CLIP/b16 shows a DPD of 7.53±2.96, which appears lower than Fair-MoE/b16’s 8.52±3.19, contradicting the paper’s conclusions. Such discrepancies undermine confidence in the reported findings.
    • Questionable component effectiveness: Table 2 reveals concerning patterns where adding FO-MoE to FairCLIP/L14 actually decreases performance for certain attributes. For instance, ES-AUC and AUC metrics drop for race, and similar inconsistencies appear across language attributes. These results cast doubt on the consistent effectiveness of the FO-MoE component across different model configurations and protected attributes.
    • Ambiguous ablation results for FOL: Table 3’s ablation studies show that for certain configurations (e.g., race attribute with Fair-MoE/b16), removing FOL results in better DPD values (3.19±2.04) compared to the full model (7.25±5.13). Many similar examples throughout the ablation studies suggest that FOL’s contribution to fairness improvements is not as clear-cut as the authors claim.
    • Limited Dataset Diversity: The evaluation is restricted solely to a dataset with narrow focus on glaucoma diagnosis. It will be good if the authors can validate across multiple medical conditions and diverse imaging modalities (X-rays, MRIs, CT scans) to substantiate claims of generalizability within medical domains. The authors should consider testing their approach on synthetic or augmented datasets with controlled bias levels, which would enable more precise quantification of the model’s bias mitigation capabilities across different scenarios.
    • Lack of Computational Efficiency Analysis. The article notably lacks any analysis of computational overhead introduced by the architecture. Given that MoE models are well-established as computationally intensive, the addition of fairness components likely exacerbates this burden. Without concrete experiments on runtime metrics, memory requirements, and inference latency against baseline models, it is impossible to assess the clinical feasibility and deployment potential of Fair-MoE in resource-constrained healthcare settings. These efficiency considerations are critical for translating research innovations into practical clinical applications.RetryClaude can make mistakes. Please double-check responses.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The model’s performance against baseline methods is somewhat underwhelming. In several experiments, the gains are marginal or inconsistent across different evaluation metrics. This raises questions about the practical utility of the proposed approach compared to existing solutions.
    • The ablation study presented does not convincingly demonstrate the value of each proposed module. In fact, some results suggest that certain components contribute minimally to the overall performance, which undermines the justification for the model’s complexity.
    • While the approach has theoretical merit, the empirical evidence does not strongly support the claimed contributions. The paper would benefit from a clearer articulation of the novel aspects and their specific impact on performance improvements.
    • Some experimental settings appear tailored to favor the proposed method, potentially limiting generalizability. A more diverse set of evaluation scenarios would strengthen the paper’s claims.
  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel framework, Fair-MoE, specifically designed to enhance fairness in medical VLM. By incorporating the FO-MOE component, which leverages expert capacity to filter bias in patch embeddings, and by introducing a new fairness loss function, FOL, that considers both distance and dispersion among sensitive attributes, the framework effectively addresses the issue of human bias in VLMs. Experiments conducted on the FairVLMed dataset, comparing this method with various approaches using multiple evaluation metrics, demonstrate that it improves both fairness and accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper addresses the unfairness problem in medical VLM, which is an important issue.
    2. The proposed method is novel, combining the MoE architecture with a new loss function design.
    3. Experimental results demonstrate the ability to simultaneously enhance both fairness and accuracy, while the ES-AUC metric effectively measures the trade-off between the two.
    4. The ablation study offers a comprehensive analysis of how different components, particularly those within the FOL, impact the final results, providing readers with a more detailed comparison.
    5. The writing is clear and easy to follow.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The MoE architecture may increase model complexity and pose training challenges. Additionally, the paper lacks implementation details and code; it does not provide information on how to construct the MoE model. A discussion of hardware resource requirements, along with related experiments concerning the MoE, would significantly enhance the quality of the paper.
    2. Due to the large number of parameters in the FOL, the effectiveness of this function may be significantly influenced by hyperparameter tuning and the specific characteristics of the dataset.
    3. The experimental results are not comprehensive. Experiments were conducted on only one dataset, and the comparison with existing state-of-the-art fairness methods is limited in scope.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method in this paper demonstrates sufficient novelty and effectively addresses the fairness issue in medical VLMs. Furthermore, the experimental setup is well-designed and thoroughly explained, considering the trade-off between accuracy and fairness. The ablation study further validates the effectiveness of the proposed modules. However, since the validation is conducted on only one dataset and the comparison with other methods is limited, I recommend a weak accept. During the rebuttal phase, the authors should provide a reason for using only one dataset to validate their method, as well as more details regarding hyperparameter tuning in FOL.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This work tackles the important problem of fairness in medical Vision-Language Models (VLMs). The authors propose a new model, Fair-MoE, aiming to improve both fairness and accuracy. The key ideas are introducing a “Fairness-Oriented Mixture of Experts” (FO-MoE) system within the VLM encoders – using MoE layers to filter patch embeddings and refine final features – and a novel “Fairness-Oriented Loss” (FOL). This loss function is interesting because it goes beyond just minimizing feature distance between groups (using Sinkhorn distance) and adds a term to explicitly reduce differences in the dispersion of group distributions, using MoE gate weight variance as a proxy. They test this on the Harvard-FairVLMed dataset and show it outperforms standard CLIP and the prior FairCLIP method on several metrics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Focus on an Important Problem: Tackling fairness in medical AI, especially VLMs, is crucial work with significant real-world implications. The focus here is timely and relevant. Novelty in Approach: Applying MoE specifically for fairness in this VLM context (the FO-MoE design) feels like a fresh approach. The idea of filtering at both patch and feature levels is conceptually appealing.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Rationale for Some Design Choices: While the FO-MoE idea is neat, I was left wondering why the patch-level MoE was placed specifically in the last attention block. Is the assumption that bias is most prominent or best filtered there? More justification would help. Similarly, the link in FOL between minimizing gate weight variance differences and achieving fairer feature distribution dispersion feels a bit indirect. It’s a creative proxy, but the paper could benefit from a clearer explanation or theoretical argument connecting the two.
    • FOL Mechanics & Sensitivity: The FOL loss uses gate weight variance. How stable is this metric during training, especially with the Top-K routing and capacity limits? It wasn’t clear how the MoE hyperparameters (K1, K2, capacity C) were chosen, or how sensitive the results are to these settings. Is there a risk the capacity limits disproportionately affect smaller subgroups?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation leans towards acceptance. The paper tackles a highly significant problem (fairness in medical VLMs) with genuinely novel ideas in both the FO-MoE architecture and the dispersion-aware FOL loss. The empirical results on the key benchmark dataset are convincing, showing clear advantages over prior work, and the thorough ablations back up the design choices effectively.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their constructive feedback and thoughtful suggestions. We appreciate the opportunity to clarify our work and address the raised concerns. Below, we respond to each reviewer’s comments point by point. Reviewer 1: W1: The fairness objective of our work is to ensure the model achieves consistent accuracy across different demographic groups. In cases where a disease has varing prevalence itself, fairness issues arise not from the prevalence itself, but when the model learns bias features related to demographic attributes that are not clinically relevant to the disease. Our approach aims to mitigate such disparities while preserving valid clinical signals. W2: Alternative methods focus on minimizing the distance between distributions across different demographic groups. Our method explicitly incorporates the dispersion within each group’s distribution, providing a complementary perspective to distributional fairness. The effectiveness has been demonstrated through empirical evaluation. W3: There is an inherent tradeoff between fairness and effectiveness. Our method, Fair-MoE, is designed to balance both, as reflected in its strong ES-AUC performance that accounts for both effectiveness and fairness. Furthermore, in most cases, Fair-MoE still outperforms other methods in AUC, DPD and EOD. W4: Using FO-MoE without load balance function in FOL will reduce training stability, which can explain isolated drops in performance during ablation. However, in most cases, FO-MoE still improves both effectiveness and fairness. W5: The tradeoff between fairness and effectiveness exists. ES-AUC captures both aspects, and adding FOL consistently improves ES-AUC across settings, demonstrating its overall effectiveness. W6: At the time of our experiments, the Harvad-FairVLMed dataset was the first and the only public medical dataset focused on fairness in VLM. W7: Since we use sparse MoE design for Fair-MoE where only a subset of experts is activated, the additional computational const remains minimal. Reviewer 2: W1: As Fair-MoE adopts a sparse MoE architecture that activates only a limited number of experts, the added computational overhead is minimal. W2: We acknowledge that some hyperparameter tuning is required to train Fair-MoE. W3: The Harvard-FairVLMed dataset is the first and, at the time of our experiments, the only publicly available dataset focused on fairness in VLMs. We compare Fair-MoE against FairCLIP, which is the SOTA model in fairness medical VLM, and the original CLIP model. Fair-MoE consistently outperforms both FairCLIP and CLIP in terms of effectiveness and fairness. Reviewer 3: W1: We place patch-level MoE in the last block because it can filter out bias information more effectively. To promote fairness, we aim to reduce distributional disparities across demographic attributes. Minimizing both distance and dispersion serves as complementary objectives for aligning feature distributions. W2: We choose K=3, capacity=0.8, and number of experts=10.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top