Abstract

Numerous studies have demonstrated the effectiveness of deep learning models in medical image analysis. However, these models often exhibit performance disparities across different demographic cohorts, undermining their trustworthiness in clinical settings. While previous efforts have focused on bias mitigation techniques for traditional encoders, the increasing use of transformers in the medical domain calls for novel fairness enhancement methods. Additionally, the efficacy of explainability methods in improving model fairness remains unexplored. To address these gaps, we introduce XTranPrune, a bias mitigation method tailored for vision transformers. Leveraging state-of-the-art explainability techniques, XTranPrune generates a pruning mask to remove discriminatory modules while preserving performance-critical ones. Our experiments on two skin lesion datasets demonstrate the superior performance of XTranPrune across multiple fairness metrics. The code can be found at https://github.com/AliGhadirii/XTranPrune.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1201_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1201_supp.pdf

Link to the Code Repository

https://github.com/AliGhadirii/XTranPrune

Link to the Dataset(s)

https://github.com/mattgroh/fitzpatrick17k/tree/main https://data.mendeley.com/datasets/zr7vgbcyr2/1

BibTex

@InProceedings{Gha_XTranPrune_MICCAI2024,
        author = { Ghadiri, Ali and Pagnucco, Maurice and Song, Yang},
        title = { { XTranPrune: eXplainability-aware Transformer Pruning for Bias Mitigation in Dermatological Disease Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces XTranPrune, a novel bias mitigation method for vision transformers in dermatological disease classification. It leverages explainability techniques to generate a pruning mask that removes discriminatory modules while preserving essential ones for performance. The method is evaluated on two skin lesion datasets, demonstrating superior performance across multiple fairness metrics. XTranPrune’s approach is significant as it addresses the challenge of performance disparities in medical image analysis due to demographic biases, enhancing trustworthiness in clinical settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, XTranPrune’s use of explainability methods for pruning is a novel approach that targets discriminatory nodes specifically, which is a more precise method compared to general modifications2. 2, The method outperforms existing bias mitigation techniques in fairness evaluation, showcasing its effectiveness in creating more equitable models. 3, By addressing bias in medical imaging, XTranPrune contributes to the development of trustworthy AI tools that can be reliably used in clinical practice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1, The complexity of the method might pose challenges for integration into clinical workflows where simplicity and interpretability are crucial. 2,The evaluation is limited to skin lesion datasets; further research is needed to validate the method across diverse medical imaging datasets.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1, The proposed methods are not compared with existing approaches such as [1-3]. Adding comparisons with these method could strength this paper. 2, In Talbe 1, why all the results much higher than the results reported by [4] on the same dataset?

    [1] Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE international conference on computer vision (pp. 4068-4076). [2] Wang, Z., Qinami, K., Karakozis, I. C., Genova, K., Nair, P., Hata, K., & Russakovsky, O. (2020). Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8919-8928). [3] Zhang, B. H., Lemoine, B., & Mitchell, M. (2018, December). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 335-340). [4] Wu, Y., Zeng, D., Xu, X., Shi, Y., & Hu, J. (2022, September). Fairprune: Achieving fairness through pruning for dermatological disease diagnosis. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 743-753). Cham: Springer Nature Switzerland.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Some established baselines are missing in the reported results to fully understand the effectiveness of this paper. Please refer to detailed and constructive comments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors proposed a method called XTranPrune that generates a pruning mask to remove bias related modules while preserving performance-critical ones. The have experiments on two dermatological dataset for disease classification task and show their method has better performance than others. They also proposed a new evaluation metric for fairness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The author used an explainability-aware method on transformer-based classification network to mitigate bias from skin-tone.
    2. The authors used both group fairness metrics as well as proposed a new fairness metric NFR to measure its performance thoroughly, compared with 6 SOTAs and 2 baselines in two datasets.
    3. Good structure with fluent language. A clear Figure 1 to show the main idea.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The Performance_mask and SA Attributions is not well defined in formula in methodology part. The caption of figure 1 does not explain very well.
    2. The author only use F1 as evaluation metric for classification accuracy. As dermatological dataset are usually imbalanced (especially for the PAD-UFES-20), though accuracy is useless, maybe introducing ROC-AUC to better analyse false positive and false negative examples.
    3. The training details, such as loss function, are not shown in the whole paper.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall it is a good work to show some explainability of ViT based skin disease classification work with a lot of experiments. The author needs to give a much detailed training, such as specifying the loss function. Moreover, some definitions such as Performance_mask is not well defined in formula. In addition, as the proposed method is a post-process method, how does it work on out-of-distribution dataset is not well analysed. In real-world practice the trainign dataset may not accessible after training, how to deal with that in practice needs to be considered.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper reads well and address the motivation and method clearly. There are some minor problems need to be fixed in the rebuttal session. Overall it is a fine work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author justified and answered the questions and I am satisfied with them.



Review #3

  • Please describe the contribution of the paper

    The paper introduces XTranPrune, a novel approach to reducing bias in vision transformers. This technique employs advanced explainability methods to devise a pruning mask that eliminates components promoting bias without sacrificing those essential for maintaining model performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. XTranPrune demonstrates strong explainability capabilities, providing insights into the model’s decision-making process.
    2. The research includes a comprehensive experiments to evaluate the method’s effectiveness.
    3. The use of an attribution-based approach for retaining and pruning is shown to be an efficient strategy for bias mitigation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper could benefit from a comparison with a similar method, [1], which also involves selecting a pruning mask but utilizes Parameter-Efficient Fine-Tuning for fine-tuning subset of a pre-trained model. Including this method in the comparative analysis would provide a more robust evaluation of XTranPrune’s performance.

    [1] Dutt, Raman, et al. “Fairtune: Optimizing parameter efficient fine tuning for fairness in medical image analysis.” arXiv preprint arXiv:2310.05055 (2023).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper could benefit from a comparison with a similar method, [1], which also involves selecting a pruning mask but utilizes Parameter-Efficient Fine-Tuning for fine-tuning subset of a pre-trained model. Including this method in the comparative analysis would provide a more robust evaluation of XTranPrune’s performance.

    [1] Dutt, Raman, et al. “Fairtune: Optimizing parameter efficient fine tuning for fairness in medical image analysis.” arXiv preprint arXiv:2310.05055 (2023).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    XTranPrune demonstrates strong explainability and experiments is comprehensive

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Thank you for your response and adding the experiments. As I already give an accept, I will keep my score unchanged.




Author Feedback

We sincerely thank all reviewers for their time and constructive feedback.

R1-6.1, Contrary to concerns, our method prioritises simplicity and interpretability by using explainability methods and the simple approach we used to identify the discriminatory nodes to enhance fairness. Our method is modular, minimally resource-intensive, and applicable to any deep learning classification model in the medical domain. We could significantly improve fairness metrics for large ViT models with just a few rounds of pruning.

R1-6.2, Skin lesion datasets are most commonly used in fairness studies and are easier for us to conduct benchmarking with existing methods. We will extend to other modalities in future work.

R1-10.1, Among the suggested studies [1-3], [2] is the most recent one (in 2020), so we have now added the comparison with [2], which shows the superiority of our method. On Fitzpatrick17k, XTranPrune achieved F1-score 73.51 vs DomainInd’s 69.06, with worst-case scores of 69.13 vs 64.26, DPM 0.586 vs 0.571, EOM 0.790 vs 0.714, EOpp0 0.086 vs 0.055, EOpp1 0.066 vs 0.128, Eodd 0.095 vs 0.139, NFR 0.114 vs 0.119. For PAD-UFES-20, the results are 62.01 vs 62.51, 57.03 vs 41.58, 0.009 vs 0.005, 0.624 vs 0.462, 0.389 vs 0.764, 1.141 vs 1.440, 0.909 vs 1.796, 0.587 vs 1.578. Our method gives better results except for EOpp0 on Fitzpatrick and F1-score on PAD-UFES-20. In addition, since FairPrune [4] (in 2022) could outperform the earlier studies [1, 3] and we could substantially improve the fairness metrics compared to FairPrune, we are indirectly showing the superiority of our method over [1, 3].

R1-10.2, Given the fact that FairPrune [4] hasn’t published their code, it was challenging to reproduce their data augmentation, metric and method implementation. We implemented their method based on the paper, following the conventional implementation of fairness metrics, and conducted comprehensive experiments with their suggested hyper-parameters, reporting the best results we could achieve. The results they report (e.g., Eopp0=0.0008) differ significantly from other papers like FairME [5] (Eopp0=0.006). In addition, for consistency with other experiments, we used ResNet18 as the backbone, which typically performs better, thus showing inferior fairness metrics compared to the simpler VGG-11 used in their paper. Another reason for this mismatch could be the difference in data for the train/test splits.

R3, Thanks for the suggestion. On Fitzpatrick17k, XTranPrune achieved F1-score 73.51 vs FairTune’s 66.80, with worst-case scores of 69.13 vs 54.44, DPM 0.586 vs 0.538, EOM 0.790 vs 0.686, EOpp0 0.086 vs 0.114, EOpp1 0.066 vs 0.104, Eodd 0.095 vs 0.195, NFR 0.114 vs 0.238. For PAD-UFES-20, the results are 62.01 vs 62.51, 57.03 vs 41.58, 0.009 vs 0.005, 0.624 vs 0.462, 0.389 vs 0.764, 1.141 vs 1.440, 0.909 vs 1.796, 0.587 vs 1.578. Our method outperforms in all the metrics.

R4-6.1, Due to the limited number of pages in the main paper, we kept the descriptions and captions concise, while including a more precise formulation of how we generate the masks through a pseudo-code in the supplementary materials.

R4-6.2, Please note that we used the Macro-averaged F1 score to properly evaluate the model performance on highly imbalanced datasets, as it accounts for both precision and recall for each class independently before averaging, ensuring a more balanced evaluation across all classes. We can add AUC measures in the next version.

R4-6.3, Our loss function was Cross Entropy. Comprehensive training details are provided in our git repository, making the reproduction of our method easy.

R4-10, Our study has followed the commonly used experimental setups and focused on improving performance with a simple approach. We will investigate out-of-distribution cases in future work.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is well-written and the topic interesting to the MICCAI community. The results are convincing and the rebuttal clarified the reviewers’ comments. I would recommend adding some clarification, like the loss function and the use of macro-F1 score in the paper for the camera-ready version. I would also recommend slightly expanding the caption of Figure 1 to enhance its comprehension.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is well-written and the topic interesting to the MICCAI community. The results are convincing and the rebuttal clarified the reviewers’ comments. I would recommend adding some clarification, like the loss function and the use of macro-F1 score in the paper for the camera-ready version. I would also recommend slightly expanding the caption of Figure 1 to enhance its comprehension.



back to top