Abstract

As Artificial Intelligence (AI) increasingly integrates into our daily lives, fairness has emerged as a critical concern, particularly in medical AI, where datasets often reflect inherent biases due to social factors like the underrepresentation of marginalized communities and socioeconomic barriers to data collection. Traditional approaches to mitigating these biases have focused on data augmentation and the development of fairness-aware training algorithms. However, this paper argues that the architecture of neural networks, a core component of Machine Learning (ML), plays a crucial role in ensuring fairness. We demonstrate that addressing fairness effectively requires a holistic approach that simultaneously considers data, algorithms, and architecture. Utilizing Automated ML (AutoML) technology, specifically Neural Architecture Search (NAS), we introduce a novel framework, BiaslessNAS, designed to achieve fair outcomes in analyzing skin lesion datasets. BiaslessNAS incorporates fairness considerations at every stage of the NAS process, leading to the identification of neural networks that are not only more accurate but also significantly fairer. Our experiments show that BiaslessNAS achieves a 2.55% increase in accuracy and a 65.50% improvement in fairness compared to traditional NAS methods, underscoring the importance of integrating fairness into neural network architecture for better outcomes in medical AI applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1123_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1123_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{She_DataAlgorithmArchitecture_MICCAI2024,
        author = { Sheng, Yi and Yang, Junhuan and Li, Jinyang and Alaina, James and Xu, Xiaowei and Shi, Yiyu and Hu, Jingtong and Jiang, Weiwen and Yang, Lei},
        title = { { Data-Algorithm-Architecture Co-Optimization for Fair Neural Networks on Skin Lesion Dataset } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a fairness-aware neural architecture search method and benchmarks it against common existing architectures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    S1. The area of NAS in fairness is relatively underexplored and is a fresh perspective for the MICCAI community.

    S2. The paper is well-presented and has great figures.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    W1. The compute budget used by the proposed method is not mentioned. Given that the proposed method involves a large co-optimisation framework, it seems safe to assume that it required substantially more compute to train than the baselines, which are all tiny models. It’s unclear whether the proposed method will still achieve strong results compared to more reasonable baselines with similar compute budgets.

    W2. Similarly, I am surprised that there are no baselines included from the fairness literature. For example, adversarial training and disentanglement methods are quite common but are not discussed here.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    C1. I think it would be helpful to make the compute requirements and limitations of the proposed method more clear. The baselines should include some larger models, as well as some other bias mitigation methods from the literature.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a tough paper for me to review because I am not particularly familiar with the NAS literature. I am recommending a weak reject because I am not too convinced by the choices of baselines and I think the paper should have engaged more with the existing fairness literature. I will be willing to change my score in discussion with the authors and other reviewers.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I thank the authors for their rebuttal. My main concerns about the computational cost of the proposed method and the questionable choice of baselines were shared by the other reviewers. I appreciate the authors addressing these points in the rebuttal, although I am surprised that the authors did not agree to make changes to the paper to clarify these two common sources of confusion.

    I am still slightly suspicious of the computational cost. The rebuttal states that the method takes one day to run. In this time, it’s likely that a substantially larger, better, model could have been trained. If model size at inference time is a concern, the larger model could likely even be distilled into a smaller one before the NAS has finished running. This is why I think that using tiny models as baselines may not be appropriate – the compute budget used for the proposed method is orders of magnitude greater than what was used to train the baselines.

    I am not too familiar with the NAS literature, so perhaps this drawback accepted in the field? Either way, NAS is less common in the fairness and medical imaging communities, so I think it is important to be explicit about the compute costs and tradeoffs in the paper, especially given the paper’s position as advocating for using co-optimisation in these fields.

    I will maintain my score due to these concerns.



Review #2

  • Please describe the contribution of the paper

    this paper proves that data, training algorithms, and neural architecture are coupled in fairness optimisation. Then it introduces Biasless-NAS, a comprehensive framework that leverages NAS for the co-optimization of data, training algorithms, and neural architecture, which can improve accuracy and fairness simultaneously.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Consider the unfairness in learning caused by skin color in the Skin Lesion Dataset, and propose corresponding solutions.

    It has been summarized that data, training algorithms, and neural architecture are the biggest impacting factors of the unfairness, and they are coupled, necessitating joint optimization.

    It’s proposed that NAS can be used to improve fairness, incorporating fairness-related design in the design of each NAS component, and propose Biasless-NAS, which can simultaneously improve accuracy and fairness, or greatly increase fairness with a limited sacrifice in accuracy.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For certain conclusion-oriented statements, such as “we’ve found that neural architectures and training algorithms, alongside data, also influence fairness. Interestingly, these factors are interconnected, suggesting that optimizing them in isolation may not yield the most equitable outcomes,” and “More interestingly, the three factors N, f′, and D are coupled with each other, which indicates that optimizing them simultaneously is the best to minimize the unfairness score,” the article fails to provide adequate references, discussions, or empirical evidence to support such claims, some of which are fundamental to the research.

    Regarding the experimental components Biasless NAS-Fair and Biasless NAS-Acc, the article does not offer sufficient explanation.

    It is well known that NAS training consumes significant computational resources and time. This cost is not mentioned in the text, which prevents a comprehensive assessment of the algorithm’s advantages.

    In Eq. 2, this selection on the parameter alpha and beta has no discussions and experiments.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For the importance of data, training algorithms, and neural architecture on fairness, and their intercoupling, provide stronger arguments to make this research more solid.

    For the final selection of α and β in Eq 2 and the selection process, provide more discussion and experimental proof.

    For the calculation of the Fairness-aware loss function in Eq 4, more explanation is needed to demonstrate its awareness of fairness.

    The experimental settings for BiaslessNAS-Fair and BiaslessNAS-Acc need to be explained, which are completely absent in the paper.

    The experiment on training efficiency should be added.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    For the importance of data, training algorithms, and neural architecture on fairness, and their intercoupling, powerful arguments about them are missing in the paper, which are also the part of core motivation of this paper.

    But the innovation on NAS, and fairness-related designs in each NAS component are meaningful for the fairness improvement. But more experiments about training efficiency are also missing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    For the two most significant issues—choices of different hyperparameters and training cost problems—no satisfactory answers have been provided. Comparative experiments under different hyperparameters are still missing. It is only mentioned that the hyperparameter settings might not be sufficient, without explaining why such choices were made or providing relevant empirical evidence. The discussion on training costs is not convincing. The ability to enhance efficiency with other modules does not imply that this module itself performs well. And under the same training costs; comparative experiments are also lacking. A general training framework should not solely target small-size architectures. Therefore, I maintain a ‘weak accept’ stance, leaving the final decision to the meta-reviewer.



Review #3

  • Please describe the contribution of the paper

    The authors proposed a method named Biasless-NAS that utilised NAS to co-optimize three fairness-related factors (i.e., data, training algorithms, and neural architecture). They incorporate fairness awareness inton each phase of NAS that enables three factors are optimized simultaneously. They have experiments on the ESFair dataset for the skin leison classification task, and showed the method achieve both high improvements in accuracy and fairness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors considered fairness constraints into each phases of NAS, ensuring the mitigation of bias in all three sources.
    2. The authors conducted mutiple ablation studies to show the importance of incorporating fairness into different phases of NAS.
    3. Well structured with fluent English.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The author didn’t compare their methods with other NAS-based fairness methods.
    2. The author only uses accuracy as the evaluation matric for classification, since ISIC 2019 and Atlas datasets are all quite imbalanced, maybe introduce ROC-AUC for better evaluation.
    3. NAS may need high GPU costs and training time costs, may be better to clarify on the costs.
    4. Captions in figures are not quite clear, for example, captions of figure 2 are too simple to show the idea of the proposed system, the descriptions of different structures in the system are not presented well.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall this is a good work which introduced an NAS-based fairness method on skin lesion classification through optimizing data, training algorithms, and neural architecture simultaneously. However, the authors need provide comparsions with other SoTA NAS-based methods. The evaluation metrics of the classification may be not enough, given the imbalance nature of skin lesion datasets. Also, the captions of the system design figure should be clearer.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper read well and introduced the motication and method clearly with multiple experiments and results. There are some minor problems need to be solved during the rebuttal session. Overall it is a good work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author answered the questions and I am satisfied with them.




Author Feedback

We appreciate the reviewers’ valuable time and comments. In the rebuttal, we summarize the common questions with the source (say, R4 C1 indicates comment 1 from reviewer 4, which is under item 6) and response below.

Q1. Is a baseline method lacking for comparison? (R5 C1, R6 C2) In fact, we compared the BiaslessNAS with a state-of-the-art NAS in the original paper, but we missed the citation. Specifically, in Table 2 of this paper, the results of the FairNAS in line 4 are from paper [N1] Table 3 line12 named FaHaNa-Fair. In the revision, we will clarify the baseline and add the citation. For R6, we are proposing a general framework with three components for data/algorithm/model co-optimization. Fairness-related methods, such as the mentioned adversarial training, can be integrated into the algorithm component, while the mentioned disentanglement method into the data component. This paper employs the widely used FairBatch [N2] and FairLoss [N3] in data and algorithm components, respectively.

Q2. What is the cost of the proposed method (R4 C3, R5 C3, R6 C1) We agree that NAS’s high search cost is a consideration. The search process for all experiments can be completed within one GPU day. This is achieved by two designs: (1) we use a block-wise structure to unify parameters in multiple layers, such that the search space can be reduced compared with layer-wise NAS; (2) with the consideration of the efficient and real-time inference for medical applications, we target small-size architectures (for R6 C1). As such, the training time during the search process can be reduced. For example, the baseline MobileNetV2 has 3.4 million parameters; Figure 3 shows that BiaslessNAS-Fair has only 162,851 parameters, and BiaslessNAS-Acc has 3.1 million parameters. We want to note two things: First, the search process is a one-time offline effort for a given application. Second, the main contribution of this work is not to develop an efficient NAS but to build a holistic framework for data/algorithm/model co-optimization. Different efficiency-improving techniques (such as zero-shot NAS, single-path NAS, and hot-start NAS) can be incorporated to further reduce the search costs.

Q3. What are the arguments of data/algorithm/model intercoupling on fairness? (R4 C1) We agree that intercoupling is the motivation behind this paper. Demonstrating the importance of considering the intercoupling for co-optimization is one contribution of this paper. In the existing literature, most research works perform fairness optimization on one dimension. However, in this work, we reveal that all these factors affect fairness, as shown by the result in Figure 1(ii). Then, we showcase that co-optimizing these factors yields the best performance from the results in Table 2. In the revision, we will rewrite these conclusion-oriented statements to support them using results from Table 2.

Q4. Does the work explore hyperparameters, and what’s the difference between BiaslessNAS-Fair and BiaslessNAS-Acc? (R4 C2, R4 C4) There are two hyperparameters used in the framework: (1) Alpha is the scalable parameter for accuracy, and (2) Beta is for fairness. We explore two settings: BiaslessNAS-Fair has a larger Beta (0.8) and a smaller Alpha (0.2), while BiaslessNAS-Acc has a larger Alpha (0.8) and a smaller Beta (0.2). We will clarify this in the revision.

Q5. Why is accuracy used as the metric for evaluation? (R5 C2) We follow the fairness research [15,N1] and the fairness-related competition [3] to use the commonly used accuracy as the evaluation metric for classification. We agree on R5 that adding other metrics, such as ROC-AUC, can further strengthen the evaluation.

New References: [N1] “The larger the fairer? small neural networks can achieve fairness for edge devices.” Proc. of DAC 2022. [N2] “FairBatch: Batch Selection for Model Fairness.” Proc. of ICLR 2021. [N3] “Fair loss: Margin-aware reinforcement learning for deep face recognition.” Proc. of CVPR 2019.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The topic is interesting and the proposed method provides improvements in terms of fairness across various metrics. I would suggest that the authors expand the captions of the figures for the camera-ready version and that they should at least discuss NAS’s high search cost as a limitation in the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The topic is interesting and the proposed method provides improvements in terms of fairness across various metrics. I would suggest that the authors expand the captions of the figures for the camera-ready version and that they should at least discuss NAS’s high search cost as a limitation in the paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed reviews, and reviewers still have questions on the chosen baselines that haven’t been completely addressed by the authors. However, the studied problem, and the new approach, is interesting, and the AC believes it will contribute to the further discussions of these ideas in the MICCAI community.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mixed reviews, and reviewers still have questions on the chosen baselines that haven’t been completely addressed by the authors. However, the studied problem, and the new approach, is interesting, and the AC believes it will contribute to the further discussions of these ideas in the MICCAI community.



back to top