Abstract

Deep learning has achieved impressive performance across various medical imaging tasks. However, its inherent bias against specific groups hinders its clinical applicability in equitable healthcare systems. A recently discovered phenomenon, Neural Collapse (NC), has shown potential in improving the generalization of state-of-the-art deep learning models. Nonetheless, its implications on bias in medical imaging remain unexplored. Our study investigates deep learning fairness through the lens of NC. We analyze the training dynamics of models as they approach NC when training using biased datasets, and examine the subsequent impact on test performance, specifically focusing on label bias. We find that biased training initially results in different NC configurations across subgroups, before converging to a final NC solution by memorizing all data samples. Through extensive experiments on three medical imaging datasets—PAPILA, HAM10000, and CheXpert—we find that in biased settings, NC can lead to a significant drop in F1 score across all subgroups. Our code is available at https://gitlab.com/radiology/neuro/neural-collapse-fairness.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3085_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3085_supp.pdf

Link to the Code Repository

https://gitlab.com/radiology/neuro/neural-collapse-fairness

Link to the Dataset(s)

https://stanfordmlgroup.github.io/competitions/chexpert/ https://www.nature.com/articles/sdata2018161#Sec10 https://www.nature.com/articles/s41597-022-01388-1#Sec6



BibTex

@InProceedings{Mou_Evaluating_MICCAI2024,
        author = { Mouheb, Kaouther and Elbatel, Marawan and Klein, Stefan and Bron, Esther E.},
        title = { { Evaluating the Fairness of Neural Collapse in Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper investigates the impact of Neural Collapse (NC) on the label noise of deep learning models in medical imaging. It examines one label noise setting and analysis how NC affects model optimization across subgroups under biased training scenarios.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It examines one label noise setting and analysis how NC affects model optimization across subgroups under biased training scenarios.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty of this paper’s contribution is limited:

    1. Section 3 of the paper examines training with label noise. Prior research has already shown that models tend to learn accurate features during initial training stages. In later stages, models may memorize noisy labels and cause bad performance [1, 2].
    2. If the paper intends to be theoretical, it should provide a more comprehensive understanding of how neural networks collapse under varying degrees of label noise under each group influence the optimization process, along with the corresponding theoretical connections.
    3. If the paper focuses on applications, real situation may not necessarily achieve neural collapse. Even if it does, recent study[3] suggest that neural collapse often occurs on the training set but not on the test set, limiting the paper’s practical recommendations.
    4. At last the evaluation is not comprehensive, firstly the analysis of optimization process and performance under different level of label noise among sensitive groups is not explored. Secondly, the paper should also explore situations where labels are superficially correlated with sensitive attributes. For instance, if a diseases’ positive labels are superficially associated with white people due to the richer medical resources, the class means might also embody white people attributes. In such cases, it’s inappropriate to claim that the model encodes less sensitive group information without analyzing the underlying correlation between labels and sensitive attributes.

    [1] EARLY STOPPING AGAINST LABEL NOISE WITHOUT VALIDATION DATA [2] Learning from noisy labels with deep neural networks: A survey. [3] Limitations of Neural Collapse for Understanding Generalization in Deep Learning

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The novelty of this paper’s contribution is limited:

    1. Section 3 of the paper examines training with label noise. Prior research has already shown that models tend to learn accurate features during initial training stages. In later stages, models may memorize noisy labels and cause bad performance [1, 2].
    2. If the paper intends to be theoretical, it should provide a more comprehensive understanding of how neural networks collapse under varying degrees of label noise under each group influence the optimization process, along with the corresponding theoretical connections.
    3. If the paper focuses on applications, real situation may not necessarily achieve neural collapse. Even if it does, recent study[3] suggest that neural collapse often occurs on the training set but not on the test set, limiting the paper’s practical recommendations.
    4. At last the evaluation is not comprehensive, firstly the analysis of optimization process and performance under different level of label noise among sensitive groups is not explored. Secondly, the paper should also explore situations where labels are superficially correlated with sensitive attributes. For instance, if a diseases’ positive labels are superficially associated with white people due to the richer medical resources, the class means might also embody white people attributes. In such cases, it’s inappropriate to claim that the model encodes less sensitive group information without analyzing the underlying correlation between labels and sensitive attributes.

    [1] EARLY STOPPING AGAINST LABEL NOISE WITHOUT VALIDATION DATA [2] Learning from noisy labels with deep neural networks: A survey. [3] Limitations of Neural Collapse for Understanding Generalization in Deep Learning

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of this paper’s contribution is limited.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Reject — must be rejected due to major flaws (1)

  • [Post rebuttal] Please justify your decision

    Thank you for your response, although some of my concerns is resolved, there are still some important issue to be tackled.

    1. “Prior research on learning under label noise limits the paper’s novelty” Although the author study label noise within each group and proposed three conclusion, the novelty of extending label noise in a single group to more than one groups is limited. And also, the second conlusion is not accurate.
    2. “By injecting underdiagnosis into one group, we intentionally correlated labels with sensitive attributes. Example: flipping 25% of “positive” labels in the “female” group increases the correlation of positive labels with the male attribute.” I still believe that Conclusion (ii) of the paper “as models approach NC, they encode less sensitive subgroup information” is inaccurate and maintain my original review comment: “The positive labels for diseases are superficially associated with white people due to their richer medical resources, and the class means might also embody attributes of white people.” The main reason is that I don’t think a 25% noise means superficial correlation. Additionally, there is a contradiction in Section 4.4 of the paper, where the author states that in the case of CheXpert, the model maintained good performance for the white population due to the large number of samples (77.9%). “In this case, the model’s NC configuration was predominantly defined by samples belonging to this group.” This indicate the model encode sensitive subgroup information and even white people become the mean. Therefore, Conclusion (ii) is not correct.
    3. “Exploring various noise levels requires training 60 models per level, a task currently being executed for extending this work.” As we see, different level of noise or superficial correlation cause different NC and become more important to be study after the discussion. We should pay more time to make a concrete work. Thank you for the author’s efforts, but I have to decrease my score.



Review #2

  • Please describe the contribution of the paper

    Explores neural collapse to improve fairness in medical imaging tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Studying the relation between neural collapse and fairness is interesting.

    Analysis is clear and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors simulate misdiagnosis by randomly changing some of the ground truth positive samples to negative samples. I am not sure if this is the most ideal way to reflect real-world scenarios. Typical misdiagnosis occurs because of the closeness in pathology of samples. The authors should do a more principled selection. For example, they could choose to change the labels of “hard” positive samples instead. This could be positive samples that are close to negative samples in embedding space.

    There is limited methodological novelty and insights. It would be good if the authors can further discuss how the results can be used in practice. They mentioned that “models approaching NC encode less sensitive information about the subgroups in the extracted features, suggesting a mitigation of the model’s bias.” Are they suggesting that models should be trained to NC for reducing bias? How does this work relate to current works in fairness and model debiasing in terms of fairness and performance trade-offs? In addition, there is a closely related line of work that studies learning from noisy labels. Would models trained in such settings exhibit more bias?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Analysis is clear and easy to follow. However, in terms of methodological novelty, this paper is limited.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Authors did not address the following concerns from my initial review. I maintain my score.

    1) How their results/findings can be used in practice. They mentioned that “models approaching NC encode less sensitive information about the subgroups in the extracted features, suggesting a mitigation of the model’s bias.” Are they suggesting that models should be trained to NC for reducing bias?

    2) How does this work relate to current works in fairness and model debiasing in terms of fairness and performance trade-offs?



Review #3

  • Please describe the contribution of the paper

    This paper studied the fairness (label bias in particular) of neural collapse in classifying medical images. Both theoretical analysis and experiments were performed to explain the mechanism behind the emergence of bias in deep learning models. The experiments were performed on three public medical imaging datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The neural collapse is an interesting phenomenon and this paper investigated its fairness when training with biased dataset. Description of the method is clear and the writing is easy to follow. Results about the correlation between model fairness and NC1-4 attributes are well presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The phenomenon of neural collapse is not easy to understand unless readers read the previous work. In section 3, the authors assume the largest impact of bias is on NC1. The reasoning behind this assumption is unclear to me.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The under diagnosis problem can also be think of as a special case of the inaccurate labels. Consider cite and further explore the phenomenon in the setting of inaccurate labels [1]. [1]. Inaccurate Labels in Weakly-Supervised Deep Learning: Automatic Identification and Correction and Their Impact on Classification Performance

    Based on the findings on NC and model fairness, providing more indications for model training would enhance the value of this study.

    Additional definition of NC in Introduction could provide prior knowledge to the readers and make the reading easier.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The mathematical rigor of the proposed method and the thorough experiments presented in this paper have informed my rating.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    This paper presents a theory to explain the emergence and mitigation of bias in medical image classification. The rebuttal has satisfactorily addressed my concerns.




Author Feedback

We thank the reviewers for their positive feedback. This work examines the fairness of Neural Collapse (NC) when training with biased datasets, which the reviewers find an interesting topic (R1, R3). The reviewers find the paper clear and easy to follow (R1, R3). The results derived from 3 public datasets on the correlation between fairness and NC are well presented (R3). The main novelty of this work is the evaluation of model training dynamics under group-bias through the phenomenon of NC. Our results can help researchers better understand model bias and build NC-inspired fair methods. Concerns and our reply: 1) Prior research on learning under label noise limits the paper’s novelty (R4): We want to emphasize that our focus is not on label noise in general, but on label bias (noise affecting a single group). We are the first to examine the influence of label bias on deep classification models through NC in the context of group fairness, building on the literature on label noise and NC [1-2]. 2) The paper should explore setups where labels correlate with sensitive attributes (R4): If we understand the reviewer’s comment correctly, we did exactly what is suggested. By injecting underdiagnosis into one group, we intentionally correlated labels with sensitive attributes. Example: flipping 25% of “positive” labels in the “female” group increases the correlation of positive labels with the male attribute. 3) The paper provides limited practical indications on model training and performance-fairness trade-off (R1, R3, R4): Firm conclusions on performance-fairness trade-offs in general cannot be drawn due to the lack of good fairness metrics [3]. Our work aims to examine the dynamics of biased training. The main novel indication is that NC reduces the sensitive information encoded in the model’s features (section 4.3). Sections 3 and 4.4 show that different NC configurations between groups indicate a biased model. We will clarify this in the paper. 4) The paper does not consider prior work on learning from noisy labels (R1, R3, R4): This work aims to study the impact of biased datasets on standard training methods, rather than developing a novel learning method for noisy labels. Prior work on learning from noisy labels is therefore outside the scope of this work and is a future effort. 5) NC is not achieved in practice and train NC does not guarantee test NC (R4): We studied models as they “approach” NC and not necessarily after reaching it. This will be underlined in section 2. We don’t make any assumptions on test collapse. We show that while both clean and biased models approach train NC (section 4.1), the clean model achieves better test NC (section 4.4), confirming that test collapse depends on the quality of the train set as shown in the literature. 6) Random choice of biased samples is not realistic (R1): We agree that bias occurs in hard samples, but datasets lack “easy vs hard” labels. It is not clear how to inject the bias to hard samples before training since obtaining the embeddings requires prior training. Note that we followed the random bias injection process of [4]. 7) The paper only includes one level of label noise (R4): The study is designed assuming label bias in general is sparce, with 25% as an extreme case. Thus, methods designed based on our results are expected to work for real-world bias levels. Exploring various noise levels requires training 60 models per level, a task currently being executed for extending this work. 8) Choice of NC1 (R3): Unlike NC2-4 which involve shared model weights and class means, NC1 relates to individual sample embeddings, allowing better group comparisons. [1] Liu2020: Early-learning regularization prevents memorization [2] Nguyen2022: Memorization-Dilation: Memorization-Dilation: Modeling neural collapse under noise [3] Mbakwe2023: Fairness metrics for health AI: we have a long way to go [4] Jones2023: The Role of Subgroup Separability in Group-Fair Medical Image Classification




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is well-written and easy to follow. The analysis is interesting and relevant for the community, and the results provide useful insights on three datasets. I would recommend increasing the resolution of Figure 1 and Figure 2 for the camera-ready version. Please also consider adding some of the clarifications from the rebuttal about the motivation and positioning of the paper in comparison to the related work (specifically points 1 and 5).

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is well-written and easy to follow. The analysis is interesting and relevant for the community, and the results provide useful insights on three datasets. I would recommend increasing the resolution of Figure 1 and Figure 2 for the camera-ready version. Please also consider adding some of the clarifications from the rebuttal about the motivation and positioning of the paper in comparison to the related work (specifically points 1 and 5).



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Borderline paper that explores the effect of neural collapse (training beyond 100% accuracy) on fairness/bias mitigation (here: label bias) in medical image classification tasks. While I agree with R#4 that the paper has many problems regarding the design of the experiments and the reported conclusion, I still believe that the analysis is good enough to be presented at MICCAI. Authors should, however, down tone the conclusions if the paper is accepted and also acknowledge the limitations regarding generalizability of their findings.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Borderline paper that explores the effect of neural collapse (training beyond 100% accuracy) on fairness/bias mitigation (here: label bias) in medical image classification tasks. While I agree with R#4 that the paper has many problems regarding the design of the experiments and the reported conclusion, I still believe that the analysis is good enough to be presented at MICCAI. Authors should, however, down tone the conclusions if the paper is accepted and also acknowledge the limitations regarding generalizability of their findings.



back to top