Abstract

Recent studies have demonstrated that deep learning (DL) models for medical image classification may exhibit biases toward certain demographic attributes such as race, gender, and age. Existing bias mitigation strategies often require sensitive attributes for inference, which may not always be available, or achieve moderate fairness enhancement at the cost of significant accuracy decline. To overcome these obstacles, we propose FairQuantize, a novel approach that ensures fairness by quantizing model weights. We reveal that quantization can be used not as a tool for model compression but as a means to improve model fairness. It is based on the observation that different weights in a model impact performance on various demographic groups differently. FairQuantize selectively quantizes certain weights to enhance fairness while only marginally impacting accuracy. In addition, resulting quantized models can work without sensitive attributes as input. Experimental results on two skin disease datasets demonstrate that FairQuantize can significantly enhance fairness among sensitive attributes while minimizing the impact on overall performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3697_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3697_supp.pdf

Link to the Code Repository

https://github.com/guoyb17/FairQuantize

Link to the Dataset(s)

https://github.com/mattgroh/fitzpatrick17k https://challenge.isic-archive.com/landing/2019/

BibTex

@InProceedings{Guo_FairQuantize_MICCAI2024,
        author = { Guo, Yuanbo and Jia, Zhenge and Hu, Jingtong and Shi, Yiyu},
        title = { { FairQuantize: Achieving Fairness Through Weight Quantization for Dermatological Disease Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper mitigates unfairness in dermatological disease diagnosis by quantizing part of the weights of a pre-trained neural network and achieves significant improvement in both fairness scores and performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    To the best of my knowledge, this is the first attempt at unfairness mitigation using quantization methods. Their code is available and this makes their work easy to follow. Th FairQuantize achieves better fairness while reserving performance, which is precious in current research. There are plenty of comparison methods in the result part, and I appreciate their statement: “A more balanced approach would be moderating, instead of eliminating, certain information to maintain both fairness and performance, allowing for partial data flow.”

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although I acknowledge that this might be the first attempt at unfairness mitigation using the quantization method, I think the novelty of this work is not enough. AFAIK, the use of the Taylor series to evaluate the importance or score of each parameter in the neural network is a common strategy and has been widely used in pruning areas, such as FairPrune. However, I do not see the citation in this part and I think it is not necessary to spend so much space to describe the consensus in this area.

    Besides, there are too many counterintuitive definitions or mathematical misleading expressions in the manuscript, which makes it difficult to follow. Please see the detailed comments.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The datasets used in this paper is publicly available, I think this paper is easy to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In the introduction part, the authors use “privileged and unprivileged” without explanation. I think the explanation in Sec 2.1 should be moved before this statement.
    2. In the Introduction part, the authors state “However, such suppression of information about sensitive attributes increases the potential to miss useful features, greatly degrading the prediction performance.”. AFAIK, in [16], the performance does not degrade.
    3. In the Introduction part, the authors state “FairQuantize addresses this by adjusting the computation precision of these pivotal weights through quantization, thereby balancing accuracy and fairness between different demographic groups.” I do not understand the logic between these two statements (why use “thereby”?).
    4. In Sec 2.1, it is better to replace “y_0_i” and “y_i” with “y_i” and “\hat{y}_i”, which is more commonly used.
    5. In Sec 2.1, “y = F (Θ, x) is a pre-trained classification model”. This is wrong, as y is the prediction, and F is the model.
    6. In “Weights hold different importance for a model.”, “where log2(abs(θi)) is the integer that represents the power-of-2 quantized version of the weight θi.”. Although explained, I believe that a round operation should be added outside “log2(abs(θi))”, as in the general math area, “log2(abs(θi))” is not required to be an integer.
    7. I do not see any special differences between Equ. 2 and Eq.3~4 in the FairPrune (MICCAI 2021).
    8. The authors use two scoring sets for model inference, but I do not find descriptions about these sets (number of samples? Class composition? Attribute composition?)
    9. I think there are mistakes in Algorithm 1, From line 4 to line 9. I think H^u and H^p should be computed over the whole S^u and S^p, but in this algorithm, these values are computed on each sample pairs in S^u and S^p? Besides, I guess the meaning of “for {su, sp} in {Su(unprivileged), Sp(privileged)} do” is directly brought from the Python enumerate function, and should be explained to avoid misleading. Moreover, “the arbitrarily small numbers” is not a proper expression, perhaps a “isQuantized” flag is better here.
    10. The font in Table 1 and Table 2 is strange and I guess it is not the official Times New Roman font. Besides, footnotes should be added to state that the results of Vanilla, MFD, FairPrune, and ME-FairPrune are copied from their origin paper. (FairPrune and ME-FairPrune).
    11. Can the authors give some insights about why FairQuantize increases the precision on the ISIC2019 dataset by about 10%, while only 3% on the Fitzpatrick-17k? I think an improvement of about 10% is unusual for quantization algorithms.
    12. Could the authors explain how the hyper-parameter beta is selected? I guess grid search is applied to find a better beta, but the values 0.556 and 0.778 are strange numbers (not some simple decimals), Is the grid search implemented with a step of 0.001?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the novelty of this paper is not enough, especially compared to FairPrune, and there are plenty of mathematical misleading statements in the manuscript. So I have to reject this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    After the rebuttal, I still cannot recognize enough disparity between this paper and FairPrune, not only from the Equations but the Figs. I think the improvement is marginal, so I choose to reject this paper.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a model weight quantization strategy to improve fairness. The key idea is to identify weights which are less “important” for the unprivileged class, but more “important” for the privileged class, and quantizing those weights to the nearest power of 2. “importance” is essentially how much the model output is expected to change given a small change in the weight – for a pretrained model that is assumed to have converged, it is the second derivative of the model output with respect to the model weight. Then, weights can be scored by the difference between two hessians – one from the unprivileged input and another with a privileged input (with a scalar tuning multiplier). The algorithm proceeds in steps - the fraction of weights to be quantized is gradually increased and crucially, the model is retrained after each increase. The reported results shows some benefits – not much degradation in performance for the privileged class but more fairness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel idea: I like the idea of model weights quantization – it allows more fine grained control than model pruning (earlier method)
    2. Good experiments and comparisons with baselines
    3. Definitely interesting to the MICCAI community
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In equation 2, the score is defined by the difference of hessians multiplied by the square of the change in weights, but in Algorithm 1, it is defined as the difference of the hessians (I think the latter is more appropriate)
    2. I am confused about Table 1. First, it is not clear to me what is the task for which the metrics are being evaluated. EOpp1 is defined to be the difference between the true positive rates of the two groups. So for the vanilla method, it should be recall_light - recall_dark = 0.086, not 0.361 as reported. Please clarify. Note that true positive rate and sensitivity and recall all mean the same thing.
    3. The same comment as above holds for Table 2
    4. The metric EOdd seems to take the mean of EOpp0 and EOpp1, without taking into consideration the class imbalance. Not sure if that makes sense.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I think the authors need to better describe the experiments, including what the specific tasks are on each dataset, how the metrics such as precision, recall as well as the fairness metrics are calculated. Currently, it is very confusing and detracts from the merits of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is interesting. But the data presented in the tables seems inconsistent. I am assuming that those inconsistencies can be explained in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper is very well written and clear and provides extensive experimental results. It is interesting to use the strategy of quantization to measure the node importance of a network. FairQuantize outperforms previous SOTAs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is very well written and clear and provides extensive experimental results. It is interesting to use the strategy of quantization to measure the node importance of a network. The methodology part is clear and easy to understand. FairQuantize outperforms previous SOTAs.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Whether FairQuantize outperforms previous ME-FairPrune on the Fitzpatrick17k dataset is unsure. While ME-FairPrune has lower Eopp1 and Eodd, it has a smaller accuracy difference and better average accuracy. The accuracy difference should also demonstrate the fairness performance of models [1]. [1] Caton, Simon, and Christian Haas. “Fairness in machine learning: A survey.” ACM Computing Surveys (2020).

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper has provided the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) It would be great if the paper could extend to sensitive attributes with multiple values in a future version of this work. (2) How did the author define the hyperparameters? While this paper provided code, it would be good to have implementation details in the supplementary material.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very well written and tackles the issue of unfairness. The usage of quantization in the fairness task is interesting. While the performance comparison between ME-FairPrune and previous SOTAs on the Fitzpatrick17k dataset is vague.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The author has addressed my main concerns. This paper has good experiments with some good observations (rather than directly prune weights, use more fine-grained adjustment through quantization), which will be interesting for the commity. I am sticking to my original score for this paper.




Author Feedback

The main concerns are 1) minor writing issues; and 2) novelty of our work. For 1), we replied respectively; for 2), we pointed out that our method is a major improvement with enough novelty. Examples for briefs: R3 is Reviewer #3. Q6.1 is the 1st inquiry in Question 6.

R3

Q6.1 Score definition different in Eq. 2 and Algorithm 1. It is a typo in Alg 1. We will correct it.

Q6.{2-4} Tasks for metrics unclear; is EOpp1 the difference between true positive rates (TPR) of two groups of the fairness attribute? Is EOdd mean of EOpp0 and EOpp1? Precision, recall, F1-score are accuracy metrics; EOpp0, EOpp1, EOdd are fairness metrics. No, EOpp1 is obtained by computing the TPR difference for each target class and taking the average. No, EOdd is obtained by getting the differences of TPRs and FPRs between two groups for each class and taking the average.

R4

Q6 FairQuantize vs. ME-FairPrune on Fitzpatrick-17k. Accuracy difference should also demonstrate fairness. Fitzpatrick-17k is unbalanced and accuracy does not reflect performance well. It is why we report precision, recall, F1 instead.

Q10.1 Extend to sensitive attributes with multiple values? Future work will address multi-value sensitive attributes.

Q10.2 How were hyper-parameters defined? The hyper-parameter β represents weighted sums. E.g., a score may be 3 * importance^u - 2 * importance^p. By normalizing, we get 2/3 for the second term as a β value.

R5

Q10.{1,4-6,8-9} “Privileged and unprivileged” was explained in Sec 2.1 but first used in Introduction. Optimize notation and clarify the model description in Sec 2.1. Add round operation for log2 in Sec 2.2. Missing definition for “scoring sets”. Clarify computations and flags in Algorithm 1. H^u and H^p are calculated on each pair in S^u and S^p. It is why line 10 takes an average. This is not a mistake. Two subsets of the training set are randomly chosen with the same number of data points for two groups of the sensitive attribute and are used for calculating Hessian matrices, known as scoring sets. We will add the corresponding definition.

Q10.2 “However, such suppression of information… greatly degrading the prediction performance.” conflicts with [16]. Section 5.1.1 and Table 2 in [16] show that the proposed methods do not outperform “weighted loss” on the exclusive test split, and have a small performance drop compared to “standard” on the co-occurring split. We will tune down our statement as suggested.

Q10.3 Why “thereby” in “FairQuantize addresses this by adjusting the computation precision of these pivotal weights through quantization, thereby balancing accuracy and fairness…” We will change it to “FairQuantize addresses this by adjusting weight precision instead of directly getting rid of them, so as to provide better balance…”

Q10.11 Explain improvement difference between datasets. Fitzpatrick-17k is a 114-class skin disease dataset while ISIC 2019 is an 8-class dataset, making Fitzpatrick-17k more complex and difficult to improve.

Q10.12 How β is selected. We did a grid search from 1.0 with a step of -2/9, with approximations for fractions.

Q{10.7;6,12} Any special difference in Eq. 2-4 from FairPrune? Novelty of the work. In Eq. 2-4, we use the quantization error instead of weight magnitude (in FairPrune) in computing the importance score, because quantization only changes weights to quantized values, while pruning changes the weights to zero. Our work shares the same motivation as FairPrune, but it improves upon FairPrune by offering a finer-grained optimization through quantization. Specifically, quantization adjusts weights by allowing a weight to be quantized to different levels while pruning either leaves a weight unchanged or sets it to zero. Besides, quantization can be used alone or after pruning, and studies show that though strongly backbone-dependent, it generally outperforms pruning in maintaining accuracy. In conclusion, our method is theoretically and empirically superior to FairPrune.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is well-written and presents an interesting and relevant methodology for fairness improvement in medical image classification. The rebuttal answered the comments of the reviewers sufficiently. For the camera-ready version, I would recommend highlighting the difference of the proposed method from FairPrune and include the experimental settings regarding the hyperparameter selection and the metric calculation in the paper or the supplementary.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is well-written and presents an interesting and relevant methodology for fairness improvement in medical image classification. The rebuttal answered the comments of the reviewers sufficiently. For the camera-ready version, I would recommend highlighting the difference of the proposed method from FairPrune and include the experimental settings regarding the hyperparameter selection and the metric calculation in the paper or the supplementary.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed reviews. The major concerns of Rev #5 are a lack of novelty and similarities with the method FairPrune. After reading the other reviewers’ comments, and going over the paper, while I agree that the motivation and general framework is similar, the method is fundamentally different: here, a model is quantized instead of pruned to achieve better fairness trade-offs. To this AC, this is a novel methodology worthy of publishing.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mixed reviews. The major concerns of Rev #5 are a lack of novelty and similarities with the method FairPrune. After reading the other reviewers’ comments, and going over the paper, while I agree that the motivation and general framework is similar, the method is fundamentally different: here, a model is quantized instead of pruned to achieve better fairness trade-offs. To this AC, this is a novel methodology worthy of publishing.



back to top