Abstract

Deep learning models have shown considerable promise in the classification of skin lesions. However, a notable challenge arises from their inherent bias towards dominant skin tones and the issue of imbalanced class representation. This study introduces a novel data augmentation technique designed to address these limitations. Our approach harnesses contextual information from the prevalent class to synthesize various samples representing minority classes. Using a mixup-based algorithm guided by an adaptive sampler, our method effectively tackles bias and class imbalance issues. The adaptive sampler dynamically adjusts sampling probabilities based on the network’s meta-set performance, enhancing overall accuracy. Our research demonstrates the efficacy of this approach in mitigating skin tone bias and achieving robust lesion classification across a spectrum of diverse skin colors from two distinct benchmark datasets, offering promising implications for improving dermatological diagnostic systems.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2178_paper.pdf

SharedIt Link: https://rdcu.be/dVZeC

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_35

Supplementary Material: N/A

Link to the Code Repository

https://github.com/fa-submit/Submission_M

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Ans_Algorithmic_MICCAI2024,
        author = { Ansari, Faizanuddin and Chakraborti, Tapabrata and Das, Swagatam},
        title = { { Algorithmic Fairness in Lesion Classification by Mitigating Class Imbalance and Skin Tone Bias } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {373 -- 382}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscript proposes a new sampling strategy to mitigate issues introduced from class imbalance and skin ton bias for skin lesion classification. The proposed sampling strategy is a combination use of instance sampling, adaptive sampling and heuristic augmentation. The combined dataset consisting of ISIC 2018 and the Asan dataset was used for evaluation. The proposed method was compared in against with various existing sampling methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and prepared, which made easy to read and follow.
    2. The proposed method aims to address the limitations with regard to the current skin lesion classification challenges which are class imbalance and skin tone bias. Both of these two problems are critical.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Authors claimed that the proposed method aims to simultaneously address the limitations of class imbalance and skin tone bias. However, it is not clear how the proposed method could address the limitations with regard to skin tone bias.
    2. It is also relatively difficult to understand the technical contributions. The proposed sampling is a combination of several simple existing sampling methods.
    3. The experimental results are worse or incremental when compared to the current methods. Therefore, it is less convincing that the proposed method is addressing the problem.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Although the proposed method claims to address the limitations of both imbalanced class and skin tone bias, however, can only find the methodology on addressing the imbalanced class problem. It is not clear how the skin tone bias problem can be solved.
    2. Many deep learning based methods have been proposed to address the problem of imbalanced classes and long tail. However, none of these methods has been included for comparison or discussion.
    3. There are also quite bit of repetitions e.g., the training dataset has been defined multiple times.
    4. The input images have been resized to 100 by 100 pixels, which seems to be too small to make an accurate prediction.
    5. In Page 6, the manuscript states that the Asan dataset has around 12 classes. So what’s the exact number of classes?
    6. The implementation details are missing e.g., how to calculate the lambda.
    7. From the experimental results, it seems that the overall accuracy has been improved, however, it’s difficult to assess that whether the imbalanced class problem has been minimized.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper aims to introduce to a new sampling strategy to address the problem of imbalanced class distribution and skin tone bias. However, it’s not clear that how the skin tone bias problem has been fixed. The problem of imbalanced class distribution problem has been well investigated in the field. The proposed method seems to be a simple implementation of the existing methods. The experimental results are also not convincing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work presents a data augmentation technique and utilizes contextual information from the dominant class to generate diverse samples representing minority classes. The authors address bias and class imbalance issues by employing a mixup-based algorithm, which is also guided by an adaptive sampler. The adaptive sampler adjusts sampling probabilities dynamically based on the network’s meta-set performance to improve the overall accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The researchers have attempted to solve a critical issue in the medical community regarding the large-scale deployment of systems by solving the data bias present in AI-based algorithms. The proposed approach has been evaluated on the ASAN and ISIC-2018 datasets. Due to the limited availability of SOTA methods, authors have compared the existing cost-sensitive methods for comparative analysis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty in the proposed approach is limited since the augmentation approach holds a high resemblance to the existing approaches:

    1. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision 2019 (pp. 6023-6032).
    2. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision 2019 (pp. 6023-6032).

    Moreover, the utilized resampling method/ resampler is also an application of the existing approach of resampling, which is generally used to solve class imbalance issues.

    The latest versions of ISIC datasets are available, but the motivation behind using old datasets still needs to be determined. The authors should add specific reasons why they have used the ASAN and ISIC-2018 datasets.

    Particularly due to skin-tone bias, the Fitzpatrick-17k dataset is very common and can be used to demonstrate the effectiveness of the proposed approach.

    Also, authors are suggested to make use of fairness evaluation matrics like Eodd, Eopp0, Eopp1, etc, to determine the fairness of the proposed approach.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Make use of recent dermatology datasets in the proposed study.
    2. To determine the effectiveness of the proposed approach in solving the fairness, authors should use metrics like Eodd, Eopp0, Eopp1, etc., to assess the fairness of the proposed model.
    3. Authors should emphasize the difference between the proposed augmentation approach with respect to the existing works like CutMix, CopyPaste, and MixUp, which makes the proposed work novelty weaker.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The authors have not used the recent benchmark datasets to demonstrate the effectiveness of the proposed approach in solving the fairness issue.
    2. Ablation studies for related implementation parameters need to be thoroughly detailed and discussed in the paper.
    3. Fairness evaluation matrices like Eodd, Eopp0, and Eopp1 have not been used.
    4. The proposed augmentation approach is similar to existing augmentation approaches CutMix, CutOut, MixUp, etc. So overall novelty is limited since the sampler used in the study is also limited novel.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    While your paper addresses an interesting topic, the contribution to the field does not sufficiently advance the current state of knowledge. The manuscript lacks significant novelty compared to existing literature. As per rebuttal, I am stick to the previous decision without any changes.



Review #3

  • Please describe the contribution of the paper

    The paper proposes an innovative data augmentation-adaptive sampler for images of skin lesions – such that is compensating for class imbalance, emphasising underrepresented classes – to counter skin tone bias. This is done by generating new lesion images by blending features of existing images, and by guiding the model training process through sampling images whose representation is more difficult to capture by the model (capturing difficulty estimated after every epoch). The method is evaluated on two publicly available datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method for countering skin tone bias during model training is innovative. The method generates synthetic images by linearly mixing existing images obtained by uniform sampling, and by adaptive sampling based on a heuristic function that determines how much difficulty the model has in classifying data instance of specific classes. The method outperforms other benchmarked methods on the chosen datasets, on average.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Because only one CNN model architecture was used, and under a single training regime (i.e. hyperparameter values were set in advance) it is difficult to assess if the proposed method would outperform other benchmarked methods if the circumstances changed. Thus, the conclusions of this work are to some extent limited.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor remarks and concerns:

    • “Beta distribution B(α, α).” – is this correct?
    • “We utilize convolutional neural networks (CNNs) like ResNeXt-50” – because only that model was used, I suggest rephrasing, i.e. “We utilize the convolutional neural network (CNN) ResNeXt-50”. The authors should explain the reasons for using this exact model architecture, as well as other hyperparameter values set in their experiments.
    • I am not sure that linear blending of images always results in plausible skin lesion images. This should be discussed.
    • The numbers presented in Tables are percentages.
    • I suggest using the term “instance” or “example” instead of “sample” when addressing individual cases/data points. A sample (in statistics, primarily) is a set of instances. Moreover, “sample”/”sampling” is also used as a verb. Using the term “sample” for one “instance” makes the text more difficult to follow. Using correct terminology is especially important in a paper that discusses sampling strategies - such as this one.
    • I suggest proofreading the paper for English grammar and syntax, either with the help of a native speaker or by using an online service.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is an interesting read and is easy to follow. The proposed method is clearly presented, the experimental setup for validating and benchmarking the method is adequate for demonstrating its usefulness, the conclusions are supported by the results. Overall, I believe it will be interesting to the MICCAI audience.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

[R1,3]We address skin lesion classification challenges due to data imbalance and limited diversity simultaneously, which is a gap in the literature. Existing approaches mitigate imbalance but don’t adapt to diverse skin tones, making direct comparison difficult. Our approach can integrate with contrastive learning frameworks, which we plan to explore in future work. [R1]To maintain consistency, we resized ISIC-2018 images to 100x100 pixels, matching the ASAN dataset, for consistency in analysis and comparison. ASAN has 12 classes, 5 of which are shared with ISIC-2018, experiments were done on both class sets. After addressing imbalance, we obtained improvements of 1.63% for GM and 2.54% for Bacc for ASAN and improvements of 4.42% for GM and 3.49% for Bacc for ISIC, showing efficacy across different ethnic groups. Similarly, consistent patterns emerged when training on ISIC and testing on the same dataset or on ASAN with varying ethnicities. [R1,4]Implementation detail for lambda is provided in the Preliminaries section. Lambda is sampled from the beta distribution B(α, α), with α set to 0.2 in the experiments. [R3]The Fitzpatrick-17k dataset includes both pale and dark skin tones but cannot directly demonstrate our addressed problem due to mixed pale and darker skin images. Also, ISIC-2018 was used instead of ISIC-2019 since it has data from Argentina and Turkey (Wen et al., 2022), while we only want the caucasian sources of ISIC-2018 in order to contrast with non-white ASAN data. [R1,3]When training with ASAN and testing on a combined test set of ASAN and ISIC, we found that our method’s Eodd is 15.2%, nearly 3% lower than CBS and Mixup, which is approximately 18%. EOpp1 (the TPR difference for different subgroups) is 31.7% for our method compared to the nearest one, which is around 35% for CBS and focal loss. Additionally, the value of EOpp0 (the TNR difference) for our method, is 7.5%, closest to CBS at 7.8%. Comparing Bacc on the ISIC test set, our method is higher by 11% compared to CBS. These metrics demonstrate predictive fairness of the proposed method. [R3]Unlike existing methods (CutMix, CopyPaste, MixUp), our adaptive sampler dynamically adjusts sampling probabilities based on the network’s learning state and a meta-set, focusing on informative minority class examples and preventing overfitting. Instance sampling enhances minority class representations by leveraging context from majority class samples. Our heuristic-driven augmentation generates new training data, improving the model’s understanding of minority samples and adjusting decision boundaries for different ethnicities, unlike the static nature of the 3 competitors. CutMix replaces parts of an image with patches from other images; this improves performance in standard image tasks like CIFAR & ImageNet (Yun et al, 2019)​. However, applying them to medical datasets like those for skin cancer detection yields less favourable results (Rao et al, 2023). Altering or removing image regions can lead to loss of diagnostic information, reducing the model’s accuracy in identifying and classifying skin lesions. [R4] ResNeXt-50, with its modular multi-branch architecture and improved cardinality over ResNet-50, was our primary model for simplifying image classification. We also experimented with EfficientNet. When training on the ASAN dataset, we observed an improvement of about 4% in Bacc compared to the closest methods, CBRW and Cbrt. Our Bacc was around 73%, compared to their 69%. When tested on the ISIC dataset, our Bacc was 42.3%, while the closest methods (CBS, Cbrt) achieved around 40.2%. Due to space limitations, detailed results and discussions on EfficientNet were omitted. We will include them in the final manuscript. The blending acts as a form of regularisation, reducing overfitting and enabling the model to learn more robust features that are not specific to individual images but general across the dataset.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper “Algorithmic Fairness in Lesion Classification by Mitigating Class Imbalance and Skin Tone Bias” has several key weaknesses. The proposed method claims to address both class imbalance and skin tone bias, but it is unclear how it effectively mitigates skin tone bias. The novelty of the approach is limited, as it combines existing sampling strategies without significant new contributions. Additionally, the use of old datasets like ISIC-2018, instead of newer versions or more relevant datasets like Fitzpatrick-17k, weakens the study’s relevance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper “Algorithmic Fairness in Lesion Classification by Mitigating Class Imbalance and Skin Tone Bias” has several key weaknesses. The proposed method claims to address both class imbalance and skin tone bias, but it is unclear how it effectively mitigates skin tone bias. The novelty of the approach is limited, as it combines existing sampling strategies without significant new contributions. Additionally, the use of old datasets like ISIC-2018, instead of newer versions or more relevant datasets like Fitzpatrick-17k, weakens the study’s relevance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper is well-written and the topic is interesting to the community. However, the used evaluation metrics and setup focus on overall classification accuracy and do not present fairness metrics and subgroup analysis. Moreover, the chosen baselines are focused on class imbalance, even though the title of the paper contains “Algorithmic Fairness”. Baselines, such as GroupDRO or other fairness-focused methods should also be included to showcase the improvements achieved by the proposed method in terms of worst-group performance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper is well-written and the topic is interesting to the community. However, the used evaluation metrics and setup focus on overall classification accuracy and do not present fairness metrics and subgroup analysis. Moreover, the chosen baselines are focused on class imbalance, even though the title of the paper contains “Algorithmic Fairness”. Baselines, such as GroupDRO or other fairness-focused methods should also be included to showcase the improvements achieved by the proposed method in terms of worst-group performance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed reviews and the criticism relates to novelty and insufficient benchmarking. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper methodology is generally sound, it presents a new and interesting sampling strategy, and it addresses an underrepresented area of research. Thus, the paper makes a good starting point for further research and is well in scope for the health equity track. The authors should highlight limitations in their discussion.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed reviews and the criticism relates to novelty and insufficient benchmarking. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper methodology is generally sound, it presents a new and interesting sampling strategy, and it addresses an underrepresented area of research. Thus, the paper makes a good starting point for further research and is well in scope for the health equity track. The authors should highlight limitations in their discussion.



back to top