Abstract

Deep learning has been widely utilized in medical diagnosis. Convolutional neural networks and transformers can achieve high predictive accuracy, which can be on par with or even exceed human performance. However, uncertainty quantification remains an unresolved issue, impeding the deployment of deep learning models in practical settings. Conformal analysis can, in principle, estimate the uncertainty of each diagnostic prediction, but doing so effectively requires extensive human annotations to characterize the underlying empirical distributions. This has been challenging in the past because instance-level class distribution data has been unavailable: Collecting massive ground truth labels is already challenging, and obtaining the class distribution of each instance is even more difficult. Here, we provide a large skin cancer instance-level class distribution dataset, SkinCON, that contains 25,331 skin cancer images from the ISIC 2019 challenge dataset. SkinCON is built upon over 937,167 diagnostic judgments from 10,509 participants. Using SkinCON, we propose the distribution regularized adaptive predictive sets (DRAPS) method for skin cancer diagnosis. We also provide a new evaluation metric based on SkinCON. Experiment results show the quality of our proposed DRAPS method and the uncertainty variation with respect to patient age and sex from health equity and fairness perspective. The dataset and code are available at https://skincon.github.io.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/4090_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://skincon.github.io

Link to the Dataset(s)

https://skincon.github.io

BibTex

@InProceedings{Ren_SkinCON_MICCAI2024,
        author = { Ren, Zhihang and Li, Yunqi and Li, Xinyu and Xie, Xinrong and Duhaime, Erik P. and Fang, Kathy and Chakraborty, Tapabrata and Guo, Yunhui and Yu, Stella X. and Whitney, David},
        title = { { SkinCON: Towards consensus for the uncertainty of skin cancer sub-typing through distribution regularized adaptive predictive sets (DRAPS) } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors center on an 8-class skin cancer classification task, introducing a novel dataset, SkinCON, derived from the ISIC 2019 challenge dataset, enriched with 937,167 diagnostic annotations by medical students and residents. The empirical diagnosis distribution from these annotations guided the training of classifiers with a focus on minimizing the KL divergence. The authors propose distribution regularized adaptive prediction sets (DRAPS) and Hit Rate (HR) for model evaluation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • New dataset collection: The authors gathered 937,167 diagnostic trials from 10,509 participants on 25,330 skin cancer images from the ISIC 2019 challenge dataset. If this dataset could be made publicly available later, it would be considered a strength.
    • Methodological Novelty: The integration of KL divergence into the loss function to align model predictions with empirical diagnosis distributions represents a creative approach to improving prediction set generation for skin cancer classification tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Limited extendibility: The DRAPS and HR proposed by the authors are generic, but the empirical diagnosis distribution is limited to the 8-class skin cancer classification task and not suitable for other dermatological diagnostic tasks.
    • Dataset quality: The heavy reliance on the empirical diagnosis distribution for model training underscores the critical importance of SkinCON’s data quality. Given the apparent discrepancies highlighted in Figures 2, 3, and 4, where many accuracy measures fall below 0.5, the feasibility of utilizing this dataset without addressing potential labeling inaccuracies needs further clarification. The authors need to explain how SkinCON’s data quality is guaranteed and evaluated.
    • Some critical results are missing. Related to the previous concern, in Table 1, an ablation study is needed to exclude KL divergence in the loss function.
    • Questionable results. For Figure 3, benign keratosis shows a more significant p-value compared to vascular lesions, visualizing the error bar will be helpful.
    • Unclear writing. For example, in section 3, K is not defined.
    • Naming conflict: The choice of the abbreviation SkinCON needs a clearer delineation, as the name almost collides with another paper: https://skincon-dataset.github.io
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Limited extendibility: The DRAPS and HR proposed by the authors are generic, but the empirical diagnosis distribution is limited to the 8-class skin cancer classification task and not suitable for other dermatological diagnostic tasks.
    • Dataset quality: The heavy reliance on the empirical diagnosis distribution for model training underscores the critical importance of SkinCON’s data quality. Given the apparent discrepancies highlighted in Figures 2, 3, and 4, where many accuracy measures fall below 0.5, the feasibility of utilizing this dataset without addressing potential labeling inaccuracies needs further clarification. The authors need to explain how SkinCON’s data quality is guaranteed and evaluated.
    • Some critical results are missing. Related to the previous concern, in Table 1, an ablation study is needed to exclude KL divergence in the loss function.
    • Questionable results. For Figure 3, benign keratosis shows a more significant p-value compared to vascular lesions, visualizing the error bar will be helpful.
    • Unclear writing. For example, in section 3, K is not defined.
    • Naming conflict: The choice of the abbreviation SkinCON needs a clearer delineation, as the name almost collides with another paper: https://skincon-dataset.github.io
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    New dataset collection

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose the distribution regularized adaptive predictive sets to achieve uncertainty quantification for skin cancer sub-typing. Specifically, they provide a skin cancer instance-level class distribution dataset SkinCon from 10,509 participants who made judgments on ISIC2019. They also propose to use Hit Rate to evaluate the quality of certain prediction sets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors have defined the problem well. In skin cancer sub-typing, the output of uncertainty quantification provides a set of predictions that provably covers the true diagnosis with a high probability, which holds a high practical clinical value.
    2. An instance-level class distribution SkinCON dataset is built to provide the underlying empirical distributions.
    3. The authors also provide a new evaluation metric Hit Rate.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The clarity of the method and the organization of the paper should be improved. e.g., in Section 3, it is unclear how “Distribution Regularized Adaptive” in DRAPS is integrated into the method. Is it achieved through the use of an additional Kullback-Leibler (KL) divergence loss for distribution regularization? Furthermore, how is the empirical distribution matched in the KL loss? I suggest that the authors provide a detailed description of this aspect to avoid confusion.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    In general, the training parameters have been explained. However, the authors have not provided details on how the dataset was partitioned.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In addition to my above remarks, here are my specific comments:

    1. How do the authors ensure that the distribution of annotations collected aligns with the distribution observed in actual clinical practice? Based on the statement in the paper, “The participants were mostly composed of medical students, with some medical residents,” does the higher proportion of medical students introduce a certain bias?
    2. I suggest that the authors appropriately condense the description of the dataset section, as this may lead to an overly brief section on the method, potentially making it difficult for readers to gain a clear understanding.
    3. Are the authors interested in open-sourcing the dataset in this paper, as this could greatly aid progress in the field?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors provided SkinCON, a large skin cancer instance-level class distribution dataset for decision consensus building. The topic of uncertainty quantification in skin cancer diagnosis is quite new and it has a high practical clinical value.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents the new SkinCon dataset, which is based on ISIC 2019 and contains over 900,000 diagnostics from medical professionals. This dataset’s purpose is to model clinicians’ inherent uncertainty when diagnosing skin lesions. The paper then presents an exploration of this dataset, taking into account metadata biases across lesion types. The paper also introduces a new method for Distribution Regulation (Distribution Regularised Adaptive Prediction Sets), which enables a neural network model to learn to represent the distribution from the dataset. Experiments were then presented across a variety of neural network architectures, with results displayed using a new purposed metric Hit Rate.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation for the dataset and the proposed method is strong, and it clearly explains why such a dataset is needed and how it fills a gap in the current literature.

    The authors’ methodology for creating the dataset is clearly presented with detailed annotation procedure, and all decisions are well motivated.

    The paper provides a detailed analysis of fairness across demographics for each lesion classification, as well as an analysis of diagnostic bias based on the diagnostic labels collected from the dataset.

    The method for Distribution Regularised Adaptive Prediction Sets is well-motivated and extends previous distribution regularisation methods. It is well presented in two sets of pseudo-code, making it easy to reproduce and understand.

    The experiments used multiple CNN backbones to demonstrate that the method is applicable to a variety of neural network architectures.

    The experiments show that the proposed method outperforms the standard baseline and RAPS, achieving higher metrics on average, including the newly introduced hit rate.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The neural network architectures chosen may have included visual transformers, which are becoming increasingly popular in the field of medical imaging.

    The experiments could have been repeated multiple times, with the average and standard deviations reported to determine how robust the methods are.

    Some of the table presentation could have been improved to highlight areas with good results.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The dataset annotation procedure has been well explained, enough that others could repeat their process to recreate the dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See Weaknesses for main points.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a strong paper that presents interesting data and fills a gap in the current literature. It is well presented, and the algorithm presented complements the dataset nicely, demonstrating the need for further investigation into both this dataset and the algorithm.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all reviewers for their valuable comments! Here are some clarifications and responses. For convenience, we use W for weaknesses, and C for additional Comments. Reviewer #1 W1: Yes, we are considering to add those advanced models. But we expect that those models would benefit less from DRAPS as their performance is higher. W2: Sure, we will add some details of the robustness in the camera-ready version. W3:Thanks for your suggestion! We have highlighted our good results in bold in the edited version. Reviewer #3 W1:The method proposed in this work can be generalizable. As specified in the algorithm description, we do not limit the choice of K, ie., the number of possible classes. K does not have to 8. It is 8 only for this dataset. But for other datasets, the K can be adapted accordingly. As this is a supervised method that needs the data for training, for other dermatological diagnostic tasks, when plugging in the data, one just needs to change the parameters of the algorithm properly and it will work. W2: We acknowledge that the individual responses are noisy. However, the consensus (accuracy after popularity voting) is 61.86% (v.s., chance level of 12.5%), which means that distribution-wise (the focus of this paper), our data captures important information. We will add details of the data analysis into the camera-ready version. W3: The Naive results (Naive columns in Table 1) are trained without the KL divergence in the loss function. The second sentence in Sec. 4 contains the description of the Naive experiment. W4: Yes. We can show error bars in the camera-ready version. W5: We have added the corresponding definition in the text. W6: Thanks for pointing out! We are focusing on different aspects. We will add clearer illustrations in the camera-ready version and cite the paper. Reviewer #4 W1: Yes, the regularization is forced by adding the KL divergence loss for training. We are working on to make the illustration clearer in the camera-ready version. W2: The KL divergence measures how different two distributions are. The larger the KL divergence loss, the more different two distributions are. During training, the model parameters are optimized to minimize the KL divergence loss that forces the learned distribution similar to the empirical distribution. We will add details of KL divergence concept into the text. C1: First, the medical students in our sample were not significantly worse in overall performance. According to self-reported demographics, medical students’ accuracy is 56.20% while non-student participants’ accuracy is 51.39%. Second, board-certified dermatologists are uncommon so recruiting a large sample of these participants is unfeasible, particularly for large-scale datasets (~940k trials, ~11k participants). The similarly high overall performance of the different groups reassures us that the data are representative of typically trained observers. C2: Thanks for your suggestion! We are working on it for the camera-ready version! C3: Yes, we will release the dataset as well as the baseline model repository along with the camera-ready paper.




Meta-Review

Meta-review not available, early accepted paper.



back to top