Abstract

The robustness of supervised deep learning-based medical image classification is significantly undermined by label noise in the training data. Although several methods have been proposed to enhance classification performance in the presence of noisy labels, they face some challenges: 1) a struggle with class-imbalanced datasets, leading to the frequent overlooking of minority classes as noisy samples; 2) a singular focus on maximizing performance using noisy datasets, without incorporating experts-in-the-loop for actively cleaning the noisy labels. To mitigate these challenges, we propose a two-phase approach that combines Learning with Noisy Labels (LNL) and active learning. This approach not only improves the robustness of medical image classification in the presence of noisy labels but also iteratively improves the quality of the dataset by relabeling the important incorrect labels, under a limited annotation budget. Furthermore, we introduce a novel Variance of Gradients approach in the LNL phase, which complements the loss-based sample selection by also sampling under-represented examples. Using two imbalanced noisy medical classification datasets, we demonstrate that our proposed technique is superior to its predecessors at handling class imbalance by not misidentifying clean samples from minority classes as mostly noisy samples. Code available at: https://github.com/Bidur-Khanal/imbalanced-medical-active-label-cleaning.git

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3963_paper.pdf

SharedIt Link: https://rdcu.be/dV57Z

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_4

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3963_supp.pdf

Link to the Code Repository

https://github.com/Bidur-Khanal/imbalanced-medical-active-label-cleaning.git

Link to the Dataset(s)

https://challenge.isic-archive.com/landing/2019/ https://zenodo.org/records/1214456

BibTex

@InProceedings{Kha_Active_MICCAI2024,
        author = { Khanal, Bidur and Dai, Tianhong and Bhattarai, Binod and Linte, Cristian},
        title = { { Active Label Refinement for Robust Training of Imbalanced Medical Image Classification Tasks in the Presence of High Label Noise } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {37 -- 47}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a two-phase approach merging learning with noisy labels (LNL) and active learning. The approach enhances the robustness of medical image classification against noisy labels and iteratively refines the dataset by correcting mislabels. (1) The authors study a valuable and challenging problem, i.e., learning with noisy labels in imbalanced datasets for medical image classification. (2) The proposed method is complete and several concepts are used to compose the final pipeline. (3) Experimental results illustrate the consistent improvements of the proposed algorithm over baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The authors study a valuable and challenging problem, i.e., learning with noisy labels in imbalanced datasets for medical image classification.

    (2) The proposed method is complete and several concepts are used to compose the final pipeline.

    (3) Experimental results illustrate the consistent improvements of the proposed algorithm over baseline methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Novelty is limited. For example, noisy sample selection serves as a core contribution of the proposed LNL approach, however, the authors seem to directly adopt the algorithms proposed in [1] and [20] to medical image classification, lacking data-specific design and refinement.

    (2) Writing could be improved for better understanding. For example, (i) The introduction could benefit from a detailed explanation of the sample selection methods, specifically regarding the active label cleaning. (ii) The motivation and explanation for utilizing VoG to regularize sample selection in imbalanced datasets are insufficiently presented.

    (3) Both Co-training and CoreSet are not new in LNL and active learning. The absence of comparisons with more recent techniques weakens the persuasiveness of the proposed pipeline.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) The author can further improve clarity by providing pseudocode for the pipeline.

    (2) Some experimental results can be better explored and discussed. For example, the preference of different methods in the selection of noise samples can further validate the proposed pipeline to cope with imbalanced datasets.

    (3) Symbols and grammar require further inspection. For example, sample x_i or x_{ij} on Page 4; “accurately label set” on Page 2.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose a complete solution to a challenging noise learning problem and validate its effectiveness through sufficient experiments. However, the proposed method is more like a combination of existing techniques and lacks specific improvement for medical images. Additionally, the manuscript would benefit from improved organization and experimental comparisons with more recent algorithms.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors’ response addressed some of my concerns. However, this paper’s core contribution is to introduce methods from other fields into a new scenario. Personally, I do not consider this a strong innovation. I would leave this to the meta-reviewers to make the final decision.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a two phase approach that combines Learning with Noisy Labels (LNL) and active learning. This approach not only improves the robustness of medical image classification in the presence of noisy labels, but also iteratively improves the quality of the dataset by relabeling the important incorrect labels, under a limited annotation budget.

    Furthermore, this paper introduces a novel Variance of Gradients approach in LNL phase, which complements the loss-based sample selection by also sampling under-represented samples.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Innovative approach: This study innovatively proposes a new method that integrates Co-Teaching strategy with Variance of Gradient (VOG) to address noisy labels in medical image datasets. Utilizing VOG as a regularizes to enhance the sample selection process, it takes into account the variations in gradients across multiple training epochs, rather than relying solely on single-loss values. A novel formulation: The paper innovatively establishes a formula for the Variance of Gradient, building upon the foundation of the co-teaching model, which is an interesting point of innovation. Demonstration of clinical feasibility: Given the common issues of class imbalance and label noise in medical image datasets, this innovative approach has the potential to improve algorithm robustness and practicality under imperfect data conditions, presenting significant clinical relevance. The use of Learning with Noisy Labels (LNL) to tackle the problem of noisy labels in medical image classification, combined with VOG for sample selection and re-labeling, is original and particularly applicable in practice, especially when dealing with imbalanced datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limitations of the F1 Score: Although the F1 score is a metric that combines precision and recall, it does not always fully reflect the performance of a model, especially in cases of extreme class imbalance. Additionally, the F1 score may not reveal the performance differences of a model across various categories. Assumptions about Noise Rate: The paper assumes a specific noise rate, which may not fully represent the complexity of real-world data. The nature and distribution of noise in reality might be more complex and not necessarily follow a uniform or random distribution. Lack of Discussion Section: The absence of a discussion section in the article may limit readers’ deep understanding of the paper, as well as guidance on self-research criticism and directions for future research.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The research focuses on the impact of image noise on model accuracy in medical contexts, with particular attention to the significant effect of uneven sample volumes. While the paper discusses the division of samples, whether there is a significant difference in the number of images for each of the eight skin diseases could be a factor that affects the accuracy of the model, and whether this has been considered in the study. Therefore, the paper should consider providing evidence of varying quantities of pathological image samples, as well as the differences in model outcomes. Additionally, this paper provides limited introduction to previous studies and lacks a discussion section with detailed comparisons to past research, making it difficult for readers to assess the validity of this study’s contributions. It is recommended to further include specific limitations and result data of previous research models.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Lack of Comprehensive Literature Review: The paper does not sufficiently cover the existing literature. This oversight makes it challenging to ascertain the uniqueness of the proposed method and its contributions in comparison to existing solutions. Insufficient Discussion on Methodology Comparisons: There is no detailed discussion comparing the proposed method with past research. This lack of comparative analysis makes it difficult to evaluate the true innovation and effectiveness of the proposed method, particularly how it advances beyond the current state of the art. Absence of a Discussion Section: The lack of a discussion section restricts the depth of analysis regarding the implications of the findings, the limitations of the study, and potential areas for future research. A well-articulated discussion is essential for contextualizing the study’s impact and guiding ongoing research efforts.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The changes to the F1 score and the Discussion part are satisfactory, but doubts about the novelty of the work remain. Undeniably, this is a meaningful and clinically effective piece of research.



Review #3

  • Please describe the contribution of the paper

    In this study, the authors a novel Variance of Gradients approach in Learning with Noisy Labels

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The Variance of Gradients approach in Learning with Noisy Labels has novelty.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Sounds good

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    What’s the difference of two “noisy labels” blocks in figure 1? In the second paragraph of section 3.3, why “Co-teaching and Co-teaching VOG,”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The Variance of Gradients approach in Learning with Noisy Labels has novelty

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their constructive feedback. The reviewers find our paper novel (R3, R4), addressing a pertinent and challenging problem (R1, R4), clinically feasible (R4), complete (R1), and consistently better than the baselines (R1). We address some concerns of the reviewers below:

i) Novelty (R1): We respectfully argue that our method does make novel contributions. As R3 and R4 noted, this is the first work to employ VOG in an active label-cleaning setup for a noisy imbalanced dataset. [1] and [20] presented studies on VOG, but didn’t use VOG for the LNL approach and, most importantly, they are out of scope for our setting.

ii) Motivation of VOG (R1): We emphasize that our method is well-motivated in Section 2.3, par. 1 and 2, detailing why VOG is crucial as a regularizer for imbalanced datasets. As explained in “ … underrepresented samples tend to exhibit high loss values because training is dominated by overrepresented samples, leading to their likely mis-selection as noisy samples …to avoid any potential bias, VOG estimates the change in gradients over epochs rather than making…”

iii) Improved Organization (R1): We decided to move some content from Section 2 to the introduction to ensure better clarity up front.

iv) Comparison with Recent Techniques (R1, R4): We agree that Co-teaching and CoreSet on their own may not be novel LNL and active learning approaches, respectively. However, our contribution lies in a unique solution that combines these two complementary approaches (LNL and active learning) for active label cleaning. Our approach is not directly comparable to only LNL or active learning methods; the closest state-of-the-art method for comparison is [2], which we have chosen as a baseline. We highlight that our approach is modular and can be replaced with any improved clean sample selection and active sampling function, which could further improve performance.

v) Pseudocode (R1): We agree that pseudocode could enhance understanding. However, we use text, equations, Fig. 1, and implementation details to explain our method, as the formatting space constraints hinder the inclusion of pseudocode.

vi) Grammar (R1, R3): We will address these in camera-ready version. We will incorporate the suggestion of R3 in the figure for better quality and organization, i.e., only use a single “noisy label” block.

vii) F1-score (R4): Please note that we computed the macro-average F1-score (see italic text on page 6), which essentially averages the per-class F1-scores to capture class imbalance. Balanced accuracy, macro-F1 score, and MCC are typical metrics used to evaluate imbalance classification. We computed these metrics but reported only macro-F1 score on the graph for brevity and to avoid cluttering. Additionally, in Sec 4.2, we analyzed three extreme classes and showed that VOG approach doesn’t ignore extremely under-sampled classes, as indicated by the recall and correct guess percentages.

viii) Assumption about Noise (R4): Without loss of generality, we followed common standard protocols in LNL. For reference, we kindly point you to [2], which also uses uniform noise distribution. We experimented with different rates within the uniform distribution (0.4 & 0.5 for ISIC and 0.7 & 0.8 for LT-NCT-CRC-HE-100K) to cover various noise ranges. While we agree about the importance of other noise distributions, we had to limit the study due to page constraints. We will investigate other forms of noise in future work.

ix) Varying class samples (R4): We will investigate with the varying quantities of pathological images as an additional ablation study in our extended future work.

x) Discussion section (R1, R4): Thank you for the suggestion regarding a more detailed discussion. We unfortunately had to limit our discussion to the Results section due to the original page constraints, but we will elaborate it into a stand-alone Discussion section in the camera-ready version given the additional space allowed.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers agree to accept this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers agree to accept this paper.



back to top