Abstract

Machine learning models experience deteriorated performance when trained in the presence of noisy labels. This is particularly problematic for medical tasks, such as survival prediction, which typically face high label noise complexity with few clear-cut solutions. Inspired by the large fluctuations across folds in the cross-validation performance of survival analyses, we design Monte-Carlo experiments to show that such fluctuation could be caused by label noise. We propose two novel and straightforward label noise detection algorithms that effectively identify noisy examples by pinpointing the samples that more frequently contribute to inferior cross-validation results. We first introduce Repeated Cross-Validation (ReCoV), a parameter-free label noise detection algorithm that is robust to model choice. We further develop fastReCoV, a less robust but more tractable and efficient variant of ReCoV suitable for deep learning applications. Through extensive experiments, we show that ReCoV and fastReCoV achieve state-of-the-art label noise detection performance in a wide range of modalities, models and tasks, including survival analysis, which has yet to be addressed in the literature. Our code and data are publicly available at https://github.com/GJiananChen/ReCoV.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0372_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0372_supp.pdf

Link to the Code Repository

https://github.com/GJiananChen/ReCoV

Link to the Dataset(s)

https://github.com/GJiananChen/ReCoV

BibTex

@InProceedings{Che_Detecting_MICCAI2024,
        author = { Chen, Jianan and Ramanathan, Vishwesh and Xu, Tony and Martel, Anne L.},
        title = { { Detecting noisy labels with repeated cross-validations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a noisy label leanring method, which adopted cross validations for noisy sample detection. Noisy label learning is an important topic in the community, but the presentation the method and design of the experiments needs improvements.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The topic of this paper is interesting, as noisy label is quite often in medical area. And the authors have evaluated their method on four datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No comparison with SOTA nosiy label method.
    • In the CIFAR-10N experiemtns, the features were extracted by supervised and self-supervised encoder, where the supervised encoder could overfit to noise already. It is not clear how the aggre noise and worst noise been selected and is it before the feature extraction or after?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see section 6.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Reject — must be rejected due to major flaws (1)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    No ablation study, no comparison with sota. And the experiments is not well designed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I found the ablation study in the supplementary, and I think the topic is important. Thank you for the thorough rebuttal. I appreciate the clarifications provided, and I’m satisfied with the responses to the concerns raised. I would like to accept the paper for publication



Review #2

  • Please describe the contribution of the paper

    This paper aims to detect noisy labels from the training datasets, which is very practical and important problem in developing medical AI models. The authors present a repeated cross-validation (ReCoV) and a fastReCoV method. The experiments have shown their effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The problem this paper aims to solve is practical and well-defined.

    2. The methods are straightforward and well-described.

    3. The experiments are supportive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main problem is that the datasets selected cannot fully reflect the medical problems.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Open-source codes will be more supportive.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This reviewer suggests the authors study more practical medical problems instead of constructing natural image datasets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to the strength and weakness parts.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Although the experiments are not perfect, the presented method is useful in practice.



Review #3

  • Please describe the contribution of the paper

    This paper presents a novel method for identifying examples with noisy labels within a given dataset, without requiring prior knowledge of the percentage of noisy examples. The authors propose an approach based on repeated N-fold cross-validation, which aims to identify examples that consistently lead to worse performance across different folds. By pinpointing the examples that more frequently contribute to inferior cross-validation results, the method effectively identifies the noisy examples within the dataset.

    Furthermore, the authors introduce a computationally efficient variant of their method, inspired by evolutionary computing optimization techniques. This alternative approach offers a slightly reduced performance compared to the original method but significantly speeds up the identification process of noisy examples. This faster version makes the method more suitable for deep learning scenarios, where cross-validation can be computationally expensive.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a very solid approach for detecting label noise, grounded in robust mathematical foundations and backed up by extensive experimentation. The method’s effectiveness is thoroughly tested across multiple datasets and base models, considering both simulated and real-world label noise scenarios, which strengthens the credibility of the results.

    The proposed method consistently outperforms existing techniques for identifying noisy labels, demonstrating superior performance in the comprehensive experimental results presented in the paper.

    While the initial version of the method may appear computationally expensive due to the repeated cross-validation process, the authors address this concern by introducing a faster variant specifically designed for deep learning settings. This adaptation significantly reduces the computational overhead, making the method more feasible for practical application. Despite the inherent computational complexity, the authors report that their longest experiment only required 2 GPU days, indicating that the computational demands are manageable, considering that label noise identification is typically a one-time process for a given dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper reports several metrics to determine whether a model performance has improved before and after the application of the noise removal procedure, it would have been great to see some results for the performance of the noise removal process itself when possible, and not only through proxy metrics. The authors do artificially add noise to the mushroom dataset, which is a relatively simple dataset, and assess the method’s ability to identify and remove the introduced noisy labels. However, it would have been more comprehensive and insightful to extend this evaluation to real-life datasets as well.

    By applying the same approach of introducing artificial noise to more complex, real-world datasets and assessing the method’s performance, the authors could have provided a more robust demonstration of their method’s effectiveness in practical scenarios. This additional evaluation would have strengthened the paper’s claims and increased confidence in the method’s applicability to real-life datasets, which often exhibit more intricate noise patterns and challenges compared to simpler datasets like mushroom.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors provide extensive pseudocode, making it seemingly easier to reproduce the work in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Consider changing the equations on page 2 from inline to full width for better readability.

    The word “ImageNet” on page 5 is taking up too much space and extending into the border. Please adjust the formatting to ensure it fits within the designated area.

    To emphasize the practicality of the method, I strongly recommend including the time required for each dataset and model. Time complexity is a major concern for readers, so clearly stating the computational time needed for each experimental setup is crucial.

    Figure 1 appears too small, making it difficult to discern the details. Consider increasing its size to improve visibility and readability.

    Please note that “HECTOR” is misspelled on page 7 at the bottom. Make sure to correct this for consistency.

    Regarding the frequent references to the supplementary material in the main paper, I suggest minimizing such occurrences. The supplementary material should be optional and not essential for understanding the core concepts and results. Consider incorporating some of the key results from the supplementary material into the main text or restructuring the paper to reduce reliance on the supplementary material, making it more self-contained.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors present a novel and well-grounded approach for detecting label noise, supported by extensive experimentation across multiple datasets and base models. The method demonstrates superior performance compared to existing techniques and offers a computationally efficient variant suitable for deep learning settings. While the paper could have benefited from a more comprehensive evaluation of the noise removal process itself on real-life datasets, the overall contributions and the thorough testing on both simulated and real-world label noise scenarios make this work a valuable addition to the field. The authors have also addressed the practicality and feasibility of their method, making it a promising tool for identifying and handling label noise in various applications.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed most of my concerns.




Author Feedback

We are grateful to the reviewers for their time and constructive comments. We are glad to see that all reviewers agree on the importance of noisy label detection (NLD) for developing medical AI models and the strength of our evaluations.

Clarifications on noise settings for CIFAR10N (R3) We would like to clarify that CIFAR10N is a variant of CIFAR10 with multiple sets of real-world noisy labels taken from crowdsourced annotations [25]. We selected the “aggre” and “worst” label sets to reflect different noise level settings. R3 also asked if the supervised feature extractor used in our CIFAR10N ablation study overfitted to label noise. We would like to clarify that the encoder is pretrained on the independent ImageNet and frozen, and is agnostic to any labels in CIFAR10N.

Datasets and evaluations (R1, R4) R4 suggested that we introduce artificial noise to other datasets as in Mushroom to further strengthen our evaluation. We agree with R4 and will add the following previously omitted (due to space constraint) results and discussions as suggested. For CIFAR10N, we believe that the detection of real-world annotation errors serves as a better demonstration for the performance of fastReCoV (sensitivity of 93% and specificity of 99% for the ‘aggre’ noise level and sensitivity of 92% and specificity of 98% for the ‘worst’ noise level) as supposed to artificially introduced random noise. As for the real world medical datasets PANDA and HECKTOR, estimation of the detection accuracy of intrinsic noise (inter-observer variability and noisy survival labels) is inherently challenging. Without confidence in the correctness of existing labels, it is unclear how artificially introduced noise will interact with intrinsic noise. We seek to address this gap in future work, by presenting the identified noisy samples to domain specialists for verification, and investigating the best ways to introduce artificial noise, especially in time-to-event datasets. We hope this also addresses R1’s concern regarding the selection of datasets. We included Mushroom and CIFAR10N as sandboxes to lay out mathematical foundations and quantitatively evaluate NLD performance. We then included PANDA and HECKTOR to ensure that our method is practical and robust in large-scale real-world medical imaging datasets.

Comparison with SOTA, and ablation experiments (R3) We would like to clarify that we had comparisons with SOTA for each dataset analyzed. In Mushroom, we showed that our algorithm achieved 100% accuracy and outperformed state-of-the-art general ML NLD algorithms (Table S1). Our method could also potentially rank 2nd place in the public leaderboard for NLD on CIFAR10N, but we left out this information because leaderboard algorithms fully tune the embedding model while fastReCoV froze it. In our manuscript, we described that ‘Naive’ detection serves as the SOTA method for the PANDA challenge. We also mentioned that our method is the first explicit NLD algorithm for survival analysis (HECKTOR). We would also like to clarify that we evaluated the effect of different noise levels, different feature extractors, and tau in the experiments section.

Details of runtime (R4) FastReCoV (all required runs) took 3.5mins for CIFAR10N, 3.5hrs for HECKTOR and 40.6hrs for PANDA on one Nvidia Titan Xp GPU. The algorithms are fold-independent and can be further accelerated by parallelizing multiple GPUs.

We thank the reviewers for their valuable feedback. In our revised manuscript, we will implement the suggested layout and formatting changes to better highlight key results, include runtime information, and reduce reliance on supplementary materials. As mentioned in our manuscript, we will make our code publicly available (datasets are public). We look forward to the application of our method as a plug-and-play NLD tool for the community to try out on various datasets and tasks. We are also really excited for the synergy between our algorithm and recent foundation models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper sets out to address an important problem in medical image analysis: prediction under noisy labels.

    However, the presentation of methods is unclear and probably erroneous. For example, in the beginning of Section 2 they define “class-conditional” label noise; but they actually provide the definition for class-independent label noise. Or on page 3 “With a large number of runs, the occurrence … “; the authors do not provide any justification on why this should be. The method itself is quite simple, based on the assumption that erroneous predictions in cross-validation experiments are likely incorrect labels. This is a highly simplistic assumption, which may very well be wrong in many applications depending on the label noise distribution and classifier; far more sophisticated methods exist.

    A large section of experiments are devoted to non-medical data, which makes no sense given the limited MICCAI paper length.

    I am surprised and disappointed that two reviewers have registered “week accept” recommendations. I think the paper should be rejected as (1) Method is highly simplistic and lacks novelty. (2) Presentation is poor, unclear, and possibly erroneous. (3) Experimental results are not convincing.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper sets out to address an important problem in medical image analysis: prediction under noisy labels.

    However, the presentation of methods is unclear and probably erroneous. For example, in the beginning of Section 2 they define “class-conditional” label noise; but they actually provide the definition for class-independent label noise. Or on page 3 “With a large number of runs, the occurrence … “; the authors do not provide any justification on why this should be. The method itself is quite simple, based on the assumption that erroneous predictions in cross-validation experiments are likely incorrect labels. This is a highly simplistic assumption, which may very well be wrong in many applications depending on the label noise distribution and classifier; far more sophisticated methods exist.

    A large section of experiments are devoted to non-medical data, which makes no sense given the limited MICCAI paper length.

    I am surprised and disappointed that two reviewers have registered “week accept” recommendations. I think the paper should be rejected as (1) Method is highly simplistic and lacks novelty. (2) Presentation is poor, unclear, and possibly erroneous. (3) Experimental results are not convincing.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers found the topic relevant and important to MICCAI community, and apparantly authors did a successful job in rebuttal. Regarding weaknesses, I do think that the basic design and assumption is over-simplified which may not address label noise of medical data, and that natrual image experiments can be irrelevant and the space should be used for clinical data. Still, in my opinion the topic and method worth some discussions for MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers found the topic relevant and important to MICCAI community, and apparantly authors did a successful job in rebuttal. Regarding weaknesses, I do think that the basic design and assumption is over-simplified which may not address label noise of medical data, and that natrual image experiments can be irrelevant and the space should be used for clinical data. Still, in my opinion the topic and method worth some discussions for MICCAI.



back to top