Abstract

The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2847_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2847_supp.pdf

Link to the Code Repository

https://github.com/MIC-DKFZ/nnUNet

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Ise_nnUNet_MICCAI2024,
        author = { Isensee, Fabian and Wald, Tassilo and Ulrich, Constantin and Baumgartner, Michael and Roy, Saikat and Maier-Hein, Klaus and Jäger, Paul F.},
        title = { { nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors have reevaluated the performance of the recently proposed 3D medical segmentation methods, showing that many of them are not superior to the U-Net. They have explained the validation pitfalls of current practice and proposed recommendations to tackle these issues.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • I really liked the discussion regarding thre validation pitfalls.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The message of the paper is not new.
    • Many of the pitfalls are well-known.
    • The provided benchmark will not solve the issues.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • I don’t believe providing a new benchmarking dataset will solve the issues raised by the authors.
    • While many of the proposed recommendations are valid, they are vague and very general. So, it is not clear how useful they are.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The paper is not novel and fails to provide a satisfactory solution for the raised problems.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The authors generally agree with the main comments.



Review #2

  • Please describe the contribution of the paper

    This paper systematically identified validation pitfalls in the field and provided recommendations for how to avoid them. At the same time, this paper proposed a strategy for measuring the suitability of datasets for method benchmarking

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) This paper proposed some validation pitfalls and recommendations in 3D medical image segmentation, namely baseline and dataset-related pitfalls respectively. (2) This paper proposes A strategy for measuring the suitability of datasets for method benchmarking based on the ratio of inter-method versus intra-method SD.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) This paper did not propose an original method. (2) The six data sets used in the experiment have incomplete segmentation target shapes and do not involve segmentation targets of other shapes, such as tubular blood vessel segmentation. (3) Popular diffusion-based segmentation methods are not mentioned in the compared methods. (4) A large number of experimental results are in the supplementary materials, while the paper only has one table, which is inappropriate.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) Propose a novel CNN-based U-Net method. (2) Use more comprehensive datasets and comparison methods in experiments. (3) Show as many experimental results as possible in the paper rather than in the supplementary material.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    (1) This paper did not propose an original method. (2) The datasets and comparison methods used in the experiment are not comprehensive. (3) There are a lot of experimental results in the supplementary materials, but very few in the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper provides a systematic critique of recent 3D medical image segmentation in terms of their validation procedures, and present guidelines for avoiding these pitfalls. They further present benchmark evaluations that abide by their guidelines that support the robustness of the nnU-Net framework.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    S1. This paper raises the issue of fair comparisons with baselines, which is particularly important for medical images which lack large-scale benchmark datasets. S2. This paper summarizes best-practice measures such as maintaining options for pretraining, hardware, ensembling, training dataset, train/test splits etc constant across methods, to ensure fair comparative evaluations. S3. This paper presents a metric to measure the suitability of a dataset for comparison between network structures based on intra- and inter- dataset performance variance. S4. Benchmark evaluation results, following the proposed guidelines, are presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    W1. The issues raised seem to be already somewhat known. Perhaps researchers have been pretending to fool each other in exchange for getting submissions accepted. This work will hinder this recipe for publications. (I am not sure if this is a weakness, and have mixed feelings on this point.) W2. The presentation clarity could be improved. For example, the pitfalls could be made more clear by a summary table or by providing a comparison between the results presented in the original articles and the results obtained after the correction of the pitfall practice. (This may have been difficult due to the page limitation.) W3. Qualitative comparison between the results of different network architectures are not provided. (This may have been difficult due to the page limitation.) W4. The effect of dataset size or pre-training on benchmark performance is overlooked. I suspect the superior performance of CNNs might have something to do with the limited size in all datasets, and this may be reduced with pre-training on a large dataset, for which transformers have demonstrated to be extremely powerful. The authors should at least qualify their claim that “CNN-based U-Nets yield best performance,” when ‘trained from scratch.’

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While the experimental details seem to be sufficient for reproducibility, the authors’ mention of their intent to “release a series of updated standardized baselines for 3D medical segmentation at github.com/*****” cannot yet be verified.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The presentation clarity could be improved. For example, the pitfalls could be made more clear by a summary table or by providing a comparison between the results presented in the original articles and the results obtained after the correction of the pitfall practice.
    2. Qualitative comparison between the results of different network architectures could enlighten the reader on the characteristics of the different architectures.
    3. The effect of dataset size or pre-training should be investigated. I suspect when initialized from a model pre-trained on a large dataset, the performances of transformer-based models might improve. (This investigation may be out-of-scope of this particular manuscript.)
    4. Minor corrections: -consistency of dataset names - KITS or KiTS
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Very important but often overlooked (maybe intentionally) issues are clearly raised, supported by a comprehensive benchmark evaluation. This work is highly relevant to recent research trends, and will enable researchers to better assess the potential or limitations in trying to develop new network architectures. Only concern is the omission of the effect of pre-training.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    I agree very much with the authors’ points in their rebuttal, as I have described in my original review, and I also believe that the reviews of my fellow reviewers all the more support the need for accepting this work to MICCAI. I had not anticipated the poor reviews from my fellow reviewers, and I have upgraded my score in case the final decision is made based on average scores.




Author Feedback

We thank all reviewers for their valuable time and feedback.

Our study makes an alarming and novel discovery: Most 3D segmentation methods introduced in recent years fail to surpass a simple 6-year old U-Net baseline. This suggests a deeply flawed state of the research field, where supposed innovation/novelty is valued more than rigorous validation that would ensure genuine methodological progress.

We believe our contributions align well with the MICCAI guidelines, which explicitly support „new insights into existing methods“, and encourage accepting „a paper [with] a good contribution if you think that others in the community would want to know it“. If the alarming discovery described above does not qualify as important “insights into existing methods” that “others in the community would want to know”, then what would?

Response to Reviewer 4

R4 recommends rejecting our work because it “did not propose an original method” and instead suggests to “Propose a novel CNN-based U-Net method”. There is a certain irony in these comments, as they provide real-time evidence for the novelty bias discussed in our work. It is this exact bias, which has led the field of 3D medical segmentation into a state where the latest methods do not surpass a simple 6-year-old baseline. Our work’s message is that overcoming this state requires a cultural shift in the community and a re-definition of the term “novelty”, such as excellently argued in Michael Black’s “guide to reviewers” [1]. Furthermore, the MICCAI reviewing guidelines state: “Please remember that a novel algorithm is only one of many ways to contribute”.

Response to Reviewer 1

R1 recommends rejecting the paper, because “the message of the paper is not new” and “many of the pitfalls are well-known.” We kindly ask R1 to reconsider their recommendation based on the following:

While we agree that our study does not invent or discover a novel type of validation pitfall, it is dangerous to assume that the discussed pitfalls are common knowledge and do not need to be explicitly studied. Our study provides empirical evidence that these pitfalls are not universally acknowledged and demonstrate how they hinder the field’s progress. Although the pitfalls may be generally known, method validation is severely neglected in current practice.

R1 also assumes that “the provided benchmark will not solve the issues”. While no study can guarantee a change in common practice, we argue that our work is a much needed first step towards better validation practices. We are the first to systematically describe current pitfalls and empirically show their severity, highlighting the crucial need for action. Besides the public benchmark, our work further provides two concrete tools facilitating meaningful validation in the future: 1) a novel set of standardized state-of-the-art segmentation baselines, and 2) a strategy for measuring the suitability of datasets for method benchmarking.

General Concerns

A similar situation occurred when the nnU-Net paper revealed that many presumed architectural advancements at the time did not surpass a simple U-Net baseline. As publicly discussed, the nnU-Net paper was rejected at MICCAI back then due to a perceived “lack of novelty” and later became a milestone for biomedical segmentation. We are genuinely concerned that MICCAI will repeat this history and reject other critical messages regarding a flawed state of research due to a perceived lack of novelty.

Thank you for considering our responses. We hope they clarify the importance of our findings and the need for a cultural shift in the community.

[1] https://medium.com/@black_51980/novelty-in-science-8f1fd1a0a143




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    none

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    none



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper addresses a crucial problem in the field and underscores the importance of thorough evaluation of AI-based decision support tools for medical image analysis. It is especially critical in today’s era when integrating AI-based technology into routine clinical practice is essential. As Reviewer 3 mentioned, this study raised very important but often overlooked issues and supported the claims by a comprehensive benchmark evaluation. Presenting such works at major conferences like MICCAI would be very useful in striking a balance between technical novelty and clinical-grade performance. Therefore, I recommend this work for acceptance and oral presentation at MICCAI 2024.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper addresses a crucial problem in the field and underscores the importance of thorough evaluation of AI-based decision support tools for medical image analysis. It is especially critical in today’s era when integrating AI-based technology into routine clinical practice is essential. As Reviewer 3 mentioned, this study raised very important but often overlooked issues and supported the claims by a comprehensive benchmark evaluation. Presenting such works at major conferences like MICCAI would be very useful in striking a balance between technical novelty and clinical-grade performance. Therefore, I recommend this work for acceptance and oral presentation at MICCAI 2024.



back to top