Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Scribble supervision has emerged as a promising approach for reducing annotation costs in medical 3D segmentation by leveraging sparse annotations instead of voxel-wise labels. While existing methods report strong performance, a closer analysis reveals that the majority of research is confined to the cardiac domain, predominantly using ACDC and MSCMR datasets. This over-specialization may contribute to overfitting, overly optimistic performance claims, and limited generalization across broader segmentation tasks. In this work, we formulate a set of key requirements for practical scribble supervision and introduce ScribbleBench, a comprehensive benchmark spanning over seven diverse medical imaging datasets, to systematically evaluate the fulfillment of these requirements. Consequently, we uncover a general failure of methods to generalize across tasks and that many widely used novelties degrade performance outside of the cardiac domain, whereas simpler overlooked approaches achieve superior generalization. Finally, we raise awareness for a strong yet overlooked baseline, nnU-Net coupled with a partial loss, which consistently outperforms specialized methods across a diverse range of tasks. By identifying fundamental limitations in existing research and establishing a new benchmark-driven evaluation standard, this work aims to steer scribble supervision toward more practical, robust, and generalizable methodologies for medical image segmentation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4424_paper.pdf

SharedIt Link: https://rdcu.be/eHwY5

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_42

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MIC-DKFZ/ScribbleBench

Link to the Dataset(s)

https://syncandshare.desy.de/index.php/s/DJ4KBZrZScFbTei

BibTex

@InProceedings{GotKar_Revisiting_MICCAI2025,
        author = { Gotkowski, Karol AND Maier-Hein, Klaus H. AND Isensee, Fabian},
        title = { { Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {436 -- 446}
}

Reviews

Review #1

Please describe the contribution of the paper

This work proposes a benchmark to assess scribble-based methods following the observation that the field lacks a well-established benchmark that allows the assessment of the generalizability of such techniques. As part of the benchmark, the paper introduces a technique based on the nn-Unet (strong baseline) that performs best among the methods considered in the benchmark
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It is very positive that this work introduces a benchmark to evaluate scribble-based methods. The MICCAI community has been “slow” in considering benchmarks and datasets as relevant and novel contributions to the field. As a result, many of the works of that nature -covering MICCAI applications- have diverted to other venues (e.g. CVPR, NeurIPS), while their most natural venue should be MICCAI. I am very supportive of this initiative.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The message of the paper is confusing. Is this about a benchmark? or a new method?
- Code is lacking, which is a requirement for a “benchmark contribution”. The paper claims that this will be released if accepted. However, for a paper of this nature, I would expect to see the code (as supplementary material or anonymous github)
- Previous simpler methods beyond cardiac imaging are not considered. Examples include itk-snap tools or a random based approach for the placenta [1].
- ScribblePrompt is a recently published work (ECCV 2024) that claims and demonstrates generalizability. Some of the claims here may need to be repositioned with respect to this work. [1] https://doi.org/10.1016/j.media.2016.04.009 [2] https://arxiv.org/abs/2312.07381
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I am very supportive of this initiative. Nonetheless, I currently lean towards rejection due to: 1) the lack of clarity in the paper. Combining a benchmark with a method is a mixture that is not necessarily good 2) the lack of code as part of the submission. While MICCAI does not request this, I consider this is a must for a benchmark contribution.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I am confident the authors will address my remarks about the main contribution’s lack of clarity. I am very supportive of expanding the submission & publishing of benchmarks at MICCAI.

Review #2

Please describe the contribution of the paper

This paper critiques the over-specialization of scribble supervision methods for medical image segmentation on cardiac datasets (ACDC, MSCMR). It argues that this focus leads to overfitting, misleading performance claims, and poor generalization. The authors contribute:

A set of requirements (R1-R5) for practical scribble supervision (generalization, benchmarking, avoiding over-specialization, maximizing performance through established practices, open-source implementation).

ScribbleBench, a diverse benchmark of seven 3D medical image segmentation datasets (LiTS, BraTS2020, AMOS2022, KiTS2023, WORD, MSCMR, ACDC).

Identification of three validation pitfalls: 1) overfitting to cardiac datasets, 2) counterproductive novelties, and 3) neglect of simple, generalizing methods.

Highlighting nnU-Net with a partial cross-entropy loss as a strong, overlooked baseline that generalizes well.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Addressing a critical problem: The paper identifies a crucial flaw in the current scribble supervision literature – the lack of generalization due to dataset bias. This is a significant contribution, as it highlights the need for more rigorous evaluation in the field.

Comprehensive Benchmark: ScribbleBench is a valuable resource for the community, providing a diverse set of datasets for evaluating scribble supervision methods.

Clear Requirements: The proposed requirements (R1-R5) provide a useful framework for designing and evaluating scribble supervision methods.

Strong Baseline: Highlighting the performance of nnU-Net with a simple partial loss is a valuable contribution, as it shows that complex methods are not always necessary for good performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Limited Comparison to Other Benchmarking Studies: While ScribbleBench is presented as a novel contribution, the paper does not adequately compare it to other existing benchmarking initiatives in medical image segmentation, such as the Medical Segmentation Decathlon (https://arxiv.org/abs/1902.09063) or the work on federated learning benchmarks (https://arxiv.org/abs/2406.04845). A discussion of how ScribbleBench complements or differs from these initiatives would strengthen the paper.

Lack of Quantitative Analysis of “Novelties”: The paper claims that many novelties in scribble supervision methods are counterproductive. However, this claim is not always supported by rigorous quantitative analysis. The ablation studies are somewhat limited, and it is not always clear why specific novelties degrade performance.

Limited Exploration of Alternative Baseline Methods: While the paper highlights nnU-Net with a partial loss as a strong baseline, it does not explore other potential baseline methods in sufficient detail. For example, methods based on variational autoencoders (https://arxiv.org/abs/1312.6114) or generative adversarial networks (https://arxiv.org/abs/1406.2661) could also serve as strong baselines and should be considered.

The Benchmark only contains public datasets: The absence of proprietary data is a significant oversight that will likely limit the usefulness of the benchmark for real-world clinical applications.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses a critical issue in scribble supervision and presents a valuable new benchmark. The results are promising, but the limitations in comparison to existing methods and the lack of more thorough analysis of “novelties” and baselines could be improved.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper tackles the topic of segmentation methods relying on “scribbles” rather than “dense” segmentations for validation purposes. They suggest that the MIC field has inappropriately focused on scribbles in the context of very specialised datasets (particularly CMR), to make generalisable conclusions, which may not generalise well when the same techniques are evaluated on other datasets.

To this end, they propose a set of criteria that scribble-based segmentation methods should adhere to before being accepted as viable alternatives to dense-based methods, particularly in the sense that they are expected to generalise well as methods, onto other datasets.

They also propose a structured approach to “benchmark” such methods across a number of relevant datasets, and use these to demonstrate ‘pitfalls’ in reasoning when developing and evaluating scribble-based methods.

Finally, they propose a ‘baseline’ scribble-based algorithm which they claim generalises well, and obtains good performance in their proposed structured benchmark, and claim that this may even serve as a dense ‘validation’ target to train further segmentation methods on new datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors are addressing a very important problem, meriting discussion in the field, by challenging claims (or at least expectations) of generalisability of a certain class of algorithms (scribble-based medical image segmentation) across medical datasets.

They address this problem by devising a systematic, structured approach for evaluating performance across a number of different datasets, and using this to demonstrate potential pitfalls in validation, and then providing recommendations for evaluating the generalisability of these methods, as well as software for scribble generation as part of this process.

Furthermore, their approach uses freely available datasets, enabling replicability / easy adoption by the community.

Finally, the proposed novel algorithm which serves as a good baseline, as indicated by their benchmark, is a simple improvement over nnUNet, which could help adoption in scribble-based contexts.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The problem of methods that are domain-specific and don’t generalise well to other domains is not specific to scribbles. E.g. segmentation methods developed for left ventricle segmentation often fail when applied to brain segmentation. This is not really a problem, except when the method explicitly claims to be generalisable. It is not made clear in the paper if the methods investigated were of this nature or expected to be generalisable methods. Similarly, their recommendations do not really make sense as universal recommendations (since there’s no real reason to avoid domain-specific solutions), but seem to be only applicable to methods that claim to be generalisable.
2. One problem in the paper, which, is not necessarily specific to this paper, but endemic to the whole field of validation, is the lack of consideration for “specifications”. E.g. even in classical, dense ground truth labels, the specifications to which these were segmented / labelled may have been different to the specification of the segmentation output desired by the algorithm. Or worse, no specification at all may have been present in either side, or even from one image to the next (e.g. if different radiologists with different labelling ‘personalities’ were involved during the labelling). In this paper, the ‘specifications’ according to which the scribbles were generated are not made explicit. The authors defend the appropriateness of their scribbles by saying they obtain similar performance to another automated scribble generation method. However, no attempt was made to independently ascertain that the scribbles share similar characteristics. If the characteristics of the two automated methods (and also the expert scribbles) were found to differ, the conclusion might instead be that the methods were invariant to these characteristics, not that the scribbles are qualitatively equivalent or ‘good’ as such.
E.g., the lack of generalisation to other datasets might stem from the fact that the generated scribbles aren’t appropriate or ‘regular’ in the other datasets and that the novel features rely on this ‘regularity’, whereas simpler methods don’t. This is not a condemnation of your method, but is simply a consideration worth explicitly mentioning as a limitation / avenue for future work.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. The placement of Fig. 1 reference in the text doesn’t make sense where it is placed. It should be placed in the next paragraph, since it is a graphical summary of your contributions.
2. Note: Fig 2 is upside down. Axial scans are typically presented in the supine view, i.e. kidneys appearing at the bottom. I believe this is a case of y-axis inversion rather than rotation, since the shadow on the left is likely the liver, but might be worth confirming with a radiologist.
3. An additional useful ‘requirement’ to consider for scribble supervision to be practical would be the need for a ‘specification’, and ensuring that any scribble-based method makes the same ‘specification’ assumptions as for the generated scribbles. The equivalent point can be made regarding ‘dense’ segmentations, and even clinician-based segmentations; i.e. results are better in the presence of a specification, and methods perform better when the produced segmentation aims for the same ‘specification’ in the output (e.g. what tissues to include and what to exclude, how smooth or wavy the boundaries should be, etc), and the guarantee that the ground truth also conforms to that specification. A useful reference for this point is
  - Niek H Prakken, Birgitta K Velthuis, EJ Vonken, Willem P Mali, and MJ Cramer. “Cardiac MRI: standardized right and left ventricular quantification by briefly coaching inexperienced personnel”. In: Open Magn Reson J 1 (2008), pp. 104–111.
  - which demonstrated that adherence to arbitrary but clearly defined segmentation protocols by itself has been shown to improve the reliability of segmentations, both by automated and human agents.
  - I would strongly urge you to consider this as an additional criterion / requirement, but it should at least be mentioned in your limitations / discussion.
4. Overall, the tone of the article is unnecessarily ‘assertive’, despite the fact that some assertions rely on untested assumptions. It would be more appropriate to rephrase some of these strongly worded assertions / section titles into more ‘statistical’ language, e.g. “may”, “likely to be” etc, which leaves room for the reader to draw their own conclusions from the findings.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The issue discussed in this paper is an important issue, meriting discussion, and the authors have done a good job of systematically addressing it. Some small issues need fixing, and some limitations made explicit before acceptance, but these should be rather straightforward to address.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors claim to have adequately addressed the minor points identified in the review. Regarding the main point about domain-specific vs generalisable methods, their rebuttal that this is mainly meant to address generalisable methods is sensible, though it is not clear if any changes have been made to the manuscript to clarify this point as a result. (and since we do not seem to be given access to the revised manuscript I cannot confirm whether this is the case).

Author Feedback

We thank all reviewers for their constructive feedback, which helped improve our work. We appreciate the recognition of the impact of our work and address their concerns below.

Domain-specific Methods (R1) We fully agree that methods limiting their scope clearly to only a specific domain should also be scrutinized only within that domain. However, all baselines we evaluated in this work explicitly or implicitly claim generality, whether through their titles, stated goals, or evaluation setup. Our benchmark reveals that these claims often do not hold up under broader scrutiny, highlighting the need for rigorous and transparent evaluation. Our recommendations are based on the trend towards generalizable domain-agnostic methods like nnU-Net, nnDetection, and Auto3DSeg. In the context of scribble supervision, requiring task-specific adaptation would undermine its main appeal—lower annotation cost.

Benchmark contribution & Non-Interactive Method Evaluation (R2) We thank the reviewer for pointing out to communicate our message more clearly. Our work is a benchmark paper for fully automated segmentation methods that learn from scribbles and can subsequently be applied to unseen images without further supervision. Methods requiring user interaction at inference, such as ITK-SNAP (requires force constraints for prediction), Slic-Seg and ScribblePrompt (require user scribbles for prediction), do not fit this task and are therefore not included. We added this clarification to the manuscript.

Benchmark Benefits (R3) We fully support the idea that benchmarks should add a clear benefit to the landscape of benchmarks, with each being tailored to a specific task, such as MSD for generalizable segmentation and FedLLM-Bench for federated learning. Learning fully automated segmentation methods from scribble annotations has so far not been sufficiently covered. The purpose of our benchmark is to fill this gap and provide a standardized way for researchers to benchmark their methods in this field. We revised the manuscript to clearly communicate this benefit.

Novelty Analysis (R3) As detailed in Section 4.2, all evaluated methods build on a U-Net with pCE and add specific novelties to boost performance. Our benchmark provides a quantitative assessment of these additions across diverse tasks. While the novelties yield improvements on cardiac datasets they fail to generalize and often degrade performance elsewhere. This suggests the methods overfitted to the cardiac benchmark rather than introducing broadly useful improvements, highlighting the importance of cross-domain evaluation.

Baseline Method Exploration (R3) We appreciate the reviewer’s suggestion to include other techniques such as VAE or GAN-based models. However, we focused on recent methods that claim strong performance and general applicability as baselines. These include nnU-Net, which remains the dominant segmentation algorithm as evidenced by recent MICCAI benchmarks and [1], as well as six additional methods representing the current state-of-the-art in scribble supervision. [1] Isensee, et al. “nnu-net revisited: A call for rigorous validation in 3d medical image segmentation”

Other Comments (R1, R2, R3) (R1) We agree that the specifications for generating our automated scribbles are highly relevant and will publish the specifications as documentation with our code repository. Further, we fixed the Figure 1 reference and the Figure 2 issue and softened overly assertive phrasing. (R2) We sincerely apologize for not publishing the code during the submission phase. We intended to publish a well-documented and polished version of our code after acceptance, requiring additional effort. (R3) Public datasets have become increasingly large and diverse over recent years and are frequently used for new benchmarks with adapted task formulations. Leveraging publicly available datasets that have been rigorously vetted by the community ensures a high-quality standard for the benchmark.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Thought the reviewers pointed out the merit of setting up a benchmark for scribble-based segmentation, they have several concerns on this work. The authors are invited to clarify them in the rebuttal: 1) The problem of “domain-specific” is not specific to scribble-based learning. 2) lack of consideration for “specifications”; 3) several writing issues such as errors in the figure and unnecessarily ‘assertive’ descriptions; 4) lack of clarity on benchmarking or method contribution; 5) ignorance of previous benchmarks and alternative methods; 6) limited clinical application due to use of public datasets; 7) Lack of Quantitative Analysis of “Novelties”
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Despite initial reviews were mixed, the rebuttal provided by the authors has convinced reviewer 1, and now all reviewers agree on acceptance of this work. I thus recommend Accept, and strongly encourage the authors to consider in the final version several of the points raised during the review process, which may indeed improve the manuscript.

back to top

Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation

Author(s):