List of Papers Browse by Subject Areas Author List
Abstract
In clinical decisions, trusting erroneous information can be as harmful as discarding crucial data.
Without accurate quality assessment of medical image segmentation, both can occur. In current segmentation quality control, any segmentation with a Dice Similarity Coefficient (DSC) above a set threshold would be considered “good enough”, while segmentations below the threshold would be discarded. However, those global thresholds ignore input-specific factors, increasing the risk of accepting inaccurate segmentations into clinical workflows or discarding valuable information. To address this, we introduce a new paradigm for segmentation quality control: image-specific segmentation quality thresholds, based on inter-observer agreement prediction. We illustrate this on a multi-annotator COVID-19 lesion segmentation dataset. To better understand the factors that contribute to segmentation difficulty, we categorize radiomic features into four distinct groups - imaging, texture, border and geometrical - to identify factors influencing expert disagreement, finding that lesion texture and geometry were most influential. In a simulated clinical setting, our proposed ensemble regressor, using automated segmentations and uncertainty maps, achieved a 5.6% MAE when predicting the mean annotator DSC score, enhancing precision by a factor of two compared to case-invariant global thresholding. By shifting to image-specific segmentation quality levels, our approach not only reduces the likelihood of erroneous segmentations but also increases the chances of including accurate ones in clinical decision-making.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4042_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{FouJor_Difficulty_MICCAI2025,
author = { Fournel, Joris and Bartoli, Axel and Marchi, Baptiste and Maurin, Arnaud and Bigdeli, Siavash Arjomand and Jacquier, Alexis and Feragen, Aasa},
title = { { Difficulty Estimation for Image-Specific Medical Image Segmentation Quality Control } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15972},
month = {September},
page = {117 -- 126}
}
Reviews
Review #1
- Please describe the contribution of the paper
- authors perform a statistical study of what determines observer variability in COVID19 lesion segmentation
- authors propose a method based on machine learning, radiomics and uncertainty for observer variability estimation
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- difficulty estimation is an important topic which has received limited attention
- studying the specific factors that influence difficulty estimation in a particular segmentation task in a statistical way is laudable and authors have done an excellent job there
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- the first contribution of the authors of demonstrating variations in segmentation difficulty is well known and can be observed in most segmentation problems in medical imaging. I think authors are overstating this contribution
- I think the introduction is misleading to the reader as the authors start by discussing quality control in AI segmentations. However, their work focuses rather on manual segmentation quality control. By developing a model to predict DSCLesion, they are modelling interobserver variability and only that - there is no guarantee that an AI will struggle in the same cases as humans do (in fact there are anedoctal evidences of the opposite for example in lung nodule detection where AI often misses the larger nodules but not smaller ones). It would be good to restructure the introduction to guide the reader in this way
- Similarly, this point should be included in the discussion and comparison to previous work on UQ (particularly when stating that this work goes beyond previous studies) should be done carefully as UQ tries to model AI uncertainty and not observer uncertainty
- Methodological details are scarce regarding this third part (3.2, perhaps the most important section) and also in the results and I struggle to understand the 3rd and 4th column of Table 2 (to the point where I don’t feel confident in commenting on their meaning). This could be improved to the detriment of the first contribution and huge figures (eg. Fig 1)
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
I had trouble rendering Fig 3 for some reason on Adobe and could only see it in a browser (Chrome)
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The topic is interesting but there is some confusion regarding the methodological aspects of the study and how it could then be used in practice
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The paper addresses the critical need for automated quality control in medical image segmentation. It introduces a paradigm shift from using global, task-invariant Dice Similarity Coefficient (DSC) thresholds to image-specific thresholds based on inter-observer agreement prediction. The main contributions are:
Demonstration of significant variations in segmentation difficulty within the same task (COVID-19 lesion segmentation), highlighting the limitations of global thresholds.
Statistical evidence linking segmentation difficulty to specific input properties (imaging, texture, border, and geometry).
A novel method for dynamically predicting segmentation difficulty, achieving a two-fold increase in precision compared to global thresholding in a simulated clinical setting. This method uses an ensemble regressor that combines radiomics and uncertainty maps derived from AI segmentations.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Addressing a critical gap: The paper tackles a crucial issue in clinical translation of AI segmentation: the need for reliable quality control. Current reliance on global thresholds can lead to acceptance of inaccurate segmentations or rejection of useful ones.
Data-driven approach: The study provides a rigorous analysis of factors influencing segmentation difficulty based on a multi-annotator COVID-19 lesion segmentation dataset. This allows for a more nuanced understanding of the problem.
Improved Performance: The proposed ensemble regressor significantly improves the accuracy of quality control by predicting image-specific thresholds.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Limited comparison to state-of-the-art quality control methods: The paper focuses on contrasting with global thresholding but lacks sufficient comparison to more advanced, existing automatic quality control methods for segmentation. Specifically, it does not consider methods that leverage prediction uncertainty or other image quality metrics directly for quality assessment.
Lack of generalizability analysis: The study is performed on COVID-19 lesion segmentation, which has specific characteristics. It is unclear how well the findings and the proposed method would generalize to other segmentation tasks (e.g., different anatomies, imaging modalities, or lesion types).
Reliance on radiomic features: While radiomic features provide valuable information, they can be sensitive to image acquisition parameters and segmentation quality. The paper could explore the use of more robust feature representations, such as deep learning-based features learned directly from the images.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper addresses an important problem in medical image segmentation and presents a novel approach to quality control. The results are promising, but the limitations in comparison to existing methods and the lack of generalizability analysis need to be addressed before the paper can be fully accepted.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper advances quality control in medical image segmentation by introducing image-specific segmentation quality thresholds based on predicted inter-observer agreement. The key contributions include demonstrating how segmentation difficulty is influenced by variations in input images, and proposing a novel method for predicting the difficulty of segmenting individual images.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This paper explores a rarely addressed yet critical topic in medical image analysis—quality control in segmentation—through a novel and well-structured approach. The use of visual features is particularly compelling, offering a comprehensive summary of distinct input descriptors grouped by region and purpose. Notably, the inclusion of metrics such as signal-to-noise ratio (SNR) and the positional orthodoxy score—which quantifies how typical a lesion’s position is within the lung based on dataset-wide frequency—is a thoughtful and innovative contribution. These descriptors enable a nuanced assessment of segmentation difficulty and observer disagreement. Overall, the methodology is clearly presented with logical structure and strong writing, making this a well-executed contribution to the field of medical image segmentation quality assurance.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The concept presented in the paper is relatively complex and may be difficult for some readers to fully comprehend. Additionally, while the paper introduces image-specific segmentation quality thresholds, it focuses exclusively on a single case study—COVID-19—which limits the generalizability of the proposed method to other medical imaging contexts or diseases. The paper also lacks a comparison to previous work in the field, or at least does not address whether such comparisons were considered.
Moreover, the study does not account for the human component, such as the skill level, fatigue, and adherence to annotation protocols, which undoubtedly influence segmentation quality but are more challenging to quantify. While the paper demonstrates the utility of image-specific quality control for COVID-19 lesions, future research should explore the applicability of this paradigm to other medical imaging tasks, particularly those with varying degrees of complexity and inter-observer variability.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(6) Strong Accept — must be accepted due to excellence
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper proposed a novel contribution to medical image segmentation quality control through image-specific segmentation thresholds and inter-observer agreement prediction. The authors introduce visual features and descriptors to assess segmentation quality. While the focus on a single case study (COVID-19) and the lack of comparison to previous works are limitations, the paper provides a strong foundation for future research. Its methodology holds significant potential for broader application in medical imaging quality control, making it a valuable addition to the field.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank the reviewers and AC for their time and feedback.
Here, we just want to provide some clarifications for R1; the final version will be modified accordingly.
== Contribution 1 ==
Segmentation quality is typically measured using the Dice Similarity Coefficient (DSC), which ranges from 0 (poor segmentation) to 1 (perfect segmentation). For a given input, a quality threshold α is defined such that any segmentation with DSC > α is considered good. Traditionally, this threshold is set globally—e.g., any segmentation of the left-ventricle myocardium with DSC above 0.7 is deemed acceptable.
Our first contribution is to highlight the variation in segmentation difficulty within a single task, illustrating the limitations of using a global, fixed quality threshold. Through statistical analysis, we quantify how many misclassifications (i.e., poor segmentations labeled as “good” and vice versa) can result from this global approach.
To our knowledge, this limitation of global thresholds has not been explicitly described in prior work.
== Quality control for human or AI segmentations? ==
Our framework is specifically designed for quality control of automated (not manual) segmentations.
We address the intermediate question: For this specific image, what would be the expected DSC between two expert segmentations? If the automated segmentation achieves this DSC, it can be considered acceptable.
Thus, while we do model inter-observer variability, it is only as a means to better assess the quality of automated segmentations.
== Relation with uncertainty quantification approaches ==
The reviewer is correct: some uncertainty quantification (UQ) approaches model epistemic (model-based) uncertainty. However, others address aleatoric uncertainty, which is closely related to inter-observer variability—so much so that some works treat them as equivalent. Therefore, discussing how our work relates to these methods is important.
== Factors of segmentation difficulty ==
We thank the reviewer for pointing out the third and fourth columns in Table 2. To identify key factors influencing segmentation difficulty, we extracted four groups of descriptors: imaging, texture, border, and geometrical features. The third and fourth columns report the prediction error of a Ridge regression model trained to estimate segmentation difficulty using only one of these feature groups. A lower error indicates that the corresponding image properties are more influential in determining segmentation quality.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
This work proposes a method for quality control of segmentation results, taking into account inter-observer disagreement and segmentation uncertainty. Being an important practical topic to try to assess segmentations is its major strength.
Reviewers mention (1) some sections that might confuse a reader, (2) a lack of comparison to state of the art uncertainty based methods as well as (3) limited generalizability due to only showing the method on a single dataset. I think the first weakness can be addressed by revising the final paper according to the reviewer comments. The second argument is valid, but the novelty and uniqueness of the contribution outweighs this drawback. Finally, I think argument 3 is not an issue, since showing the method on a single dataset is sufficient as a proof of concept.
Overall, I think this paper tackles an important problem that is often overlooked and it will generate discussions, thus leading to my decision.