Abstract

Foundation models for volumetric medical image segmentation have emerged as powerful tools in clinical workflows, enabling radiologists to delineate regions of interest through intuitive clicks. While these models demonstrate promising capabilities in segmenting previously unseen anatomical structures, their performance is strongly influenced by prompt quality. In clinical settings, radiologists often provide suboptimal prompts, which affects segmentation reliability and accuracy. To address this limitation, we present SafeClick, an error-tolerant interactive segmentation approach for medical volumes based on hierarchical expert consensus. SafeClick operates as a plug-and-play module compatible with foundation models including SAM 2 and MedSAM 2. The framework consists of two key components: a collaborative expert layer (CEL) that generates diverse feature representations through specialized transformer modules, and a consensus reasoning layer (CRL) that performs cross-referencing and adaptive integration of these features. This architecture transforms the segmentation process from a prompt-dependent operation to a robust framework capable of producing accurate results despite imperfect user inputs. Extensive experiments across 15 public datasets demonstrate that our plug-and-play approach consistently improves the performance of base foundation models, with particularly significant gains when working with imperfect prompts. The source code is available at https://anonymous.4open.science/r/SafeClick/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4193_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GaoYif_SafeClick_MICCAI2025,
        author = { Gao, Yifan and Sheng, Jiaxi and Wu, Wenbin and Li, Haoyue and Dong, Yaoxian and Ge, Chaoyang and Yuan, Feng and Gao, Xin},
        title = { { SafeClick: Error-Tolerant Interactive Segmentation of Any Medical Volumes via Hierarchical Expert Consensus } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes SafeClick, a novel plug-and-play module designed to enhance the robustness of SAM2 and MedSAM2 for medical image segmentation in the presence of imperfect user prompts. The method introduces two main components: (1) a Collaborative Expert Layer (CEL) that processes image and prompt features through specialized transformer modules, and (2) a Consensus Reasoning Layer (CRL) that integrates these features via a fusion strategy. SafeClick is evaluated across 15 public datasets and shows performance improvements over foundation model baselines, especially in settings with suboptimal prompt inputs.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Addresses a practical challenge - The paper targets the realistic scenario where user-provided prompts (clicks, boxes, etc.) may be imperfect - a common issue in real-world clinical settings.

    Plug-and-play compatibility - The proposed method is compatible with widely used segmentation foundation models like SAM2 and MedSAM2, which increases its potential for broad adoption.

    Comprehensive experimental scope - The evaluation spans a large number of public datasets and includes comparisons with both ideal and imperfect prompts, demonstrating the practical utility of the approach.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Unclear problem framing - The paper frequently refers to “imperfect” and “ideal” prompts, but does not provide clear definitions or examples of what constitutes each. Quantifying how often imperfect prompts occur in practice would strengthen the motivation.

    Vague terminology and architectural descriptions - The reasoning behind the naming “Collaborative Expert Layer” is not well justified — it appears more as a multiscale feature fusion mechanism rather than a true expert module. Even the prompt provided by expert is expected to be “imperfect”.

    Lack of clarity in method details - The paper mentions adopting the “same encoder” as “recent foundation models” but does not specify which encoder or cite a concrete model. This affects reproducibility. The hierarchical nature of the CRL module is also not well illustrated or elaborated upon. There is no loss function defined.

    Dataset and metric issues - While the use of 15 public datasets is commendable, the paper fails to list them in dataset description or provide references. There is no information on whether official partitions of training and testing set from these datasets were available. Additionally, the definitions of “perfect point” and “perfect bounding box” are vague. The evaluation is limited to Dice and lacks other common segmentation metrics such as IoU or Hausdorff Distance (HD).

    Incomplete ablation and analysis - The ablation study omits key aspects, such as the effect of the Key-Value input from the final feature layer (E3), whose contribution is not analysed.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the concept of SafeClick is relevant and potentially impactful, especially in making interactive segmentation more robust to imperfect prompts, the paper falls short in several key areas. The problem definition is vague, critical architectural components are insufficiently explained, and reproducibility is hindered by a lack of detail around datasets, model components, and evaluation metrics. The evaluation design and ablation study are also incomplete, leaving important questions unanswered about how specific design choices contribute to the performance gains. With a clearer exposition, better-defined problem framing, and more rigorous experimental detail, this work could make a stronger contribution in future.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thank you for the rebuttal. While the proposed SafeClick framework targets a relevant challenge—robust segmentation under imperfect prompts—several concerns remain unresolved. The framing of the problem and definitions of “ideal” vs. “imperfect” prompts remain underdeveloped without concrete examples or frequency analysis. Architectural descriptions are still vague, and terms like “expert” and “collaborative” remain conceptually unclear. The ablation study is limited in scope and granularity—particularly in isolating the contributions of key components such as E3. While the authors argue that removing E3 disables prompt processing, standard ablation strategies (e.g., substituting or degrading the module rather than removing it entirely) would still allow for meaningful analysis. Therefore, I maintain a score of reject.



Review #2

  • Please describe the contribution of the paper

    This paper proposes SafeClick, an error-tolerant interactive segmentation framework designed to address the sensitivity of existing foundation models (e.g., SAM 2/MedSAM 2) to prompt quality. The framework features a hierarchical expert consensus mechanism comprising: (1) A Collaborative Expert Layer (CEL) with three specialized modules (E1, E2 and E3); (2) A Consensus Reasoning Layer (CRL) to dynamically fuse expert-derived features.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The motivation is clearly presented and significant, as inaccurate prompts lead to severe performance degradation in medical image segmentation.
    2. The proposed method is technically sound, introducing a novel error-tolerant interactive segmentation approach that transforms traditional single-path prompt-driven segmentation into multi-expert collaboration.
    3. Comprehensive experimental validation demonstrates substantial performance improvements across all 15 benchmark datasets compared to baseline methods.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The comparative approach is considered somewhat limited. The proposed method is evaluated against only two baseline models and does not include comparisons with recent advanced variants of SAM. Additionally, it omits comparisons with other methods that also address the challenge of inaccurate prompts, such as [1][2]. Including these comparisons is crucial to demonstrate the advantages of the proposed method over existing solutions. [1] DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation [2] Customizing Segmentation Foundation Model via Prompt Learning for Instance Segmentation

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The specific meaning of the symbols in Equation 6 is not explained.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. This work introduces an innovative error-tolerant prompting strategy.
    2. The experimental validation remains incomplete due to insufficient comparison with state-of-the-art approaches.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This work propose a new plug-and-play error-tolerant prompting strategy for interactive volume segmentation. The authors addressed the concerns of the reviewers well. Thus, I recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    The authors introduce a plug-and-play consensus module with expert attention and reasoning layers, enhancing segmentation robustness to suboptimal prompts, enhancing foundation models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is clear and well writen. State of the art covers the relevant main work and problem statement is latent.
    • The method builds on solid foundations, and the paper has strong clinical applicability for adoption.
    • Extensive validation across various body regions on 15+ public datasets, consistently outperforming base foundation models.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The paper provides comprehensive qualitative and quantitative results using box prompts. However, while point prompts with a red star are mentioned, they are not shown in the figures
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clinically relevant, and the framework introduces novel aspects. Its architecture-agnostic design enables easy adoption

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The work has significant clinical importance, bringing novelty with the framework. Its architecture-agnostic design enables easy adoption; the authors have carefully addressed the concerns and corrected the identified errors in the manuscript.




Author Feedback

We sincerely thank the reviewers for their positive/constructive comments. We are greatly encouraged that they found our work to address a “practical challenge” (R1), be of “significant clinical importance” (R2, R3), and involve “comprehensive experimental validation” (R1, R2, R3). We have carefully considered all comments and will revise the manuscript accordingly. Here, we address the main points in their reviews.

  1. Problem Framing and Datasets (R1): Ideal point prompts are located at the target centroid, while box prompts use the minimal enclosing box. Quantifying imperfect prompt frequency in clinical practice is hard, with limited existing quantitative analysis. Our interactive annotation experiments with clinicians investigated imperfect prompt impact, showing their prevalence. SafeClick handles these suboptimal prompts, boosting efficiency and cutting annotation time. Full user study analysis is for the journal version due to page limits. Refs [9, 10] show “even minor deviations in prompt placement can lead to significant performance degradation,” indirectly showing imperfect prompts’ significant impact. Table 1 lists data for 15 public datasets ([12]-[23]). Lacking uniform official splits for some, and for cross-dataset consistency, datasets were patient-level randomly split (7:1:2 train/val/test).
  2. Terminology and Architectural Descriptions (R1): CEL’s design uses three Transformer modules (E1,E2,E3) for varied input feature aspects (E1: cross-attention for intermediate/final image features; E2: self-attention on final features for prompt-independent analysis; E3: integrates prompt/final image features). This helps the model focus on image features, reduce prompt over-sensitivity, and handle imperfect prompts. Methodology will further detail “expert” roles and “collaboration.” On ‘expert prompts also being imperfect,’ SafeClick is designed for imperfect prompts; CEL is a main component, mitigating their negative impact via multi-faceted analysis.
  3. Method Details and Evaluation Metrics (R1): Encoder: MAE pre-trained Hiera, same as SAM2/MedSAM2. CRL computes cross-attention/self-attention via a contrastive mechanism, dynamically fusing CEL features using α. As a plug-and-play module, SafeClick uses dice and CE compound loss. We calculated IoU and HD. Due to page limits and IoU/HD trending with Dice, we focus on Dice results. We will add these results in the journal version.
  4. Ablation Study (R1): E3, referencing SAM2’s original mask decoder transformer, fuses prompt/image features. Removing E3 stops prompt processing, altering prompt interaction. E3’s role will be further discussed in ablation.
  5. Comparative Experiments (R2): SafeClick, as a plug-and-play module, aims to enhance the performance of existing foundation models under imperfect prompts. Therefore, our comparisons focus on demonstrating the improvements SafeClick brings to these specific models (Zero-shot vs. fine-tuned vs. SafeClick). These methods target different tasks, rather than our specific objective of enhancing robustness to imperfect manual interactive inputs. Furthermore, we focus more on 3D segmentation, whereas most SAM-based variants like DeSAM and PLM+PMM primarily target 2D. However, comparing SafeClick with different architectures that address inaccurate prompts is a valid point for broader context. We will discuss it in updated version.
  6. Formula Explanation (R2): Φ1,Φ2 are E1,E2 outputs post channel-first reshape (Eq. 5). Max is matrix maximum. ⊮(H) is an all-ones matrix for broadcasting max results for subtraction.
  7. Visualization Results (R3): Thanks for the reminder. Initially, we planned point/box prompt results, but page limits led to only box prompts in the main paper. We’ll update captions (removing point prompt descriptions) and add point prompt results to the appendix.

We would like to express our gratitude to the reviewers for your valuable feedback and suggestions!




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top