Abstract

Interactive segmentation tools are necessary to achieve the desired segmentation accuracy for complex target structures, such as vessels in medical images. But existing interactive methods-including those pre-trained on large internet-scale datasets-offer limited mechanisms for users to provide prompts that effectively control segmentation outcomes. In particular, one-at-a-time point or text prompts are often insufficient for correcting errors in vascular segmentation masks. To address these limitations, we propose a novel interactive medical image segmentation method tailored for complex vascular structures. Our approach learns to interpret sequences of multimodal prompts-combining both text and point inputs. By enabling dual mode prompting, the method allows users to add semantic meaning to point-based interactions. Furthermore, by learning from aggregated sequences of prompts, the method captures inter-prompt relationships, enhancing its understanding and response to user input. Quantitative evaluations on six vascular datasets demonstrate that our method outperforms existing approaches. Additionally, it avoids critical failure cases and consistently generates improved segmentation masks across diverse imaging modalities and vascular anatomies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5230_paper.pdf

SharedIt Link: https://rdcu.be/eHaVO

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_32

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LimJon_Multimodal_MICCAI2025,
        author = { Lim, Jongsoo AND Lee, Soochahn},
        title = { { Multimodal Prompt Sequence Learning for Interactive Segmentation of Vascular Structures } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {337 -- 347}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a methods that learns sets of prompts for interactive vessel segmentation. The method is based on SAM. Several datasets are used to illustrate the results on different images (OCTA, fundus, angiography…)
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written The method is evaluated using multiple datasets of various modalities. The idea is interesting
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- There is no statistical analysis of the results
- The only metric used is the mIOU (for vascular segmentation, at least a clDice should be added)
- In the table, the PSM module is questionnable (because there is no statistical analysis, it is quite hard to say if the PSM module really has an influence on the results)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

For me, there is a problem with the reproducibility and the result part. I am not sure after reading that the method improves the segmentation mainly because only the mIoU is used, and without and statistical analysis. I am also quite surprised by the results of nnunet.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper offers a method for interactive medical image segmentation for vascular structures. The proposed method learns sequences of multimodal text and point prompts enabling the user to add meaning to point prompts. Experimental results on several vascular datasets show improvements over previous methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Being interactive to maintains robust performance in noisy environments.
- It uses dual modes of prompts to add meaning to point prompts.
- It addresses the accurate segmentation of complex and slender structures such as vessels in medical images.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Assuming that there is an initial segmentation mask.
- Pipelines algorithms with many manual steps might be more interpretable but can limit model performance in particular situations.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty in this paper may be limited, but it presents new perspectives for some applications that require human interactions.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes a multimodal interactive segmentation method for vascular structures, which learns sequences of both text and point prompts to refine segmentation results iteratively. The model introduces a Prompt Sequence Module (PSM) to aggregate prompt history and is evaluated on six diverse vascular datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The integration of sequential multimodal prompts (text + point) is a thoughtful extension of existing point-based methods, offering better control and interpretability during segmentation refinement.

The design of synthetic training sequences and the text prompt prediction strategy allows the model to simulate user correction behavior effectively.

Extensive experiments across six datasets of various modalities demonstrate consistent performance improvements over existing interactive methods.

The inclusion of an ablation study helps analyze the architectural choices of the prompt sequence module.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Figure 2 is visually unbalanced—the module blocks appear overly large and inconsistent with the visual style of Figures 1 and 3, affecting readability. It is recommended to refine Figure 2 to maintain a consistent visual style.

The difference between text prompts and learnable class prompts remains unclear in terms of practical utility. The experimental results suggest limited gain from using multimodal prompt sequences compared to simple prompt concatenation (e.g., “Ours w/o PSM” already performs strongly). A more thorough quantitative comparison isolating the added value of textual semantics is needed.

The novelty of the method is moderate, as it builds largely on existing prompt learning and SAM-HQ structures. The contribution lies more in the integration and application than in algorithmic innovation.

There is no discussion on inference latency or efficiency, which may be critical for interactive segmentation systems where responsiveness is essential.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The proposed method is a promising direction for enhancing user-guided segmentation. However, to further improve clarity and impact, I suggest:

Redesigning Figure 2 for better visual clarity—current components appear disproportionately large and stylistically inconsistent with Figures 1 and 3.

Providing additional analysis comparing the effect of using natural language text prompts vs. learnable class prompts. It remains unclear whether the use of natural text (e.g., “extend”, “remove”) brings significant advantages over simpler learned class tokens. A comparative experiment or ablation would strengthen the justification for using CLIP-based text encoders.

Briefly discussing the computational or latency impact of sequential prompt modeling, especially under interactive usage constraints.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main reasons for my weak accept recommendation are: (1) the method is well-motivated and shows consistent improvements across multiple datasets, (2) the integration of multimodal sequential prompts is novel and practical, but (3) the performance gain from text prompts over learnable prompts is not clearly justified, and (4) some figures and explanations need refinement.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Multimodal Prompt Sequence Learning for Interactive Segmentation of Vascular Structures

Author(s):