Abstract

The explosive development of large-scale model technology has provided strong support for achieving more intelligent, robust, and precise segmentation techniques. However, owing to the unique challenges posed by medical domain data, the 3D medical image-text alignment model, 3D CLIP, struggles to match the performance of its natural scene counterpart. This limitation hinders the application of CLIP-based text-image reasoning in medical segmentation tasks. Furthermore, CLIP has been shown to rely on high-level semantic alignment between vision and text, lacking effective support for local visual features that are crucial for dense prediction tasks. Existing reasoning segmentation methods often adopt a redundant design with two visual encoders—one from CLIP and another from a large vision models for downstream dense tasks. This adversely affects model efficiency and complicates the training process. To address these challenges, we propose a novel framework, R1Seg-3D, which unifies a visual encoder. Our approach achieves three-way alignment of dense visual, text reasoning, and mask decoding features within a shared latent space. Compared with previous methods, R1Seg-3D implicitly incorporates more detailed spatial features into the reasoning path. Therefore, it can strengthen the reasoning ability by incorporating additional visual spatial details and directly enhances the mask decoding process. The R1Seg-3D architecture is more concise and easier to train. Extensive evaluations on 25 diverse datasets demonstrate that R1Seg-3D outperforms state-of-the-art methods in both performance and stability. This work advances intelligent medical imaging and lays a foundation for future research in inference-driven segmentation. Our code and models are available at https://github.com/lihaoqin168/R1Seg-3D.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1982_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/lihaoqin168/R1Seg-3D

Link to the Dataset(s)

https://github.com/BAAI-DCAI/SegVol

BibTex

@InProceedings{HaoQin_R1Seg3D_MICCAI2025,
        author = { Hao, Qin and Yu, Long and Tian, Shengwei and Ye, Xujiong and Zhang, Lei},
        title = { { R1Seg-3D: Rethinking Reasoning Segmentation for Medical 3D CTs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {416 -- 426}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addresses a relevant challenge in applying LLM-based reasoning to 3D medical segmentation, the proposed method appears to be an incremental architectural refinement.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The author proposed a reasoning segmentation model in 3d medical image segmentation, which uses single encoder to reduce computational cost. The method is novel and interesting.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The prompt descriptions are unclear. Are the prompts structured texts? can they be free style? Additionally, it is not specified how the model behaves when it fails to understand the prompt information — does it produce a fallback output or fail entirely?
    2. The authors claim their method has an advantage over LISA by using a unimodal vision encoder, but this claim is not supported by direct comparative experiments. Moreover, the overall comparison experiments are limited in scope.
    3. Authors are recommended to do ablation study of unified vision encoder over mutliple encoders.
    4. Can the authors provide a comparison of time efficiency between their method and other baseline approaches?
    5. The authors appear to violate the double-anonymity policy by releasing code that includes identifying information such as names and institutional affiliations.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a method with some novelty, particularly in its use of a unified encoder, which distinguishes it from approaches like LISA. However, the experimental validation is insufficient. The work could be considered for acceptance if the authors provide comparative experiments evaluating the performance of unified versus multiple encoders, as this is a key contribution of the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The paper lacks sufficient experimental validation to convincingly support the proposed method. Concerns regarding the design of prompts and temporal aspects remain inadequately addressed. While the integration into a unified framework is noted, the overall novelty of the approach appears limited.



Review #2

  • Please describe the contribution of the paper

    (1) A single-vision encoder architecture is proposed to replace the dual-encoder design (such as CLIP SAM) in the traditional inference segmentation method. (2) The framework achieves the tripartite alignment of visual, textual, and mask features in the shared latent space. At the same time, these features are also used for mask decoding to ensure that the segmentation results conform to the semantic description while preserving the exact anatomy. (3) A progressive training strategy is adopted to optimize the encoder, multimodal alignment, and LLM inference modules in stages. At the same time, the sliding window inference strategy and iterative decoding mechanism are used to further improve the segmentation ability of small targets (such as lesions), which provides a reliable solution for clinical open-ended vocabulary inference segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study proposes a single-encoder three-modal alignment architecture for the first time, breaking through the design limitations of traditional dual-encoders. Through a shared 3D ViT encoder, global semantic understanding and local feature extraction are realized at the same time, which solves the key problem of semantic and spatial information separation in medical imaging. Through the introduction of the inference module, the total false-positive samples of the 25 datasets were significantly reduced from 2146 to 592, and the specificity was increased to 94.63%, which is close to the actual needs of clinical diagnosis. The open-ended lexical inference segmentation of 3D medical images is realized, and specific segmentation can be performed using natural language instructions, breaking through the dependence of traditional methods on predefined categories.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    This paper only compares with M3D-LaMed, but does not include other recent advanced medical 3D reasoning segmentation models, which weakens the persuasiveness of the performance advantage.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The R1Seg-3D framework proposed in this paper has carried out extensive experiments on 25 datasets in the field of medical 3D CT inference segmentation and achieved significant improvements, which is the core basis for suggesting weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although some of the rebuttle was not convincing, if the authors are able to make the revisions as promised in the rebuttle, it is recommended that the paper be accepted.



Review #3

  • Please describe the contribution of the paper

    This paper introduces R1Seg-3D, a unified framework for reasoning-based open-vocabulary 3D CT segmentation. The proposed architecture leverages a single unified visual encoder, integrates LLMs for reasoning, and uses a multimodal fusion strategy with iterative mask refinement. The authors argue that their design improves both segmentation accuracy and efficiency by avoiding redundant dual-encoder setups seen in prior work (e.g., CLIP + SAM). Extensive evaluations on 25 medical datasets demonstrate consistent improvements over the state-of-the-art (notably M3D-LaMed), particularly in precision and false positive reduction.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    i) Open-vocabulary segmentation with LLMs in 3D medical imaging is an emerging field. The authors target a real challenge effective multimodal reasoning in 3D medical CTs where current models like CLIP or dualencoders fall short.

    ii) Unlike M3D-LaMed and other prior work that use two separate visual encoders (e.g., CLIP + SAM), R1Seg-3D simplifies the pipeline with a single ViT-based encoder, boosting training efficiency and architectural simplicity.

    iii) The shared latent space alignment between visual and textual features is well-motivated. The use of a token with loop-based refinement (inspired by autoregressive reasoning) adds practicalvalue.

    iv) The proposed method is tested on 25 public datasets with multiple large language models (Phi, LLaMA, Qwen, LLaVA-Med), ablation studies for each module, and comparison against the top-performing baseline (M3D-LaMed) in both performance and specificity.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    i) The proposed architecture mainly combines existing components: pretrained ViTs, LLMs (e.g., Qwen, LLaMA), LoRA tuning, and common multimodal fusion. No new algorithms, loss functions, or training objectives are introduced. The innovation lies in integration, not in method or theory, making it incremental work.

    ii) Both visual and language modules are largely reused from existing work, with minimal custom adaptation. LoRA-based tuning is well-known and does not offer methodological novelty. This weakens the technical contribution of the paper.

    iii) The mechanism is introduced as a key element, but there is no ablation, visualization, or interpretability analysis to justify its effectiveness. Its role in actual segmentation guidance remains unclear and unsupported.

    iv) All experiments are run on open-source datasets. No testing on OOD data or feedback from clinicians. This weakens claims of generalization and real-world applicability.

    v) No visual examples of successful or failed segmentations. No evidence of how reasoning improves predictions (e.g., on ambiguous cases). Without this, the reasoning module’s value remains unproven.

    vi) Results are extensive but limited to pixel-level metrics (F1, precision, etc.). No boundary-aware metrics (e.g., Hausdorff Distance), instance metrics, or analysis of small lesion performance. Focused more on benchmark performance than clinical relevance.

    vii) The use of sliding windows for 3D CT volumes is computationally expensive. The model’s actual inference time, GPU load, and memory footprint are not reported. This challenges its claim of being a lightweight, scalable solution.

    viii) The model requires four separate training stages with different frozen/unfrozen modules. This adds engineering burden and tuning overhead, which reduces reproducibility and accessibility in clinical settings.

    ix) While 25 datasets are used, there is no demographic analysis, imaging protocol variation, or data source diversity evaluation. It’s unclear how well the method would work across heterogeneous or unseen populations.

    x) Improvements over M3D-LaMed could stem from better LLMs or more training data, not architectural superiority. No fairness in comparison (e.g., using same backbone or data volume) is ensured.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes incremental but meaningful architectural innovation by unifying and simplifying the reasoning-based segmentation pipeline in 3D medical CTs. While not groundbreaking in theory, it represents a good engineering contribution, addressing real-world inefficiencies and offering a practical, reproducible, and extensible framework. Given the growing interest in vision-language reasoning in medical imaging, the paper merits inclusion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely appreciate the reviewers’ time and valuable comments on our manuscript. These constructive suggestions have been immensely helpful to our current work.

1) M3D-LaMed as the Sole Comparable Baseline: Unified vs. Multiple Encoders (Reviewer 1, 2 and 3): While LLM-based reasoning segmentation methods have achieved remarkable success in natural scenes since LISA (2024), the 3D medical imaging domain remains largely unexplored. Through an extensive literature review, we confirm that M3D-LaMed is currently the only published LLM-based medical reasoning segmentation method available for comparison—a fact further supported by the absence of comparable methodologies in the M3D-LaMed publication itself. Traditional 3D medical segmentation approaches (e.g., UNETR) and prompt-based SAM variants are not directly incomparable. The former cannot support joint training across 25 datasets, while neither approach possesses reasoning capability for implicit targets.

All compared models shared the same phi3 LLM and were trained on identical datasets (25 in total) under equivalent conditions.

M3D-LaMed adopts a multiple encoders design inspired by LISA. Our comparison with M3D-LaMed in Section 3 evaluates the performance of unified versus multiple encoders, highlighting the key contribution of our R1Seg-3D.

For greater clarity, we will revise the original statement—”M3D-LaMed [23] is currently the state-of-the-art reference expression segmentation model in medical 3D imaging”—by adding the specification “based on LISA with a dual-encoder architecture.”

2) Innovative Contributions and Future Work (Reviewer 3): While our architecture builds upon established components like ViTs, SAM, LLMs (Phi/Qwen/LLaMA), and LoRA tuning, we emphasize that our work introduces significant advancements. We establish a novel framework and create an important benchmark for future research in medical reasoning segmentation. This benchmark opens broad research prospects, including: prompt-LLM-ViT joint optimization, prompt-reasoning segmentation consistency, and integrated reasoning-mask decoding co-optimization—each presenting substantial potential for future exploration.

Currently, we are collaborating with hospitals to conduct out-of-distribution (OOD) testing, adaptive optimization, and clinical evaluation on their pulmonary disease datasets. Additionally, we will further investigate the interpretability of medical LVMs (e.g., through [SEG] token analysis) and prompt-LVM reasoning consistency.

3) Dataset Link and Open-Source Repository (Reviewer 2 and 3): While our manuscript omits detailed dataset information, the imaging protocols and acquisition parameters are available through the dataset link provided in Footnote 1 (Non-team resources, ensuring double-anonymity).

Upon acceptance, we will release our complete open-source repository to ensure reproducibility and practical deployment. This will include: dataset preprocessing procedures, QA templates, R1Seg-3D implementation code, baseline models, training scripts, inference demos, visual examples and comprehensive performance metrics (e.g., HD95, model parameters, GPUs, FLOPs).

We would like to clarify: While staged training may appear complex, it represents the standard approach for reasoning segmentation tasks (e.g., [16][17][32]), enables progressive learning unmatched by end-to-end methods. Batch sizes may need adjustment, but the core architecture remains unchanged. We are actively developing solutions to streamline clinical deployment while maintaining the performance advantages of this approach.

Our structured prompt templates (e.g., “Does the radiograph contain the target, which {}?”) consist of two critical components: implicit target description and existence verification. Successful recognition generates the “[SEG]” token in the output text, while failed recognition produces a negative response (e.g., “Sorry, there is no {}”) along with a zero-filled mask.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top