Abstract

2D biomedical foundation models (FM) have demonstrated remarkable capabilities in 2D medical image segmentation across various modalities, with text-prompted approaches offering scalable analysis that facilitate integration with LLMs and clinical application. Adapting these models for 3D medical image segmentation can leverage their rich visual features while enabling text-prompted volumetric image segmentation. However, efficient adaptation poses significant challenges due to the substantial disparity between 2D and 3D medical images and the necessity to establish text-volume alignment. To address these limitations, we propose \textbf{Bio2Vol}, a novel adaptation framework that enables text-prompted 2D biomedical FMs to effectively handle volumetric data. Specifically, (1) To bridge the dimensional disparity, we propose a Dual-Rate Sampling strategy (DRS) that processes inter slices within a volume at both sparse and dense intervals, capturing global contexts and local details; (2) To enhance volumetric feature representation, a Cross-slice Dual-head Attention (CSDHA) is built upon the intra-slice features by repurposing existing pre-trained attention modules for parameter-efficient inter-slice information fusion; and (3) To establish text-volume understanding, a Semantic Text-Visual Alignment loss (SAT) is used to extend the existing 2D text-visual alignment to the volumetric domain. Using BiomedParse as a demonstration case, extensive evaluation across 11 medical datasets across diverse anatomical regions and modalities shows that Bio2Vol significantly improves 3D medical image segmentation performance, enhancing DSC by 4.72\% on Amos22 dataset with substantial improvements across MSD tasks. Code will be available \url{https://github.com/JiaxinZhuang/Bio2Vol}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1852_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhuJia_Bio2Vol_MICCAI2025,
        author = { Zhuang, Jiaxin and Wu, Linshan and Ni, Xuefeng and Wang, Xi and Wang, Liansheng and Chen, Hao},
        title = { { Bio2Vol: Adapting 2D Biomedical Foundation Models for Volumetric Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {24 -- 34}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work extends BiomedParse to 3D by 3 key designs: 1. DRS, a 2-ratio sampler to obtain slices; 2.CSDHA, use information from other slices by modifying Transformer, achieving 3D decoding without additional params; 3. SAT, extend text-visual alignment to 3D

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. 3D biomedical foundational models, especially with text prompt, are under-explored. This work propose a simple way to extend 2D text-driven FM, and experimental results suggest its effectiveness.
    2. The paper is well-written, and experiments are solid.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Might be weak in novelty. As stated in the paper, CSDHA is an adaption from other work on video rather than an innovation; SAT is a straightforward design with minor novelty; DRS and the overall design could be seen as the main innovation, however:
    2. The ablation studies are not described clearly enough. e.g. how DRS is removed.
    3. “Generalization Analysis” on MSD is confusing. Under what setting, e.g., 0-shot transfer? And only AIM is compared with in this part, why?
    4. I probably have found several TYPOs. Table 1: “mUnet”; Table 3 “90.32±0.33”??
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is of great significance, and provides a new solution to this topic. The performance is satisfying. However, as stated in previous sections, several issues might need to be clarified.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors have clarified issues about experiment settings in their rebuttal. The experimental results are strong enough.



Review #2

  • Please describe the contribution of the paper

    -This paper introduces Bio2Vol, a novel framework that adapts text-prompted 2D biomedical foundation models (FMs), specifically BiomedParse, for volumetric (3D) medical image segmentation.

    -A Cross-slice Dual-head Attention (CSDHA) mechanism that repurposes pre-trained 2D attention heads for inter-slice modeling without introducing new parameters.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The paper presents a strategy to adapt a 2D foundation model trained with text prompts to volumetric data without introducing significant overhead. The DRS+CSDHA+SAT combination is parameter-efficient solution to 3D modeling.

    -CSDHA uses intra-slice attention heads for inter-slice communication, avoiding additional parameters while still enabling effective volumetric modeling.

    -The authors provide a detailed ablation study to quantify the contribution of each component

    -11 public datasets are used

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    -While the paper compares against a wide range of visual-prompted and SAM-based methods, it lacks direct comparisons to other recent or concurrent text-prompted 3D FM-based segmentation methods for medical domain.

    -There’s no further discussion on whether the model is sensitive to how the prompt is phrased. In addition, no ablation or robustness test showing what happens if the prompt is changed, simplified, or made more ambiguous.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    However, the paper lacks comparison to dedicated 3D medical foundation models, and does not evaluate prompt robustness, both are important for clinical applicability

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a method for adapting SAM to include 3D volumes using slice-wise attention and a text prompt for improved class recognition. Extensive experiments show the effectiveness of the method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper has extensive experimentation and shows good results in a lot of relevant datasets, indicating the generalizability of the method 2) The paper proposes the Dual Sampling strategy that generates local and global cues for the cross attention that follows. 3) An auxiliary loss called SAT is proposed that enhances classwise features.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The method combines multiple existing modules. Thus, the components are by themselves, not novel as claimed by the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well rounded paper, with enough experiments, comparisons and ablations and adequate novelty. While the components in themselves are not novel, the authors have developed a network by combining these techniques that outperforms existing state of the art on multiple datasets.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I did not have major questions. The experiments are extensive. While the novelty is slightly limited, I believe that the strengths overcome the weaknesses




Author Feedback

We thank all reviewers for their valuable comments. We appreciate that Reviewer1 (#R1) valued our 2D-to-3D adaptation strategy, Reviewer2 (#R2) commended our approach’s novelty and thoroughness, and Reviewer3 (#R3) acknowledged our solution’s significance and performance. We address the concerns below:

#R1: Comparison with text-prompted 3D foundation models (FM): We compared our approach against two text-prompted 3D FMs: M3D [2] and SegVol [7]. M3D’s [2] LLM-based decoder, designed for text rather than pixel-level prediction, struggles with fine-grained segmentation and loses visual cues [1]. While SegVol [7] works best with combined bbox and text inputs, we used text-only prompts for fair comparison. Our approach consistently outperforms both models (due to rebuttal policies, we can’t provide exact experimental results). Our method uniquely leverages existing 2D medical foundation models—pretrained on extensive 2D medical datasets—for effective 3D image analysis. [1] LISA:Reasoning segmentation via large language model. CVPR 2024

#R1: Prompt robust analysis: We utilized BiomedParse [31] datasets with GPT-4-generated synonymous descriptions (~8.28 prompts per object type) to enhance prompt robustness. BiomedParse [31] shows that well-curated training datasets yield models robust to prompt variations in the inference. On Amos22 [13], we infer the model with three prompt formulations (taking text for Liver as an example) against the baseline (official text prompts): minimal (“Liver”), standard (“Liver in abdominal CT scan”), and detailed (“Segment liver, large organ in right upper abdomen on CT”). Results showed minimal performance variation across all formulations, with a slight but consistent trend of improving performance as prompt detail increased from minimal to detailed descriptions. This confirms Bio2Vol’s stability regardless of prompt phrasing.

#R2, #R3: Novelty of components: Our work introduces a new approach by addressing three critical gaps when transferring 2D foundation models to 3D volumes: a) Dimensional Disparity Gap: Our DRS resolves 2D models’ inability to understand volumetric relationships through strategic cross-slice sampling; b) Feature Representation Gap: Our CSDHA enables cross-slice reasoning by repurposing pre-trained modules without additional parameters; c) Semantic Understanding Gap: Our SAT extends text-visual alignment to the 3D domain. These targeted solutions explain why, as #R2 noted, our approach “outperforms existing state of the art on multiple datasets.”

#R3: Clarity of ablation studies: Our ablation studies (Table 3) systematically evaluated four configurations: a) Base: Original BiomedParse[31] with uniform slice sampling; b) Base+CSDHA: Added cross-slice attention while maintaining uniform sampling; c) Base+CSDHA+DRS: Implemented dual-rate sampling with cross-slice attention; d) Full model: Complete Bio2Vol with all components. For configurations without DRS, we used uniform sampling at a fixed rate (rd=1).

#R3: Generalization analysis setting: We evaluated our method’s generalizability by finetuning on the MSD dataset [1], which encompasses diverse anatomical structures. In Table 2, we specifically compared with AIM [30] as it was the strongest BiomedParse [31] adaptation baseline from Table 1. Our method consistently outperformed AIM [30] across all metrics, demonstrating Bio2Vol’s superior generalizability. While additional comparisons with other Table 1 methods (Ensemble [33], ST-Adapter [18]) showed similar trends, we focus on AIM [30] as the most competitive baseline to highlight Bio2Vol’s superior generalizability to various anatomical structures.

#R3: Typos: We’ll correct “mUnet” to “nnUNet” in Table 1 and fix the NSD score from “90.32±0.33” to “80.32±0.33” in Table 3. We’ll thoroughly review for all typographical errors in our revision.

We will carefully update our manuscript based on the reviewers’ valuable suggestions! Sincerely, authors




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top