Abstract

Autonomous bronchoscopic navigation is vital for pulmonary disease diagnosis and treatment but still suffers from subtle anatomical variations and open-set bronchial variants. Current vision-language foundation models enable open-set recognition but get trapped in capturing fine-grained spatial features and disentangling class-specific attributes. We propose a structure-aware cross-modal prompt tuning framework that combines the contrastive language-image pre-training (CLIP) model and the efficient segment anything model (EfficientSAM) to address these limitations. Specifically, EfficientSAM extracts structure-aware features for learnable textual prompts via cross-modal attention to enrich visual embeddings in CLIP, while a base-unknown decoupled head disentangles shared anatomical knowledge and class-specific features in the latent space, enhancing separability for both base and open-set classes. Moreover, unified optimization aligns multi-modal distributions using image-text matching loss and base-unknown decoupled loss. We evaluate our method on clinical bronchoscopic data, with the experimental results showing that our method outperforms state-of-the-art approaches and improves recognition and open-set identification (88.94%, 87.00%).

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3311_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{FanHao_StructureAware_MICCAI2025,
        author = { Fang, Hao and Zeng, Zhuo and Yang, Jianwei and Fan, Wenkang and Luo, Xiongbiao},
        title = { { Structure-Aware Cross-Modal Prompt Tuning for Autonomous Bronchoscopic Navigation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {567 -- 577}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a structure-aware cross-modal prompt tuning framework combining CLIP and EfficientSAM to enhance fine-grained bronchial bifurcation recognition, achieving state-of-the-art performance in both base and open-set classification tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors combine the strengths of CLIP for high-level semantic understanding and EfficientSAM for fine-grained regional feature extraction, addressing limitations in representing intricate bronchial bifurcation features.
    2. The proposed method achieves SOTA performance on a bronchial bifurcation dataset. The authors also include comparisons with multiple baseline methods, ablation studies, and t-SNE visualizations, demonstrating the effectiveness of each proposed component.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed framework is irrelevant to the autonomous navigation task… it is an open-set classification task.
    2. Evaluation on a single dataset cannot fully demonstrate the effectiveness of the proposed method.
    3. The majority component of BUKD is from [7], and cross-attention is also widely used in vision-language embedding tasks. I would say it is interesting to combine point prompt with SAM for additional prompt information, while a similar setup has been proposed in a MICCAI 2024 paper “Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Pattern”. It seems this paper does not propose a strong technical contribution to the open-set classification methodology. From the application perspective, this paper claims they propose an autonomous navigation solution, but the task itself is only an open-set bronchoscopic classification task. I would say the paper does not make enough contributions either to MICCAI technology or to medical applications.
    4. The dataset details are missing.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to point 7.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper develops a novel prompt tuning approach for open-set automatic recognition of bronchial bifurcation. It utilizes both CLIP and frozen EfficientSAM models to encode text and visual features, fuses those features using cross-attention mechanism, and designs a specific loss function based on feature compression to solve the long-tailed recognition problem.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Foundation models are well-utilized for bronchial bifurcation recognition combined with automatic point prompt generation, and the experimental results look promising.
    • The proposed feature compression design is interesting
    • The paper is well-structured to read and follow.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Unclear experimental setting. In the ablation study, the difference between w and w/o EfficientSAM is not clearly stated. Which backbone is used to replace EfficientSAM?

    • Inconsistent definition and missing discussion of parameter α. In Sec. 3, the authors state that L_CE=L_ITM where α=1, but it is the opposite in Eq. 11. More importantly, in figure 5, the curve looks strange as there are multiple peaks, the accuracy drops to lowest when α=0.5, and it almost reaches highest when α=0.9. Does that mean either L_ITM or L_BUKD is not actually necessary? Could the authors provide further illustrations, for example by including more indicators?

    • Lack of dataset details. The authors only provide an overall class distribution of the whole dataset, but the details of the test set are not provided. Also, if the test set is highly imbalanced, it would be better to display per-class recognition results.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Can the std be provided in Table 1?

    • What is the impact of the number of base classes?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although there are unclear parts in the experiments, the idea of applying foundation models for the bronchial bifurcation recognition problem is an interesting contribution

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper
    1. The paper presents a creative integration of CLIP and EfficientSAM via a cross-modal prompt tuning framework tailored for bronchoscopic navigation.
    2. The method explicitly proposed a method of open-set recognition in bronchoscopy, which is clinically relevant but rarely addressed in prior work.
    3. The base-unknown knowledge decoupling head (BUKD) is well-motivated and seems effective in enhancing open-set generalization.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper introduces a new prompt tuning framework that integrates EfficientSAM-based region-aware features with CLIP’s textual prompts via a cross-attention (T2RAF) mechanism, enriching the visual branch of CLIP with fine-grained anatomical information. This paper uniquely uses SAM’s fine-grained anatomical cues (via point prompts and attention maps) to guide cross-modal fusion—effectively combining structure and semantics at the feature level. This framework supports better alignment for open-set anatomical structures—critical for safe autonomous navigation.
    2. EfficientSAM is used in an innovative way to extract localized structural cues (airway contours, orifice boundaries) from bronchoscopic images by using point prompts automatically generated through brightness/contrast + HSV-based segmentation. Instead of using segmentation masks for downstream supervision (as is typical), the paper uses SAM-derived attention as a dynamic visual prompt source. This allows low-level morphology to influence high-level semantic matching in CLIP, a non-trivial cross-level and cross-modal interaction.
    3. Foundation models + prompt tuning in bronchoscopic navigation focus on closed-set bifurcation classification or segmentation.
    4. This paper Introduces class-dependent modulation to widen decision boundaries for open-set robustness.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Missing comparison with segmentation-based navigation methods while the focus is classification.
    2. The author should explain EfficientSAM operation.
    3. The author should to discuss the dataset, such as the resource of the data, the characters of data.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The writing is good. This article describes the innovative methods and details, as well as experimental comparisons. While some components are incremental and the dataset is relatively small, the performance improvements and ablation studies are convincing. Stronger justification of clinical impact and better articulation of novelty would further strengthen the paper. Moreover, the vague description of the source and quality of the dataset makes it difficult to determine the effectiveness of the algorithm.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their valuable comments. Our responses are as follows. Reviewer #1: Q1: Clarification of w and w/o EfficientSAM. A1: We apologize for the confusion. The experimental settings of w and w/o EfficientSAM are intended to illustrate whether the SCPT integrating the extracted spatial features by EfficientSAM into the CLIP visual branch. We will clarify this in the final version. Q2: Further illustrations about parameter α and test dataset distribution. A2: Thank you for the constructive comments. We provide the intuitive metric (i.e. overall Accuracy) to compare the performance of using different α. Actually, the recognition accuracy for per-class is different and α=0.7 achieves the best accuracy on overall and per-class recognition which illustrates the effectiveness of combining L_ITM and L_BUKD. Due to the different examination purposes in clinical, the bronchial bifurcation training and testing dataset distributions are unbalanced. As recommended, we will provide the per-class recognition accuracy on the test dataset, include additional evaluation metrics such as F1-score and Recall, and correct the inconsistent statement regarding Eq. 11 in the final version. Reviewer #2: Q1. Missing comparison with segmentation-based navigation methods while the focus is classification. A1: Thank you for the valuable comment. As our current goal is to validate the effectiveness of the bronchial bifurcation recognition for navigation, we did not include direct comparisons with segmentation-based approaches. We consider them complementary and will explore their integration in future work. Q2: The author should explain EfficientSAM operation. A2: Thank you for the valuable comment. We would like to explain more details of EfficientSAM operation about spatial feature extraction in the final version. Q3: The details of the bronchial bifurcation dataset. A3: The bronchial bifurcation dataset was collected from 1356 clinical bronchoscopy reports across multiple medical centers. The bifurcations were annotated during the bronchoscopic procedures by experienced clinical surgeons. This dataset ensures clinical relevance and captures the morphological variability of bronchial bifurcations encountered in practical bronchoscopic navigation. Reviewer #3: Q1: Relevant to the autonomous navigation task. A1: We apologize for this confusion. In autonomous navigation, recognizing the current location is essential for determining subsequent movements. Especially in bronchoscopic navigation, the common examination and navigation process are suspended due to the intense reactions of patients and the view is lost for autonomous navigation. The recognition of different bronchial bifurcation can quickly find out the current position of bronchoscope after the view lost. The SCPT empowers the ability of recognizing the current location and status in bronchial trees which is critical for autonomous bronchoscopic navigation. Q2: The technical contributions of the proposed method. A2: Our work proposes an automatic point prompt generation strategy to guide EfficientSAM in extracting fine-grained anatomical features from bronchial bifurcations which greatly enrich the visual branch of CLIP with detailed structural information. A cross-attention mechanism is employed to allow fine-grained morphological features to guide and enhance the high-level semantic alignment in CLIP. BUKD head is designed explicitly to widen decision boundaries and improve the open-set bronchial bifurcation recognition which is a significant in addressing the challenge of recognizing both base and unknown bronchial bifurcations during autonomous bronchoscopic navigation. Q3: The details of the bronchial bifurcation evaluation dataset. A3: Please refer to A3 to Reviewer #2. Meta-Reviewer #2: We sincerely appreciate the constructive comments. We have carefully read all reviewer comments and will revise the final version based on these valuable suggestions.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    This paper introduces a cross-modal prompt tuning framework integrating CLIP and EfficientSAM for open-set bronchial bifurcation recognition, which reviewers recognize as a clinically relevant and technically innovative contribution. Reviewer #1 and Reviewer #2 highlight the method’s strengths, including its novel use of SAM-derived attention for dynamic visual prompts, class-dependent modulation for open-set robustness, and state-of-the-art performance supported by ablation studies. While Reviewer #2 notes the lack of comparison with segmentation-based navigation methods (a valid but addressable limitation), this critique does not diminish the paper’s primary focus on classification, which is rigorously evaluated. Reviewer #3’s concerns about technical novelty are outweighed by the consensus on the framework’s originality (e.g., cross-modal fusion for bronchoscopy) and its empirical validation. Based on these considerations and the paper’s demonstrated contributions, I recommend acceptance and encourage the authors to address the minor comments raised during the review process in the final version.



back to top