Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate segmentation of pulmonary structures is crucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require large amount of labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. We constructed PAV-Seg3D, the largest Pulmonary Arteriovenous 3D Segmentation Dataset to date (718 scans).The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called LA-CAF, which adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a specially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We validate LA-CAF on two datasets: PAV-Seg3D and the public PARSE2022 dataset. The experiments show that our method outperformed other state-of-the-art methods by a large margin. The dataset and code will be made publicly available upon publication.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2333_paper.pdf

SharedIt Link: https://rdcu.be/eHw2c

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_47

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zhuji423/LA-CAF-MICCAI2025

Link to the Dataset(s)

https://github.com/zhuji423/LA-CAF-MICCAI2025

BibTex

@InProceedings{GuoXia_Selfadaptive_MICCAI2025,
        author = { Guo, Xiaotong AND Yang, Deqian AND Wang, Dan AND Zhu, Ying AND Zhao, Haochen AND Li, Yuan AND Sui, Zhilin AND Zhou, Tao AND Zhang, Lijun AND Meng, Hui AND Meng, Yanda},
        title = { { Self-adaptive Vision-Language Model for 3D Segmentation of Pulmonary Artery and Vein } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {489 -- 499}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors introduces a novel segmentation task using vision and text model fusion and a self-adaptive learning strategy to effectively fuse the embeddings from the text and image modalities. The paper also introduces PAV-Seg3D dataset, comprising 718 annotated CT scans of Pulmonary Arteriovenous (79 fully labeled and 639 half-labeled)
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Introducing a new dataset
- Proposing a text-vision model fusion to enhance the segmentation performance
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

As shown in Fig. 1, the vision model follows a traditional encoder-decoder architecture. While the encoder’s output (feature maps) is passed to the Adapter module—similar to the output from the text encoder—there is no mention of the decoder output from the vision model, which typically produces the predicted segmentation mask. If the final prediction is derived from the outputs of both encoders, it remains unclear what role the vision model’s decoder plays in the overall architecture.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I recommend clarifying the role of the vision model’s decoder, both in Fig. 1 and in the main text. Additionally, incorporating the decoder’s predictions into the loss function could potentially enhance segmentation performance.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In traditional segmentation tasks and hybrid models, the decoder typically receives gradients from the loss calculated at the model’s output and is updated accordingly. However, the role of the vision model’s decoder in this work remains unclear and requires clarification. If the final prediction is solely based on the combined outputs of the text and vision encoders, then the Adapter module effectively assumes the role of the decoder. In that case, the vision model’s decoder appears redundant and should be reconsidered or removed from the architecture.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The main contribution of the paper is the introduction of a new framework called LA-CAF for segmenting pulmonary arteries and veins in 3D CT scans. This framework uses a pre-trained vision-language model, CLIP, reinforced with a Self-Adaptive Learning Pipeline and a Cross-Attention Fusion to improve segmentation accuracy without needing vast amounts of labeled data.

The authors also created a dataset for this task, called PAV-Seg3D, which helps train and validate their model. By combining text and image data, the framework outperforms existing methods, making it a promising tool for medical image analysis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- By integrating a pre-trained vision-language model like CLIP with a self-adaptive learning strategy, the authors have created a novel approach that effectively addresses the challenges of segmenting pulmonary arteries and veins in 3D CT scans.
- The authors constructed a dataset for Pulmonary Arteriovenous 3D Segmentation, comprising 718 annotated CT scans. This dataset provides a substantial amount of data for training and validating segmentation models, which is often a bottleneck in medical imaging research.
- They also integrate a tailored augmentation with specific design for the characteristics of pulmonary vascular structures which focuses on enhancing the features relevant to the segmentation task. This specialized augmentation helps the model to generalize better and achieve improved performance on the segmentation task.
- The suggested solution is validated qualitatively and quantitatively on both the PAV-Seg3D and PARSE2022 datasets showing significant accuracy improvement compared to existing solutions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

While the authors have combined existing solutions to develop a new adaptive strategy, the novelty of their approach is limited and primarily application-specific. Many of the techniques, such as using adapter modules and attention mechanisms, are common in the computer vision field. The methodological novelty in this work is incremental and tailored to a very specific application, as it largely revolves around combining existing methodologies in a new domain. To enhance the novelty, the authors could have introduced additional tasks, such as image captioning for interpretability or mechanisms for uncertainty estimation, which would have broadened the scope and impact of their framework.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the strengths and weaknesses, the paper presents a valuable contribution to the specific task of pulmonary artery and vein segmentation using vision-language models. However, the limited methodological novelty and application-specific focus are significant drawbacks.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

There are two main contributions of the paper. Firstly, a large data set of annotated CT scans of pulmonary arteries and veins will be released. Secondly, an adaptive module to effectively enhance the fusion of text and vision features based on pretrained CLIP vision-language models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The major strength is the detailed description of the adaptive module. Especially the multi-head cross attention part to model to fuse the image feature embeddings and the text feature embeddings is my personal highlight. In my opinion, this is really a good idea to fill the domain gap between language and vision. Additionally, the results of the experiments are really promising.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The major weakness is the description of the data augmentation. Although this was highlighted as a part of one of the major contributions and although in the main figure 1 it is included, the description it consists of two sentences in the implementation details on page 6. This data augmentation might be important for this specific task - it remains unclear why it helps for the partly annotated images - but it seems not that important for the proposed method of effective fusion of text and vision based on a pretrained model. However, if it is part of a main contribution, it should be explained properly.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Additionally, I would like to stress the necessity of stringent notation: (a) Please avoid * if you describe the dimension of a tensor as done frequently on page 4 and 5. This operator is already used for convolutions as yourself state on page 5. Please use the \times symbol as done in the description of the 1x1x1 kernels. (b) The variable H^a_v is defined twice with different dimensions. Please use different variables for different things.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is clearly written and the method is very well described in detail with the exception of the data augmentation. The idea of a multi-head cross-attention module to overcome the domain gap between text and vision is a good one. The results are promising and the release of a large annotated data set is a benefit for the community.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Self-adaptive Vision-Language Model for 3D Segmentation of Pulmonary Artery and Vein

Author(s):