Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advancements in medical image segmentation have leveraged multi-modal learning, incorporating textual descriptions to enhance segmentation accuracy. However, existing approaches suffer from high computational costs and inefficient text-vision fusion mechanisms, necessitating a more accurate yet computationally efficient solution. To address this, we propose ViTexNet, a novel vision-language segmentation model that introduces Text-Guided Dynamic Convolution (TGDC) for effective and lightweight fusion of medical visual features and textual cues. Unlike standard cross-attention mechanisms, which impose high parameter complexity, TGDC dynamically refines image features by leveraging relevant textual semantics at each decoder stage, ensuring efficient feature modulation without excessive overhead. By adaptively emphasizing clinically significant regions based on textual descriptions, TGDC enhances segmentation performance while maintaining computational efficiency. Extensive evaluations on QaTa-COV19 and MosMedData+ datasets demonstrate ViTexNet’s state-of-the-art performance, achieving 90.76% Dice and 83.25% mIoU on QaTa-COV19, and 78.19% Dice and 64.04% mIoU on MosMedData+, while operating at just 11.5G FLOPs, substantially lower than competing models. Ablation studies confirm TGDC’s superiority over cross-attention-based methods, highlighting its effectiveness in optimizing segmentation accuracy without computational trade-offs. The source code is made publicly available at: https://github.com/bhardwaj-rahul-rb/vitexnet

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3769_paper.pdf

SharedIt Link: https://rdcu.be/eHwXw

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_65

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/bhardwaj-rahul-rb/vitexnet

Link to the Dataset(s)

https://github.com/HUANGLIZI/LViT

BibTex

@InProceedings{BhaRah_ViTexNet_MICCAI2025,
        author = { Bhardwaj, Rahul AND Tambe, Utkarsh Yashwant AND Neog, Debanga Raj},
        title = { { ViTexNet: Vision-Text Guided Dynamic Convolution Network for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {690 -- 699}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces ViTexNet, a vision-language based segmentation network for medical images. The main contribution is the Text-Guided Dynamic Convolution (TGDC) module. The authors evaluate the method on two common medical datasets (QaTa-COV19 and MosMedData+), reporting improvements metrics.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed TGDC module is simple and efficient and avoids the high computational overhead associated with cross-attention.
2. Experimental results show performance gains in both segmentation accuracy and FLOPs reduction.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The technical novelty is limited. The TGDC module is conceptually similar to existing dynamic convolution and feature modulation approaches. The work lacks clear justification on how it differs substantially from prior dynamic or text-guided convolution methods.
2. Compared with HCFNet, the accuracy improvement is limited.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. Please list the differences from HCFNet.
HCFNet. Zhou X, Song Q, Nie J, et al. Hybrid cross-modality fusion network for medical image segmentation with contrastive learning[J]. Engineering Applications of Artificial Intelligence, 2025, 144: 110073.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method, and imformation refer from study HCFNet.

HCFNet. Zhou X, Song Q, Nie J, et al. Hybrid cross-modality fusion network for medical image segmentation with contrastive learning[J]. Engineering Applications of Artificial Intelligence, 2025, 144: 110073.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s rebuttle solved my question.

Review #2

Please describe the contribution of the paper

This paper presents a novel vision-language segmentation model that leverages Text-Guided Dynamic Convolution (TGDC) to efficiently fuse medical image features with textual cues in a lightweight framework. Compared with existing methods, it has achieved competitive performance while improving the execution efficiency of the algorithm.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Obtained good experimental results and achieved a balance between algorithm segmentation performance and efficiency
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

（1）Table 1 lists the performance comparison between this method and other methods. The author should clearly indicate the backbone networks used in each method and compare the methods of the same backbone network. To our knowledge, it is difficult to effectively evaluate the performance gains brought by the innovative points of the method when applying different backbone networks.

（2） Suggest adding 1-2 experimental datasets for comparison results. Currently, the paper is conducting experiments on two datasets, but the experimental quantity seems to be insufficient.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Paper’s innovation and performance, readability of paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposes the ViTexNet model, which efficiently fuses medical image visual features and text cues through the Text-Guided Dynamic Convolution (TGDC) module. It outperforms existing uni-modal and multi-modal methods on multiple datasets, achieving high segmentation accuracy with low computational cost. Ablation experiments verify the advantages of the TGDC module, and the source code is publicly available.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The model was extensively evaluated on QaTa - COV19 and MosMedData+ datasets. ViTexNet achieved state - of - the - art performance. The paper conducts ablation studies to compare TGDC with cross - attention and a self + cross - attention pipeline.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.Figure 1 should be labeled as network architecture results rather than comparative analysis results. 2.The mathematical formula must be accompanied by detailed explanatory captions to ensure computational reproducibility.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The study’s robust methodology and validated results are commendable, though the presentation would benefit from enhanced adherence to manuscript preparation guidelines regarding typographical standardization.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed my concerns. I recommend acceptance.

Author Feedback

We sincerely thank all reviewers for their valuable comments. To R1: [Backbone Networks and Comparison] We grouped the methods in Table 1 based only on their modalities due to limited space. In the final manuscript, we will also include, in Table 1, the backbone architectures for clarity as follows: (a) Unimodal methods (image only): CNN-based: U-Net [14], U-Net++ [21], nnUNet [9]; SAM-based: MedSA [18]; Hybrid CNN-Transformer: Swin-Unet [2], UCTransNet [16]. (b) Multi-modal methods (image + text): CNN-based: TGANet [15]; SAM-based: LGA [6]; Hybrid CNN-Transformer: GLoRIA [7], ViLT [10], LAVT [19], Ariadne’s Thread [20], LViT [11], RefSegformer [17], RecLMIS [8]. ViTexNet falls under hybrid category and uses Swin-V2 Transformer-based image encoder and CXR-BERT text encoder, with a CNN-based Text-Guided Dynamic Convolution (TGDC) module for lightweight cross-modal fusion. [Additional Datasets for Comparison] Following MICCAI submission policies, we are unable to include additional datasets in the final version. However, we recognize the value of broader validation and plan to evaluate ViTexNet on additional datasets in the future. To R2: [Figure 1 Labeling] We acknowledge this oversight. We will revise the Figure 1 caption with a brief summary of the network architecture. [Mathematical Formula Explanation] Following the reviewer’s suggestion, we will update the final version of the paper to provide more detailed captions for the formulas to ensure computational reproducibility. We also plan to provide these details in code releases and related documentation. To R3: [Comparison with Dynamic Convolution [4] or Dyn. Conv.] a. [Text-Driven Fusion, Not Image-Driven] Unlike Dyn. Conv., which generates attention weights from image features to modulate image-based convolution operations, TGDC derives attention weights entirely from the globally pooled text embeddings (e.g., “Bilateral pulmonary infection, two infected areas, lower left lung and lower right lung.”). This enables semantic fusion across modalities, where the guidance comes from linguistic context rather than visual content. b. [1D Depthwise Convolutions, Not 2D] Unlike Dyn. Conv. methods that apply 2D kernels over spatial feature maps, TGDC uses a more efficient 1D depthwise convolution over flattened token sequences. Since the Swin encoder already encodes 2D spatial structure into the token embeddings, TGDC operates along the sequence (token) dimension to refine features. c. [Iterative Refinement, Not Single-Pass Filtering] Unlike Dyn. Conv., which typically applies filter selection only once, TGDC performs iterative refinement by reapplying the same set of depthwise filters within each decoder stage. Both passes are guided by the same global text-derived weights, allowing TGDC to progressively enhance image features using semantic guidance from the text. This behavior is not present in standard Dyn. Conv. [Comparison with HCFNet: Architecture and Efficiency] ViTexNet is a lightweight model designed to balance segmentation performance with computational efficiency. In contrast to HCFNet (Hybrid cross-modality fusion network for medical image segmentation with contrastive learning, Engineering Applications of Artificial Intelligence, 2025), which employs a hybrid decoder integrating multi-head cross-attention (MHCA), a feature modulation block (LCFM), and a multi-stage contrastive loss; ViTexNet introduces TGDC that uses text-guided weighing over multiple depthwise convolution filters to integrate visual and textual information. TGDC avoids reliance on cross-attention and contrastive supervision while maintaining competitive or superior segmentation performance. While HCFNet introduces significantly higher computational overhead (102M parameters, 16.44 GFLOPs), ViTexNet uses only 37.7M parameters and 11.5 GFLOPs. ViTexNet achieves state-of-the-art Dice and IoU scores on the QaTa-COV19 dataset and remains highly competitive on MosMedData+, with performance close to HCFNet.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

ViTexNet: Vision-Text Guided Dynamic Convolution Network for Medical Image Segmentation

Author(s):