Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

In medical image segmentation, obtaining pixel-level annotated data is costly. While semi-supervised and weakly-supervised methods reduce annotation dependence, they still require some pixel-level annotations. In contrast, leveraging textual descriptions corresponding to medical images as supervisory information for segmentation is more promising. Textual descriptions are easier to acquire, as users only need to provide location and appearance details of lesions. We present TIFCMamba, a Mamba-based architecture for text-image fusion segmentation. The framework processes images and texts in parallel to establish cross-modal correspondences, aligning CLIP-encoded features through contrastive learning. The architecture employs a Mamba-based image encoder that reduces computational complexity compared to traditional Transformer models. We propose Mamba Fusion (MF) module integrates text and image features through Bi-Dimension Fusion (BiDF), enabling both intra-modal refinement and inter-modal interaction while preserving computational efficiency. Experiments on polyp and skin lesion datasets demonstrate competitive performance against fully supervised methods and state-of-the-art weakly-supervised approaches. Code and dataset will be available at \url{https://github.com/PZalio/TIFCMamba}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0648_paper.pdf

SharedIt Link: https://rdcu.be/eHwYE

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_29

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{PanZhe_MAMBABased_MICCAI2025,
        author = { Pan, Zhen AND Huang, Wenhui AND Zheng, Yuanjie},
        title = { { MAMBA-Based Weakly Supervised Medical Image Segmentation with Cross-Modal Textual Information } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {299 -- 309}
}

Reviews

Review #1

Please describe the contribution of the paper

this work tackles weakly-supervised image segmentation. following recent works, authors pursue a text-based weak-supervision. authors uses clip-based contrastive architecture using Mamba-model. they include an alignment loss. method is evaluated on polyp dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- aims to reduce supervision (and computation) cost in image segmentation.
- leverages text and multi-modal.
- provides results
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- limited novelty. using text exists since 2023. not clear what is novel in the proposed method.
- it is clear that the performance is boost by increasing the model size T vs S vs B. Tiny model leads to poor performance across almost all cases. this also makes it difficult to compare this method to previous work.
- using different text than previous work makes it difficult to compare the method. the same for the architecture.
- evaluated only on polyp dataset type.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- limited/unclear novelty.
- difficult to compare to previous work: different model/architecture/text-supervision.
- performance is boosted due to using large model.
- method evaluated only on polyp dataset type.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

using the text is not new, whether it comes from reports or chat-gpt. in addition, the evaluation doesnt cover different datasets. finally, the size of the model boosts drastically the performance. this suggests that the size of the model plays an important role in the performance, limiting the methods benefit.

Review #2

Please describe the contribution of the paper

Obtaining annotated medical data is costly and requires domain expertise; therefore, semi-supervised or unsupervised methods are highly desirable. This paper proposes TIFCMamba, a weakly supervised framework for image segmentation. The model integrates a Mamba-based architecture with a novel Mamba Fusion (MF) module, enabling bi-dimensional fusion of image and text features to enhance both intra- and inter-modal interactions. The authors also introduce a mutual alignment mechanism to address inconsistencies between training and testing phases that commonly arise in image-text alignment paradigms. Experimental results demonstrate the strong performance of the proposed approach on skin and polyp lesion datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses a relevant and practical challenge in medical image analysis by aiming to reduce reliance on pixel-level annotations.
- The proposed Mamba Fusion module and the bi-dimensional fusion strategy are well-motivated, along with the proposed loss enhancing the effectiveness of cross-modal learning.
- Experimental validation on two challenging datasets demonstrates the proposed approach achieves performance comparable to fully supervised models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

As acknowledged by the authors, text generation remains a challenging aspect that requires further investigation. While temporal information was beyond the scope of this study, it could be a valuable addition to the proposed pipeline, especially given that colonoscopy data inherently contains rich temporal cues.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend acceptance of this paper due to its well-motivated and efficient approach to weakly supervised medical image segmentation, leveraging a novel Mamba-based architecture and bi-dimensional text-image fusion to achieve competitive results with significantly reduced annotation requirements. Despite minor limitations, the methodology is technically sound and experimentally well-validated.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposed a new medical image segmentation framework that incorporates textual information. The authors also introduced a novel fusion module to enable intra-modal refinement and inter-modal interaction, and conducted extensive experiments to demonstrate its effectiveness.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. They employed a Mamba-based architecture to replace the Transformer-based design, aiming to improve computational efficiency.
2. They proposed a novel fusion module that enhances text-image feature interactions and addresses token fusion limitations in Mamba.
3. They introduced an image-text mutual alignment mechanism to achieve more precise alignment.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In Fig.1, why is $\hat(I)$ followed by a CLIP text encoder and $\hat(T)$ followed by a CLIP image encoder? This appears counterintuitive and would benefit from clarification.
2. In Page 2, the authors state: “We introduce an image-text mutual alignment mechanism for precise alignment between image and text segments during training and testing.” However, there is limited experimental evidence to support the effectiveness of this mechanism. Could the authors provide more details and comparative analysis, e.g., using CTC Loss or other baselines?
3. It is strongly recommended that the authors run multiple experiments with different random seeds to reduce the impact of randomness and provide more reliable results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main strengths of the paper lie in its novel approach to image segmentation based on vision-language multimodal learning, and the authors conducted extensive experiments to demonstrate its effectiveness. However, there are still some typographical errors and unclear statements throughout the paper that should be addressed.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Dear Reviewers, We appreciate your valuable feedback. While Reviewers #2 and #3 acknowledged the novelty and completeness of our work, Reviewer #1 raised several concerns that may reflect misunderstandings. We address all comments below, with a focus on clarifying those points.

For Reviewer #1:

“limited novelty, using text exists since 2023”: Unlike prior works that rely on manually annotated clinical reports or structured templates, we employ automatically generated GPT-based descriptions, enabling scalable weak supervision even for datasets without textual metadata. While most methods rely on Transformer-based fusion with quadratic complexity O(n²·d + n·d²), our Mamba-based state-space model achieves linear complexity O(n), improving efficiency for high-resolution inputs (Sec. 2.1). Prior approaches align the entire image with a global sentence embedding, which can cause semantic drift. In contrast, our method applies localized contrastive alignment (Sec. 2.3) on text-relevant regions, improving segmentation in complex or multi-lesion cases.

“evaluated only on polyp dataset type”: We would like to clarify that our method was evaluated on both polyp and non-polyp datasets. As shown in Table 1, Table 2, and Fig. 3, we include results on ISIC2017 (skin lesions) in addition to three polyp datasets.

“difficult to compare to previous work: different model/architecture/text-supervision”: We would like to clarify that all comparisons were conducted using the same datasets and standardized GPT-4-generated descriptions (e.g., lesion location, appearance), as noted in Sec. 3.1. These were applied consistently across all methods and will be released for reproducibility. Text variation was explored only in the ablation study (Fig. 3) to assess robustness under different phrasings for the same image, and not used to support main performance claims.

“performance is boosted due to using large model”: We would like to clarify a possible misunderstanding regarding the role of large models in our pipeline. While we use GPT-4 and CLIP, GPT-4 is used only offline to generate training descriptions and is not part of our model backbone. CLIP encoders are frozen throughout training and serve only as fixed feature extractors for contrastive learning. Thus, the observed performance gains do not result from large model tuning, but from our architectural innovations—including the Mamba Fusion Block, region-level alignment, and Bi-Dimension Fusion—which are validated through experiments and ablations.

For Reviewer #2: We sincerely thank the reviewer for the positive feedback and recognition of our method’s novelty. We agree that incorporating temporal information is a promising future direction and plan to explore it in follow-up work.

For Reviewer #3:

Clarification of Fig. 1: We apologize for the confusion. There was a labeling issue in the figure: $\hat{I}$ should go to the CLIP Image Encoder, and $\hat{T}$ to the CLIP Text Encoder.

Effectiveness of Alignment Loss and Comparison with CTC Alternatives: Thank you for the suggestion. Our alignment uses Symmetric InfoNCE Loss to optimize semantic similarity between image regions and text segments. While CTC Loss you mentioned is effective for sequential alignment (e.g., speech), it assumes monotonic paths and is not applicable to our cross-modal setting. We conducted preliminary comparisons between Symmetric InfoNCE and Symmetric Multi-class CE Loss using our TIFIMamba-B. Substituting InfoNCE with CE caused consistent mDice drops: ClinicDB: 88.24 → 83.35; ColonDB: 87.74 → 82.97; LaribPolypDB: 88.92 → 82.46; ISIC2017: 87.95 → 83.04. These results confirm the superiority of our alignment loss.

Multi-Seed Evaluation: We agree that assessing robustness across multiple seeds is important. We will include multi-seed experiments and report mean ± std in the final version or future work to improve reproducibility.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Following a thorough review, two reviewers commended the work’s innovation in integrating Mamba-based frameworks with weakly supervised methods, noting significant advancements in the field. While one reviewer initially questioned the novelty, the authors’ rebuttal effectively demonstrated the non-trivial contribution—showcasing how the approach addresses key challenges and outperforms state-of-the-art methods empirically. We recommend acceptance, recognizing the work’s validated innovation and impact.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This is an overly complex approach targeting a scenario that is becoming increasingly less relevant with the emergence of foundation models. Moreover, the authors have not compared their method with segmentation approaches based on models such as SAM, which significantly weakens the evaluation and the overall impact of the work.

back to top

MAMBA-Based Weakly Supervised Medical Image Segmentation with Cross-Modal Textual Information

Author(s):