Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Medical image segmentation is crucial for various clinical applications, and deep learning has significantly advanced this field. To further enhance performance, recent research explores multimodal data integration, combining medical images and textual reports. However, a critical challenge lies in image data augmentation for multimodal medical data, specifically in maintaining text-image consistency. Traditional augmentation techniques, designed for unimodal images, can introduce mismatches between augmented images and text, hindering effective multimodal learning. To address this, we introduce Region-Based Text-Consistent Augmentation (RBTCA), a novel framework for coherent multimodal augmentation. Our approach performs region-based image augmentation by first identifying image regions described in associated text reports and then extracting textual cues grounded in these regions. These cues are integrated into the image, and augmentation is subsequently performed on this modality-aware representation, ensuring inherent text-cue consistency. Notably, the RBTCA’s plug-and-play design allows for straightforward integration into existing medical image analysis pipelines, enhancing its practical utility. We demonstrate the efficacy of our framework on the QaTa-Covid19 and our in-house Lung Tumor CT Segmentation (LTCT) datasets, achieving substantial gains, with a Dice coefficient improvement of up to 7.24% when integrated into baseline segmentation models. Our code will be released on https://github.com/KunyanCAI/RBTCA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3559_paper.pdf

SharedIt Link: https://rdcu.be/eHwRI

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_51

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/KunyanCAI/RBTCA

Link to the Dataset(s)

Qata-Covid19 dataset: https://www.kaggle.com/datasets/aysendegerli/qatacov19-dataset

BibTex

@InProceedings{CaiKun_RegionBased_MICCAI2025,
        author = { Cai, Kunyan AND Yan, Chenggang AND He, Min AND Qu, Liangqiong AND Wang, Shuai AND Tan, Tao},
        title = { { Region-Based Text-Consistent Augmentation for Multimodal Medical Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {533 -- 543}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes Region-Based Text-Consistent Augmentation (RBTCA), a framework addressing the challenge of multimodal data augmentation in medical segmentation tasks. The approach identifies regions described by associated textual reports, extracts textual cues grounded in these regions, and integrates them with image augmentations. The proposed method is plug-and-play, ensuring modality consistency during augmentation and significantly improving segmentation performance on QaTa-Covid19 and Lung Tumor CT datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper tackles an important and practical problem in multimodal medical image segmentation: ensuring semantic consistency during augmentation.
2. The RBTCA framework offers a plug-and-play design, allowing easy integration into existing segmentation pipelines without architectural modifications.
3. Experiments demonstrate clear segmentation improvements over multiple baselines and different architectures, highlighting the method’s effectiveness and generalizability.
4. The authors commit to releasing the implementation, enhancing reproducibility and utility.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The discussion of related work is incomplete. The studied topic significantly overlaps with existing referring segmentation methods, specifically those described by Ouyang et al. (LSMS [3]) and Chen et al. (CausalCLIPSeg [4]). A detailed comparison and discussion on how this method distinguishes itself from these prior works are necessary.
2. Experimental comparisons lack robustness. Essential segmentation backbones such as TransUNet [5] and nnU-Net [6], which have demonstrated strong performance in medical segmentation, are notably missing. Further, comparisons with other prominent text-driven methods (e.g., Liu et al.’s CLIP-driven approach [7]) would strengthen the evaluation.
more paper cited:

[1] Koleilat T, Asgariandehkordi H, Rivaz H, et al. Medclip-samv2: Towards universal text-driven medical image segmentation[J]. arXiv preprint arXiv:2409.19483, 2024.

[2] Killeen B D, Wang L J, Zhang H, et al. Fluorosam: A language-aligned foundation model for x-ray image segmentation[J]. arXiv preprint arXiv:2403.08059, 2024.

[3] Ouyang S, Zhang J, Lin X, et al. LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation[J]. arXiv preprint arXiv:2408.17347, 2024.

[4] Chen Y, Wei M, Zheng Z, et al. CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024: 77-87.

more backbone should be included:

[5] Chen J, Lu Y, Yu Q, et al. Transunet: Transformers make strong encoders for medical image segmentation[J]. arXiv preprint arXiv:2102.04306, 2021.

[6] Isensee F, Jaeger P F, Kohl S A A, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation[J]. Nature methods, 2021, 18(2): 203-211.

[7] Liu J, Zhang Y, Chen J N, et al. Clip-driven universal model for organ segmentation and tumor detection[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 21152-21164.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a promising method with demonstrated improvements. However, the inadequate treatment of related literature and insufficient experimental comparisons undermine confidence in the contributions. A thorough discussion comparing your method with existing referring segmentation approaches ([3,4]) and including more competitive baseline architectures ([5,6]) and text-driven methods ([7]) are necessary to better position your work within the existing landscape.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Thanks author for providing the rebuttals. I carefully read other reviewers’ comments, and also the author feedback. The authors made the statement of “space constraints limit our ability to include more results.” Thus unfortunately I have to maintain the original decision as weak reject.

Review #2

Please describe the contribution of the paper

This work presents a data augmentation strategy for multimodal medical data with modality-aware representation and inherent text-image consistency. Experiments on x-ray and lung CT datasets show consistent improvements.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Paper is well-written
2. Method is well-motivated and clear.
3. Comprehensive experiments.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The CT dataset is private, limitng the reproducibility. Please consider either release the dataset or using other public lung CT dataset, such as CT-RATE.
2. CT testing set only has 23 cases, which don’t have statistical power.
3. Inproper metric usage (see Metrics Reloaded). Please replace IoU with boundary-based metrics.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See weakness
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
1. The authors propose a Region-Based Text-Consistent Augmentation (RBTCA) framework to maintain semantic consistency between augmented medical images and their corresponding textual reports.
2. The RBTCA is designed as a lightweight, plug-and-play module that can be seamlessly integrated into existing medical image analysis pipelines.
3. Experimental results demonstrate the effectiveness of the proposed framework, showing substantial performance improvements.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel formulation: The authors introduce a novel perspective by focusing on maintaining semantic consistency between image and text modalities during augmentation. This direction is relatively underexplored in existing multimodal data augmentation literature, making the approach innovative and timely.
2. Practicality and compatibility: The proposed RBTCA framework is lightweight and designed as a plug-and-play module, which makes it easy to integrate into existing segmentation pipelines without requiring architectural changes. This enhances its practicality for real-world clinical applications.
3. Clarity of presentation: The manuscript is well-structured and clearly written. The logical flow of the methodology and experiments allows readers to easily follow and understand the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Need for more clarity：
  - In Section 2.1, the author states: “we utilize a BERT-based embedding model”, but it lacks citation or detailed information about the specific BERT variant used. It would be helpful for reproducibility if the authors could specify whether a pre-trained BERT (e.g., BERT-base, BioBERT, ClinicalBERT) was used and clarify if any task-specific fine-tuning was applied.
  - In the description of the M_{ROIE} module, the authors mention that the image and textual prompt are “linearly projected into Query (Q), Key (K), and Value (V) representations”. However, it remains unclear how this projection is implemented. The manuscript would benefit from a clearer explanation or formula detailing this transformation (e.g., using learnable linear layers or convolutional projections).
2. Experiments:
  - In Section 3.1, the authors do not clearly specify the train/test split strategy for the QaTa-COV19 dataset. Additionally, for the in-house LTCT dataset, more detailed information about the imaging acquisition should be provided — such as the CT scanner parameters (e.g., tube voltage, current, scanning protocol).
  - Table 2 reveals that adding TCA to multimodal models leads to a slight performance degradation. This raises a concern: does the proposed augmentation framework (MAR + TCA) primarily benefit unimodal models, while offering limited or even negative impact on multimodal ones? If so, why not directly use stronger multimodal models instead of applying the framework to unimodal ones? The authors are encouraged to elaborate on this trade-off and justify the design choice.
  - The manuscript sets the weighting factor λ in the loss function to 0.5 without justification. Please clarify how this value was determined. If λ = 0.5 is based on empirical tuning, it would be helpful to include a sensitivity analysis or validation experiment to support that this is indeed the optimal setting.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty and the clarity of presentation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

This paper proposes a practical plug-and-play module that can be seamlessly integrated into various backbone networks for text-guided image segmentation. The method is well-designed and shows novelty. The authors have addressed reviewers’ concerns with detailed responses and additional experimental results, demonstrating the module’s generalizability and effectiveness. Although some limitations exist regarding dataset scale and certain details, the overall contribution meets the conference standards. I recommend acceptance.

Author Feedback

We thank reviewers for their constructive feedback. We appreciate positive comments on our important problem focus & plug-and-play design (R1), well-motivated method & comprehensive experiments (R2), and novel formulation with practical compatibility (R3). Our responses follow:

R1.1: Novelty and Distinction from Prior Works. Thanks for highlighting the relevant works. We propose a versatile plug-and-play module, which is distinct from traditional end-to-end referring segmentation frameworks (e.g., LSMS). Its core innovation is seamless, architecture-agnostic integration into existing backbones (e.g., U-Net) without requiring architectural changes, enabling text-guidance for unimodal models and enhancing multi-modal pipelines. Discussion will be added into the revision.

R1.2: Comparisons on more network backbones and text-driven methods. We have demonstrated our module’s generalizability across 7 representative backbones, including CNNs (e.g., UNet) and hybrid designs (e.g., SwinUNet), under unimodal models and three text-driven models (e.g., ASDA). The mentioned architectures, nnUNet, TransUNet and Liu’s CLIP-driven method, are aligned with our tested variants, either being CNN (UNet), hybrid CNN-Transformer (SwinUNet) backbones or text-driven methods (ASDA). While conducting experiments on an even broader range of architectures would be ideal, space constraints limit our ability to include more results.

R2.1 & R2.2 & R3.2a: Dataset Details. The primary validation of was conducted on the publicly available QATA-COVID19 dataset, where extensive experiments were performed (in Sec. 3.1). The train/test split details are in Sec. 3.1, Pg. 6, following Li’s work (mentioned as [12] in paper). The private 3D CT (143 scans; 23 test) demonstrated 3D generalization. While larger test sets are ideal, these 23 cases provided meaningful initial 3D validation, complementing 2D results. We are actively communicating with the data providers regarding the data sharing agreement for this private CT dataset.

R2.3: Metric Usage. Dice and IoU are the most widely recognized metrics for segmentation tasks, commonly used in previous work like UNet, SwinUNet and LViT. The studied segmentation task often involves disconnected components (a segmentation class often appearing as multiple, spatially distinct regions), which present challenges for boundary-based metrics due to the need for additional component-matching processes. We leave the evaluation on component-level boundary metrics as future work.

R3.1: Method Details: BERT & QKV. a) For text embedding (Sec. 2.1), we used a pre-trained BERT-base-uncased model, with its parameters unfrozen and fine-tuned end-to-end. b) In our M_ROIE module (Sec. 2.1), Q, K, V representations are generated from image/text features using learnable linear layers.

R3.2b: Performance on Multimodal Backbones & Design Rationale. We thank the reviewer for the constructive comments. Our MAR/TCA are flexible. As noted (Sec. 3.2), TCA’s strong text augmentation, which is vital for unimodal models as it enforces text-image consistency (with MAR), can slightly degrade performance on some multimodal models by conflicting with their inherent alignment mechanisms. Thus, we recommend MAR+TCA for unimodal, primarily MAR for multimodal. Our approach’s value is its plug-and-play simplicity and competitive efficacy. Lightweight MAR/TCA modules enhance existing models without complex redesign, unlike many SOTA multimodal systems. Importantly, MAR+TCA with simple unimodal backbones (e.g., U-Net) rivals or exceeds complex SOTA multimodal methods (Table 2). This efficiency justifies the TCA trade-off with advanced multimodal systems, suiting our goal to empower diverse architectures.

R3.2c: Justification of Loss Weight λ. We set λ to 0.5 based on preliminary validation (λ∈ {0.1,0.5,0.9}), which balanced Dice/CE losses and yielded strong performance. We will add more details regarding this in the revision.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper receives mixed reviews (1 negative and 2 positives). I feel the paper can be accepted considering its plug-and-play module design and the experimental results. I recommend acceptance.

back to top

Region-Based Text-Consistent Augmentation for Multimodal Medical Segmentation

Author(s):