List of Papers Browse by Subject Areas Author List
Abstract
Deep learning has achieved great success in medical image segmentation and computer-aided diagnosis, with many advanced methods reaching state-of-the-art performance in brain tumor segmentation from MRI. While studies in other medical domains show that integrating textual reports with images can enhance segmentation, there is no comprehensive brain tumor dataset pairing radiological images with textual annotations. This gap has limited the development of multimodal approaches. To address this, we introduce TextBraTS, the first publicly available, volume-level multimodal dataset with paired MRI volumes and textual annotations, derived from the BraTS2020 benchmark. Based on this dataset, we propose a baseline framework and a sequential cross-attention method for text-guided volumetric segmentation. Extensive experiments with various text-image fusion strategies and templated text demonstrate clear improvements in segmentation accuracy and provide insights into effective multimodal integration. The dataset and model are available at https://github.com/Jupitern52/TextBraTS.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2164_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/Jupitern52/TextBraTS
Link to the Dataset(s)
TextBraTS dataset: https://huggingface.co/datasets/Jupitern52/TextBraTS
BraTS20 dataset: https://www.med.upenn.edu/cbica/brats2020/data.html
BibTex
@InProceedings{ShiXia_TextBraTS_MICCAI2025,
author = { Shi, Xiaoyu and Jain, Rahul Kumar and Li, Yinhao and Hou, Ruibo and Cheng, Jingliang and Bai, Jie and Zhao, Guohua and Lin, Lanfen and Xu, Rui and Chen, Yen-Wei},
title = { { TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {650 -- 660}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper presents a dataset that enhances the BRATS data with textual description, and leverages text guidance to improve the segmentation of brain tumors.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Combining textual information with radiology images promises to improve the quality of image analyses and enables bringing clinical information within the image processing pipeline.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper explores an important direction in the of brain tumor segmentation, however these needs further clarification.
What is the prompt used for generating the textual representation, is there any analysis in the consistency of the textual descriptions generated in response to the prompts?
How many radiologist are used to review the reports? and is each report evaluated by multiple radiologists?
In section ‘data analysis’ more clear description of the 3D spatial coordinates used will help understand the process.
In section 3, what is meant by foundational strategies?
What is the main contribution in the pipeline presented in Figure 4?
The results presented in Table 1 can be updated, since the current SOTA of brats adult glioma is higher than reported in the table.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The technical contributions needs more work, particularly the performance metrics show marginal improvements with compared methods. The use of textual reports in current format needs more justification as currently it is hard to evaluate the correctness of the text descriptions generated using GPT.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
My concerns are addressed.
Review #2
- Please describe the contribution of the paper
The main contribution of this work is the development of a paired text-image dataset, generated using GPT-4, for advancing multi-modal brain tumor analysis.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.This paper is easy to understand. 2.The pipeline, including both the methodology and data creation process, is clearly presented.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- From a technical perspective, the proposed bidirectional cross-attention mechanism is not novel to the multi-modal community, as similar designs have already been introduced by [A] and [B].
- Despite the necessity of generating reports to enhance visual understanding, the final results do not seem to sufficiently support the importance of the generated text data, as only approximately 1% Dice improvement is observed compared to competing methods.
- The proposed text-guided segmentation model requires further evaluation. For instance, there are no comparisons with text-enhanced approaches, and no ablation studies are provided to assess the feature fusion specifically at the bottleneck.
- The authors should provide more insights into how text data enhances segmentation outcomes. Since the extracted knowledge comes from distinct domains, how do they contribute to one another? [A] Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers, NeurIPS 2024. [B] Bidirectional correlation-driven inter-frame interaction transformer for referring video object segmentation. Pattern Recognition 2024.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Apart from the effort invested in constructing a paired text-image dataset, the overall evaluation and technical contributions of the work appear to be rather limited.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
Thank the authors for their response. I am inclined to reject this paper for the following reasons: (1) If the technical contribution includes the use of textual data, the experimental comparisons should involve text-enhanced segmentation approaches. The current comparisons are insufficient and incomplete in this regard. (2) The proposed SeqCA module, which first uses textual features to enhance visual features and then refines only the enhanced visual features, does not represent a novel design. Similar ideas have already been explored in prior work, such as “Ariadne’s Thread: Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray Images.”
Review #3
- Please describe the contribution of the paper
This paper presents a transformer-based approach for brain MRI segmentation that incorporates volume-level textual reports into the segmentation network. Specifically, the method employs a Swin UNETR architecture to segment tumor regions across the four MRI modalities provided by the BraTS dataset. A fusion block within the latent space integrates text-derived features—extracted using a frozen BioBERT model—with image-based features to guide the segmentation process and enhance accuracy. An additional contribution of this work is the authors’ commitment to sharing the dataset, including the generated textual reports for each volume, to support further research in this area.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper introduces an exciting and forward-thinking intuition that steers the medical image processing field toward prompt-based solutions—an approach that aligns well with real-world clinical practices where medical reports play a crucial role. The proposed method is both simple and effective, allowing for clear understanding and easy reproducibility. Furthermore, the release of the dataset, including the generated volume-level reports, adds significant value by establishing a strong baseline for future research in this direction. The performance validation is particularly robust, demonstrated through comparisons with state-of-the-art medical image segmentation models such as Swin UNETR (without text report integration) and nnUNet. Finally, a comprehensive ablation study is conducted to evaluate the contribution of each component introduced in the framework, further reinforcing the soundness of the proposed approach.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
As highlighted in the results, report templating is essential; however, this may be seen as a limitation due to the lack of standardized templates in real-world clinical applications for medical reports. Further work in this area is needed to promote report structuring and standardization on a global scale.
Additionally, the reports are initially generated by ChatGPT based solely on the volume inspection, which could introduce potential bias. This is especially relevant when considering the review process conducted by radiologists. ChatGPT serves as a single report generator, while, in practice, reports are produced by numerous sources with varying styles, which may influence the consistency and reliability of the generated reports.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper presents a novel and relevant approach by integrating volume-level text reports into a transformer-based segmentation network, aligning well with real-world clinical practices. The method is simple, effective, and well-validated against SOTA baselines, with a thorough ablation study supporting its contributions. The release of the dataset, including synthetic reports, adds value for future research. While the reliance on templated, ChatGPT-generated reports introduces some limitations, these are acknowledged and offer avenues for improvement. Overall, the paper is clear, impactful, and reproducible. I recommend acceptance.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The explanation from the authors about the use of ChatGPT for report generation and their subsequent validation and inspection by radiologists is sufficiently clarifying my only concern about the adoption of the LLM model for the initial step of the report generation, as mentioned earlier in the review
Author Feedback
We sincerely thank all reviewers (R1, R2, R3) for their constructive feedback. We appreciate the recognition of our contribution in releasing the first open, volume-level text–image paired dataset and proposing a text-guided 3D brain tumor segmentation framework. Below we address the main concerns:
(1) Novelty of Framework and Fusion Design (R1, R2) To our knowledge, this is the first approach that integrates text descriptions with 3D medical images for brain glioma segmentation (R2-W4). Our main contribution lies in the segmentation framework and fusion strategy, not just in applying cross-attention (Fig. 4, R2-W5). Unlike R1-[A][B], which apply parallel cross-attention by swapping Q/KV roles, our method introduces a sequential two-stage mechanism: Text features act as queries (Q), and image features as keys/values (KV), to guide semantic alignment. The refined image features are fused with the originals through a second cross-attention, enabling more context-aware integration. This approach addresses the semantic gap between free-text and 3D volumetric data, which is more challenging than 2D vision-language tasks. As shown in Table 3, this sequential design improves multimodal segmentation performance (R1-W1). Our templated-text ablation (Table 2) shows that different textual content types benefit different segmentation aspects — e.g., location-only text improves anatomical localization (higher whole tumor Dice), while descriptive text improves boundary precision (lower HD95). Attention visualizations (Fig. 5) further confirm enhanced model focus after text fusion (R1-W4). In the final version, we will provide a comparison showing the superiority of our proposed method over the existing bidirectional approaches [A, B]. We will also re-name our method as Sequential Cross-Attention (SeqCA).
(2) Dataset Creation and Text Report Consistency Validation (R2, R3) All reports were created by human annotation using BraTS20 MRI scans and their corresponding segmentation masks. Each case was annotated by two radiologists, and discrepancies were adjudicated by a third expert (R2-W2). GPT-4 was only used to generate initial templates for efficiency; all final reports were reviewed and edited by medical experts. To ensure consistency, we implemented an automatic quality control system: each report was validated against predefined templates and keyword sets, and regenerated if inconsistencies were detected. Due to space constraints, we included only partial prompt examples; the full version will appear in both the released dataset and camera-ready appendix (R2-W1, R3-W1). Regarding spatial references, radiologists used standard anatomical terms (e.g., “left frontal lobe”) rather than absolute coordinates, following clinical conventions that account for anatomical variability. Our report structure follows this practice, as shown in Fig. 3 (R2-W3).
(3) Performance Justification and Experimental Settings (R1, R2) In glioma segmentation, year-over-year gains in top methods are typically within 0.5–1%. Our method outperforms [24] by over 1% Dice, with a p-value of 0.0077 across 10 runs (Table 1), demonstrating statistical and practical significance. We also used a more challenging data split (train/val/test = 220/55/94) than [24] (315/17/37), highlighting the model’s generalization (R1-W2). Our comparisons include several strong, recent baselines with publicly available code. Some newer SOTA methods were excluded due to the lack of official implementations (R2-W6). We tested several 3D text–image fusion strategies versus a pure image encoder (Tables 1 and 3). Our fusion is applied at the bottleneck layer, where spatial resolution is reduced (e.g., 4³ vs 64³), making low-level fusion impractical. Early- and mid-level fusion variants were also explored but showed inferior results, so were omitted from the main paper (R1-W3).
We thank the reviewers again and will address minor comments in the final revision to improve clarity and presentation.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This is an interesting yet controversial work, where reviewers diverge on the opinions. One reviewer actually reversed the recommendation from reject to accept. The strengths and weaknesses are in general very clear.
Strengths
-
This is indeed the first work that provides brain tumor imaging with radiology reports, which can be of great interest to the community.
-
The authors have provided a simple method that shows the benefit of adding textual information.
Weaknesses
-
Although the reports have been reviewed and edited by experts, I am still concerned about the potential distribution shift between these synthetic reports and real reports in hospitals. As far as I understand, real reports in hospitals are still generately purely manually without the use of LLMs. My experience is that LLM-generated reports can have very different styles than real reports written by radiologists. It is difficult to restyle the reports simply by editing. Therefore, the significance of the new dataset becomes weakened if the provided reports do not faithfully reflect real reports. Why not directly ask radiologists to write the reports?
-
Although there is a method developed for image-text segmentation, its necessity and significance remain unclear without comparison with other segmentation methods based on images and texts, for example, LViT. To show that the new method described in the paper is really a contribution, it should be compared against existing vision-language segmention models.
-
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
After rebuttal stage, two suggest to accept and one insist to reject it. Though some drawbacks in this paper, I still suggest to accept it. This paper is of high value because it discussed a important problem: integrate conventional semantic segmentation approaches with medical reports for real world clinic workflows.