Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Deep learning has shown potential to enable automated personalized cancer treatment by automating radiotherapy treatment (RT) planning. However, generalizing RT planning across multiple protocols with deep learning remains a critical challenge due to the diversity of clinical requirements. This paper introduces Treat: a unified Text-guided Radiotherapy for dose prEdiction in Automated Treatment planning to address these complexities. By leveraging conditional text embeddings using the CLIP text-encoder, the model dynamically adapts to protocol-specific requirements, enabling the generation of high-quality per-protocol dose distributions. We propose an efficient text-conditioning method, graph prompts pooling (GPP), to effectively leverage multiple protocol-specific prompts, and dynamic batch weighting to balance the model training using multiple datasets. We validated Treat on five datasets—two early-stage prostate, left and right partial breast, and head-and-neck—using clinically relevant metrics: mean absolute error (MAE) of homogeneity index (HI) and dose-volume histogram (DVH). Compared to the protocol-specific model with the MAE-HI of 0.274 and the MAE-DVH of 7.46, Treat achieves a superior performance of 0.062 and 2.87 for MAE-HI and MAE-DVH score, respectively. When com- pared to baseline one-hot conditioning with the MAE-HI of 0.085 and the MAE-DVH of 3.35, GPP demonstrates its efficiency in adapting prompt- based conditioning for predicting dose distributions for diverse protocols.The code is available: https://github.com/mcintoshML/TextGuided_RT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0197_paper.pdf

SharedIt Link: https://rdcu.be/eHwUZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_58

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/mcintoshML/TextGuided_RT

Link to the Dataset(s)

OpenKBP: https://github.com/ababier/open-kbp

BibTex

@InProceedings{KimSan_Treat_MICCAI2025,
        author = { Kim, Sangwook AND Gao, Yuan AND Purdie, Thomas G. AND McIntosh, Chris},
        title = { { Treat: A Unified Text-guided Conditioned Deep Learning Model for Generalized Radiotherapy Treatment Planning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {614 -- 624}
}

Reviews

Review #1

Please describe the contribution of the paper
This paper presents TREAT, a unified deep learning model designed for radiotherapy dose prediction. TREAT leverages text-guided conditioning using CLIP text embeddings, enabling effective generalization across multiple treatment protocols.

The primary methodological contributions include:
- Graph Prompts Pooling (GPP): A novel pooling method to dynamically integrate protocol-specific text embeddings for robust conditioning.
- Dynamic Batch Weighting (DBW): An adaptive loss-weighting strategy for balanced training across multiple heterogeneous datasets.
The authors validate the proposed approach on five distinct radiotherapy datasets (prostate, breast, head-and-neck cancers), demonstrating improved performance and generalizability compared to both baseline and protocol-specific models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Validation on Multiple Datasets: The proposed method is evaluated across five diverse radiotherapy datasets (prostate high-dose, prostate low-dose, left/right partial breast, head-and-neck), demonstrating performance in varying clinical contexts.
- Comprehensive Ablation Studies: The authors systematically investigate the contributions of individual components (e.g., CLIP embeddings, GPP, DBW), illustrating the effectiveness of each methodological component.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited Clinical Practicality (High Labeling Cost): Unlike the approach of Oh et al. [26], which effectively utilizes CT images and simple textual prompts, the proposed method (TREAT) requires detailed manual labeling, including precise segmentation of both planning target volumes (PTV) and organs at risk (OAR). This significantly increases labeling costs and limits clinical feasibility.
- Unclear Input-Output Relationship: The paper does not sufficiently clarify the difference between the provided target volume (TV) as input and the predicted dose distribution as output. This ambiguity makes the clinical relevance and practical implications of the method unclear.
- Methodological Clarifications Needed:
- Training/validation/test splits vary inconsistently across datasets, raising concerns about the validity and generalizability of the reported results.
- The reason for calculating the loss exclusively within ROIs is inadequately justified. It remains unclear why non-ROI regions were entirely excluded from the loss computation. If areas outside the ROI have incorrectly high predicted dose values, would additional post-processing be required in clinical scenarios?
- Experiments that examine the performance of the method without ROI inputs (using purely CT-based predictions) are missing. It is unclear whether the baseline protocol-specific models used for comparison were equally conditioned on ROIs.
- Questionable Protocol Integration and Performance Claims:
- The claim that TREAT effectively integrates inter-protocol information lacks robust experimental validation. The significant performance degradation observed upon removal of critical textual information (Table 4) implies heavy reliance on exact prompts rather than generalized semantic learning.
- Residual U-Net 3D outperforms TREAT when critical textual components are removed, weakening the claim that textual conditioning offers substantial advantages.
- Incomplete Experimental Descriptions:
- The specific composition and structure of textual prompts used for conditioning are insufficiently described.
- The implementation details of “random pooling,” which surprisingly outperforms some intuitive methods (excluding TREAT) in Table 3, are inadequately explained, hindering reproducibility and understanding.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

In Table 3 (Experiment A1), the notation “N_C (No Condition)” appears misleading. It seems that only the first row truly represents the “No Condition” scenario, while subsequent rows (N_C+CLIP, N_C+CLIP+GPP, N_C+CLIP+GPP+DBW) incorporate conditioning elements (e.g., CLIP embeddings). Please clarify this notation to prevent confusion and improve readability.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The primary reasons for rejection are:
- Insufficient Practical Feasibility: The high labeling cost (ROI requirements) significantly limits practical clinical adoption compared to simpler methods.
- Weak Experimental Validation of Inter-protocol Generalization: The heavy reliance on explicitly critical textual input weakens the claim of robust inter-protocol generalization, a central contribution of the paper.
- Critical Methodological and Presentation Issues: Important aspects of the methodology (prompt construction, training details, ROI-only loss rationale) remain unclear, undermining reproducibility and trustworthiness of the findings.
While the proposed method presents some novel ideas, these significant weaknesses, particularly concerning practical deployment and experimental clarity, strongly support rejection at this stage.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors’ rebuttal sufficiently addresses most of my previous concerns. I recommend explicitly incorporating these clarifications into the manuscript and releasing the code to ensure reproducibility.

Review #2

Please describe the contribution of the paper

The authors propose a novel approach for dose prediction across protocols (unified model). The approach uses graph-pooling to fuse clip-based prompt embeddings and a SwinUNet backbone.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is really nicely presented; very clear and well explained. I enjoyed reading it and the experimental design is very insightful. The technical novelty might be seen low for Miccai, but I mostly praise the new knowledge the authors presented:
- A unified model can outperform protocol-specific models in * Average *
- Graph Pooling can effectively improve multiple prompts
- Clip can outperform other Bio-oriented models
- The comparison of backbones is very useful
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Remarks: I would have liked to see the protocol-specific results instead of the average of models vs the unified model. In practice one might decide for the multiple models path. Here, my concern is whether one protocol-specific model might worsen the average, whereas other models could have outperform TREAT as unified. From the values, it doesn’t seem likely to be case but I ask the authors to comment what happened on each task. That CLIP outperformed other Bio-inspired text encoder is surprising and counterintuitive indeed. I suggest the authors to check whether that’s an artefact. If not, it would be a nice source of potential new knowledge for our community.

There are other dose prediction models based on diffusion that the authors could have added. These models do not use text (yet) and are protocol-specific, but from a pure SOTA POV it’d would be interested to know whether the community profits going that direction. The ultimate question here is whether an unified or specific model is the way to go. Please comment.

One example: https://ieeexplore.ieee.org/document/10486983
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

the paper is really good; it just misses more evidence on whether an unified model is the way to go versus protocol-specific models. On this point, the authors did not compare to SOTA dose prediciton models using diffusion.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The main contribution of this paper is a meta-architecture combining text and vision inputs for dose prediction in radiotherapy.

The meta-architecture comprises of several components that contribute to the performance of the model, 1) Self-attention graph pooling for determining optimal text-representations based on a collection of prompts; and 2) Condition block in Encoder-Decoder to support incorporate of textual context.

Text-conditioned/driven-modeling has been rarely explored in radiation therapy dose prediction and has potential to be impactful
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Congratulations to the authors on this study.

Overall, the study uses well-formulated methodological reasoning to develop a system for dose prediction in radiation therapy use-cases. The novelty lies in the combination of text, input CT volume and ROI labels in predicting the dose.

The study will likely be impactful due to,
1. Sufficient amount of patients across 5 different RT datasets covering diverse anatomies.
2. Appropriate use of methodology such as incorporation of SAGPool for using an optimal prompt and balanced loss weighting for accommodating different datasets.
3. Multiple different metrics are used for comparison between methods.
4. Appropriate ablations are performed justifying the choices made in the modeling process.
5. Promising results are obtained over baselines
6. Figures are clear and communicate the intent of the study and methodology design efficiently.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Despite the promising results provided by the paper, there are places where more information is left to be desired:
1. While Fig.1 shows some samples of the text prompts, it is difficult to understand the full range of text prompts without a few more examples. What exactly do these text prompts contain? Does it contain information such as IMRT/VMAT? Is it just the site and dose level that forms the text description of the protocol? I strongly recommend adding more examples and a structured description/subsection outlining this as this is the main novelty of this work.
2. If the text-prompts are indeed variations of the site and dose level to the target, then the text-modeling seems to be quite limited in incorporating relevant information. I recommend the authors to add a section in discussion about how not just patient metadata but radiation therapy knowledge bases could be incorporated to assist better with conditioning.
3. In terms of metrics, how is MAE-DVH calculated? How exactly is the predicted dose DVH compared to the ground truth DVH? A small section on how this is computed would be recommended unless an alternate citation can be furnished.
4. All the results presented seem to be aggregated across datasets. While this is useful to show the unified model paradigm, I would atleast expect a table/plot showing the breakdown of the results per site. This would help inform under what sites and protocols the method works better.
5. In Table2, for protocol-specific models, SwinUNET3D is not the best performing, and is in fact one of the worst performers when it comes to HI and DVH. Why was this chosen as the backbone for treat? It would have been more appropriate to choose for instance a simple U-Net3D/Attention U-Net3D.
6. For the condition block, is there a reason why cross-attention was not performed between the text and the image? It seems like cross attention only comes into the picture with the ROI inputs.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The submission provides descriptions about the methodology used, however, without the actual code, reproducing results would take significantly more effort in adoption of their work. I recommend authors to open-source the full pipeline code as this would boost their work.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I’d request the authors to furnish more information about the text-prompts used in the study and an improved discussion about what other text data could be promising in this sort of modeling, given the insights the authors observed from this study. Currently, it seems to me that the paper uses the text as a slightly improved alternative to one-hot encoding with better relationships between prompts but the text could be used for so much more.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors have addressed most of my comments as possible within the context of the rebuttal

Author Feedback

We thank all reviewers for recognizing the strengths: clear presentation(R1,R3); extensive validation on multiple datasets and ablation(All); effective use of GPP for prompt integration(R1,R3) and balanced loss strategies(R3); and the promising potential of unified text-guided models for radiotherapy(R1,R3). We have thematically grouped the responses below. All citations here are from the main manuscript.

(R2: Input-output, splits, loss) The model takes CT volumes and ROI masks as inputs and predicts radiotherapy treatment dose distributions as output. We apologize for the lack of clarity and have revised the relevant figures accordingly. Dataset splits followed [14] for Prostate high and [2] for OpenKBP; the others were split 7:1:2 (train:val:test). All baselines used the same ROI conditioning. For random pooling, we sampled one prompt per case during training to expose the model to all prompts. During testing, a single representative prompt was used per dataset. These have been clarified in the manuscript.

(R2, R3: Unclear prompts) Apologies for the lack of clarity on the prompts. We have included structured examples and discussed their design in the revised manuscript. Each prompt includes the treatment site, prescribed dose, and number of fractions.

(R3: MAE-DVH) MAE-DVH was computed following [2], which defines ROI-specific DVH metrics according to clinical protocol guidelines.

(R1, R3: Per-dataset Performance) We aggregated results for space efficiency, but now include per-dataset summary scores in the revised version. Experimentally, we found that TREAT outperformed baseline models on all datasets.

(R2: Prompt dependency) TREAT’s performance declines when critical components are removed, which is expected given the importance of semantic conditioning. However, this does not reflect reliance on exact tokens; instead, TREAT generalizes across varied prompt formulations, learning meaningful inter-protocol structure using GPP. We’ve clarified this distinction in the revised discussion.

(R2: Table 3 notation) We acknowledge that the label “N_C” may be misleading and have updated Table 3 for clarity.

(R3: SwinUNET3D choice, condition block) We found SwinUNET3D to integrate effectively with our GPP module, given its patch-wise architecture, which aligns well with ROI and prompt conditioning. Nonetheless, we agree this choice may not be optimal for all metrics and plan to explore U-Net variants and cross-attention conditioning in future work.

(R2, R3: Reproducibility) We appreciate the concerns about the lack of details, and plan to release the codes upon acceptance to ensure reproducibility of the method.

(R1: CLIP, diffusion) As in [5], we observed that CLIP outperforms domain-specific encoders, likely due to its rich pre-training, and we will analyze this further. Diffusion models are orthogonal to our method and require significant computational resources, but they are of great interest; we are exploring unified diffusion models as future work.

(R2: Clinical feasibility) We respectfully clarify that our goal differs from Oh et al. [26], who aim to segment target volumes (TVs) using CT images and prompts. In contrast, TREAT leverages existing contours to predict dose, a crucial component of routine radiotherapy planning. Since segmenting OARs and PTVs is standard-of-care at all hospitals [22], we do not introduce additional labeling burden. Thus, our method is designed to be compatible with existing clinical workflows.

(R2: Loss function) Our ROI-focused loss promotes accurate dose escalation to target volumes and sparing of OARs, aligning with clinical goals. We agree that using non-ROI loss could further improve safety, and we’ve included this as future work.

(R3: Enriching prompts) We agree that integrating radiotherapy knowledge could enrich the conditioning space beyond descriptive prompts. We have revised the discussion section to include the potential of providing meaningful clinical context.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Authors rebuttal successfully address most of concerns from reviewers and one negative reviewer improve the score. Recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper proposes a framework that combines vision and language input for dose prediction in radiotherapy, which has not been explored. All reviewers recommend “Accept”.

back to top

Treat: A Unified Text-guided Conditioned Deep Learning Model for Generalized Radiotherapy Treatment Planning

Author(s):