Abstract

Cancer remains a leading cause of death, highlighting the importance of effective radiotherapy (RT). Magnetic resonance-guided linear accelerators (MR-Linacs) enable imaging during RT, allowing for inter-fraction, and perhaps even intra-fraction, adjustments of treatment plans. However, achieving this requires fast and accurate dose calculations. While Monte Carlo simulations offer accuracy, they are computationally intensive. Deep learning frameworks show promise, yet lack uncertainty quantification crucial for high-risk applications like RT. Risk-controlling prediction sets (RCPS) offer model-agnostic uncertainty quantification with mathematical guarantees. However, we show that naive application of RCPS may lead to only certain subgroups such as the image background being risk-controlled. In this work, we extend RCPS to provide prediction intervals with coverage guarantees for multiple subgroups with unknown subgroup membership at test time. We evaluate our algorithm on real clinical planing volumes from five different anatomical regions and show that our novel subgroup RCPS (SG-RCPS) algorithm leads to prediction intervals that jointly control the risk for multiple subgroups. In particular, our method controls the risk of the crucial voxels along the radiation beam significantly better than conventional RCPS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0603_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0603_supp.pdf

Link to the Code Repository

https://github.com/paulkogni/SG-RCPS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Fis_SubgroupSpecific_MICCAI2024,
        author = { Fischer, Paul and Willms, Hannah and Schneider, Moritz and Thorwarth, Daniela and Muehlebach, Michael and Baumgartner, Christian F.},
        title = { { Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors develop a deep learning based model for estimation of radiation dose. They utilize the DeepDose model and extend risk-controlling prediction sets (RCPS) to handle uncertainty quantification. The model is trained on 125 patients including 5 different anatomical entities, with dose prediction for lymph nodes used as a out-of-domain entity during testing. Authors compare their architecture (SG-RCPS) to a baseline RCPS model and show improved performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Authors are able to prove better performance of their model over a plain RCPS based approach.
    2. The tasks at hand, i.e. fast calculation of radiation dose, is a essential next step in the advancement of irradiation therapy.
    3. Authors test model performance on a out-of-domain entity to proof generalizability.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Authors motivate their work based on MR imaging and adaptive radiation therapy. However, the developed model is then trained on CT and other input information that cannot be access from MR imaging on the fly. Making the argument for a model that is able to adopt radiation dose online invalid.
    2. The paper is generally hard to follow. The definition of “risk” is only specified in the Supplemental Materials but used throughout the paper.
    3. Moreover, the paper lacks a clear distinction between novel contributions from the authors and work developed elsewhere. For example, as far as I understand it, section 2.3 completely lists contributions from prior work, which is not made clear.
    4. To train the DeepDose network, input segments are split into patches. How are those patches combined during inference to get a image level prediction?
    5. Only one quantitative measure, i.e. the fraction of truth segments contained in the uncertainty intervals, is provided. Also other metrics should be considered, like mean squared error or dose-volume histograms.
    6. Authors motivate development of the model by a requirement for fast calculation. However, no inference time measures are provided.
    7. Minor: In Figure 3, Authors write that the example represents head and neck tumor. However, this look more like liver tumor to me.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Private dataset and code is not provided. However, pseudo code is published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Authors should restructure the paper and make novel contributions and work developed elsewhere more clear.
    2. Definition of the risk provided in the Supplemental Materials should be part of the main body.
    3. Please specify how patches are combined.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper needs revision to be more clear and better structured. Different quantitative measures are missing and no measure of inference time is provided.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Why MR motivation when CT is used? The envisioned MR-guided workflow would then involve: MR acquisition - registration of MR and planning CT - dose prediction - intervention. To be really applicable online, during radiation therapy all those steps have to be performed in real-time. However, as the authors write themselves, DL-based dose prediction alone will require several seconds of computation time. In this case it also does not matter that the DL model will be faster than MC simulations, because it simply has to be fast enough. Therefore, a motivation by MR-guided radiation therapy does not make sense.

    Dose estimation performance Further quantitative measures and, as outlined by R4, bounds for uncertainty intervals are still missing.



Review #2

  • Please describe the contribution of the paper

    The authors have introduced a method to predict uncertainty intervals (lower and upper bounds) in knowledge based planning dose prediction tasks. Specifically, they extend the risk-controlling prediction sets (RCPS) framework developed for image-to-image regression to include subgroup calibration. The subgroups in their work are the radiation absorbed in the foreground of the beam vs the background (along with total image), with the foreground defined by the area of the image receiving >ground truth dose. They validate their approach on clinical data for 5 treatment target domains, and compare it to the unmodified RCPS approach. They show that in their modified approach, the uncertainty interval captures a greater fraction of the predictions, and that the unmodified RCPS approach is not sufficient in predicting uncertainty intervals.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work addresses an important need in the knowledge based planning field. As dose predictions method mature, and grow close to clinical adoption, techniques that provide uncertainty estimates of the predictions are necessary. The paper is well organized with good logical flow. The figures are clear and easy to follow. The authors propose a hypothesis and test it on clinical data of a modest size (N=125). The math is clear, and the authors provide additional details in the supplementary material.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper resolves around the experimental results. The authors only provide quantitative evidence of the ratio of voxels that fall outside/within the predicted uncertainty interval. What are the actual ranges (values in Gy)? As mentioned in their work, the SG-RCPS method leads to wider intervals (Figure 3, Conclusion). If these intervals are overly conservative, then they may capture most (all) of the predictions, but be too wide to be useful. Specifically, in Figure 3, it seems that in around the right posterior border of the beam, the ground truth dose is ~5-6 (Gy?). The RCPS interval is ~1, and the SG-RCPS interval is ~3. If the SG-RCPS interval is 50% of the predicted value, what is its degree of utility? In contrast, in the center of the beam, posteriorly, the ground truth looks quite hot at >15, while the interval is quite small at ~1.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. Reference [1] provides a repository for the original RCPS work. It would be helpful to provide a fork of the repo with the author’s modifications for sub-group calibration.
    2. If possible, provide a reference to an open source implementation of DeepDose (or release the author’s implementation).
    3. Reproducibility would be enhanced if the authors clarified the treatment details (imaging acquisition, prescription, treatment planning software) used to create the clinical plans.

    [1] Angelopoulos, A.N., Kohli, A.P., Bates, S., Jordan, M., Malik, J., Alshaabi, T., Upadhyayula, S., Romano, Y.: Image-to-image regression with distribution-free uncertainty quantification and applications in imaging. In: International Conference on Machine Learning. pp. 717–730. PMLR (2022)

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Major comments

    1. Please report the lower and upper bounds of the uncertainty intervals for both RCPS and SG-RCPS. This context is crucial for the interpretation of the work. Since each pixel has its own lower and upper bound associated with, a summary statistic or histogram may be necessary.
    2. Alternatively, for each pixel, this reviewer suggests computing a ratio of the interval to the ground truth dose, and reporting a summary statistic or histogram of this. Additionally, this reviewer would like to see the subgroup analysis for border pixels of the foreground vs. non border pixels, as the border is much more difficult to predict. Additionally, please comment on the significance of these results.
    3. Although the differences are very large and likely to be significant, a statistical test between the RCPS and SG-RCPS results would strengthen the results. However, this must be taken in context of the range of uncertainty intervals.
    4. The original RCPS algorithm performs poorly in the total image case for the prostate, liver and mamma sites. When considering the total image only, this is equivalent to no subgroups. What are the author’s hypotheses for the limitation of RCPS in this the case?

    Minor comments

    1. Please clarify the sentence “Should be >=90%” in the caption for Figure 2. This seems like an error, and the sentence actually belongs to the caption for Table 1.
    2. Please add units/label to the colorbar in Figure 3.
    3. Are the uncertainty intervals symmetrical (for both RCPS and SG-RCPS)? I.e. is DeepDose more likely to overestimate or underestimate the dose?
    4. In Equation 1 of the Supplementary Materials, what does the variable Z refer to, since this is the no sub-group case?

    Please note that 1,2,3 are not requests for new experiments, but additional reporting of the results of the already completed experiments.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented work is novel, and addresses an important problem in the field. However, a crucial component of the results is missing (the actual range of the uncertainty interval) that is necessary to contextualize utility of the method. Providing this missing information may significantly improve this paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The problem tackled by this paper is important, but the main concern raised by this reviewer remains unaddressed.

    Specifically, without knowing the lower and upper bounds of the uncertainty interval, it is not possible to assess the model’s clinical utility. If the uncertainty interval is very wide, then most of the voxels will fall within the interval, but the interval is not useful for making decisions. As mentioned in the original review, Figure 3 shows that for some voxels, the confidence interval is 50% of the predicted value which is quite large. Further, the network predicts a dose per beam/segment. In order to create a full 3-D distribution of the radiation received by the targets and healthy organs, these predictions must be summed together across all beams/segments, and propagation of uncertainty must be performed. Therefore, the uncertainty interval will be even larger. Without seeing the Gy values of the lower and upper bounds in the rebuttal, this reviewer feels unable to raise their score.

    Also, this reviewer disagrees that summary statistics are not informative. For example, a statement such as “for all test set patients, X% of FG voxels had a corresponding interval of <Y Gy would be very helpful to a radiation oncologist or medical physicist using a dose prediction tool.

    Finally, in the rebuttal, the authors state that: “We can be confident in the SG-RCPS foreground intervals being as large as necessary (but no larger) by interpreting Fig. 2 and Tab. 1.” However, Table 1 shows that the head & neck category does not reach the desired error level (as stated by the authors in the last paragraph of 3.2), appearing to contradict this statement.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a method to calibrate the uncertainty estimate in dose prediction for radiotherapy. It builds on risk-controlling prediction sets (RCPS), which suggests to estimate an uncertainty interval and obtain a guarantee that the solution is within this interval with a certain probability. Utilizing previous work in dose prediction, DeepDose (a 3D U-Net architecture), RCPS is then extended and applied to neural network-based dose prediction. The network is adjusted to additionally predict a pixel-wise upper and lower bound on the uncertainty interval. This is then followed with a calibration step to adjust the intervals and guarantee that the predicted interval will contain a valid solution with the pre-specified probability. A calibration algorithm (SG-RCPS) is proposed to enable calibration of different subgroups that are unknown at test time.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper investigates an interesting and valuable research direction of uncertainty quantification in radiotherapy.

    The extension of RCPS to neural network-based dose prediction seems novel.

    The proposed subgroup calibration method appears easy to implement and effective, as can be seen from the results obtained.

    The complete method is straightforward and shows good performance in acquiring the desired risk intervals.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper would benefit from a discussion on how the proposed method integrates into existing clinical workflows or its potential impact on clinical practices, providing clearer applications or case studies.

    Certain details are vague or omitted, which could impact the clarity and comprehensiveness of the paper: • For each patient, we extracted multi-leaf collimator (MLC) segments, resulting in a total of 6638 segments: Were all available segments utilized or how were the segments chosen? • Did you use SG-RSPC calibration on the training set or test set? There is no mention which subset the 3 randomly selected segments came from. • Can anything be said about the runtime or size of the calibration set required for SG-RSPC?

    From the paper it is not clear how much effort was spent optimizing the 3D U-Net? Can anything be said about the performance regarding the initial training objective and subsequent performance on the test set.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While the paper lacks direct links to code or datasets, the inclusion of detailed pseudocode and methodological descriptions suggests that the approach should be reproducible. However, providing access to code and sample datasets could significantly enhance the paper’s utility and facilitate validation

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please discuss how this methodology can be integrated into current clinical workflows. Including potential impacts or case studies could significantly strengthen the practical relevance of your work.

    Include details about the runtime and computational requirements for SG-RCPS to help assess the feasibility of implementing your method in practice.

    Optimization and Performance Metrics: It would be beneficial to provide information on any optimization efforts for the 3D U-Net and discuss the model’s performance on initial training objectives as well as its efficacy on the test set.

    Consider sharing the code and datasets used in your study, or at least providing a more comprehensive guide to reproducing your results. This would greatly aid in validating and extending your research.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Calibrating uncertainty estimates in radiotherapy dose prediction is a valuable contribution. Aligning risk-controlling prediction sets (RCPS), a known clinical characteristic, with a neural network-based uncertainty framework, will enhance its adoptability. This methodology addresses a gap in enhancing the reliability of radiotherapy treatments. The introduction of a subgroup calibration method, SG-RCPS, for handling unknown subgroups at testing time, is particularly notable and demonstrates strong experimental results, underscoring its practical impact and potential for improving patient outcomes. Furthermore, the paper is well-written, offers clear methodologies.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The application of RCPS to neural network-based dose prediction is a valuable contribution to radiotherapy, representing some of the first efforts to integrate these frameworks for dose prediction. The authors have addressed the reviewers’ concerns in their rebuttal and committed to incorporating necessary improvements, such as clarifying clinical workflow integration and providing detailed performance metrics. Given these assurances and the promising initial results, the paper warrants publishing for its potential to enhance uncertainty quantification in radiotherapy.




Author Feedback

We thank the reviewers for their constructive feedback. We are pleased that they found the work novel [R5], a valuable and interesting research direction [R5], clinically relevant [R3, R4], and well-organized and easy to follow [R4]. Below, we address the main comments.

[R3] Why MR motivation when CT is used? The envisioned MR-guided radiation therapy workflow registers the planning CT volume to the image acquired in the MR-Linac. Dose prediction is performed using the transformed CT image. MR image intensities do not correspond to physical units and lack the necessary information to simulate dose deposition. Will be clarified.

[R3] Clarity and distinction to prior work We acknowledge a lack of clarity in our submission. If accepted, we will rearrange the introduction for improved readability, defining risk early on. We will distinguish more clearly between the background in Sec. 2.3 and our novel subgroup RCPS algorithm in Sec. 2.4.

[R3] How were patch predictions combined? Inference was performed with overlapping 3D patches, and outputs were aggregated into a full 3D volume. This will be clarified.

[R3, R5] Dose estimation performance We validated our system with the gamma pass rate but did not report the figures to our focus on uncertainty quantification. The median gamma pass rate was 98.9% for the 3%/3mm criterion. This will be included in the final version. The dataset currently lacks segmentations, so dose volume histograms could not be computed.

[R3, R5] Inference and training times, and calibration set requirements DL-based dose prediction reduces prediction times from hours or days (MC simulations) to ~30 seconds. This speedup was addressed by DeepDose [12]. Building on [12], our contribution lies in added uncertainty quantification via RCPS and SG-RCPS. Inference times will be clarified. Training took around 10 days on an Nvidia 2080ti GPU. Calibration took around 15 hours for each entity. Determining the optimal number of calibration samples is difficult. However, each voxel is a calibration sample, providing a large effective calibration set size.

[R4] Actual Gy values for intervals in Fig. 3 and utility of intervals We will report actual Gy ranges and add a legend with units. We can be confident in the SG-RCPS foreground intervals being as large as necessary (but no larger) by interpreting Fig. 2 and Tab. 1. The desired risk levels are met in close to 90% (=1-delta) of the segments indicating that the interval size produces our desired risk characteristics. Larger BG intervals are necessary to satisfy BG and FG risks simultaneously. Discussion of results will be improved.

[R4, R5] Code availability The code for SG-RCPS will be made publicly available upon manuscript acceptance.

[R4] Report the lower and upper bounds of the uncertainty intervals Unfortunately, reporting these bounds as summary statistics is not informative as comparisons between patients are not meaningful.

[R4] Statistical test between the RCPS and SG-RCPS Differences are statistically significant. Will be emphasized in text.

[R4] Unexpected behavior of different subgroups We conclude from the behavior that the requirement for exchangeability between calibration and test set is better satisfied for FG voxels than BG voxels. SG-RCPS, which is dominated by foreground voxels, produces better risk-controlled intervals, while RCPS is more affected by violations in exchangeability.

[R4] Are the uncertainty intervals symmetrical? The intervals are not symmetrical as we estimate separate upper and lower bounds.

[R5] Questions about data splits We performed a random train/test/val/calibration split at the patient level (see Table 1 in the supplementary materials). All segments from each set were used for training, evaluation, validation, and calibration. The description of the calibration split will be clarified.

[R3, R4, R5] Minor comments All minor comments will be incorporated.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Accept while some concerns still remain – especially the clinical applicability of the model due to computational time and the clinical utility of the uncertainties because of large margins. However I think there are still value to its publication and inclusion in MICCAI even if the method is not the “final” solved solution for radiotherapy planning.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Accept while some concerns still remain – especially the clinical applicability of the model due to computational time and the clinical utility of the uncertainties because of large margins. However I think there are still value to its publication and inclusion in MICCAI even if the method is not the “final” solved solution for radiotherapy planning.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Following the rebuttal, all reviewers have reached a consensus to accept the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Following the rebuttal, all reviewers have reached a consensus to accept the paper.



back to top