Abstract

Image-guided radiotherapy procedures in the abdominal region require accurate real-time motion management for safe dose delivery. Anticipating future 4D motion using live in-plane imaging is crucial for accurate tumor tracking, which enables sparing normal tissue and reducing recurrence probabilities. However current real-time tracking methods often require a specific template and volumetric inputs, which is not feasible for online treatments. Generative models remain hindered by several issues, including complex loss functions and training processes. This paper presents a conditional motion diffusion model treating high-dimensional data, describing complex anatomical deformations. A discrete wavelet transform (DWT) maps inputs into a frequency domain, allowing to select top features for the denoising process. The end-to-end model includes a masking mechanism of deformation observations, where during training, a motion diffusion model is learned to produce deformations from random noise. For future sequences, a denoising process conditioned on input deformations and time-wise prior distributions are applied to generate smooth and continuous deformation outputs from cine 2D images. Lastly, a temporal 3D local tracking module exploiting latent representations is used to refine the local motion vectors around pre-defined tracked regions. The proposed forecasting technique allows to reduce errors by 62% when confronted to a 4D conditional Transformer displacement model, with target errors of 1.29+/-0.95 mm, and mean geometrical errors of 1.05+/-0.53 mm on forecasted abdominal MRI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0595_paper.pdf

SharedIt Link: https://rdcu.be/dV5vZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72089-5_9

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Thi_Conditional_MICCAI2024,
        author = { Thibeault, Sylvain and Romaguera, Liset Vazquez and Kadoury, Samuel},
        title = { { Conditional 4D Motion Diffusion Models with Masked Observations to Forecast Deformations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {89 -- 98}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors utilise the diffusion model to describe and forecast complex anatomical deformations for 4D liver MRI scans. The authors combined multiple methodologies including masked observations, conditional priors and the final tracker to improve the baseline DDPM model. The authors compared the proposed model with several transformer-based models and ablation variations to demonstrate the effectiveness of the proposed model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors conducted comprehensive ablation studies to show the effectiveness of each added component.
- The authors applied the diffusion model to boost the motion forecasting under image-guided radiotherapy scenario and demonstrated the DL model introduces certain improvement in TRE.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Methodological Flaws: The manuscript presents a framework whose description lacks clarity, particularly in how the problem is formulated and the alignment of variables with the provided illustrations. This ambiguity complicates the understanding of the proposed method. Furthermore, the absence of detailed information on the loss functions employed raises questions about the methodology’s robustness. Additionally, the limited dataset size presents a significant risk of overfitting, given the complexity of the neural network proposed. A more comprehensive explanation of the dataset and strategies to mitigate overfitting would strengthen the paper.
- Clinical Relevance: The paper asserts the necessity of predicting organ motion to manage system latencies during online therapy sessions. However, diffusion-based models, as noted in the paper, are typically slow to execute, even in the testing phase. This characteristic casts doubt on the framework’s ability to effectively reduce system latencies rather than potentially exacerbating them. The inclusion of a comparative analysis of running and inference times would provide essential evidence to substantiate the claims of improved latency management.
- Comparisons Missing: While the paper employs target registration errors (TRE) derived from 3D volumes as an evaluation metric and offers qualitative comparisons in 2D sequences, it lacks integration of standard evaluation metrics such as Root Mean Square Error (RMSE), anatomical DICE scores, and Deformation Vector Field (DVF) Jacobian maps. These metrics are crucial for a comprehensive assessment and validation of the model. Additionally, considering that the proposed framework does not appear to be limited by input modality, evaluation using publicly available datasets would not only enhance the robustness of the results but also facilitate comparability with existing methods.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Majors:
- In the pipeline figure (Fig 1), M and 1-M vectors should be complementary according to the description in the Method section but they are illustrated as the same in the Fig 1.
- In section 2.2, the authors use a ‘temporal encoder’ to capture the feature vectors from the input 2D in-plane images. But no further details/citation about the decoder can be found. Similarly the authors introduce Model-based tracker in section 2.6 but not enough details both visually and in text are provided.
- The authors should consider adding a systematic summarizing of the objective function/loss function in the section 2 to clarify the problem formulation and training process.
- In section 3, the authors claimed the DVF between pairs of volumes were pre-computed using VoxelMorph and served as GT. How much can we trust the outcome from VoxelMorph serving as Ground Truth? Could author also clarify how much data were used to pre-train the VoxelMorph and how good did this reference model achieve?
- The authors claimed the sampling procedure DDIM “fine-tunes the parameters on the validation set”. I believe DDIM sampling is fine tuned based on the trained DDPM rather than on the validation set. Could the authors clarify this statement?
- In section 4, the authors described the proposed model as an “online forecasting model”, which seems a bit over-claimed to me. The term “online” is usually used to describe the approaches that involves testing time optimisation like meta-learning, zero-shot learning, which did not appear in the proposed model. Also, as mentioned above, diffusion-based models are known for relatively slow inference time and there’s no experiments in terms of running time to support the proposed approach achieved an “online speed”. Please clarify this term.
Minors:

Introduction section
- “recursive prediction” -> “recursive prediction process”
- “complex to train” -> “complicated to train”
- “These approaches were shown…” -> “These approaches show…”. The author should use the consistency tense when describing the related works.
- The authors talked about GANs application in image segmentation and reconstruction, “particularly for image segmentation and image reconstruction applications…”. Please consider cite related publications to support the claim.
- “determined from the principal modes of variation from surrounding organs and target shape,” This sentence does not make sense to me. Please consider splitting the long sentence into short and clear sentences.
- “NLP” -> “neural language processing (NLP)” if this is the first time the authors talk about NLP.
Fig.1
- y_t-1 -> y_{t-1}
- x N -> \times N
Methods
- (optional) Preliminaries -> Problem definition/problem formulation
- “The feature encoder receives as input the” This sentence does not make sense to me. Maybe the author is trying to say “the feature encoder receives the input as the concatenation of…”?
- “between all the elements…” -> “among all the elements…”
- “and and” -> “and”
- “Table 1 presents the target registration errors from the tumor target regions, comparing the proposed model the recent spatiotemporal predictive methods, including MotionDiff [22], LMC [11] and a Transformer-based approach [17], which were trained with similar conditions to the proposed model.” Similarly, this sentence is too long and miss referring the subjects.
- The authors call the C component (in ablation study) as “conditional prior” and also referring it as “conditional factor” sometimes. Please keep the term consistent.
- “error increase will not be apparent…” -> “error increase is not apparent…”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors led me to my overall score for this paper are the major weakness and the major concerns I listed above.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors describe a novel, end-to-end, conditional diffusion model that is evaluated for it’s ability predict, prospectively, deformations over time within a 4D T2-weighted abdominal MR dataset (n=30). Specifically, a tumor target area defined by a radiologist was delineated then compared against other deformation predictive models (MotionDiff, LMC and a Transformer-based model). Ablation experiments are performed whereby the effects of conditional diffusion, masked observation and model tracking components of the overall model are also assessed. Using leave-one-out training:validation testing the model achieves a TRE ranging from 1.31 +/- 1.02 to 1.28+/-0.92 mm depending on the time elapsed interval and compares favorably with other deformation models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strengths of the paper include assessing a clinically meaningful task – prospective and accurate deformation prediction within a 4D medical imaging dataset. The value of better deformation prediction models include use not only in radiation treatment but also for intra-procedural and intra-operative guidance for an array of interventional and surgical therapies. Another strength is the evaluation design by focusing on all areas of the imaging volume, as well as the targeted tumor, and comparing to across multiple previously described deformation prediction models and performing ablation experiments for the principle components of their novel model. Figures 2 and 3 were especially illustrative of how deformation prediction accuracy is influenced by respiratory motion and local anatomy.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of the paper for this reviewer was the challenge of understanding how the different model components constitute the whole model (conditional diffusion, masked observation inference and model tracking). For instance, it would help those with less domain expertise in novel model design if these components were clearly described and annotated in figure 1. Additionally, it would be helpful to understand training and inference computation time to better gain a sense of translatability. In future studies the authors may consider simulating a more clinically meaningful endpoint such as dosage to tumors and organs at risk if their described approach were used for online, adaptive radiation treatment of a liver tumor.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

There is no link or discussion of specific programming languages used or packages utilized and so there’s no way to test reproducibility of the authors’ work.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

An expanded figure 1 that more clearly delineates the different components of the model that are described in the methods is needed allow better appreciation of the role these components play. Including diagrams of the comparator architecture for juxtaposition would also improve the audiences appreciation of the novelty of their work. If possible, including links to code and the specific training/test dataset would improve reproducibility of the authors’ work. It would also improve clinical translatability if the boundaries of organs at risk (OAR) were chosen as landmarks and dose maps across models were presented.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors appear to present a novel approach for deformation prediction within a 4D MR dataset that could have a clinically meaningful impact on minimially invasive liver interventions. They compare performance with multiple previously described models and, most importantly, illustrate how accuracy is affected by local anatomy and respiratory motion.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This article proposes a real-time volumetric motion prediction method based on a diffusion model. The proposed model predicts the displacement vector field (DVF) based on the observed DVF sequence. The authors propose a loss function with a conditional term that measures the similarity between the current and the previous observed images to guarantee the smoothness of the model inference. As compared to the previous methods, the proposed one achieves better long-term stability with a reduction of error by 62% and lower errors in terms of target errors and mean geometrical errors.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The article proposes a real-time volumetric motion prediction model-based on a diffusion model, achieving better long-term stability and lower errors as compared to the previous transformer-based prediction methods.
- The article is well written, with vivid visualizations of the dynamic tracking and prediction results.
- The smoothness term proposed in this article introduces efficient constraints to condition future predictions.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- It is unclear how the respiratory motion (as mentioned in the introduction Section, the last sentence of paragraph 4) is used as an additional prior to condition future predicted values. Do the authors separate the input sequence based on the respiratory phase?
- In Section 2.3, it is unclear what “K” stands for and how its value is determined.
- Temporal resolution is reported, while the inference speed and the training time are missing.
- It is unclear how the conditional parameter “c” is calculated from the anatomical image Vref and previously obtained feature maps Zf.
- It is unclear what the minimal requirement of the length of the previous observed DVF is to obtain good enough prediction results.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

The main concept of the proposed method is well introduced in the article. Some detail implementation of the method, such as the smoothness conditional parameters, is unclear, but might be easy to improve by adding more information.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- In the second paragraph of the introduction section, the abbreviation term LSTM first appears without a full name. Please add its full name here as Long-Short Term Memory (LSTM) and modify the relative term in the next sentence as Convolutional LSTM.
- Providing more information about how the smoothness consistency is constrained may help the reader understand the importance and usefulness of the conditional prediction.
- I would like the authors to provide some information about the training time, inference speed of each new frame and the minimal requirement of the length of the previous observed DVF. Such information may help to elaborate the real-time and intraoperative performance of the proposed method.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is overall well written. It clearly discusses the limitations of the previous methods and smoothly raise the motivation of using the diffusion model for volumetric motion prediction. The challenges of leveraging the diffusion model are clearly stated, the solutions are described in detail except some missing explanations of the terms, such as the smoothness conditional prior. Experiments are solid and the visualization of the results is impressive. The whole story flow is smooth and reasonable; thus, I would like to accept this article for MICCAI.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

Methodological Flaws

As suggested by the R#1, we will add additional description to the illustrations, defining each variable presented in the figure caption, and how they relate to the method. Descriptions of the loss function terms will be expanded, in addition to the weighted balance between each loss term. Finally, we will provide more details on the datasets which was used in previous studies for liver motion prediction using previous observations. While they seem to be limited in size, each sequence includes more than 42,000 volumes, covering several breathing cycles.

Clinical Relevance. Diffusion-based models, as noted in the paper, are typically slow to execute, even in the testing phase. This characteristic casts doubt on the framework’s ability to effectively reduce system latencies rather than potentially exacerbating them.

We agree with the R#1 that stable diffusion models are rather limited for real-time inference. This proof of concept shows that sufficient accuracy can be achieved in comparison to previous methods. In fact recently, Latent Consistency Models (LCMs) have been getting a lot of interest because they allow to generate images very quickly around 150ms (acceptable for system latencies), as opposed to 10 seconds with vanilla Stable Diffusion. This will be investigated in further studies.

Comparisons Missing: While the paper employs target registration errors (TRE) derived from 3D volumes as an evaluation metric and offers qualitative comparisons in 2D sequences, it lacks integration of standard evaluation metrics such as Root Mean Square Error (RMSE), anatomical DICE scores, and Deformation Vector Field (DVF) Jacobian maps.

We in fact report geometrical errors in the paper, as reported in Fig. 2. This yields a geometrical error of of 1.05 +/- 0.53mm. Due to the page limits, we plan to reports RMSE and DVF errors, as well as evaluate other modalities in future work.

In the pipeline figure (Fig 1), M and 1-M vectors should be complementary according to the description in the Method section but they are illustrated as the same in the Fig 1.

We will distinguish vectors M and 1-M as complementary in Fig. 1.

In section 2.2, the authors use a ‘temporal encoder’ to capture the feature vectors from the input 2D in-plane images. But no further details/citation about the decoder can be found. Similarly the authors introduce Model-based tracker in section 2.6 but not enough details both visually and in text are provided.

We will add a reference to the 2D in-plane decoder, as well as the tracker presented by (Romaguera TMI 2023).

The authors should consider adding a systematic summarizing of the objective function/loss function in the section 2.

The loss term in section 2.4 will be detailed.

How much can we trust the outcome from VoxelMorph serving as Ground Truth? Could author also clarify how good did this reference model achieve?

In several previous studies, VoxelMorph was used to register liver images between several phases, showcasing the method’s robustness to various types of breathing patterns and deformations. The reference model will be added as reference.

The authors claimed the sampling procedure DDIM “fine-tunes the parameters on the validation set”. I believe DDIM sampling is fine tuned based on the trained DDPM rather than on the validation set. Could the authors clarify this statement?

The reviewer is correct, the fine-tunning is made on the trained DDPM. This will be corrected in the revised paper.

In section 4, the authors described the proposed model as an “online forecasting model”, which seems a bit over-claimed to me.

The term online was aimed to design the near real-time performance of the model for prediction motion during radiation. However, to avoid any further confusion, we will therefore remove this

Meta-Review

Meta-review not available, early accepted paper.

back to top

Conditional 4D Motion Diffusion Models with Masked Observations to Forecast Deformations

Author(s):