List of Papers Browse by Subject Areas Author List
Abstract
Temporal modeling on regular respiration-induced motions is crucial to image-guided clinical applications. Existing methods cannot simulate temporal motions unless high-dose imaging scans including starting and ending frames exist simultaneously. However, in the preoperative data acquisition stage, the slight movement of patients may result in dynamic backgrounds between the first and last frames in a respiratory period. This additional deviation can hardly be removed by image registration, thus affecting the temporal modeling. To address that limitation, we pioneeringly simulate the regular motion process via the image-to-video (I2V) synthesis framework, which animates with the first frame to forecast future frames of a given length. Besides, to promote the temporal consistency of animated videos, we devise the Temporal Differential Diffusion Model to generate temporal differential fields, which measure the relative differential representations between adjacent frames. The prompt attention layer is devised for fine-grained differential fields, and the field augmented layer is adopted to better interact these fields with the I2V framework, promoting more accurate temporal variation of synthesized videos. Extensive results on ACDC cardiac and 4D Lung datasets reveal that our approach simulates 4D videos along the intrinsic motion trajectory, rivaling other competitive methods on perceptual similarity and temporal consistency. Codes are available at https://github.com/AlexYouXin/Mo-Diff
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0894_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0894_supp.zip
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{YouXin_Temporal_MICCAI2025,
author = { You, Xin and Zhang, Minghui and Zhang, Hanxiao and Yang, Jie and Navab, Nassir},
title = { { Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15968},
month = {September},
page = {609 -- 619}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors introduce a two-stage framework for generating 4D sequences from a single image slice. Initially, a Temporal Difference Diffusion Model (TDDM) predicts temporal difference fields for the entire sequence using a diffusion process, conditioned on the initial frame and the desired number of frames. Subsequently, a separate diffusion model utilizes these temporal difference fields to construct embeddings, which are then input into a Variational Autoencoder (VAE) decoder to synthesize the 4D video.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- An incremental method: A two-stage approach in which the first stage generates a temporal field, rather than generic flow information, and the second stage conditions on these temporal fields to synthesize embeddings corresponding to the video frames.
- Demonstrates satisfactory image quality metrics on two datasets.
- Ablation studies highlight the importance of including the frame number and concatenation operations.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The experimental results could have been strengthened by evaluating performance on a wider range of downstream tasks like segmentation and localization.
- Although the number of models in the experimental section may appear sufficient, more relevant methods — particularly those that conditionally use temporal information within diffusion models — could have been included for a more comprehensive comparison
- There is limited discussion of the memory footprint and inference time of the proposed method, which is particularly important for clinical applications where computational efficiency and speed are critical.
- The public repository lacks instructions for training and evaluating the proposed method. Additionally, it does not specify the required versions of libraries, which could hinder reproducibility.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- The synthesized images should be evaluated using downstream tasks, not just image quality metrics. For instance, the ACDC dataset includes segmentation masks, and the 4D Lung dataset provides tumor location annotations. Evaluating the synthesized volumes on these tasks would offer a clearer picture of their potential clinical impact.
- It would still be interesting to explore alternatives to temporal difference fields, such as using optical flow or SIFT-flow, to model motion between frames.
- You mention that generation starts from the first frame, typically corresponding to the diastolic phase in the ACDC dataset. However, given the cyclic nature of the cardiac volume, frames near the beginning and end tend to resemble each other. Could you report image quality metrics specifically for the systolic phase in a separate table?
- You stated that using two prompting frames improves image quality, yet the current setup uses only one. Could you clarify the motivation for using a single frame? Is there a technical constraint or specific advantage to this choice?
- I noticed that metadata such as age, height, and weight is not used. Have you considered incorporating this information, either directly or via a language model to generate context-aware embeddings?
- Are the Gaussian noise samples $G_1$ and $G_2$ identical or different across the two diffusion stages?
- Given the two-stage nature of your method, the following related works should at least been discussed in the introduction: Ni et al., “Conditional Image-to-Video Generation with Latent Flow Diffusion Models”, CVPR 2023 Shen et al., “Decouple Content and Motion for Conditional Image-to-Video Generation”, AAAI 2024 Wang et al., “LEO: Generative Latent Image Animator for Human Video Synthesis”, IJCV 2024
- Could you clarify what the variable $L$ refers to? Does it represent slice location in the volume, such as apical to basal slices?
- It would be fascinating to explore the method’s applicability beyond the image domain—for example, in k-space data or sinogram representations.
- From the code, it appears that the TDDM and I2V models must be trained separately. This is an important implementation detail that should be clearly mentioned in the paper. In addition, I wasn’t able to see the installing and running instructions in the repository.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Considering the novelty and technical soundness of the approach, this submission has the potential to be an interesting contribution to MICCAI 2025. However, the limited scope of comparative analysis and the absence of downstream task evaluations slightly diminish the overall impact of the work. Additionally, the publicly released code would benefit from clearer documentation, and an efficiency comparison is warranted—especially given the sequential and memory-intensive nature of diffusion-based models.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The work is focused on the estimation of physiological (cardiac and respiratory) motion from dynamic volumetric medical imaging (4D CMR and 4D lung CT). The specific problem addressed by the authors is the case of slight patient movement and/or unstable breathing that result in dynamic backgrounds between the first and last temporal frames, that induces additional bias that cannot be thoroughly removed through image registration, thereby affecting the temporal motion modeling. Previously proposed methods such as flow-based interpolation model or diffusion models do not completely address this problem. The authors proposed a two-stage pipeline aimed at simulating motion (cardiac or respiratory) via conditional diffusion models. The pipeline is termed Mo-Diff. An image to video framework (I2V) was combined with a temporal differential diffusion model (TDDM) to predict motion in subsequent frames from one initial reference frame.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
In itself, the approach is not new, image to video with conditional diffusion models have been used to simulate physiological motion. Nevertheless, the authors demonstrate improved metrics compared to state-of-the-art methods, owing to the Temporal differential fields were used as conditional input to boost the temporal consistency in the synthesized video. Temporal Differential Diffusion Model (TDDM) that was proposed to generate temporal differential fields, which measure the relative differential representations between adjacent frames.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The manuscript is poorly written, the formulation of the phrases is hard to understand. In the entire manuscript, the authors refer to simulating respiratory motion, while half of the study was performed on cardiac MRI.
The authors state: “Due to a fixed sampling duration between frames, the specific N corresponds to the specific breathing period, acquired by electrocardiogram signals, thus influencing the rate of temporal motion variations.” This sentence, as it is the case for most of the manuscript, is unclear. Electrocardiogram signals are used to acquire cardiac resolved images not respiratory resolved ones. Furthermore, the authors state in the conclusion that “However, it requires more clinical guidance including electrocardiogram signals to simulate unstable breathing, which is promising in future work.” Again, very unclear and unrelated. - Please rate the clarity and organization of this paper
Poor
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The strategy to address the stated specific problem, i.e. “To eliminate the potential misalignment caused by patients’ movements in clinical practice” was to “avoid the data acquisition of the ending CT or MRI frame in a respiratory cycle, and only collect the starting frame for the motion simulation of future frames. Thus we select the diffusion model conditioning on the first volume frame to simulate regular temporal motions.” I find this intriguing. Bulk patient motion appears randomly, may affect any part of the dynamic acquisition. Similarly to unstable breathing, The hypothesis that only the last frame of an acquisition is affected does not hold.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
Temporal modeling of regular respiratory motions is crucial for image-guided clinical applications. During the acquisition of 4D MRI/CT scans, even slight patient movement can introduce motion artifacts between the first and last frames of a respiratory cycle. This paper presents a method for synthesizing regular motions using the image-to-video (I2V) framework, which leverages the first frame to predict future frames over a given time span. The proposed approach employs a two-stage method: First, a temporal differential diffusion model is introduced to generate temporal differential fields as conditional guidance, enhancing the temporal consistency of the synthesized volumes. Second, a field-augmented layer is used to effectively integrate these fields with the I2V framework. Experimental results demonstrate that the proposed method can generate accurate and realistic videos with improved perceptual quality and temporal consistency, highlighting its potential for simulating regular cardiac and pulmonary motion.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is well-structured, with clearly defined objectives and methodology. The experimental setup and results are presented in a clear and organized manner. The reported results are promising, and the deep architecture used is relatively simple, making the approach reproducible.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While simulating unstable breathing is proposed as future work, the diversity of the current dataset could be evaluated, provided that detailed dataset information is publicly accessible. The paper does not address the impact of varying noise levels on the model’s stability and performance. Including an analysis of this aspect would enhance the empirical evaluation
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
It is mentioned that the diffusion time-step 𝑇 and frame number 𝑁 are passed through an MLP. However, a more detailed explanation of this design choice—specifically, the motivation for using an MLP.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The proposed method has significant potential for applications in the medical domain and could pave the way for further research in image-to-video synthesis within this field.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
The motivation for using a single frame (R2). For temporal motion modeling, existing methods rely on starting and ending frames simultaneously during test-time evaluation. However, for the data acquisition in the preoperative stage, the slight movement or unstable breathing will result in dynamic backgrounds, which bring additional bias to affect the motion modeling. Thus, in this paper, only the starting frame is collected. Then we incorporate the diffusion model conditioning on the first volume frame and the video length to simulate regular temporal motions.
Generalization to varying noise levels (R1). Our model can handle cases with various pathologies. Specifically, the ACDC dataset contains 80% pathological cardiac cases (myocardial infarction, cardiomyopathy). Mo-Diff successfully synthesizes temporal frames conditioning on the first frame (see supplementary video).
Downstream tasks (R2). That is an insightful review. We will conduct further evaluations on downstream tasks in the extended journal, including cardiac and tumor segmentation.
Comparison with more diffusion-based methods (R2). To clarify, we have compared Mo-Diff with four existing diffusion-based models including LDMVFI (AAAI 2024), DDM & LDDM (MICCAI 2022 & 2024), conditional diffusion-based VFI (CVPR 2024), etc. We additionally implement more comprehensive comparisons, which will be added to the camera-ready version.
Computational efficiency (R2). We will add the model efficiency to the manuscript. Mo-Diff acquires 23.2h training time, 2.24T FLOPs, and 49.8s average inference speed per case.
Instructions for the code repository (R2). We have already completed the instructions of the anonymous github, which will be made public.
Terminology (R3). We clarify that our work primarily focuses on the breathing-induced motions in the first sentence of Abstract, including cardiac beating and pulmonary respiratory motions. Thus, the terminology of ‘respiratory motions’ will be revised as ‘respiration-induced motions’ or ‘breathing-induced motions’.
Electrocardiogram signals for respiration modeling (R3). Physiologically, the temporal modeling for respiratory lung CTs is closely relevant to the cardiac cycle. Thus, electrocardiogram signals may also boost the pulmonary respiratory motion modeling. That is a potential direction for future works.
Random patient motion (R3). The random patient’s motion will cause the dynamic backgrounds among the temporal sequences. However, during the intraoperative surgical phase, patients are typically under anesthesia and incapable of movement. Thus, our work is aimed at temporal modeling with static backgrounds. In fact, two public datasets that we adopted have been preprocessed, to ensure static backgrounds. Besides, the proposed image-to-video synthesis framework is demonstrated to generate temporal frames with good background consistency.
Minor issues (R2). 1) Gaussian noise samples G_1 and G_2 are randomly sampled in two individual networks. Thus, they are different. 2) The variable ‘L’ refers to the number of axial slices in anatomical structures.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A