Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

This paper presents a novel motion feature guided diffusion model for unpaired video-to-video translation (MFD-V2V), designed to synthesize dynamic, high-contrast cine cardiac magnetic resonance (CMR) from lower-contrast, artifact-prone displacement encoding with stimulated echoes (DENSE) CMR sequences. To achieve this, we first introduce a Latent Temporal Multi-Attention (LTMA) registration network that effectively learns more accurate and consistent cardiac motions from cine CMR image videos. A multi-level motion feature guided diffusion model, equipped with a specialized Spatio-Temporal Motion Encoder (STME) to extract hierarchical coarse-to-fine motion conditioning, is then developed to improve synthesis quality and fidelity. We evaluate our method, MFD-V2V, on a comprehensive cardiac dataset, demonstrating superior performance over the state-of-the-art in both quantitative metrics and qualitative assessments. Furthermore, we show the benefits of our synthesized cine CMRs improving downstream clinical and analytical tasks, underscoring the broader impact of our approach. Our code is publicly available at https://github.com/SwaksharDeb/MFD-V2V.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4670_paper.pdf

SharedIt Link: https://rdcu.be/eHw2r

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_62

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/4670_supp.zip

Link to the Code Repository

https://github.com/SwaksharDeb/MFD-V2V

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DebSwa_Unsupervised_MICCAI2025,
        author = { Deb, Swakshar AND Wu, Nian AND Epstein, Frederick H. AND Zhang, Miaomiao},
        title = { { Unsupervised Cardiac Video Translation Via Motion Feature Guided Diffusion Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {648 -- 658}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper investigate unpaired Cine CMR synthesis from Dense CMR. They introduce a latent temporal multihead attention registration network and a spatiotemporal motion encoder to enable CMR motion perception. They validate its synthesis performance on a newly-collected multi-site dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. It is a critical topic that develop unsupervised cardiac video translation methods considering the deficiency of paired cine and dense video data.
2. The proposed framework achieve the promising synthesis results on translating Dense CMR to Cine CMR videos.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Please describe the novelty of the proposed LTMA in a straightforward way. What is the difference against traditional MSA?
2. How the proposed STME make it differ from the vanilla multi-level feature fusion? From my perspective, it is an incremental module consisting of several attention layers. Why it is important for motion features?
3. In Table 2, what is LTA? Please clarify if it differs from LTMA.
4. It would be great to introduce some unsupervised evalution metrics like IS for Cine CMR synthesis.
5. More extensive experiments and visualizations should be given to evaluate the downstream tasks.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The author should clarify their novelty better. LTMA and STME are more like modified multi-head self-attention modules with no significant technical contributions. Their motivations remain unclear.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors clarify the novelty of LTMA and mostly address my concern, I tend to accept this stage.

Review #2

Please describe the contribution of the paper

This work introduces MFD-V2V, an unsupervised video-to-video (V2V) translation framework that synthesizes high-contrast cine cardiac MR (CMR) sequences from low-SNR DENSE CMR inputs. The model uniquely combines a Latent Temporal Multi-Attention (LTMA) registration network to capture temporal motion and a Spatio-Temporal Motion Encoder (STME) to extract multi-level motion features. These features guide a diffusion-based video generative model, ensuring both anatomical accuracy and temporal coherence in the synthesized cine CMR sequences.

Experiments on a large, multi-site CMR dataset show that MFD-V2V outperforms several baselines—including GAN- and diffusion-based models—across metrics like FID, KID, FVD, and FID-VID. The synthesized cine CMR sequences also improved segmentation performance on a downstream task, indicating practical clinical relevance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel Formulation through Integration of Components While the individual components (e.g., video diffusion models, attention-based motion encoders, and registration networks) have prior art, the specific integration of the motion-aware registration component with LTMA, multi-level motion feature encoding with STME, and VDM within the medical imaging domain. The key novelty lies in formulating a fully unsupervised motion-conditioned video synthesis pipeline, which has not been previously applied to DENSE-to-Cine CMR translation.
2. Profound Evaluation Strategy The evaluation is particularly well-structured. The authors conducted comparison to both GAN-based (CycleGAN, RecycleGAN, etc.) and diffusion-based (VDM, ControlNet) baselines. With unpaired data, the authors use distributional video quality metrics: FID, KID, FVD, and FID-VID—well-suited for unpaired, video-domain synthesis. The authors also conducted an ablation study clearly quantifies the contribution of LTMA and STME to the overall model performance.
3. Clinical Feasibility Demonstrated The proposed method targets a real clinically relevant problem. It shows that generative models can augment low-SNR imaging sequences to meet analysis and diagnostic needs—a tangible, translational outcome. Demonstrates practical benefit by improving automated segmentation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited Architectural Novelty While the combination is new in this context, the individual methods are well-known and not extensively innovated upon. There is no fundamental architectural breakthrough introduced.
2. Insufficient Experimental Design Details The manuscript lacks clarity regarding the experimental setup including train/test split, temporal registration methods, and frame extractions. Besides, methods are compared on a single test set without mention of cross-validation or multiple runs, which could lead to biased performance estimates. While the specificity of the problem may pose challenges in curating large datasets, it’s worth noting that there are publicly available DENSE and cine cardiac MRI datasets that could be leveraged. (Details are provided in major comments in the below sections)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Suggestions and Requests for Clarification:
1. Train/Test Split: Please clarify how the dataset was partitioned for training and testing.
2. Frame and Slice Selection: Explain how cine and DENSE CMR sequences were sampled. Which slices from the cardiac volume were selected and why?
3. Temporal Registration Between Modalities: How did you temporally align cine and DENSE sequences? Did you use anchor points like ED/ES frames, or some form of motion-based interpolation?
4. Use of LV Contour Masks: You mention manual annotations for LV masks—please clarify their role. Were they used for evaluation, model training (e.g., registration supervision), or segmentation experiments?
5. External Evaluation: Although paired datasets are scarce, your method does not require alignment. In the future, testing on external CMR and DENSE datasets (e.g., OCMR, CMRxRecon) would significantly strengthen generalizability.
6. Improving Qualitative Results: I suggest selecting a different sequence for Fig. 3. The current example does not convincingly show LV wall thickening across frames. Since your dataset includes diverse cardiac conditions, please consider showing a case that clearly demonstrates contractile motion.
7. Figure Usage: Fig. 2 could be reduced to allow more space for visual or quantitative comparison across baselines.
8. You mentioned the evaluation of the synthesized cine CMR on a downstream segmentation task using a pretrained 3D UNet. I understand there’s limited on pages. Could you explain more details about this downstream experiment design? Is the 3D UNet is pretrained on CMR or DENSE? if pretrained using CMR for sure it will fail on DENSE images.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The strength of this work lies more in its integration and application to an underexplored but clinically meaningful task. I appreciate authors efforts in implementation and conducting experiments on various methods. The methodologies are well explained while the details of experiments need more clarifications. Please see the additional comments for more details.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposes an unsupervised video-to-video (V2V) translation framework named MFD-V2V to synthesize high-contrast cine cardiac MR sequences from low-SNR DENSE MR sequences. The method consists of (1) a Latent Temporal Multihead Attention (LTMA) registration network that extracts temporally consistent motion fields from cine sequences, and (2) a video diffusion model conditioned on multi-level motion features derived from a Spatio-Temporal Motion Encoder (STME). The model is trained without paired data and evaluated on a multi-site dataset. Results show improved performance over GAN and diffusion baselines using FID, KID, FVD, and FID-VID. A downstream segmentation task is also used to show the clinical utility of the generated sequences.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Proposes a motion-conditioned diffusion framework for unpaired cardiac video translation, which is not commonly addressed in prior work.
- Introduces a registration module (LTMA) using multihead attention, which avoids recurrent modeling and is computationally efficient.
- The STME module enables multi-scale motion feature extraction and is shown to improve synthesis when ablated.
- Evaluation includes multiple baselines (GANs, VDM, ControlNet) and multiple metrics suited for video generation.
- Demonstrates improved segmentation accuracy using generated cine data, showing application beyond synthesis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- While DENSE motion fields are used during inference, it is not clearly discussed how robust the model is to noise or artifacts in those fields.
- The evaluation lacks metrics related to clinical usability, such as expert grading or task-based reader studies.
- The segmentation experiment uses a pretrained 3D U-Net, but implementation and setup details (e.g., data splits, preprocessing, whether trained on real or synthesized data) are underspecified.
- The method relies on synthetic supervision from a learned motion field rather than ground truth cine-DENSE alignment, which can introduce biases not discussed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. The paper does not discuss potential failure modes. How does the method perform under pathological conditions (e.g., infarcted myocardium) or with imaging artifacts such as poor gating? A brief error analysis or qualitative examples of failure cases would strengthen the paper.
2. The STME module is presented without explanation of its internal structure. Please clarify the feature dimensionality, how the attention maps behave, and whether the extracted features correlate with known anatomical motion patterns.
3. The method uses a hybrid of convolution and transformer layers, but only in the motion registration component. Why not consider a fully transformer-based architecture? Using transformer layers throughout could better capture long-range dependencies across space and time.
4. The manuscript states that cine and DENSE sequences are unpaired, yet also mentions they are spatially and temporally aligned. Please clarify whether the sequences are from the same patient, and how training/test split integrity was ensured to avoid leakage.
5. The segmentation task is under-specified. It is unclear what anatomical structure was segmented, what data were used, whether the model was trained or only tested on synthetic data, and how ground truth was obtained. These details are important to assess clinical relevance.
6. The images were cropped to a fixed resolution, but the criteria for cropping are not described. Please clarify whether the cropping was centered on the myocardium, done manually, or automated, and whether it could introduce bias or affect generalization.
7. Unpaired video-to-video translation lacks a clear ground truth. Please discuss how ambiguity in the mapping is handled. Are there any regularization mechanisms (e.g., motion consistency or anatomical priors) that help constrain synthesis? Could this lead to artifacts?
8. As a minor point, references should be ordered by first appearance in the text.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a new method for translating DENSE to cine MRI using a motion-guided diffusion model. The approach is technically reasonable and shows better results than existing methods. However, several important details are missing, including how segmentation was done, how data were split, and how the method performs in challenging clinical cases. There is also limited explanation of some components. While the idea is useful and the results are promising, the paper would benefit from clearer presentation and discussion. I give a weak accept based on the originality and potential usefulness of the method.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors gave clear answers to the main concerns, including how the experiments were set up and how the model works. The method is not a big architectural change, but the way it combines known parts for this specific medical task is useful and shows good results. The response was well explained, and I keep my weak accept.

Author Feedback

We thank all reviewers’ valuable comments and suggestions.
[R1 & R2 & R3]
[Experimental Details] First, we used a 70/10/20 train/validation/test split at the subject level for all experiments, selecting short-axis slices of the left ventricular (LV) myocardium from both cine and DENSE sequences. Second, for the downstream segmentation task, the network was trained exclusively on standard-quality cine CMR images to predict LV segmentations. Experimental results show that inference on low-quality DENSE images led to a significant drop in performance (dice: 40.2%), while using our generated cine-like DENSE images restored segmentation accuracy to 81%. All evaluation metrics are averaged over 10 independent runs.
[R1]
[Model Novelty & Contribution] Our main contribution is the novel integration of temporal registration and multi-level motion feature guided diffusion within an unsupervised video-to-video translation framework for cardiac MRI - a direction that, to the best of our knowledge, has not been explored in existing literature.
[Temporal Alignment] We agree that temporal alignment between cine and DENSE sequences is important during data preprocessing. In our experimental datasets, both sequences have comparable temporal resolutions, so no additional alignment was required.
[R2]
[Validation on Robustness & Potential Bias] While we haven’t explicitly quantified robustness to noise in the DENSE displacement fields, our architecture addresses this by extracting multi-level motion features via the STME module instead of conditioning directly on raw inputs, helping suppress noise and distill meaningful representations. We acknowledge that supervision from learned motion fields may introduce bias due to registration errors. However, given their good quality and the lack of ground truth cine-DENSE alignment, we find that the benefits of using this supervision strategy outweigh the potential drawbacks in our experiments. We will add related discussions in the revised manuscript.
[Spatial & Temporal Alignment] Both cine and DENSE sequences are acquired from the same patients but unpaired due to cardiac motions. In our paper, “spatial and temporal alignment” refers to matching spatial and temporal resolution (1 mm², 40 frames) across both modalities, as well as aligning the starting point of the cardiac cycle.
[More Architecture] The STME extracts motion features from 2D displacement fields of shape (T×H×W×2) and outputs a shape of T×H×W×64. In contrast to fully transformer-based registration models that are computationally intensive on high-dimensional images, our LTMA performs attention in the low-dimensional encoded velocity space, enabling efficient temporal reasoning with reduced computational cost. [Ambiguity In Unpaired Translation] While we don’t have paired cine-DENSE data for training and evaluating the ambiguous mapping. One possible way to address alignment is to compare myocardial strain between synthesized cine CMRs and input DENSE, which we can’t include due to space.
We agree that clinical validation is critical and plan to include them in the extended journal. [R3]
[LTMA Contribution] In contrast to traditional MSA, which is typically applied to input tokens, LTMA introduces a new approach by applying temporal attention directly in the encoded velocity space across time. This design allows the model to capture global temporal dependencies more effectively and generate consistent motion representations over time.
[STME Feature Fusion] Unlike vanilla multi-level fusion, which simply aggregates features across layers, STME extract motion representations by first applying 3D convolutions to learn local frame-wise correlation, followed by spatial and temporal attention to capture global anatomy and temporal dynamics. This motion-aware design is crucial to synthesis realistic cine-like sequences, as shown in Table 2. The term “LTA” should be “LTMA”. We will report the IS in the revised version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Unsupervised Cardiac Video Translation Via Motion Feature Guided Diffusion Model

Author(s):