List of Papers Browse by Subject Areas Author List
Abstract
In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity—especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations, resulting in misregistration, artifacts, and diminished spatial consistency. To address these challenges, we propose TSSC-Net, a novel framework that generates intermediate frames while preserving spatial consistency. To improve temporal fidelity under fast motion, our diffusion-based temporal super-resolution network generates intermediate frames using the start and end frames as key references, achieving 6x temporal super-resolution in a single inference step. Additionally, we introduce a novel tri-directional Mamba-based module that leverages long-range contextual information to effectively resolve spatial inconsistencies arising from cross-slice misalignment, thereby enhancing volumetric coherence and correcting cross-slice errors. Extensive experiments were performed on the public ACDC cardiac MRI dataset and a real-world dynamic 4D knee joint dataset. The results demonstrate that TSSC-Net can generate high-resolution dynamic MRI from fast-motion data while preserving structural fidelity and spatial consistency.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2515_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/Joker-ZXR/TSSC-Net
Link to the Dataset(s)
N/A
BibTex
@InProceedings{ZhoXua_ADiffusionDriven_MICCAI2025,
author = { Zhou, Xuanru and Liu, Jiarun and Yu, Shoujun and Yang, Hao and Li, Cheng and Tan, Tao and Wang, Shanshan},
title = { { A Diffusion-Driven Temporal Super-Resolution and Spatial Consistency Enhancement Framework for 4D MRI imaging } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15969},
month = {September},
page = {2 -- 12}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors present a temporal super-resolution framework for generating intermediate frames within sequences of cardiac and knee joint MRI. The key contributions include a cross-frame attention diffusion-based temporal super-resolution network and a tri-dimensional Mamba-based module designed to address spatial inconsistencies. The proposed approach demonstrates strong performance on both a public dataset and a private one.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The main strength of TSSC-net lies in its ability to achieve a 6× increase in temporal resolution while outperforming three distinct registration-based methods.
Modularity: The introduction of a separate 3D spatial consistency enhancement module facilitates the analysis of individual component contributions and allows for integration and testing with alternative methods.
Evaluation: The inclusion of qualitative comparisons across two distinct datasets enhances the ability to benchmark the approach against other methods.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Limited application: While the 6× increase in temporal resolution helps reduce otherwise long MRI acquisition times, the applicability to downstream tasks is significantly constrained by how the sequence is generated. The authors did not provide any discussion on how the network adapts to varying motion patterns, which limits its broader usability.
Experimental decisions: Certain aspects of the experimental design remain unclear to the reviewer. Specifically, in the CMR experiments, the sequence is extended to 12 frames—but how were these frames selected? Does the range span from end-diastole (ED) to end-systole (ES)?
Clinical implications: Although the framework improves temporal resolution, its clinical relevance—particularly in cardiac MRI—seems limited to functional analysis. However, it remains unclear how well cardiac function is preserved in the generated sequences. Additionally, common clinical metrics such as ejection fraction (EF) do not require intermediate frames, raising questions about the added value of this resolution enhancement.
Evaluation across modalities: While the proposed method shows improved performance compared to prior work, the article would benefit from a more detailed analysis of performance differences across the two evaluated datasets. For example, what do the authors attribute to the lower performance observed on the knee MRI dataset?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The reviewer finds the article well-written, with a strong methodology and excellent presentation. The proposed stages represent a valuable contribution to the field. However, the article would greatly benefit from a more detailed discussion of each dataset’s results and the design choices made when adapting each data modality to the model. Additionally, incorporating further insights in the conclusion, particularly regarding the sources of performance differences between data modalities and their clinical implications, would significantly enhance the impact of this research.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The authors of this paper introduce TSSC-Net, a framework that can generate intermediate missing frames in longitudinal data with high temporal coherence, spatial consistency and high fidelity. The framework consists of a two-step generation process, first a 2DxT slice-wise generation which employs a transformer architecture with factorized spatio-temporal attention in the denoiser. and second, a tri-directional mamba-based module to capture long-range dependencies and resolve any spatial and volumetric inconsistencies. This method achieves high performance metrics on two dynamic 3D datasets with large motion between frames, a key challenge in frame interpolation and generation tasks.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This approach results in a 6x increase in temporal resolution as compared to the input images, in each single inference step. The mamba module, known for its computational efficiency and superior performance as compared to transformers and CNNs, was used to effectively incorporate information from multiple directions of the synthesized images to ensure volumetric consistency. The authors have provided thorough validation with other deformation-based interpolation techniques and conducted an ablation on the spatial consistency enhancement module, showing superior performance for their final proposed method.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Ablations for the necessity of the combined loss terms are missing, raising questions about the impact on performance if one of the three loss terms (MSE, Wavelet Transform, or Total Variation Smoothness) is either removed or assigned a lower weight. However, ablations on the role of the spatial consistency network have been provided, showing enhanced volumetric coherence and increased performance when this module is included. Despite these improvements, the performance metrics for the proposed method are very close to the baseline model “UVI-Net.” Moreover, the use of a deterministic spatial refinement network after the diffusion model might limit the variability in the final generated images, which may negatively affect the diversity of the synthesized frames.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Here are some additional comments that I would prefer be addressed:
Is the 6x improvement in temporal resolution a design choice or was this resolution empirically determined?
For the transformer backbone of the denoiser in the diffusion model, how are the latent representations of I_0 and I_1 incorporated as conditions in the training/sampling process? Additional details on the conditioning mechanism are needed.
It is mentioned on page 4 that the forward diffusion process is performed on image I_1, which I presume is the target also given as condition. Without the availability of the intermediate frames as ground truth while training the diffusion model, how is it optimized and how does the model learn to predict appropriate levels of noise for the inference?
For the baseline comparison, more details on how the interpolation was performed for registration networks such as VoxelMorph and TransMorph is required.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper proposes a novel diffusion model to increase the temporal resolution of 4D dynamic images by generating intermediate frames with high fidelity and temporal consistency. However the paper lacks certain implementation details and ablations on the proposed loss functions, raising several questions on the training and inference processes.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper introduces TSSC-Net, a two-stage framework for interpolating 3D+t MR images. The method combines a diffusion-based model for temporal super-resolution with a Mamba-based module for spatial consistency enhancement. Experimental results on both cardiac (ACDC) and dynamic knee MRI datasets show improvements over registration-based baselines, indicating the potential of the proposed method in handling temporal interpolation tasks.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The integration of diffusion models and Mamba-based refinement blocks is technically sound and well-motivated, enabling both smooth temporal interpolation and improved 3D spatial consistency.
- The framework demonstrates competitive performance on datasets with different motion characteristics/modalities, showcasing robustness across anatomical regions.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Inappropriate Dataset Usage: The ACDC cardiac dataset is not ideally suited for evaluating 3D spatial consistency, as it exhibits high anisotropy (slice thickness of 5–10mm versus in-plane resolution ~1.5mm) (https://ieeexplore.ieee.org/abstract/document/8360453). However, the model’s spatial enhancement module assumes isotropic resolution, treating all spatial axes equally (as shown in Fig. 2), which may compromise anatomical fidelity and limit performance in the cardiac dataset. Additionally, the reported resampled volume size (256×256×32) is atypical for CMR—more explanation on the resampling strategy is needed.
- Lack of Methodological Clarity: Key components of the framework are under-explained. For example, the symbol “T” in Figure 1 is undefined, and the conditioning process in the diffusion model is not clearly described.
- Limited Baseline Comparisons: Although the paper critiques GAN-based interpolation methods, it does not include any in the experimental comparisons. A fair evaluation should include GAN-based baselines to support the claim of superiority.
- No Clinical Evaluation: While the paper shows strong quantitative and visual results, there’s no physician-based or task-driven assessment of clinical usefulness, such as comparasion of segmentation results.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Typo needs to be checked, like de-sign and lev-eraging on page 4.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper presents a promising approach, but the issues with dataset selection, methodological clarity, and incomplete baseline evaluation significantly weaken the overall contribution. I recommend rejection in its current form.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We sincerely appreciate the reviewers for acknowledging our contributions and providing valuable feedback. We provide the responses as follows. Reviewer#1 (Q1: Limited application.) A1: Our method generalizes to varying motion patterns by conditioning on start and end frames and using cross-frame attention to capture both small and large motions. (Q2: Experimental decisions.) A2: The 12-frame sequences are designed to cover the entire motion process. For cardiac data, this spans from end-diastole (ED) to end-systole (ES); for knee data, it includes the full flexion-extension cycle. (Q3: Clinical implications.) A3: Higher temporal resolution enables more detailed motion assessment (e.g., regional wall motion abnormalities), supplementing EF with more diagnostic information. (Q4: Evaluation across modalities.) A4: The lower performance on the knee MRI dataset is primarily due to its complex, non-rigid joint motion and greater inter-frame variability compared to the relatively periodic and constrained motion in cardiac data. Reviewer#2 (Q5: If 6× temporal resolution is a design choice or empirical.) A5: It is a design choice to match standard 12-frame clinical sequences. (Q6: Additional details on the conditioning mechanism and training/inference process.) A6: During training, we add noises to intermediate frames and train the model to predict the noises, conditioned on I₀ (start frame) and I₁ (end frame). This allows the model to learn the noise distribution of intermediate dynamics. At inference, intermediate frames are generated from random noise via reverse diffusion guided by I₀ and I₁. (Q7: More details registration-based baselines.) A7: The registration-based methods predict deformation fields from I₀ to I₁ and perform linear interpolation to generate intermediate warped frames. (Q8: Ablations on combined loss terms.) A8: Preliminary results show that removing wavelet and TV losses reduces performance, with PSNR dropping to 32.663 dB and SSIM to 0.970. We will include more detailed ablation studies in future work. (Q9: Spatial refinement may reduce image diversity.) A9: While diversity can be valuable, clinical imaging prioritizes anatomical accuracy. Our spatial refinement module focuses on enhancing structural consistency, which is the main objective. Reviewer#3 (Q10: More explanation on the resampling strategy.) A10: We will provide more details on the resampling strategy in the final version. (Q11: Lack of Methodological Clarity.) A11: “T” denotes the number of diffusion steps. Conditioning is implemented by concatenating I₀ and I₁ with the noisy input and applying cross-frame attention. We will clarify this in the final version. (Q12: Missing GAN-based baselines in comparison.) A12: It is a good suggestion and we will take it into consideration in our future work. (Q13: No Clinical Evaluation.) A13: We agree and plan to include clinical feedback or segmentation-based evaluations in future studies. (Q14: Typos such as “de-sign” and “lev-eraging”.) A14: We apologize and will correct all typographic issues in the final version.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A