Abstract

Synthesizing high-quality medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at \href{https://github.com/Yaziwel/FEAT}{here}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1131_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1131_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanHui_FEAT_MICCAI2025,
        author = { Wang, Huihan and Yang, Zhiwen and Zhang, Hui and Zhao, Dan and Wei, Bingzheng and Xu, Yan},
        title = { { FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {266 -- 276}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces with FEAT a method for generation of medical videos. Three main contributions are shown: integration of channel attention into the spatial-temporal attention, replacement of self-attention by Shift mechanism to ensure linear complexity and the combination of video content and noise patterns as input embeddings.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is the convincing quantitative and qualitative evaluation. The tables show the superior performance of the method, and the supplemental video shows very promising generated videos.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weakness of the paper is the lack of insight for the three main contributions. All three contributions seems to be adapted from works mainly from the language domain. They are properly cited. However, although I looked these papers up I had my problems to really understand the point. For example, the “Shift” modules, which are obviously also part of the WKV attention mechanism are adapted by 2D depth-wise convolutions. I know convolutions and I have an idea, what a shift might be, but I have no idea, what a 2D depth-wise convolutional shift is all about. It would be helpful to introduce this in more detail as a formula. Another example is the Residual Value Guidance Module. Although, there is a formula to explain the math to introduce the input embedding Z, I would love to get more details how Z is computed and what the difference to SOTA input embeddings is exactly.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Just two minor additonal comments: To allow space for more insights and details of the method, as mentioned in the weakness section, I suggest omitting the repeating of the drawbacks of SOTA methods. These are stated in the abstract, the introduction, and chapter 2.2. Additionally, the downstream task might be less important than the details of the method and the ablation study. On page 5, third last row: one of the K_i should be a V_i

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The results are so promising that I suggest accepting the paper in principle. However, such a methodological paper should focus on details of the method and should be more or less self-contained. This is definitely worthy of improvement.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors presents a novel approach FEAT to synthesize high-quality dynamic medical videos. The authors highlight the challenges in transformer based methods, such as the insufficient channel interactions and high computational costs. To capture global dependencies, the authors utilize an unified spatial-temporal-channel attention mechanism. At the same time, proposed model utilizes WKV and global channel attention to reduce computational complexity. Aresidual value guidance module (ResVGM) is proposed for fine-grained, pixel level guidance. This model has the potential to solve the data sparsity issue when developing medical image analysis models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Authors’ claims are backup with extensive experiments including a downstream task and ablation studies. The ablation studies demonstrated that each proposed changes can lead to improvement.

    The proposed method achieve SOTA result with a small of parameters comparing to other methods.

    The downstream task shows the potential for generating new medical video dataset for clinical training.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Can the authors comment on their transformer block comparing to the efficient attention proposed by Shen et al.[1]?

    [1] Shen, Zhuoran, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. “Efficient attention: Attention with linear complexities.” In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3531-3539. 2021.

    PolyDiag utilize I3D to extract features, at the same time, FVD also uses I3D to evaluates the Fréchet Distance between the generated and data distributions. I3D was more sensitive to spatial distortion than temporal distortion argued in [2]. The results might not paint the full picrture regarding the quality of generalized videos. So Visual Turing test could be a metric to test the efficiency of FEAT.

    [2] Ge, Songwei, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, and Jia-Bin Huang. “On the content bias in fréchet video distance.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7277-7288. 2024.

    For quantitative comparison, it would be better to include the more recent diffusion-based baseline, for example, Videocrafter2 [3].

    [3] Chen, Haoxin, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. “Videocrafter2: Overcoming data limitations for high-quality video diffusion models.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7310-7320. 2024.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed work could potentially be impactful; However, it’s hard to gauge the results without comaring to more recent SOTA methods.

    Reproducibility is questionable based on the manuscript alone.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces FEAT, a full-dimensional efficient attention Transformer designed to serve as the core denoising network within diffusion-based medical video generation frameworks. It addresses key limitations of existing architectures through three innovations: (1) a spatial-temporal-channel attention framework capturing full-dimensional dependencies; (2) an efficient attention design using WKV and global channel attention to achieve linear complexity; (3) a novel Residual Value Guidance Module (ResVGM) for fine-grained pixel-level denoising guidance. Extensive experiments show strong gains over prior methods (e.g., Endora) in generation quality and downstream clinical utility.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Efficient yet effective. The use of WKV attention (RWKV-inspired) and global channel attention allows FEAT to outperform SOTA with far fewer parameters. FEAT-S achieves comparable performance to Endora with 23% of its size; FEAT-L achieves new SOTA.
    • Novel and lightweight ResVGM. Enhances timestep guidance by conditioning on the noisy input embedding, enabling content-aware denoising without additional priors and with minimal overhead.
    • Strong experiments. FEAT consistently outperforms baselines across two medical video datasets on FVD, FID, and IS metrics. Ablation studies validate the incremental value of each component.
    • Clinical relevance. Synthetic videos generated by FEAT significantly boost semi-supervised polyp classification (PolyDiag), showing clear benefit in data augmentation for medical AI.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Limited modality scope. Experiments are limited to endoscopy videos (Colonoscopic, Kvasir-Capsule), which may limit generalizability to other medical video modalities.
    • “Unified” attention is sequential. Full-dimensional attention is implemented via separate sequential modules, not a joint mechanism. The claim could be clarified.
    • No runtime benchmarks. Theoretical linear complexity is well-argued, but practical speed/memory improvements over baselines are not reported.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend accepting this paper. FEAT represents a thoughtful and practically impactful contribution to medical video generation. While it builds on existing components such as WKV attention and channel-wise modeling, the integration into a unified, efficient, and full-dimensional architecture tailored to diffusion-based video generation is both timely and well-executed. The proposed ResVGM module, though conceptually related to prior conditioning methods, introduces an efficient internal alternative to external priors and improves performance without added complexity. The model demonstrates strong empirical gains over the current state-of-the-art on multiple benchmarks and shows tangible benefits for downstream clinical tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely appreciate the reviewers for acknowledging our methodological contribution and providing constructive comments for further clarification. Our feedback is as follows.

Q1(R1): Lack of insight for the three contributions A1: Our objective is to develop a video-generation backbone that is both efficient and effective. From the standpoint of backbone design, we evaluate three key aspects: dimensional modeling, computational efficiency, and adaptiveness. Current architectures, however, (1) omit processing along the channel dimension, (2) incur quadratic computational complexity due to self‑attention, and (3) lack fine‑grained guidance for adaptively handling inputs at varying noise levels. Thus we introduce channel attention from the image restoration domain, WKV attention from the NLP domain, and innovatively propose ResVGM to address the aforementioned three issues.

Q2(R1): 2D depth-wise convolutional shift A2: In the original RWKV, the shift mechanism strengthens local dependencies by linearly interpolating and shifting information from the preceding token into the current token. We extend this concept to 2D spatial data by using the widely used 2D depth‑wise convolution—which shifts information from neighboring pixels to enrich local context—and, by applying a separate spatial filter to each channel, extracts per‑channel features while dramatically reducing computation compared to standard convolutions. We will describe it in detail in the revision.

Q3(R1): Derivation of Z A3: Z is simply the embedding of the input (which is also the denoised result from the previous timestep and contains detailed noise information) obtained via convolution. Z serves as feature‑level guidance added to subsequent layers to ensure that the features are progressively refined based on it. This makes the backbone to have more adaptiveness to handle different noise-level inputs. We will describe it in detail in the revision.

Q4(R1): Make space for insights and method A4: We will take your consideration in our final revision.

Q5(R1): K_i should be a V_i A5: We will correct it.

Q6(R2): Comparison with efficient attention A6: The linear attention [1] introduced by Shen et al. is closely related to WKV attention, but they represent two distinct strategies for achieving linear complexity attention. Linear attention [1] lowers computational cost by restructuring the standard attention sequence (QK’)V into Q(K’V), yielding a complexity of O(TC^2). In contrast, WKV attention redefines the computation as (W+K)V, which runs in O(TC). Consequently, WKV attention O(TC) is more efficient than linear attention O(TC^2) in [1]. We will properly cite and discuss [1].

Q7(R2): Visual Turing test A7: We agree that FVD has drawbacks to fully assess the video quality. We will cite [2] and add visual turing test as our future work in discussion.

Q8(R2): Comparison with Videocrafter2 A8: As no new experiment is allowed, we will properly cite it and compare with it in the future.

Q9(R3): Limited modality scope A9: We will conduct experiments on additional modalities in the future.

Q10(R3): “Unified” attention is sequential A10: The current full-dimensional attention is conducted sequentially and we will clarify this in the revision.

Q11(R3): runtime/memory benchmark A11: As no new experiment is allowed, we will report this in the future. In the meantime, Table 1 of our manuscript already reports the model FLOPs: 118.7 G for FEAT-S, 472.1 G for FEAT-L, and 465.8 G for Endora.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    All three reviewers are positive to accep this work, and their ratings are ‘4.Weak Accept’, ‘4.Weak Accept’, and ‘5. Accept’. The authors are suggted to release the code and revise the paper based on reviewer comments.



back to top