Abstract

Reconstructing temporally coherent 3D meshes of the beating heart from multi-view MR images is an important but challenging problem. The challenge is entangled by the complexity in integrating multi-view data, the sparse coverage of a 3D geometry by 2D image slices, and the interplay between geometry and motion. Current approaches often treat mesh reconstruction and motion estimation as two separate problems. Here we propose Mesh4D, a novel motion-aware method that jointly learns cardiac shape and motion, directly from multi-view MR image sequences. The method introduces three key innovations: (1) A cross-attention encoder that fuses multi-view image information, (2) A transformer-based variational autoencoder (VAE) that jointly model the image feature and motion, and (3) A deformation decoder that generates continuous deformation fields and temporally smooth 3D+t cardiac meshes. Incorporating geometric regularisation and motion consistency constraints, Mesh4D can reconstruct high-quality 3D+t meshes (7,698 vertices, 15,384 faces) of the heart ventricles across 50 time frames, within less than 3 seconds. When compared to existing approaches, Mesh4D achieves notable improvements in reconstruction accuracy and motion smoothness, offering an efficient image-to-mesh solution for quantifying shape and motion of the heart and creating digital heart models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0276_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0276_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QiaMen_Mesh4D_MICCAI2025,
        author = { Qiao, Mengyun and Zheng, Jin and Zhang, Weitong and Ma, Qiang and Li, Liu and Kainz, Bernhard and O’Regan, Declan P. and Matthews, Paul M. and Niederer, Steven and Bai, Wenjia},
        title = { { Mesh4D: A Motion-Aware Multi-View Variational Autoencoder for 3D+t Mesh Reconstruction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {344 -- 354}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a method that integrates boundary and temporal information to learn a deformation field for warping a template mesh in 4D cardiac reconstruction.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method demonstrates fast inference compared to existing baselines.

    2. The proposed approach achieves superior mesh reconstruction accuracy in comparison to prior methods, as shown in the experimental results. A visual demo is also provided to support the qualitative performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Several critical design choices lack sufficient justification.
      • First, the motivation for using a VAE framework is unclear. The task is to learn a deterministic, spatially and temporally smooth deformation trajectory, not to capture uncertainty or generate multiple plausible deformation fields. Moreover, representing the entire trajectory using a single mean and covariance combined with temporal embeddings is questionable. No ablation study is provided to validate this choice. A baseline comparison to validate using a Transformer encoder with temporal embeddings directly producing latent representations for decoding.
      • Second, the novelty of the approach is not well-articulated. The Transformer encoder appears to perform only temporal attention, which aligns closely with conventional practices in video transformers where temporal and spatial attention are computed in separate stages. Additionally, it is unclear why SAX features are used as queries for multi-view fusion instead of 2CH or 4CH, or whether this choice significantly impacts performance.
      • Third, the continuous deformation field decoder is not adequately described. Given its significant contribution to performance (as indicated in Table 2), the lack of architectural or implementation details hinders reproducibility and limits the interpretability of its contribution.
    2. The framework involves balancing seven different loss terms (Equation 3), yet there is no analysis or discussion regarding the sensitivity of performance to these hyperparameters.

    3. The boundary alignment loss relies on per-frame segmentation masks, which are often costly and time-consuming to acquire in practice. This requirement could limit the applicability of the method.

    4. The method is only evaluated on a single dataset. As such, the generalizability and robustness of the approach across different datasets remain unclear. Moreover, certain components of the framework appear specifically tailored to the UK Biobank dataset, raising concerns about adaptability.

    5. The paper lacks a dedicated related works section. As a result, it is difficult to understand how the proposed method compares with or improves upon prior work in temporal mesh reconstruction and transformer-based modeling.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See main weakness.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a 3 stage framework Mesh3D for 3D+T reconstruction of the LV and RV from cardiac MR images. The first stage consists of a cross-attention encoder that fuses image features across multiple views (SAX, 2CH, 4CH). The second stage is a transformer VAE that takes a sequence of feature tokens from the first stage and generates another sequence of latent representations. The third stage is a decoder that takes this latent sequence as input and generates a deformation field of vertex-wise displacements for each time frame. This field is used to warp a template mesh to get the reconstruction. The loss consists of a boundary term (for aligning the reconstruction with the 2D segmentation contours), a template term (for aligning the reconstruction with a pre-registered template), geometric regularization terms (edge length, normal consistency, smoothing), a motion consistency term (temporal consistency) and a latent space regularizer (normal distribution for latent space). The method is trained and evaluated on 1,984 subjects from the UK Biobank CMR dataset. The segmentations are generated using a public model. The accuracy of the reconstructed mesh is assessed using four metrics: Hausdorff distance (HD) & average symmetric surface distance (ASSD) for the meshes, Pearson’s correlation coefficient (r) for volume curves, and root mean squared error (RMSE) for volumetric differences. The proposed method is compared to 3 other methods and it is shown to outperform all the other methods across all evaluation metrics. The inference time is 2.89 seconds. Ablations are done by removing CD, alignment loss and motion consistency losses. Qualitative results are presented in Fig 3 and supplementary video.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The writing and flow in the paper are clear

    • The evaluation of the proposed method is done on a publicly available dataset.

    • The method is shown to run within a few seconds (2.89 seconds)

    • The method combines geometric and motion constraints to mimic the physiological properties.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Typo : “and long-axis two-chamber (4CH)”

    • This sentence needs to be made clearer as there are 2 latent representations - “The Transformer decoder then reconstructs temporally smooth latent representations from the latent space.”

    • Why is the summation in equation 2 only until T-2?

    • Contours look bad in Fig 3 SAX row. Why is this the case?

    • Not many details about the networks and their layers and configuration are provided.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper proposes a 3 stage method to reconstruct the LV and RV structures in 3D+T. The authors use a simple way (cross-attention) to fuse SAX with long axis views. One thing that could improve the experiments is an ablation on the multi-view encoder to see if the long axis views are helping the final reconstruction or not. Another recommendation would be to enforce the motion consistency cyclically across the last frame (between the ‘T-1’ and ‘0’th frames). For reproducibility of the results, details about the network and layers should be provided (eg: How many layers are there in the 3D conv encoder? Which VAE architecture is used?, What kind of network is the CD field decoder? etc)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The writing in this paper is clear and well-structured. The proposed method applies existing techniques (eg:cross-attention, VAE) to predict 3D+T meshes for cardiac structures. Some of the details of the network architectures used in the various stages are missing. The evaluation is done against other methods on a public dataset. The metrics used in evaluation include both distance and clinical volume metrics. The proposed method is shown to outperform all the other 3 methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    Maintaining temporal consistency in image-derived models is a challenging topic of interest to the image analysis community. This work proposes an innovative method: a motion-aware multi-view variational autoencoder for reconstructing 3D+time cardiac meshes from MRI sequences.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study proposes a novel 3D+time mesh reconstruction method that ensures temporal coherence of cardiac meshes across time. Cardiac motion and shape are jointly learned, rather than treating mesh reconstruction and motion estimation separately.

    The approach streamlines three innovative features: 1) a cross-attention encoder that integrates features across multiple imaging planes, 2) transformer-based variational autoencoder that learns dependencies for multi-view features across time, 3) a deformation encoder that learns mesh displacements.

    The proposed method performs favorably with respect to three alternative approaches.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    It is not clear if or how the deformation encoder ensures a diffeomorphic transformation or prevents self-intersection of the mesh.

    There are seven total loss terms enforcing alignment and regularization. The ablation studies do not appear to demonstrate how much each loss contributes and if they are all essential. The ablation study introduces new notation for the losses - it is unclear if the “alignment loss” refers to the boundary, temporal, or edge alignment, or a combination of them.

    The evaluation metrics mainly relate to reconstruction accuracy with respect to the ground truth segmentation, rather than the motion coherence of the mesh. The evaluation relies on the assumption that there is temporal consistency in the segmentations, but it is not entirely clear if and how temporal consistency was implemented in their generation.

    Minor note: The ground truth segmentation is difficult to interpret in the top row (SAX view) of Figure 3.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The 4D mesh generation architecture is interesting and innovative in that it spatially integrates multiple imaging planes while jointly learning motion. This strength outweighs comments related to the evaluation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although there are a few experimental weaknesses and missing methodological details, the proposed motion-aware multi-view variational autoencoder for 4D mesh reconstruction is interesting, the analysis is carried out on a publicly available dataset, and the short inference time is favorable.




Author Feedback

[R1, R3] Architecture (R1-q1, R3-q1) Regarding the deformation decoder, while we do not impose hard constraints of diffeomorphism, our deformation decoder is guided by a template mesh and constrained by geometric losses (edge-length, normal-consistency, Laplacian smoothness; Sec. 2.4), which promote mesh integrity and smooth deformations. Ablation confirms their impact: removing the continuous deformation decoder increases HD from 4.350 mm to 5.515 mm. The deformation decoder is a lightweight MLP conditioned on latent and temporal embeddings. (R3-q1) We clarify the rationale for adopting a variational autoencoder (VAE). Our task requires learning temporally smooth yet subject-specific deformation fields across a large and diverse cohort (n = 1,984). VAE regularises the latent space distribution, leading to physiologically plausible and temporally coherent deformation fields. In contrast, deterministic AE lacks inherent mechanisms to regularise the latent space, risking overfitting or discontinuities. Some previous study[20,28,31], demonstrates that a VAE is both effective and efficient for learning the distribution of 4D cardiac meshes and their motion. Our Transformer encoder targets 3D+t and multi-view volumetric features rather than 2D appearance features (e.g. RGB videos), and thus differs from standard video Transformer formulations. (R3-q1) We use SAX features as queries in the cross-attention module because they span multiple slices in a 3D space and offer high in-plane resolution, while 2CH/4CH are single-slice views. Cross-attention allows SAX features to integrate complementary 2CH/4CH information. Table 2 shows that using only SAX views degrades performance (HD increases to 5.014 mm), supporting this design choice. [R1, R2, R3] Loss definitions (R1-q2, R3-q2) The alignment loss L_a in the ablation study (Table 2) refers to template alignment, L_temp, which is a typo and will be modified. (R3-q3) The boundary alignment loss uses automatically generated masks by a pre-trained and publicly available segmentation model[4], widely used in cardiac segmentation and motion analysis tasks. The loss is computed per time point over 3D volumes using a one-sided Chamfer distance (Eq. 1), and is implemented in a batched manner for efficiency. (R2-q3) Eq. 2 terminates at T–2 because Δv(t) requires a forward frame (v(t+1) – v(t)). (R2-o2) We agree that cyclic temporal consistency is an interesting direction. Our current framework is unidirectional but could be extended in future work. [R2, R3] Reproducibility (R2-q4, R2-q4, R3-q4) Our code will be made publicly available upon acceptance. Although trained on UK Biobank data, the method is extendable to other multi-view cardiac imaging datasets. Even in a SAX-only setting, the model outperforms prior work (e.g. MeshHeart), indicating strong robustness (Table 2). [R1] Motion fidelity metrics (R1-q3) While direct ground-truth motion fields are unavailable, motion fidelity is indirectly validated through (i) a strong volume-curve correlation (Pearson r = 0.986; Fig. 2b) and (ii) a significant performance drop when removing the motion-smoothness loss L_m (Table 2), confirming the role of motion regularisation. [R1, R2] Minor corrections (R1-q4, R2-q4) The SAX projection in Fig. 3 overlays multiple slices, causing visual clutter. In contrast, the 2CH/4CH rows show single slices with crisp contours. We will clarify this in the figure caption. (R2-q1, R2-q2) We will correct “two-chamber” to “four-chamber”, and revise the sentence to: “The Transformer decoder…from the VAE latent space.” [R3] Related works (R3-q5) Due to page limit, we included related works in the Introduction, grouped into segmentation-based and mesh-based methods. Furthermore, we reviewed the technical challenges in this field, including through-plane resolution and learning from multi-view data etc. [31]N. Gaggion, et al. “Multi-view Hybrid Graph Convolutional Network for Volume-to-mesh …” arXiv:2311.13706 (2023).




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents an interesting and novel motion-aware VAE-based 4D mesh reconstruction from multi-view CMR sequences. Three main components, i.e., a cross-attention encoder for multiple-view feature integration, a transformer-based VAE for motion modelling, and a deformation encoder that learns mesh displacements, are combined with prior, geometric and motion constraints in the loss functions to achieve the 4D mesh. The clinical interest and the contributions are recognized by the reviewers. The concerns about the network details, loss functions have been well explained in the response.



back to top