Abstract

Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge-Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step “knowledge enhancement + variational inference” framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2374_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/2374_supp.zip

Link to the Code Repository

https://github.com/YaxuanLi-cn/KE-CVAE

Link to the Dataset(s)

LaSVoM Dataset: https://huggingface.co/datasets/YaxuanLi/LaSVoM Variational Inference Dataset: https://figshare.com/articles/dataset/A_multispeaker_dataset_of_raw_and_reconstructed_speech_production_real-time_MRI_video_and_3D_volumetric_images/13725546/1

BibTex

@InProceedings{LiYax_Speech_MICCAI2025,
        author = { Li, Yaxuan and Jiang, Han and Ma, Yifei and Qin, Shihua and Woo, Jonghye and Xing, Fangxu},
        title = { { Speech Audio Generation from Dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {627 -- 637}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of this paper is a Knowledge-Enhanced Conditional Variational Autoencoder (KE-CVAE) framework, which establishes a two-stage framework that achieves high-quality, temporally accurate speech audio generation directly from dynamic MRI video sequences. Main contribution includes: (1) A self-supervised knowledge enhancement strategy leveraging unlabeled large-scale MRI data through a teacher-student vision transformer, optimized by consistency loss, masked reconstruction loss, and KoLeo regularization for domain-specific feature learning; (2) A variational inference architecture integrates conditional variational autoencoders, normalizing flows, and adversarial training, combining a posterior encoder, an MRI-conditioned prior encoder, and a decoder to synthesize high-fidelity speech waveforms through enhanced distribution modeling and adversarial refinement. This method effectively addresses MRI-specific challenges such as hardware constraints, acoustic noise, and data corruption, outperforming conventional methods in objective metrics (PESQ, Corr2D) and subjective MOS scores, offering a paradigm for clinical and research applications in speech analysis and data recovery.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper combines self-supervised feature learning (via teacher-student ViTs) with conditional variational autoencoders for MRI-to-speech synthesis. This addresses the limitations of prior methods by enabling end-to-end learning of latent representations directly from unlabeled MRI data. The self-supervised knowledge enhancement phase uses three complementary loss functions (consistency, reconstruction, and KoLeo regularization) to learn domain-specific vocal tract dynamics without manual annotations, significantly improving feature robustness and generalizability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the authors claim novelty to pioneer the MRI-to-speech synthesis, they only benchmark against vanilla CNN/Transformer baselines and ignore comparisons with existing SOTA MRI-based methods. For instance, Liu et al. [18] proposed a tagged-MRI-to-speech synthesis method using Non-negative Matrix Factorization (NMF), which is not discussed or compared. This weakens the claim of novelty and leaves uncertainty about improvements over domain-specific predecessors. The network architecture appears to resemble a patchwork of existing modules. In addition, the evaluation uses only 75 subjects from a single dataset. Clinical feasibility requires larger, multi-site cohorts to ensure robustness across scanner types and patient conditions.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The flowchart in Figure 1 has a relatively confusing and unclear structure, making it difficult to discern the experimental steps. It is recommended to revise this figure.
    • Some operators in the formula descriptions of the methodology section should be explained, such as the det operator in Formula and the E in Formula (6), etc.
    • Whether the pre-trained MRI visual extractor used in the prior encoder to obtain the hidden representation has also undergone the training of the knowledge enhancement strategy?
    • Why did the authors use normalizing flows and adversarial training to address the issues of data loss, artifacts, and corruption during MRI data acquisition and reconstruction in the context of speech audio generation?
    • The comparison methods in the results section are relatively few, and it is recommended to compare with some relevant methods.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the authors compare to CNN and Transformer, they do not compare other speech synthesis models adapted for MRI data. This gap limits the assessment of the proposed method’s competitiveness. In addition, the experiments demonstrate improvements over baseline methods (CNN, Transformer), but the comparisons are limited to a single dataset.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The study shows a nice application of MRI to speech conversion. The authors went a long way to create a large dataset focusing on the vocal tract and then move to audio generation using a nice pipeline focusing on knowledge enhancement of a student/teacher network (their LaSVoM dataset). The generated speech samples are decent with their method gaining a MOS of >4.10 - also indicated by the audio samples provided.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method relies on VAE, however, the application is seems to be reasonable and works. I appreciate strong technologies together with a thorough loss analysis (ablation study, table 2) to showcase the knowledge enhancement (from 0.712 -> 0.818, Corr and 3.74 to 4.13 in mean opinion score (MOS)). The additional samples provided by the authors are useful and sound great. The qualitative assessment is convincing.

    The pipeline seems legit and reasonable, WaveNet-based audiosampling is state of the art and well-used in this context.

    I enjoyed the usage and description of the different losses provided, namely the consistency loss, the reconstruction loss, KoLeo loss as well as the classical VAE (ELBO = reconstr + KL). In general, the losses are known beforehand, however, the combination as well as the use in this special problem is new and special, and the systematic analysis is appreciated.

    Taken together, I like the application, which is novel, as well as the effort of the authors compiling this (novel) dataset. The reasoning is solid, the analysis with the baselines sufficient informative, and the mean opinion score the right analysis for the problem. The pipeline follows established and state-of-the-art methods such that the presented method is of great interest to the MICCAI community.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I would appreciate if the authors could provide a supplementary movie that shows both, MRI and audio sequence together (maybe in a loop) to better justify the quality.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The quality of Figure 2 is unfortunately relatively bad; I can see everything I wanted to see, but the authors need to revise this Figure and provide at least 300 dpi and more care towards the presentation of the waveform and the spectrograms.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach is novel, the dataset is well, and the MRI->audio conversion an important problem where people around the world are working on. I would love to see the paper discussed at MICCAI and can be accepted as is (see my further comments for improving the paper).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a framework for speech audio generation from dynamic MRI. A self-supervised ViT-based teacher-student model is trained on large-scale unlabeled vocal tract MRI data to learn domain-specific representations, which serve as the encoder in a conditional variational autoencoder augmented with normalizing flows and adversarial training. This is the first work to directly synthesize speech from MRI sequences. The method outperforms CNN and Transformer baselines on standard objective and subjective metrics. Code and data are promised upon acceptance to support future research in this area.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is the first study to directly generate speech audio from dynamic MRI sequences, addressing a relevant and underexplored problem in speech imaging.
    2. Train a self-supervised ViT-based teacher-student model to extract articulatory-relevant features from unlabeled MRI. Masked modeling captures local vocal tract dynamics, and KoLeo regularization prevents collapse. The resulting encoder enables effective conditioning without manual labels.
    3. Innovatively integrates a conditional VAE with normalizing flows and adversarial training to improve posterior flexibility and waveform quality. Ablation studies confirm the contribution of each component.
    4. Conduct a well-structured evaluation using both objective metrics such as Corr2D and subjective MOS ratings based on human listening tests. Include ablation studies to assess the impact of each component, and provide generated audio samples in the supplementary material to illustrate perceptual quality.
    5. Construct a paired dataset of 30,000+ large-scale vocal track MRI samples with intent to release via GitHub. Also plan to release code upon acceptance, supporting reproducibility and future research.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Some implementation details are omitted or unclear. For example, the MLP used in the consistency loss is only briefly mentioned and not shown in Figure 1. It is also unclear whether the MLP is part of the teacher/student models or applied externally, which affects the interpretation of the feature alignment process and the student model’s update. While this may be clarified through released code, it is recommended that the authors revise Figure 1 to include the MLP component or explicitly describe its position and role in the text to improve clarity and reproducibility.
    2. The paper vaguely states that the reconstruction loss complements the consistency loss by enabling learning at “multiple levels”, but does not clearly specify how the two losses function at different levels of representation. For example, consistency loss supervises global features via the [CLS] token, while reconstruction loss operates at the patch level to capture local articulatory structure. The authors should clearly describe this division in the text and explain how these losses interact during training.
    3. The paper states that the dynamic MRI dataset includes 75 participants and that an 8:2 train-test split was used, but it does not clarify whether this split was performed in a speaker-independent manner. Without this information, it is unclear whether the evaluation truly reflects generalization to unseen speakers. The authors should specify whether the training and test sets were separated by participant to ensure fair performance assessment.
    4. The paper does not report whether statistical significance testing was performed on subjective MOS scores or objective metrics. Given the limited number of evaluated samples and raters, statistical analysis would be helpful to support the reported performance differences.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents the first framework for directly generating speech audio from dynamic MRI sequences, addressing a practically important and technically underexplored problem. The proposed method combines a self-supervised ViT-based encoder with a conditional variational autoencoder, enhanced by normalizing flows and adversarial training. The approach is well-motivated, methodologically sound, and supported by a carefully designed evaluation protocol. Experimental results include both objective and subjective assessments, and ablation studies confirm the contribution of each component. While some implementation details require clarification and significance testing is not reported, these issues are minor and do not detract from the overall contribution. The planned release of code and data further increases the value of this work to the community. For these reasons, I recommend acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We first thank all reviewers and chairs for handling our submission. Here is our response and clarification on major comments. [R1] Q1: A supplementary movie. A1: We will update the supplementary audios with their coupled videos. More video samples will also be uploaded to the open depository link. Q2(opt): The quality of Figure 2. A2: We will enhance the quality of this figure with a higher dpi. [R2] Q1: MLP in consistency loss. A1: The MLP is used independently in the calculation of consistency loss, propagating [CLS] to obtain pseudo-class probability. We will update Figure 1 with more detailed components to clarify the whole structure. Q2: Explanation of “multiple levels”. A2: The consistency loss primarily supervises coarse-grained features (including local and global). The model can minimize this loss by focusing abstract features corresponding to pseudo-labels. In contrast, reconstruction loss emphasizes fine-grained features, as any overlooked details will increase the loss, thus compelling the model to closely attend to every local detail. We will further clarify in the Methods section. Q3: Split of training samples. A3: Thanks for the suggestion. This is an important issue to clarify. We did perform all experiments in a speaker-independent manner so that the metrics we used are not affected by the change of speakers. We will clarify this in the final manuscript. Q4: Statistical significance testing. A4: Current Tables 1 and 2 have shown the effectiveness of our method comparing to other methods in a comprehensive level. We will add a report of statistical test on significance to further solidify the differences in effectiveness between the compared methods. [R3] Q1: Ignoring comparison with Liu et al. A1: We would like to first clarify that we have already discussed this work in Introduction. We think that R3’s concern pertains to why we did not include a more detailed comparison. We have already mentioned that existing method differs substantially from our task settings. The approach by Liu et al. relies on a pre-computed sequence of deformation fields from tagged MRI images for training. However, our ground truth is based solely on cine+audio pairs and even without tagged data, and certainly does not contain any pre-established deformation fields for ground truth comparison. Thus, we claimed that: to the best of our knowledge, this work is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. This would thus frame our comparisons exclusively to CNN and Transformer-based baselines. Q2: Small dataset scale. A2: As we mentioned in Section 3.1, our test dataset comprises 1,662 MRI movies out of a total dataset of 8,310 MRI movies. In fact, the dataset involving 75 subjects is sufficiently large and diverse for speech synthesis tasks from dynamic MRI (note that unlike other medical imaging applications, real-time speech MRI movies are difficult to acquire), particularly under our task setting that does not emphasize speaker identity. Q3(opt): Figure 1 is unclear. A3: Please refer to our response to [R1 Q2], we will modify the figure to enhance its clarity and expressiveness. Q4(opt): Lack of explanation for the “det” operator and “E” in formulas. A4: We adopted common machine learning mathematical notations on determinants and expectations. In the revised manuscript, we will explicitly explain these operators. Q5(opt): Whether the MRI visual extractor undergone knowledge enhancement(KE) training. A5: As described in Section 2.2, the pre-trained MRI visual extractor has indeed been trained using a KE strategy. Q6(opt): The roles of normalizing flows and adversarial training. A6: Normalizing flows enhance the expressiveness of variational inference. Adversarial training aims to make our generated audio more realistic by reducing noticeable issues, such as artifacts and corruption. Q7(opt): Insufficient comparison methods. A7: This issue has been addressed in our [R3 A1].




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top