Abstract

Even though multiple studies have examined the decoding of speech from brain activity through non-invasive technologies in recent years, the task still presents a challenge as decoding quality is still insufficient for practical applications. An effective solution could help in the advancement of brain-computer interfaces (BCIs), potentially enabling communication restoration for individuals experiencing speech impairments. At the same time, these studies can provide fundamental insights into how the brain processes speech and sound. One of the approaches for decoding perceived speech involves using a self-supervised model that has been trained using contrastive learning. This model matches segments of the same length from magnetoencephalography (MEG) to audio in a zero-shot way. We improve the method for decoding perceived speech by incorporating a new architecture based on CNN Transformer. As a result of proposed modifications, the accuracy of perceived speech decoding increases significantly from current 69\% to 83\% and from 67\% to 70\% on publicly available datasets. Notably, the greatest improvement in accuracy is observed in longer speech fragments that carry semantic meaning, rather than in shorter fragments with sounds and phonemes. Our code is available at https://github.com/maryjis/MEGformer

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1369_paper.pdf

SharedIt Link: https://rdcu.be/dV1Ok

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72069-7_27

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1369_supp.pdf

Link to the Code Repository

https://github.com/maryjis/MEGformer

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Boy_MEGFormer_MICCAI2024,
        author = { Boyko, Maria and Druzhinina, Polina and Kormakov, Georgii and Beliaeva, Aleksandra and Sharaev, Maxim},
        title = { { MEGFormer: enhancing speech decoding from brain activity through extended semantic representations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15002},
        month = {October},
        page = {281 -- 290}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose an improved method for matching MEG to audio signals using a CLIP-based framework. The work shows that the proposed architecture modifications that replace the CNN with a transformer-based encoder substantially improve the results on two publicly available datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors claim to be the first work to apply a Transformer-based architecture to the task of speech decoding from MEG signals.

    The authors show that the model’s performance increases with longer sequences which proves the benefits of using transformer-based architectures for the task.

    While being of iterative nature, the proposed solution achieves promising performance and substantially improves the results.

    The paper is well structured and clearly written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty of the presented work is limited as the authors propose an iterative improvement of an existing solution. The application of a Transformer-based architecture is a rather obvious design decision when working with sequential data.

    The paper lacks an explanation or discussion on how this work can be applied to practical use cases.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Not being an expert in the field, I assume that the MEG signals of listening to audio and actual speaking are very different which is not explained well in the paper. The authors should indicate clearer that the publicly available MEG-audio datasets are acquired in a listening setting

    The paper is lacking a discussing of the potential applications of the work - how exactly can it help impaired persons? While the problem seems challenging in general, I assume that a top-1 accuracy of 55.08% and 39.43% for the respective datasets render the method inadquate for practical use cases.

    Not all dimensions are sufficiently explained in Fig 2. - e.g. what corresponds to S?

    There is a typo in the first sentence of the results section (“archived”).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting field of research but lacks novelty. Furthermore, the application and outlook how this could be beneficial to patients is unclear and not explained in the paper. Therefore, I believe that the work could be more suited to a more niche journal or conference with experts in the field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Based on the rebuttal and the author’s statement to add a discussion on how this work can be applied to practical use cases in the final camera-ready version, I change my recommendation from “weak reject” to “weak accept”.



Review #2

  • Please describe the contribution of the paper

    Authors propose a transformer-based architecture to encode brain signals for speech decoding task

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It outperforms SOTA methods

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Figure-3 and 4 are not explained in a clear way. Less reference

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Results and Discussions section should be rewritten in a clear way.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Explanation of the proposed methodology is not very clear. Findings are not discussed.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a novel architecture for decoding speech from MEG (magnetoencephalography) using a CNN-Transformer combination, marking the first introduction of a transformer-based framework in this application. Furthermore, the study demonstrates that the performance of the model is enhanced by decoding longer segments of speech.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Simplicity: The proposed method is straightforward and resulted in an improved performance. According to the authors, transformer-based architecture helps to capture long-range interactions. State-of-the-art performance: The reported results demonstrate a significant improvement from the previous state-of-the-art model in both datasets. Investigation for an effective segment length: The authors conducted detailed analysis for the effective segment length choice, showing that longer the segment - better decoding quality. The results are backed by the quantitative experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited clarity: It is not clear how predictions are made and how zero-shot predictions are made. The evaluation was made top-10 accuracy, which implies that the target segment was present among 10 predicted ones. It is doubtful whether this evaluation metric is indicative of the decoding performance, more discussion would be better on this issue. Limited performance comparisons: As shown by the authors, some techniques such as pre-processing, splitting segments via sound, and simplifying the model enhanced the accuracy by a significant margin, especially in the Gwilliams dataset. Similarly, it would be more comprehensive to see the performance of the MEGFormer with and without these techniques. More discussions on the effect of the dataset size would be appropriate given the importance of training data for transformer architectures.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper provides the link to the anonymous implementation code and the proposed method itself is straightforward, making the paper’s reproducibility very possible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    First, it would be better more explanation of the prediction process. Second, it would be better to have more results with MEGFormer (e.g. without preprocessing, splitting segments via sound, non-simplified model). Third, for future work, it would be more comprehensive to have subject-specific resutls, as provided in the previous work in [3]. Finally, I would recommend authors to revise the paper for consistency in text. For example, the authors mentioned that they evaluated using top-5 accuracy, while such results were missing in the paper. Also, review for typos and formatting is recommended.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposed a new method with a substantial performance increase in MEG-based speech brain decoding. I think this work, backed with the open-source implementation and well-explained methods would contribute to the development of the field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I would like to express my gratitude to the authors for their feedback on the review. The authors clearly addressed major review concerns and explained them in a clear way. My reviews related to the explanation of the prediction process were explained, so I hope that it will be incorporated in the paper too. Additionally, the authors addressed my concern about top-10 accuracy, and I believe this valuable discussion will be reflected in the paper too. Some of my concerns required additional experiments, so I hope that the authors will consider them in their future work. So I keep my opinion as Accept.




Author Feedback

Thank you for your constructive feedback! R4: “The paper … lacks novelty. The application of a Transformer-based architecture is a rather obvious design decision when working with sequential data.” Applying Transformer-based approaches to various types of sequential data isn’t always optimal, i.e. CNN-based TimesNet outperforms Transformers in time series analysis tasks, but applying it to MEG data for speech decoding hasn’t been explored previously. Our contribution isn’t in using the Transformer-based approach per se, but as R5 noted, in developing a hybrid CNN-Transformer architecture specifically tailored for MEG data. This involves incorporating spatial attention layers, and a carefully designed sequence of blocks to accurately capture the nuances of brain activity. This hybrid architecture achieves SOTA performance in decoding speech. Also, we demonstrate improvements with longer speech segments, highlighting its ability to accurately decode complex speech representations linked to high-level perception or even speech comprehension. R4-R6: “The paper lacks an explanation or discussion on how this work can be applied to practical use cases”, “Findings are not discussed”. Decoding speech directly from brain signals is a relatively new and emerging task. Our primary objective is to demonstrate the feasibility of decoding perceived speech. With the development of larger datasets for speech production, it could be adapted for this task without significant modifications. Thus, our work represents an important step toward building a foundational model for brain recordings. Recent studies have utilized limited vocabularies, whereas our model demonstrates a zero-shot performance with an unrestricted one. The performance of our model (55.08%) and Defossez (39.43%) are based on vocabularies that don’t overlap with the training set, showcasing that models can generalise to new words without additional training. In practical applications, such as aiding individuals with speech impairments, our model can be adapted to use a limited vocabulary set, significantly enhancing its performance and making it suitable for tasks like issuing a predefined set of commands. Our approach also can be useful for understanding differences between healthy controls and patients with Auditory processing disorder which can lead to the development of brain interfaces for them [doi:10.3389/fnhum.2014.00151] R5&R6: “Explanation of the proposed methodology is not very clear”, “It is not clear how predictions are made” Our method decodes speech from MEG data using a dual-encoder architecture. The audio encoder employs a pre-trained wav2vec 2.0 model, while the brain activity encoder processes MEG signals with CNN layers to capture local spatial features and transformer layers to model long-range dependencies. A spatial attention model focuses on the most relevant MEG channels, and a subject-specific layer tailors the representation for individual subjects, enhancing accuracy. The prediction aligns features from both modalities using a contrastive loss function inspired by СLIP, encouraging similar MEG and audio representations to be closer together in a shared latent space. During inference, given a new MEG signal, the model predicts the corresponding audio representation by finding the closest match in the learned latent space. R5: About evaluation metrics. While top-10 accuracy might seem less stringent than top-1 accuracy, it’s crucial for understanding the model’s ability to narrow down potential matches in a large search space, reflecting its utility in applications where further refinement steps might be employed. Additionally, we included top-1 accuracy metrics to provide a more comprehensive evaluation of the model’s performance. We’ll carefully consider your suggestions and make the necessary revisions to figures (R4, R6) and text (R4, R5, R6), and in future work, we’ll further explore the impact of dataset size and subject-specific results.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have addressed the concerns raised by the reviewers in their rebuttal and have promised to make changes in the camera ready version as suggested by the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have addressed the concerns raised by the reviewers in their rebuttal and have promised to make changes in the camera ready version as suggested by the reviewers.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors were successfully able to clearly reply to most of the reviewers comments

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors were successfully able to clearly reply to most of the reviewers comments



back to top