Abstract

While understanding visual processing in the human brain is fundamental for computational neuroscience, decoding objects from electroencephalography (EEG) remains challenging due to noisy neural dynamics during rapid image presentation and semantic misalignment in zero-shot settings. We propose BrainAlign, a novel framework leveraging contrastive learning to align EEG features with visual-language models (VLM). Our approach addresses three fundamental challenges: (1) We introduce a Frequency-Aware Temporal Encoder (FATE) using real Fast Fourier Transform with tunable bandpass filters to compress noisy signals while preserving temporal fidelity. (2) We develop a Differentiable Cluster Assigner (DCA) that dynamically optimizes channel grouping through cross-attention mechanisms, adaptively suppressing noise and enhancing task-relevant features. (3) We implement a self-supervised framework aligning EEG features with VLMs through contrastive learning. Extensive experiments demonstrate state-of-the-art performance on large-scale datasets, improving zero-shot retrieval accuracy by 5.85% and classification by 3.3%. Our work establishes new possibilities for brain-computer interfaces.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3762_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ShiEnz_BrainAlign_MICCAI2025,
        author = { Shi, Enze and Hu, Huawen and Yuan, Qilong and Zhao, Kui and Yu, Sigang and Zhang, Shu},
        title = { { BrainAlign: EEG-Vision Alignment via Frequency-Aware Temporal Encoder and Differentiable Cluster Assigner } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {100 -- 110}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors propose BrainAlign, a novel framework leveraging contrastive learning to align EEG features with visual-language models (VLM). Extensive experiments demonstrate state-of-the-art performance on large-scale datasets, improving zero-shot retrieval accuracy by 5.85% and classification by 3.3%.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Extensive experiments: The authors conduct a diverse set of experiments, including baseline comparisons and ablation studies, which provide strong empirical support for the proposed method.

    Strong performance: The experimental results demonstrate clear performance gains, suggesting that the method is both effective and competitive.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The sentence “achieving 5.85% and 3.3% accuracy improvement in zero-shot object recognition” is unclear. It is not evident whether this improvement refers to EEG-based classification or image-based classification performance. This ambiguity also appears in other parts of the manuscript where results are reported. In Section 3.1 “Datasets and Settings”, the authors do not clearly describe the experimental paradigms of the datasets used. For instance, it is unclear whether the tasks are binary or multi-class classification problems, and how the experimental conditions are structured. This information is essential for readers to understand the context and scope of the experiments.

    In Section 3.2 “Overall Performance”, the abbreviations “NICE” and “ATM-S” appear without prior introduction or explanation. All abbreviations should be defined when first used to ensure clarity for readers unfamiliar with the terminology.

    The manuscript mentions both “zero-shot retrieval” and “zero-shot classification”, but does not provide a clear distinction between the two. The authors should offer a concise explanation of these concepts and clarify how they are evaluated and differentiated in the context of this study.

    In Table 2, both “NICE [22]” and “NICE (Our Framework)” are listed as separate methods. However, the relationship between the two is not clearly stated. Are these two configurations of the same model, or is “NICE (Our Framework)” a modified version based on the referenced “NICE [22]”? The manuscript should explicitly describe the differences and connections between them.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the experiments are comprehensive, several parts of the manuscript lack clarity. The manuscript could be further improved to enhance readability and coherence.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    With the rise of vision-language models (VLMs), brain decoding using neural signals such as fMRI, MEG, and EEG has gained significant attention. Among them, EEG is particularly challenging due to its low signal-to-noise ratio, making the extraction of meaningful frequency information crucial. This paper proposes an end-to-end framework that tackles the critical problem of EEG-VLM alignment by capturing essential frequency components through cluster-aware modules.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • EEG-based brain decoding offers a cost-effective and temporally precise alternative to fMRI, making it valuable for real-time applications.
    • The paper effectively addresses the difficulty of extracting meaningful temporal and frequency features from noisy EEG signals by proposing an alignment and cluster-based strategy.
    • The method demonstrates strong zero-shot retrieval and classification performance, outperforming prior works.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Although the paper proposes cluster-aware feed-forward modules to capture frequency patterns, it still relies on static band-pass filtering. Another approaches [1][2] have proposed dynamic or learnable kernel techniques to adapt to crucial subject-specific frequency bands. Including a comparison or discussion with these approaches would enhance the work.
    • The contrastive loss used in the framework differs from the standard CLIP loss, which typically includes separate image-to-text and text-to-image objectives. The authors adopt a symmetric, single-objective contrastive loss. This design choice may influence the results and requires further clarification.
    • While the paper includes model interpretation and ablation studies, the explanations are insufficient, and visualizations are difficult to interpret due to small or unclear figures. Clarifying the ablation conditions and improving the interpretability of visual content would greatly strengthen this section.

    [1] Li, Tianfu, et al. “WaveletKernelNet: An interpretable deep neural network for industrial intelligent diagnosis.” IEEE Transactions on Systems, Man, and Cybernetics: Systems 52.4 (2021): 2302-2312. [2] Kim, Jun-Mo, et al. “A learnable continuous wavelet-based multi-branch attentive convolutional neural network for spatio–spectral–temporal EEG signal decoding.” Expert Systems with Applications 251 (2024): 123975.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this is an impactful study with strong empirical performance. However, further clarification on the issues mentioned above—especially regarding the bandpass filtering strategy, contrastive loss, and interpretability—would greatly enhance the completeness and transparency of the work. Addressing these concerns would likely elevate the overall impact of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper is the introduction of the BrainAlign framework, a novel method for aligning EEG features with visual-language models (VLMs) through contrastive learning. The framework incorporates a Frequency-Aware Temporal Encoder (FATE) and a Differentiable Cluster Assigner (DCA) to address issues of noise in EEG signals and complex inter-channel dependencies, and to effectively align EEG features with visual semantics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Application of Contrastive Learning:The application of contrastive learning for aligning EEG features with visual semantics is a novel approach. By using contrastive learning, the model can learn a direct mapping between EEG signals and visual semantics, enabling zero-shot recognition. Innovation of the FATE Module:The FATE module innovatively combines real Fast Fourier Transform (rFFT) with tunable bandpass filters to compress noisy signals while preserving temporal fidelity. This hybrid time-frequency analysis method is novel in EEG processing, as it leverages frequency information while retaining key temporal features.

    Dynamic Nature of the DCA Module:The DCA module dynamically optimizes channel groupings through cross-attention mechanisms, adaptively suppressing noise and enhancing task-relevant features.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Theoretical Basis of the Methods Needs Strengthening:While the FATE module claims to preserve essential EEG characteristics through bandpass filtering, this approach may discard important phase information, which could be crucial for semantic decoding. Additionally, the probability calculation in the DCA module ignores the spatial topological structure between channels, potentially leading to clustering results that are inconsistent with actual neurophysiological structures.

    Questionable Rationality of the Contrastive Learning Strategy:Contrastive learning typically relies on clear positive and negative sample pairs. However, in the semantic decoding of EEG signals, defining such pairs is not straightforward.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I have a few additional comments for the authors: The experiments were conducted on a single dataset (THINGS-EEG) without cross-dataset validation. This raises concerns about the model’s generalizability. For future work, I recommend: Although the method shows promising results in zero-shot retrieval and classification tasks, the authors should validate its performance on more tasks, such as participating in the Medical Segmentation Decathlon . This would help assess the method’s applicability and generalizability across different tasks.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I chose to recommend acceptance based on the following major factors: Novelty: The study proposes a novel framework, BrainAlign, which aligns EEG features with visual-language models through contrastive learning. This approach is innovative within the field of EEG decoding. Performance Improvement: The experimental results demonstrate significant performance improvements in zero-shot retrieval and classification tasks, indicating the method’s effectiveness in handling complex EEG signals. Potential Application Value: The research offers new insights and methodologies for brain-computer interfaces and neuroscience, with high potential for application. While there are aspects that need further refinement, such as a more detailed discussion of the theoretical basis and validation on additional datasets, the overall novelty and performance improvements make this study a valuable contribution.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their valuable feedback. We address the main concerns below:

  1. Clarity of Results and Experimental Design R1: Ambiguity in reported accuracy improvements and experimental paradigms. We apologize for the lack of clarity. To clarify, all our retrieval and classification tasks are entirely EEG-based. More specifically, during both retrieval and classification, the model’s input consists solely of EEG data. The reported improvements (5.85% and 3.3%) refer to EEG-based zero-shot object recognition compared to baseline methods. We will ensure clearer expression in the camera-ready version.

R1: Insufficient description of experimental paradigms. We apologize for not providing a complete dataset description due to space constraints. However, we did include essential information in Section 3.1. To clarify, the study employed a rapid serial visual presentation (RSVP) paradigm where each image was shown for 100ms with a 200ms stimulus onset asynchrony. The task involves multi-class zero-shot classification, with the training set comprising 1654 image classes and the test set containing 200 different image classes.

R1: Distinction between “zero-shot retrieval” and “zero-shot classification”. We have clearly illustrated the inference processes for both zero-shot retrieval and zero-shot classification in Fig. 2, accompanied by corresponding textual explanations in Section 2.1. We invite the reviewer to refer to these sections for the detailed distinction between these two tasks.

R1: Relationship between “NICE” and “NICE (Our Framework)”. In Section 3.2, we briefly explained the definition of “Our framework,” which involves appropriately modifying NICE and ATM-S and embedding them into our framework for testing. The results demonstrate further improvement, highlighting the effectiveness of our approach.

  1. Methodological Considerations R2: Reliance on static band-pass filtering versus dynamic approaches. We appreciate the suggestion to compare with dynamic frequency band methods. While our approach uses predefined frequency bands based on established neuroscientific literature, we acknowledge the benefits of learnable kernels. Our FATE module preserves frequency characteristics while allowing subsequent layers to learn cross-frequency interactions. We will include this discussion in the camera-ready version.

R3: Theoretical basis of methods and phase information loss. The reviewer raises a valid point about phase information. While bandpass filtering can indeed affect phase relationships, our subsequent cluster-aware feed-forward module is specifically designed to capture temporal dynamics and cross-frequency interactions that preserve crucial temporal patterns. Our experimental results and interpretability analysis (Figure 4-(b)) demonstrate that the resulting clusters effectively align with established functional brain regions, confirming the neurological validity of our approach. We acknowledge the reviewer’s point about spatial topological structure between channels, and we will focus on integrating spatial topological information into model training in our future work.

R3: Rationality of contrastive learning for EEG signals. For EEG-based semantic decoding, we define positive pairs as (EEG signal, corresponding visual stimulus) and negative pairs as (EEG signal, unrelated visual stimuli). Our experimental results (Section 3.2) validate this approach, showing meaningful alignment between EEG representations and visual semantics.

  1. Visual Interpretation and Figures R2: Insufficient explanations and unclear visualizations. We apologize for the suboptimal visualization quality. To include more experimental results within the limited space, we consolidated multiple results together. In the final version, we will improve figure resolution, add more detailed annotations, and optimize the image arrangement to ensure readers can clearly interpret the visualizations.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top