Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Emotion recognition plays a vital role in affective computing and mental health monitoring within intelligent healthcare systems. While EEG captures rich emotional patterns, its clinical applicability is limited by cumbersome acquisition and susceptibility to motion artifacts. In contrast, electrocardiogram (ECG) signals are more accessible and less prone to artifacts, but lack direct semantic representation of emotions categories. To address this challenge, we introduce a cross-modal alignment approach using contrastive learning. First, we extract emotional features from EEG signals using a pre-trained encoder. Then, we align the ECG encoder to these EEG-derived features through a contrastive learning framework, using sequence and patch level semantic alignment based on a temporal patch shuffle strategy. This method effectively combines the strengths of both modalities. Experiments on the DREAMER and AMIGOS datasets show that our method outperforms other baseline methods in emotion recognition tasks. Additional ablation studies and visualizations further reveal the contribution of core components. From a practical application perspective, our approach facilitates accurate emotion recognition in scenarios where EEG acquisition is impractical, providing a more accessible alternative for real-world affective computing applications. The code is available at https://github.com/pokking/ECG_EEG_alignment.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4191_paper.pdf

SharedIt Link: https://rdcu.be/eHw5S

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_7

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/pokking/ECG_EEG_alignment

Link to the Dataset(s)

DREAMER dataset: https://zenodo.org/records/546113 AMIGOS dataset: http://www.eecs.qmul.ac.uk/mmv/datasets/amigos/

BibTex

@InProceedings{WuYi_CrossModal_MICCAI2025,
        author = { Wu, Yi AND Chen, Yuhang AND Cui, Jiahao AND Liu, Jiaji AND Liang, Lin AND Li, Shuai},
        title = { { Cross-Modal Contrastive Learning for Emotion Recognition: Aligning ECG with EEG-Derived Features } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {64 -- 73}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a new training methodology for physiological signals alignement. The aims is to incorporate some knowledge from a signal to another. The approach is based on the alignement of extracted patches and sequence of two different signals using transformers methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Based on the paper results, the methodology improves the exploitation of two different type of signals for physiological signals, by aligning features extracted using Transformers and training methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
First of all, the paper is outside the scope of MICCAI: none of the proposed methods could be applied to medical imaging problems.

It has several weaknesses about the method:
- there is a strong assumption that there is no emotion contamination in the signal outside the time windows considered. Indeed, the proposed windowing method with contrast learning assumes no link between positive and negative examples.
- the 10-second time window assumes that the emotion is stable during emotion induction, whereas the participant’s evaluation only takes place at the end of emotion induction.
- The paper begins by stating that EEGs are difficult to use, but the proposed method relies solely on the addition of an extra sensor (ECG) to EEG. On the other hand, it is stated that the ECG does not include emotional information, but the paper proposes to use this signal. This is probably a wording problem, but it creates inconsistency in the paper.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(0) Not relevant to the MICCAI community
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
First of all, the paper is outside the scope of MICCAI: none of the proposed methods could be applied to medical imaging problems.

He has several weaknesses about the method:
- there is a strong assumption that there is no emotion contamination in the signal outside the time windows considered. Indeed, the proposed temporal blending method with contrast learning assumes no link between positive and negative examples.
- the 10-second time window assumes that the emotion is stable during emotion stimulation, whereas the participant’s evaluation only takes place when the emotion is induced.
- The paper begins by stating that EEGs are difficult to use, but the proposed method relies solely on the addition of an extra sensor (ECG). On the other hand, it is stated that the ECG does not include emotional information, but the paper proposes to use this signal. This is probably a wording problem, but it creates inconsistency in the paper.
Translated with DeepL.com (free version)
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I completely missed the point that only ECG is used at inference time. Based on Reviewer 2’s comments, it seems I am not the only one — this aspect is not fully clear in the current manuscript. That said, the approach is interesting, the results are convincing, and similar work (e.g., Brant-X) uses comparable terminology. If possible, I recommend the authors add a sentence explicitly clarifying that after alignment, only ECG is required during inference.

Review #2

Please describe the contribution of the paper

Addresses the desire to infer emotion from more accessible ECG. While the intention is to infer emotion, the major technical contribution is the semantic contrastive alignment of ECG and EEG.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well structured, with a clear technical focus on the contrastive semantic alignment. I also applaud the authors in “landing” it with an emotion inference task by evoking a classifier to reach a quantifiable endpoint.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Quantitative result reporting lacks statistical information and power tests.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Good framework. Structured design. Needs proper result support.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes a cross-modal contrastive learning framework for emotion recognition that aligns ECG features with semantically rich EEG-derived representations. The method leverages pretrained EEG encoders, temporal patch shuffling, and multi-scale contrastive loss to transfer emotional information from EEG to ECG. The resulting ECG encoder achieves strong performance on emotion classification tasks across two benchmark datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper has several notable strengths:
1. Aligning ECG with EEG-derived semantic features is a practical and under-explored direction, especially considering the real-world limitations of EEG acquisition.
2. The combination of patch-level and sequence-level contrastive objectives effectively improves the quality and discriminative power of ECG features.
3. The method demonstrates consistent improvements over baselines on two datasets, and the ablation studies are relatively thorough.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The paper states that the EEG encoder is pretrained via reconstruction. However, no details are provided regarding the pretraining architecture or procedures, and the effectiveness of this pretraining is not validated through comparative or ablation experiments.
2. Experimental results show that SSL-ECG outperforms EEG-Conformer, suggesting that ECG data already contains strong and discriminative information, possibly even superior to that of EEG. Therefore, it is worth questioning whether forcibly aligning ECG to EEG’s feature space is optimal. There may be more effective strategies for fusing both modalities that preserve the valuable information in ECG while incorporating the rich semantics of EEG.
3. The novelty of the method lies more in the task formulation and training strategy than in architectural innovation. The overall model structure bears strong similarity to Brant-X and lacks distinct architectural originality.
4. The generalizability of the model needs further validation and discussion. Since the paper’s motivation is to replace EEG with ECG in practical emotion recognition scenarios, it is essential to verify whether the learned ECG encoder generalizes to new ECG datasets or zero-shot settings. In practice, re-aligning to EEG may not be feasible for every new dataset.
5. The formatting of tables is not standardized and should be improved for clarity and consistency.
6. The chosen baselines are not necessarily state-of-the-art for the respective datasets, and thus lack representativeness. Stronger comparisons would strengthen the paper’s empirical claims.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak accept. The paper proposes a well-motivated and practical cross-modal contrastive learning framework. While the core ideas are interesting and empirically validated, the exploration of multi-modal fusion and alignment strategies remains limited. The paper would benefit from clearer explanations of the training pipeline, comparisons to more recent multimodal methods, and broader generalization experiments. With revisions to improve clarity and completeness, this work has the potential to make a meaningful contribution to the field of emotion recognition and multi-modal physiological modeling.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

After the rebuttal, the paper still has the following three shortcomings: First, the paper does not make full use of the 8-page limit, so the authors should have enough space to elaborate on the pretraining process. Second, the approach of forcibly aligning the feature spaces of ECG and EEG remains questionable. Finally, the performance comparison with baseline models lacks persuasiveness.

Author Feedback

Thank you to the reviewers and AC for your careful review of the article and for your constructive comments. The following responses will be arranged in the same order as the comments raised by each reviewer. R1: 1.Thank you for the valuable feedback. We performed 5-fold cross-validation , but only reported means. All SD <1%(e.g., DREAMER valence: ±0.76%, arousal: ±0.39%) , indicating stable performance. If permitted, we will include SDs. Following prior works, we used a 4:1 train-test split to ensure consistency and statistical power. R2: 1.Due to space limits, we omitted pretraining details. As noted in your “strengths”, our work focuses on integrating EEG semantics into ECG via contrastive learning. Using a pretrained EEG encoder better demonstrates this integration. Given limited data and lightweight models, Multi-stage is more feasible than end-to-end encoder training and alignment. 2.Thank you. SSL-ECG targets ECG-based emotion recognition and is validated on DREAMER and AMIGOS, while EEG-Conformer learns general EEG features for broader tasks. Though ECG has less semantic content, it is more stable under emotional states. This may explain why SSL-ECG outperforms EEG-Conformer. Our findings demonstrate the effectiveness of the integration strategy in emotion classification, which can be further explored subsequently. 3.Thank you for your suggestions. Brant-X offers an impressive, unified EEG-EXG alignment framework for multiple scenarios. We both used comparative learning, which tends to lead to architectural similarities. Contrastive learning centers on how to construct positive and negative sample pairs. There are significantly difference between this paper and Brain-X. Specifically, we define them based on temporal alignment and introduce negatives via segment shuffling. Ablation studies confirm that this strategy is crucial to our performance improvement. And our aims differ: our approach aims to integrate EEG’s semantic information to ECG, and improve ECG-based emotion recognition. If permitted, we will clarify the the core contribution in the architecture figure. 4.Thank you for the insightful comment. While our current work does not evaluate cross-dataset generalization, the ECG encoder functions independently at inference and benefits from the inherent stability of ECG signals. This may suggests better generalizability than EEG-based models. If possible, we will emphasize this point and the limitation 5.Thank you. We followed the official template, which may have affected the table formatting. If possible, we will further refine the layout to improve clarity. 6.Thank you for the comment. Since our method uses ECG only at test time, direct comparison with multimodal SOTA methods would be unfair. We therefore selected recent strong baselines using single-modality signals for fair and representative evaluation. R3: 1.Thank you.We believe oour ECG-based diagnostic framework aligns well with MICCAI’s scope, especially within the CAI theme. And the organizing committee explicitly provides the option of “EEG/ECG” under “Mode”. Besides, MICCAI has accepted a growing number of such studies (8, 11, 10 papers from 2022–2024), indicating community relevance. 2-3. Thank you for the comment.While we define positive and negative pairs, our contrastive loss uses soft constraints without assuming strict independence. Following prior works, we excluded preparation period and used 10-second windows to map to the overall emotional trend represented by the label. In practice, emotion recognition is typically performed continuously to support downstream tasks rather than as discrete classifications. In summary, we don’t assume the emotion in input is stable and that the input capture all available information.

Thank you. Our goal is to enhance ECG-based emotion recognition by leveraging EEG semantics during training. During testing, we only use ECG, fulfilling our goal of ECG‐only emotion classification when EEG is unavailable.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Cross-Modal Contrastive Learning for Emotion Recognition: Aligning ECG with EEG-Derived Features

Author(s):