Abstract

Sleep stage prediction is a critical task in medical diagnostics, such as for sleep disorders like Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS). Traditionally, this task involves analyzing Electroencephalogram (EEG) signals and classifying the stages based on general features, often relying on medical expertise. However, this process is prone to bias and variance, as clinicians incorporate subjective experience into their predictions. In recent years, multimodal large language models (MLLMs) have demonstrated significant advancements, particularly in medical applications, outperforming traditional methods in many domains. Despite their promising potential, MLLMs are sensitive to high memorization effects and require high-quality, well-labeled data for fine-tuning. Label noise, commonly present in real-world datasets, can severely hinder their performance and robustness. Consequently, directly applying MLLMs to sleep stage prediction using noisy EEG labels presents a challenge. In this paper, we introduce a novel framework for sleep stage prediction using EEG data under label noise, leveraging the power of MLLMs. Our approach integrates multi-perspective agreement techniques to identify high-quality samples based on the prior knowledge embedded in MLLMs. We then employ a self-training method to enhance prediction accuracy despite the presence of label noise. We validate our framework using real patient EEG data in sleep stage prediction tasks, and the results demonstrate that our approach is both robust and accurate under label noise, outperforming other state-of-the-art robust learning methods. Our code will be made publicly available at https://github.com/Leonard-zc/MICCAI2025-RSSP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1959_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QiuXih_Robust_MICCAI2025,
        author = { Qiu, Xihe and Zhan, Chen and Ma, Gengchen and Huang, Jingjing and Tan, Xiaoyu},
        title = { { Robust Sleep Stage Prediction from Electroencephalogram with Label Noise Using Multimodal Large Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {586 -- 596}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework for robust EEG-based sleep stage classification under label noise by leveraging multimodal large language models; However, it has big problems with novelty, experiments and baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Leveraging multimodal LLMs which is a rising trend in medical imaging, for EEG‐based sleep staging and expands MLLM utility beyond vision tasks
    2. The multi‑perspective agreement filter coupled with self‑training helps combat label noise
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Transforming EEG signals into PNGs discards fine temporal‑spectral information, increases compute, and lacks justification compared to 1D‑CNN or transformer approaches directly on raw time‑series.

    2. Naive angle , lacks reader interest. Does not focus on real research challenges (thus lack of clinical interests) but instead engineering challenges like label distribution. If you want to focus on these questions fine, then your approach is not at all interesting and up-to-date. What I see is copying existing methods to MLLMs.

    3. Very outdated baselines. ( baselines from 2010,2018,2019 paper) No comparison to recent deep learning architectures tailored for EEG sleep staging, which routinely achieve good accuracy on Sleep-EDF.

    4. Iterative self‑training risks reinforcing early biases. The paper does not describe safeguards (e.g., confidence thresholds for pseudo‑labels) to prevent drift.

    5. Please use grammarly or ChatGPT or the help of a English speaker to go through the writing again. e.g. “We proposes”

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Architecture is trendy and method figure is good. However, please consider addressing major comments and submit to a journal. This is not a fit for MICCAI as MICCAI audience cares about novelty.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Reviewers have carefully addressed each of my questions and concerns, therefore I recommend Accept after the rebuttal.



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the development of a framework for sleep stage prediction using EEG data with label noise, leveraging multimodal large language models (MLLMs). The proposed approach integrates multi-perspective agreement techniques to identify high-quality samples and employs a self-training strategy to enhance the model’s robustness and accuracy despite noisy labels. This framework improves sleep stage classification performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One of the major strengths of the paper is its sleep stage prediction framework that addresses the challenge of label noise in EEG data. The paper introduces an innovative combination of multi-perspective agreement techniques and self-training strategies, which effectively reduces the impact of noisy labels by selecting high-quality samples for training and refining the model iteratively. This approach is particularly interesting because it leverages the power of multimodal large language models (MLLMs) in the context of EEG-based sleep staging, an area where MLLMs have not been extensively applied. The paper also demonstrates strong clinical feasibility by validating the framework using real patient EEG data, showing improvements over existing noise-robust learning methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The core of the model is the combination of multi-perspective consistency techniques and self-training strategies to construct a high-quality training set and further improve the model’s robustness and accuracy through multi-round optimization. While this design strategy has shown good results in the experiments, there may be some potential issues with the model design.

    First, the model relies on multi-perspective consistency to filter high-quality samples. Although this method can effectively reduce the impact of label noise, it may struggle to capture all the important and valid samples, especially when the data quality is poor or the noise is highly complex. Therefore, the model might miss some potentially important data, especially in clinical data, where the diversity and complexity of the samples could lead to suboptimal consistency filtering results.

    Second, the model employs a self-training strategy that depends on the initial high-quality samples for secondary training and filtering. While this approach allows for gradual expansion of the training set, it is also susceptible to the quality of the initial samples. If the initial high-quality sample selection is flawed, the entire training process may be amplified and propagated, thus affecting the subsequent performance optimization.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach presented, which combines multi-perspective consistency and self-training strategies for EEG sleep stage prediction under label noise, shows promising results, especially in improving robustness and accuracy. The paper provides a clear methodology, experimental validation with real clinical data, and demonstrates significant performance improvements over existing methods. However, the paper lacks sufficient exploration of the scalability of the model in diverse real-world clinical settings and does not fully address the challenges of noise patterns across different patient populations. Therefore, the paper shows strong potential but could benefit from additional discussion on these aspects.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a robust sleep stage classification framework using single-channel EEG data with label noise, leveraging the power of multimodal large language models (MLLMs). The key methodological innovation lies in combining a multi-perspective agreement strategy with an iterative self-training mechanism to identify and refine high-quality samples in noisy datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work presents a fresh approach by combining multi-perspective agreement with MLLM prompt-based question answering for label cleaning in EEG sleep staging. The innovation lies in both the method itself and its demonstrated effectiveness. I particularly appreciate the thoughtful implementation of dynamic curriculum optimization and ongoing data re-screening processes, which significantly enhance model robustness and generalization capability across different datasets. The ablation studies are presented with clarity and strong visual support, effectively showing how each component (consistency filtering, balanced positive-negative sampling, and the iterative self-training framework) contributes to the overall performance gains.RetryClaude can make mistakes. Please double-check responses.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The core idea of using MLLMs for consistency-based sample filtering is compelling but would benefit from greater transparency. The paper does not provide examples of the prompts used, the format of EEG image inputs, or how responses are parsed. Without such details, reproducibility and clarity are limited.

    2. There is no indication of whether the code, prompts, or training scripts will be released. This hinders reproducibility, although the methods are described in reasonable detail.

    3. Lack of relevant cross-domain citations: While the paper proposes an innovative approach to handling label noise in EEG-based sleep staging, it misses the opportunity to reference related work that addresses similar issues in other physiological signals. For example, Zhao et al. (2025) proposed a method using the SuSiE model to segment heart rate time series and identify sleep regions [https://doi.org/10.1080/00952990.2024.2441868]. eg:’Similar to how Zhao et al. (2025) applied SuSiE for robust sleep region detection in heart rate signals, our work leverages multi-perspective consistency and self-training to identify clean samples in EEG data, ensuring robust sleep stage classification under label noise.’

    4. The current tables report only single-point accuracy results, without providing standard deviations from repeated runs or cross-validation. It is recommended that the authors include the mean ± standard deviation over five independent runs to enhance the credibility and robustness of the reported results.

    5. Additionally, adding a case study visualization (e.g., ground truth vs. initial prediction vs. post-filtering prediction vs. final self-trained prediction) would greatly improve the interpretability and persuasive power of the method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper offers a timely, well-motivated, and methodologically novel contribution to sleep stage classification in noisy EEG datasets. It creatively adapts large vision-language models for a biosignal task and addresses label noise through consistency filtering and self-training — both well-executed and empirically validated. The paper is clearly written, supported by comprehensive ablation studies, and poised for real-world impact. Minor improvements in model transparency and statistical reporting are encouraged, but they do not detract from the paper’s strong scientific merit and originality.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have provided a thorough and well-articulated rebuttal that effectively addresses the concerns I raised.




Author Feedback

We deeply appreciate the chairs and reviewers for their dedicated review of our work. We are encouraged by your recognition of our work. Below, we address the major concerns raised by each reviewer.

Reviewer #1 1.We acknowledge that multi-view consistency filtering may miss valid data under complex noise. To address this, our design incorporates multi-round self-training that progressively expands the high-confidence sample set. As shown in Figure 2(c), performance improves over iterations, validating the effectiveness of this mitigation. 2.To reduce early-stage bias propagation, we enforce a conservative consistency criterion (Eq. 1, Sec. 2.1) that requires agreement on both verification and recognition prompts. This ensures only reliable samples are included in each round.

Reviewer #2 1.Converting electroencephalography (EEG) signals into images is an intentional design to enhance interpretability and support human-AI interaction—not just to fit multimodal large language models (MLLMs). Visual formats make temporal-spectral features directly observable to clinicians, enabling more explainable outputs and easier error analysis, which 1D models struggle to provide. 2.We respectfully clarify that our work is not a simple adaptation of existing methods, but a novel framework combining multi-view consistency filtering with multi-round self-training for robust EEG sleep staging. Due to inter-annotator variability and the inherent ambiguity of medical data, reliable labeling is highly challenging. Label noise and annotation inconsistency are common and severely degrade model generalization. Learning under noisy labels has thus become a core issue in medical data analysis[1]. A key innovation of our method is leveraging an MLLM to perform multi-view consistency assessment. For each EEG image, the model is prompted from two angles: (1) a verification prompt checking if the label fits the image, and (2) a classification prompt predicting the most likely sleep stage. A sample is retained only when both answers match the original label. This dual-perspective filtering is enabled by MLLMs’ unique ability to reason across both visual and textual inputs—something conventional models like CNNs or transformers cannot achieve. This enables more robust sample selection under label noise. 3.We selected widely-cited robust learning baselines (e.g., Co-Teaching+, GCE) to focus on label noise resilience. While these methods are earlier, they are still actively used in noisy learning benchmarks[2]. 4.While our main text (Eq.1, Sec. 2.1) presents binary consistency as a filtering gate, we did explore threshold-based confidence constraints during early iterations. Due to space limits, these details were omitted. Empirically, our multi-round design (Figure 2) naturally prevents drift via repeated filtering. 5.We will revise the manuscript using professional proofreading and native English review. [1] Shi, J. et al. (2024). A survey of label-noise deep learning for medical image analysis. Medical Image Analysis, 95, 103166. [2] Zhang, Q. et al. (2024). Cross-to-merge training with class balance strategy for learning with noisy labels. Expert Systems with Applications, 249, 123846.

Reviewer #3 1.We agree reproducibility is important. Due to space and anonymization, prompt templates and input examples are only briefly shown in Figure 1. 2.We confirm that we plan to release all code, prompt templates, and fine-tuning scripts upon publication, pending institutional clearance. We understand this is crucial for reproducibility. 3.Excellent suggestion. We will include relevant works on label-noise learning in physiological signals (e.g., SuSiE on heart rate) to enhance theoretical grounding. 4.We conducted five repeated runs and observed low variance. We will update Table 1 and 2 with mean ± standard deviation values. 5.Due to space constraints, we will include a case-level visualization in our open-source repository to improve interpretability.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Vox populi: All three reviewers voted in favor of this paper, including a strong positive review and a vote switched to “for” post-rebuttal.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top