Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The presence of interictal epileptiform discharges (IEDs) in electroencephalogram (EEG) recordings is a critical biomarker of epilepsy. Even trained neurologists find detecting IEDs difficult, leading many practitioners to turn towards machine learning for help. Although deep learning algorithms have shown state-of-the-art accuracy on this task, most models are uninterpretable and cannot justify their conclusions. Absent the ability to understand model reasoning, doctors cannot leverage their expertise to identify incorrect model predictions and intervene accordingly. To improve human-model interaction, we introduce ProtoEEG-kNN, an inherently interpretable IED-detection model that follows a simple case-based reasoning process. Specifically, ProtoEEG-kNN compares input EEGs to samples from the training set that contain similar IED morphology (shape) and spatial distribution (location). We show that ProtoEEG-kNN can achieve state-of-the-art accuracy while providing visual explanations that experts prefer over existing approaches.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3200_paper.pdf

SharedIt Link: https://rdcu.be/eHxde

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_59

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DennisTang2000/ProtoEEG

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanDen_This_MICCAI2025,
        author = { Tang, Dennis AND Donnelly, Jon AND Barnett, Alina Jade AND Semenova, Lesia AND Jing, Jin AND Hadar, Peter AND Karakis, Ioannis AND Selioutski, Olga AND Zhao, Kehan AND Westover, M. Brandon AND Rudin, Cynthia},
        title = { { This EEG Looks Like These EEGs: Interpretable Interictal Epileptiform Discharge Detection With ProtoEEG-kNN } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {615 -- 625}
}

Reviews

Review #1

Please describe the contribution of the paper

Proposes an interpretable model inspired on ProtoPNets for IED detection from EEG signals. The comparison space of prototypes in completed with statistical features derived from signal analysis, providing a novel similarity metric that is averaged based on channel-wise weights. The last linear layer is replaced with a kNN layer over the learned embeddings to provide comparisons with the top k most similar prototypes (instances).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The related works include relevant references and their limitations, pointing to the main differences with respect to the current approach.
- Complete evaluation as it considers multiple levels: 1) comparisons to other baseline models in the main task, 2) ablations of the model components, 3) human-based evaluation to capture alignment of similarity given by models and humans.
- The method description is easy to follow and presents appropriate equations that enhance its understanding. Ideas borrowed from previous works are properly referenced, such as the kNN adoption to ProtoPNets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Some unclear aspects of the evaluation: it is mentioned that there is disagreement on the labels and I wonder how this affects the evaluation of the task and could elaborate more on how the loss term handles uncertain labels. In page 6, the evaluation of some models is done over different types of inputs: FFT of EEG samples (includes the proposed approach) or the ISFs of EEG samples. Why is this the case? Based on Figure 1, I thought the proposed architecture takes raw EEG signals as input.
- Limited justification for including all 3 ISFs, which seem to be relevant features in signal analysis. An ablation is reported when they are all removed, but not tested individually.
- The paper does not discuss potential computational issues given that the new prototype layer creates a prototype of every training sample. Besides, the new similarity metric requires more comparisons that need FFT-based computations (are these computed online?).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- I suggest an agreement on what happens when models do not explain their reasoning: in the abstract, it says “doctors cannot leverage their expertise to identify incorrect model predictions” but in the introduction the problem of providing no insight into how decisions are made is when a practitioner disagrees with a model.
- The 3rd contribution listed in the introduction sounds more a step of the approach (the channel-wise weights).
- Not sure if this is a typo, but in the kNN replacement step paragraph, the notation in the prototype definition has a dash symbol before the different ISF (looks like -var for example).
- Missing reference for EEG-ProtoPNet. Or is it the current method without the kNN? Same for SpikeNet.
- Figure 2 includes a table? Consider having a separate table.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method definition is well justified and even though it borrows ideas from previous methods (like learning these prototypes and the kNN adaptation), it also introduces enhancements (like enriching the comparison space) motivated on the application domain (EEG analysis). The evaluation also demonstrates the contribution of different modules and comparisons to other baselines. The user study is relevant in the evaluation of such explanations that involve a notion of similarity.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper presents ProtoEEG-KNN, a novel and interpretable model based on ProtoEEGNet and kNN. The authors introduce new similarity metrics and augment the latent features with interpretable statistical features (ISFs), enabling more clinically meaningful comparisons of EEG signals, particularly focusing on spike morphology and spatial patterns. The model operates in a fine-grained, channel-wise regime, allowing more precise learning of the embedding space and offering channel-specific interpretability. At inference, ProtoEEG-KNN retrieves only the k most similar training samples rather than relying on the full set of learned prototypes, as done in standard prototypical networks. This improved quantitative performance while also reduced the cognitive load for users during model validation. Crucially, the authors go beyond algorithmic design by conducting a user study to assess the interpretability and fidelity of the model’s explanations—an essential but often overlooked step in explainable AI (XAI) research, especially in clinical domains. This makes the work a strong step toward real-world adoption of interpretable EEG classification models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper clearly motivates the need for ProtoEEG-KNN by identifying limitations in prior prototypical network approaches for EEG, setting up a natural and logical progression to the proposed solution.
- The proposed use of Interpretable Statistical Features (ISFs) to augment the latent features is a novel contribution, especially in the EEG domain. This enhancement improves the semantic richness of the embeddings and aligns model reasoning with clinically relevant EEG characteristics.
- The model’s fine-grained, channel-wise design shows a smart and original adaptation. Rather than aggregating across channels, its encoder preserves channel-specific information which allows for more fine-grained interpretability.
- Quantitative results show superior performance, and the qualitative analysis (e.g., retrieved EEG examples) demonstrates the model reasoning to be more meaningful and clinically intuitive.
- I found the inclusion of a user study to be a major strength. It evaluates the fidelity of model explanations from a clinical perspective. This is an important step that is missing in many explainable AI (XAI) papers, while being essential for real-world adoption.
- The method is evaluated on a fairly large dataset.
- Overall, despite some minor missing content, the paper is well-written and self-contained, with clear mathematical formulation and defined notations, which makes the reading easy.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
There are a few limitations in the current manuscript, some of which could potentially be addressed during the rebuttal. The key weaknesses are outlined below; additional limitations are discussed in subsequent sections of this review.
- Missing quantitative comparison to ProtoEEGNet [18]: Although ProtoEEGNet is cited, its performance is not reported in the results section. Given that the proposed method is a modified version of ProtoEEGNet, including a quantitative baseline comparison would strengthen the claim that the proposed design offers superior performance.
- Ambiguity in uncertainty handling: The first contribution highlights an update to the loss terms to handle uncertain labels (“…update the loss terms to reflect training under uncertain labels”). However, it is unclear how this uncertainty is actually modeled or utilized. Was this simply achieved by using BCE loss with soft labels (e.g., from 0/8 to 8/8) rather than one-hot encoding? If so, wouldn’t this be the same approach as ProtoEEGNet [18]? In that case the novelty claim should be reworded. If the uncertainty handling was achieved through other steps, then please clarify.
- Lack of centralized implementation details: Due to space constraints, and with the current content organization style, the paper lacks a dedicated implementation section. As a result, some technical details (e.g., selected hyperparameter values and their tuning strategies) are scattered or missing. This also applies to some dataset descriptions, citations, and reporting of results. If the authors plan to release the codebase, that could mitigate the impact of this limitation, but clarification on this point would be helpful.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Aside from the points mention so far in major weaknesses, this paper can be further improved by addressing some minor weakness.
- in “… while revealing the spatial focus of the model’s attention across the EEG.”, reader may interpret that the model can highlight which part of the EEG signal within a channel is important (along the T dimension). Was this meant to be the message? Or did the authors mean by “spatial” as in which electrode or location on the brain is in focus (along the C dimension), e.g. Fp1 vs O2?
- Will the codebase be published? Because with the current organization, the implementation details are scattered around in different sections and this makes it a bit harder for the readers.
###### Regarding the Implementation details: ######
- How many prototypes do we have per class? How many classes? Are these the same as ProtoEEGNet [18] which had 9 classes (0/8 to 8/8) and 12 prototypes per class (M=108)?
- What aer the values of M, c_fft, epsilon, loss coefficients (K_i), and how some of these hyperparameters were selected?
- Similarly, for the kNN-based inference, how is k=10 selected?
- The Introduction and Related Work sections could potentially be merged to save space for further implementation details. However, this might be too substantial a change for the rebuttal stage, and the current flow is fairly smooth and readable.
###### Regarding the Dataset and Results: ######
- Is the data public or private?
- What is the distribution of the IED labels, both in granular and binary formats? Is the reported binary accuracy a balanced accuracy to account for class imbalance? The use of class-balanced sampling suggests imbalance—how exactly is this sampling performed, and is it based on binary or granular labels (0/8 to 8/8)?
- Why is the ablation study done on the test set? The test set should be reserved for final evaluation only—ablation should be performed on a validation set to avoid data leakage.
- “We train with 3 different random seeds and report mean and standard deviation.” Since SpikeNet is also a neural network–based model, it would be more consistent to report its results with mean and standard deviation as well. While this is a minor concern—especially given the noticeable performance gap in favor of the proposed model—consistent reporting would strengthen the experimental rigor and make comparisons more fair and transparent.
- SpikeNet has no citation.
- Figure 2 quality can be improved. It is pixelated. Please modify.
- Why is protoEEGNet [18] not reported in quantitative results? It is cited as one of the related works and its limitation in lack of fine-grained channel-wise interpretability is discussed too. So having it in quantitative results would be helpful in demonstrating the superiority of ProtoEEG-KNN.
- lambda_{CoefReg} encourages the lambdas to be closer together. I am curious what was the final learned values of the lambdas! It was not mentioned in the results.
###### Some other minor comments: ######
- The order of cited work together needs to be ascending. For example [5,6,18] in “… Net style reasoning to IED detection [18, 6, 5]”
- “in layer g” is repeated twice in paragraph “ISFs and Prototype Similarity”.
- In Figure 1, the channel names appear too close to the w_c(x) bars, which may cause visual confusion. If possible, consider adjusting the layout to better separate the Input and w_c(x) columns—perhaps by adding more spacing or aligning the input channel names directly under the Input header. As it stands, readers unfamiliar with EEG terminology might not easily recognize that labels like “Fp1-Avg” refer to EEG input channels. Improving the figure’s alignment would enhance readability, especially for those outside the domain.
- Overall, the equations are informative and well-presented. However, if space permits, it would strengthen the paper to briefly contextualize the significance of some of them when they are introduced. For instance, the second contribution states that the model captures both “spike morphology and spatial distribution patterns,” yet the ISFs—such as range, variance, and FFT—are defined without connecting them back to how they contribute to modeling these specific aspects of EEG signals. Providing this context would improve the narrative flow and help readers better appreciate the motivation behind these design choices, making the paper more engaging and accessible.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe that most of the minor issues can be addressed, provided space allows, and doing so would significantly improve the clarity and completeness of the paper.

For the major concerns, I would particularly need clarification from the authors on two points: (1) the absence of ProtoEEGNet [18] performance results, which is important for validating the improvements claimed, and (2) the specific mechanism used for handling uncertainty in labels, which currently lacks clarity. The second point may be resolved with rewording too. Additionally, the missing implementation details—such as hyperparameter choices and tuning methods—could be consolidated or expanded if space permits.

Overall, the paper presents several innovative ideas, thoughtful architectural design choices, and promising evaluation results. I view the current limitations as addressable and believe that, with a satisfactory rebuttal, the paper could be suitable for acceptance.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes ProtoEEG-kNN, a adaption of ProtoPNet with novel components to capture EEG characteristics while improving classification accuracy and interpretability of deep networks. It adds prototype layer to a pretrained spikenet for every EEG channel and uses a KNN based classificaiton during testing to improve accuracy.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Extremely well written paper with good flow, supporting figures, and sound equations.
- Provides good prototype based explanation of final predictions
- Well-conducted experiments and the results are also validated by 4 neurologists showing good translation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Methodology is computationally expensive. Requires the model to compare NxN times during training and also to store a prototype for every training sample for future use. The required compute power increases exponentially with larger datasets.
- Single dataset experiments. Since the prototype is matched with the training set, it may or may not be generalizable. It might infact require retraining for datasets collected at different sites. No results provided to prove generalizability.
- Single train/val/test split experiments. Difficult to judge the statistical significance of improvement. Possibly overfit to this dataset.
- The only methodological inventions in my opinion is adapting ProtoPNet per channel and computing weighted similarity values of three different types. This also increases compute 37 times (number of channels). The ablations don’t do all-channel computation if i am understanding the table correctly. Even the ablations are not significantly different from each other.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Some missing details:
- The changes made to SpikeNet is very hard to follow.
- What is the input length in seconds per sample?
- To decide the class of a sample, was max of 8 annotations taken whenever needed (eg, in L_{clst})
- Was majority voting done on KNN or mean?
- How was R^2 computed?
- In equations under the subsection “ISFs and Prototype Similarity “, it makes more sense to subscript f(x_{i,c}) than f_c(x_i). The former matches with p_{j, c}
- Unclear why some of the results have standard deviation mentioned while others don’t. Were three different random seeds used to train on same train/test splits?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed methodology is quite interesting and the experiments are conducted show good results, particularly impressive that 4 neurologists validated it. But the final score provided by them is still quite low although better than compared methods. The experiments are conducted on single dataset and possibly requires high compute power for future adaptations, hence may not be generalizable. Might be very difficult to reproduce results with the lack of code and since a private dataset used.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We appreciate all reviewers’ (R2,3,4) effort and detailed feedback. We address the following major weaknesses and resolve minor ones in the final version:

(R2, R3) Ambiguity with loss and 8-vote labels: For binary accuracy and AUROC, labels are binarized where 4+ votes indicate a positive IED. During training, we use 8-vote labels and require prototypes of each vote count to ensure diversity.

In contrast to ProtoEEGNet, our loss terms now account for the relative 0-8 scale. We replace their 9-class cross-entropy loss with L_{BCE} that uses soft labels to calibrate model outputs with vote proportions. L_{clst} is multiplied by y_{i} to be stronger for samples with higher labels (matching yes-IED is more important than matching no-IEDs). L_{sep} is weighted with |class(j^{*}) - y_{i}| to be higher when labels are more separated (e.g. 0 vs 8 = higher; 7 vs 8 = lower). L_{ortho} now applies to both embedding and FFT space.

(R2, R4) kNN-Efficiency: The added computational cost for kNN-replacement and 37-channel comparisons is small due to the single instruction, multiple data parallelism inherent in GPUs. The non-kNN model takes ~2 seconds to run 2,000 test samples and the kNN version takes ~6 seconds (on 1 P100 GPU). FFT is computed online and takes ~2 seconds for 100k EEGs. We note (for R4) that the kNN replacement occurs after training and therefore does not incur a NxN runtime expense. During inference, compute increases linearly (not exponentially) with respect to the size of the training data.

Our user-study shows the kNN-step significantly improves match quality. For added efficiency (at the potential cost of match-quality), users could reduce redundant prototypes with clustering methods or by pruning samples from the same patient.

(R2) Input/Baseline Clarification: Our full model takes raw EEG signals as input, then combines their embedding with ISFs of the raw EEG. This balances the expressiveness of the neural-network embeddings (helps accuracy) and the simplicity of the ISFs (helps interpretability). We compared with two relevant ablations: kNN over the ISFs/FFT alone, and a kNN over the embedding space only (denoted “Remove wc & ISFs”).

(R2, R3) ISF motivation: We selected ISFs based on domain expert feedback to optimize the semantic match quality. Range encourages similarity in wave amplitude, variance can differentiate erratic fluctuations from flat backgrounds, and FFT reveals frequency content to match signals with similar sharpness and rhythm characteristics. We exclude mean/median as they do not capture shape well. We do not ablate over individual ISFs as they negligibly affect accuracy (Table 1) and would be too expensive to include in the user-study.

(R3) Code/Data: we will release code containing our entire Bayesian hyper-parameter search, training details, and all learned coefficients. We will also add a citation for the recently published data.

(R3) ProtoEEGNet [18] Comparison: [18] is a nonarchival extended abstract, with no public code. One can view our submission, in a sense, as a publication-quality improvement of [18], with public code. We note they report a +0.7% AUC improvement over SpikeNet, while ProtoEEG-kNN has a +3.7% increase over SpikeNet (although data splits differ across papers).

(R4) Method contribution: We also introduce channelwise weighting (in addition to per-channel prototypes) and a kNN replacement step. These innovations are meaningful because they significantly improve match quality over ProtoPNet (5% -> 47.5% ‘best match’ rate; Fig 2. Top) while aligning model reasoning with neurologists (evaluating both spike location & overall distribution). It is very difficult to improve interpretability (Fig 1. (Top); Table 2) while (a) not harming accuracy and (b) being fast enough for online inference. Therefore, similarity in ablation accuracy is a positive as, combined with the user study, it shows we achieve (a) and our ~6 sec. run time over the test-set shows (b).

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

This EEG Looks Like These EEGs: Interpretable Interictal Epileptiform Discharge Detection With ProtoEEG-kNN

Author(s):