Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model’s decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that offers user-friendly explanations and supports human intervention. Our approach employs a sparse autoencoder to discover human-interpretable concepts from the image feature space, which are then used to train ProtoMIL. The model represents predictions as linear combinations of concepts, making the decision process transparent. Furthermore, ProtoMIL allows users to perform model interventions by altering the input concepts. Experiments on two widely used pathology datasets demonstrate that ProtoMIL achieves a classification performance comparable to state-of-the-art MIL models while offering intuitively understandable explanations. Moreover, we demonstrate that our method can eliminate reliance on diagnostically irrelevant information via human intervention, guiding the model toward being right for the right reason.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0542_paper.pdf

SharedIt Link: https://rdcu.be/eHxc4

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_49

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ss-sun/ProtoMIL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SunSus_PrototypeBased_MICCAI2025,
        author = { Sun, Susu AND van Midden, Dominique AND Litjens, Geert AND Baumgartner, Christian F.},
        title = { { Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {507 -- 517}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper addresses the challenge of whole slide image (WSI) classification under the ubiquitous multiple instance learning (MIL) paradigm, focusing on the need for interpretability in computational pathology. The authors propose ProtoMIL, a novel MIL model that integrates concept discovery via sparse autoencoders to generate human-interpretable explanations and enable model interventions. The model represents predictions as linear combinations of concept activations, allowing for transparent decision-making and manual removal of spurious concepts.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper proposes an interesting combination of two components: 1) sparse autoencoder-based unsupervised concept discovery and 2) interpretable MIL with human intervention support.
- The ProtoMIL framework is designed to be inherently interpretable, allowing both local and global explanations via concept-level contributions.
- The paper presents a clear and well-structured methodology with illustrative figures that allow for easy understanding.
- Human intervention to remove spurious features is well-motivated and demonstrated convincingly using domain expert feedback.
- Quantitative experiments show that ProtoMIL performs comparably to state-of-the-art MIL methods while offering significant interpretability gains.
- The method appears to be generalizable across datasets, as shown by training on both Camelyon16 and PANDA with consistent interpretability and performance.
- Concept-level visualizations and naming of concepts with pathologist input enhance the practical relevance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- While innovative in the context of computational pathology, the use of sparse autoencoders for unsupervised concept discovery is well established in other domains; novelty is incremental in this regard.
- No comparison is made to other concept-based MIL models like Concept Bottleneck or Concept MIL using vision-language models on similar datasets.
- The evaluation metrics are limited to Accuracy and AUC. Metrics such as AU-PRC or class-wise precision/recall would be valuable, especially for imbalanced settings.
- Additionally, only evaluation on two rather small datasets.
- No information is provided about runtime, computational cost, or efficiency compared to other MIL models.
- While intervention is a key contribution, the ablation comparing performance before and after intervention is relatively minor in performance gain and could benefit from more detailed analysis (e.g., number of retraining epochs, stability).
- Concept pruning is done manually. A brief discussion on semi-automated or scalable concept filtering methods would strengthen the practicality of the approach.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Please include more details on the hyperparameter selection process (especially λ₁ and d_hid). Was this dataset-specific?
- Consider discussing related work more broadly, especially other approaches combining unsupervised concept discovery with MIL, such as Concept Bottleneck models [10], and related MIL concept-based methods [19].
- While ProtoMIL aims to eliminate reliance on spurious correlations, it would be informative to quantify the performance gain (or robustness improvement) in a stress test, e.g., with added artifacts or under domain shift, but I understand that this exceeds the scope of MICCAI.
- Since you adopt the nomenclature of “prototype-based”, it would be wise to briefly mention and explain prototypes in MIL and cite related works such as [1] Butke, Joshua, Noriaki Hashimoto, Ichiro Takeuchi, Hiroaki Miyoshi, Koichi Ohshima, and Jun Sakuma. “Mixing Histopathology Prototypes into Robust Slide-Level Representations for Cancer Subtyping.” In International Workshop on Machine Learning in Medical Imaging, pp. 114-123. Cham: Springer Nature Switzerland, 2023.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is intriguing and introduces an interpretable MIL framework with human-intervention capability, which is relevant. The combination of sparse concept discovery and model-level transparency addresses a critical need in WSI classification. However, the novelty is limited as the individual components (SAE, MIL, concepts) are thoroughly explored, and comparison to similar concept-based methods is lacking. Additionally, the evaluation is convincing but rather weak and limited. With improvements in clarity on implementation/reproducibility, and most importantly broader related work coverage and comparison to those methods, the paper could merit acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces ProtoMIL, a novel inherently interpretable Multiple Instance Learning (MIL) model for histopathology Whole Slide Image (WSI) analysis. The key innovations of this work include:

A method that employs a sparse autoencoder (SAE) to automatically discover human-interpretable concepts from the image feature space, overcoming limitations of previous approaches that required predefined concepts. A transparent decision process where predictions are represented as linear combinations of discovered concepts, making the model’s reasoning accessible to pathologists. A human-in-the-loop framework that allows users to perform model interventions by altering the input concepts, particularly removing diagnostically irrelevant features. A practical demonstration that showing the approach can eliminate reliance on spurious correlations through human intervention, guiding the model toward “being right for the right reason.”
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. ProtoMIL achieves classification performance comparable to state-of-the-art MIL models while offering intuitive explanations. Experiments on Camelyon16 and PANDA datasets demonstrated competitive accuracy and AUC metrics, with the added benefit of transparent decision processes.
2. The sparse autoencoder effectively learns sparse, human-interpretable pathology concepts automatically, eliminating the need for hand-crafted features or predefined textual concepts as required in previous approaches (SI-MIL and Concept MIL). The authors found that 75% of detected concepts on Camelyon16 and 86% on PANDA captured monosemantic pathology features, as verified by pathologists.
3. The ability to inspect activated concepts through representative patches and to modify the model by disabling spurious concepts represents a significant advance in human-AI collaboration for medical imaging. This addresses a critical need in clinical applications where explanation and trust are paramount.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. As acknowledged in the literature on prototype-based models, these approaches can sometimes face challenges in making the meaning of prototypes clear and unambiguous. While the paper demonstrates that many concepts were mono-semantic, some remained polysemantic, which could potentially lead to confusion in clinical interpretation.
2. The multi-stage process (SAE training, concept discovery, ProtoMIL training) introduces computational overhead that isn’t fully benchmarked in the paper. For practical deployment in clinical settings, understanding the computational requirements and training time would be valuable, especially for very large WSI datasets.
3. While the approach was tested on two standard datasets (Camelyon16 and PANDA), more extensive validation across diverse pathology domains, cancer types, and staining methodologies would strengthen the claims of generalizability. The paper does not fully address how the approach performs with more heterogeneous data.
4. The effectiveness of ProtoMIL depends heavily on the quality of concepts discovered by the SAE. The paper doesn’t thoroughly explore how concept quality might vary across different tissue types or disease processes, or how the approach would handle rare pathological findings that might not be well-represented in the training data.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Based on a thorough analysis of this paper, I recommend its acceptance. This work makes several significant contributions to the field of computational pathology and addresses a critical gap between high-performance deep learning models and the interpretability needs of clinical practice. The weaknesses identified are reasonable areas for future work rather than fundamental flaws in the approach.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors stated that existing MIL models offer suboptimal explanations and have poor actionability. This paper introduces Proto-MIL, an interpretable-by-design multiple-instance learning model to analyze whole-slide images, equipping the end-users with explanations and the ability to perform human intervention. Their framework comprises a sparse autoencoder to automatically discover interpretable pathology concepts from the images, with which they train ProtoMIL. The model predictions are linear combinations of those concepts. The quality and representativeness of such learned concepts can be assessed via probing set patches, so that clinical users can act on the input concepts to remove those that are diagnostically irrelevant. That “blacklist” of concepts in removed from model learning to reduce spurious correlations and guide the model better. They reach comparable performance with state of the art on public datasets
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Addressing explainability in computational pathology is an important aspect. Giving the clinical end-use the ability to perform model interventions to eliminate dependence on spurious signals is interesting.

The paper is overall well written and easy to follow. Although the idea is quite simple, the methodology is straightforward, well designed, and demonstrated effectively.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Section 2.2. (bottom) – the authors state that prior studies have shown that the neurons in the latent representation often do align with distinct and interpretable concepts. I am unsure whether the cited references are appropriate; there is extensive literature stating the impossibility to have disentangled concepts encoding; most often than not, different visual concepts are encoded in the same latent representation, and they are highly uninterpretable to humans.

Section 2.2. (end) – “… gave sufficiently sparse and interpretable concepts, and we used those values for our experiments.” - I am confused on how can you understand if the concepts are indeed interpretable at this stage. For that to be possible, you need to associate concepts with visual representations (next section) - I suggest the authors to comment on that beforehand or restructure the subsections and/or move that empirical statement to a later subsection or directly to the Results.

Figure 3 - How do you prevent different neurons to encode the same visual concepts? As we can see from figure 3c), for instance, neuron 1215 top left patch is very much similar to neuron 1863 top left patch.

Section 3 – Data and Baselines. The dataset size is greatly different between the two data sets; the authors mention they merge data from the two sources to learn shared concepts: how do they counterbalance the great size mismatch? The proportion of positive and negative classes are not specified for either datasets. The classification performance could be unreliable with very few data (training breast cancer metastasis with only 216 WSIs) and with highly imbalanced classes.

Table1 - are those results obtained in the internal validation set (during training) or in the external held-out test set? Also, are those obtained with a single run or from bootstrap/cross-validation/different random seeds? With no standard deviation nor confidence intervals it is an incomplete picture of the results.

Section3 – Evaluations with interventions on spurious concepts - I believe this spurious concepts removal study is not well structured. It is obvious that re-training the same network on the same data after removing the concepts deemed spurious leads the network to not base its prediction on them (they are no longer there). This approach requires that the network is trained (at least) twice and that there is an expert pathologist to validate the visual concepts again. The true/causal or false/spurious nature of individual concepts is strictly empirical and related to the data used and the downstream task. I think it would be more appropriate to use an external dataset to assess the generalization of this approach. The authors should comment on this.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

In addition to the points raised above, I would add the following:

Figure 1. it would be good to add in this figure the same mathematical formalism introduced in the text, such as the dimension of the latent embeddings, d_in, d_hid, etc.

Figure 1. the caption says “with the option to disable spurious concepts via human intervention.” Where in the figure is this implication represented? It is not clear which is the representation of spurious concepts, like neuron 267 from panel b), in the panel c)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See above
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their constructive and insightful feedback. We are encouraged by their recognition that our work addresses a critical need in clinical applications, and that the proposed method is well-designed, effectively demonstrated, and makes significant contributions to the field of explainable AI in medicine (R1, R2, R3). We appreciate the suggestions for improvement and will address them in the camera-ready version through the following revisions:

Training details (R1, R2): To enhance the reproducibility and clarity, we will include additional training details, such as computational requirements, training and inference time, etc., in the Experiments and Results section. As mentioned in the original manuscript, we will also make our code publicly available.

Related work (R1): We will expand the related work to include a broader discussion of relevant multiple instance learning (MIL) approaches, such as ProtoMixer, a prototype-based method as suggested by R1. Additionally, we will include Concept MIL as an additional baseline in the Experiments and Results section to represent the performance of concept-based approaches that use histopathology vision-language models.

Discussion and Future Work (R1, R2, R3): We will provide a more in-depth discussion of the limitations of our current approach, along with potential directions for future work. These include improving the monosemanticity of the learned concepts, addressing challenges posed by heterogeneous data and rare pathological findings, and systematically evaluating model interventions in stress-test settings within artificially designed shortcut learning scenarios.

We believe these improvements will further strengthen the quality of our paper.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification

Author(s):