Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Deploying digital pathology models across medical centers is challenging due to distribution shifts. Recent advances in domain generalization improve model transferability in terms of aggregated performance measured by the Area Under Curve (AUC). However, clinical regulations often require to control the transferability of other metrics, such as prescribed sensitivity levels. We introduce a novel approach to control the sensitivity of whole slide image (WSI) classification models, based on optimal transport and Multiple Instance Learning (MIL). Validated across multiple cohorts and tasks, our method enables robust sensitivity control with only a handful of calibration samples, providing a practical solution for reliable deployment of computational pathology systems.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2982_paper.pdf

SharedIt Link: https://rdcu.be/eHdUg

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_54

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/owkin/tsm

Link to the Dataset(s)

N/A

BibTex

@InProceedings{PigArt_Robust_MICCAI2025,
        author = { Pignet, Arthur AND Klein, John AND Robin, Geneviève AND Olivier, Antoine},
        title = { { Robust sensitivity control in digital pathology via tile score distribution matching } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {565 -- 574}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces a method called Tile-Score Matching (TSM). TSM is designed to control the sensitivity of whole slide image classification models, addressing the challenge of maintaining consistent performance like specific sensitivity levels when deploying models across different medical centers or datasets. It uses optimal transport and Multiple Instance Learning, specifically operating at the tile level rather than the whole slide level
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- TSM offers to explicitly control sensitivity, a critical metric in clinical settings that often isn’t guaranteed by methods focusing solely on AUC
- Unlike prior methods like UPA that match scores at the WSI level, TSM operates on tile-level prediction scores. This significantly increases the number of samples available for calibration, making the process more robust even with very few WSIs
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The method, particularly the reweighting step using importance sampling, relies on knowing the prevalence of positive samples in the target (calibration) cohort. The paper mentions that analyzing TSM in settings where this prevalence is unknown and must be estimated is a direction for future research
- The theoretical proof for sensitivity control relies on an assumption that tiles within a WSI are independent and identically distributed (i.i.d.) conditional on the slide label. While the experiments show practical success, this assumption is almost always violated in practical scenarios since tissue samples have correlated spatial structure to them
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper tackles the important and practical challenge of ensuring reliable model performance specifically sensitivity control when deploying histopathological models in new clinical environments, which is crucial for regulatory approval and clinical trust
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The authors detail a calibration method for classification models for Whole Slide Images (WSI) based on the Chowder architecture. Under this architecture, the slide is divided into a series of tiles. The proposed Tile-Score Matching (TSM) method works at the tile-level and uses an optimal transport approach to ensure that the sensitivity for a previously chosen threshold is maintained for a new dataset. The authors include theoretical arguments to support their approach and results are shown for several tasks, including ER/PR/HER2 status prediction for breast cancer and MSI status for colorectal cancer.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors demonstrate that the calibration method can maintain the desired sensitivity operating point even when few positive samples are available. This is particularly important for pathology applications where the number of WSIs can be limited.
2. The authors show that the calibration method does not negatively impact the overall diagnostic performance, in terms of the ROC curves and AUC.
3. Results are demonstrated for several tasks, ER/PR/HER2 status prediction for breast cancer and MSI status for colorectal cancer, and performance is compared against two important baseline methods, UPA and PLTS, with favorable results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The proposed approach seems specific to the Chowder architecture, in particular, breaking WSIs into tiles and only keeping the top-k and bottom-k tiles by score. The second condition is likely much more restrictive than the first. Generalizing the approach to all tile-based (or MIL-based) methods would greatly improve the applicability of the technique.
2. Some of the mathematical formalism reduces the clarity and accessibility of the paper. Some notation, e.g. M_#, may not be known to all readers. It may be possible to describe the approach initially at a high-level before the mathematical description or to move some of the content to an appendix. An algorithmic (or pseudocode) description may increase the likelihood that the method would be utilized by a wider audience. Some background information on optimal transport may also be beneficial.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Figure 3 appears to be missing part of its caption.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper details a method for controlling the sensitivity operating point of whole slide image classification models based on the Chowder architecture. It shows the the calibration technique does not negatively impact overall diagnostic performance and that the sensitivity operating point can be maintained even when few positive samples are available for calibration.
Reviewer confidence

Not confident (1)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper examines the problem of the shift in the distribution of prediction scores (classification) between training and inference. The problem is tackled in the context of WSI binary classification, typically based on multiple instance learning (MIL). The Chowder MIL architecture is considered in the paper. The main contribution is a method for calibrating distributions, acting at instance level (on tiles) rather than at WSI level, to reach pre-specified sensitivity levels. The calibration is realized by matching the score distributions to a reference distribution using optimal transport.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1) The proposed tile-score distribution matching is novel, well-motivated, and theoretically solid with proofs ensuring:
- transferability of the model’s sensitivity when the calibration set contains only positive labels
- that the tile selection of the Chowder MIL model is invariant through calibration (implying interesting computation properties). Compared to WSI-level approaches, the number of WSI used for calibration is reduced.
2) The proposed method shows good performance on several benchmark datasets and classification tasks, the performance is globally better than two state-of-the-art methods. Moreover, they show that transferability of the model’s sensitivity is also good when the calibration set contains both positive and negative labels.

3) The paper is well-written and the method is simple and easy to understand.

4) In medical image classification, tackling sensitivity issues is far more relevant than looking for the best AUC (or even accuracy / f1) scores. The extension of the method to reach pre-specified specificity levels is straightforward by simply considering negative samples instead of positive ones (details are omitted in the paper).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) Whereas the Chowder MIL architecture has been applied to several classification problems (as mentioned in the introduction), more recent MIL methods achieve SOTA (as mentioned in the conclusion it is a perspective). With such methods, the transferability and invariance properties (lemma and theorem) may be more no more satisfied.

2) The prevalence of positive samples must be known.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

To compare the approach with others, 30 calibration WSI are used, which doesn’t represent much diversity in the appearance of the tiles, as a WSI is often made up of tiles that look more or less the same (at least for cytology, perhaps less so for histology). This lack of diversity could be a problem in some clinical configurations.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The considered calibration problem is challenging. The proposed method is interesting, but limited by the MIL method and the knowledge of prevalence.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We want to thank the reviewers for their positive and valuable feedback, and suggesting several avenues for improvement.

First, several reviewers raised the fact that the prevalence of the calibration set must be known, which is indeed a limitation. We point out that it can also be estimated, from various sources. For instance, in the context of the deployment in a hospital, the historical prevalence of the center can be used, as well as the indication prevalence from existing epidemiological data or public health statistics. While it is not permitted to amend significantly the paper with new experiments, we experimentally noticed that in the setting where labels are not available for the calibration samples, but the prevalence in the validation cohort is known (which corresponds to a setting where the model is deployed in a new center, and one has access to a historical prevalence for this center), the performance of TSM only changes marginally (compared to having access to the labels of the calibration samples). Finally as TSM is performing in a low data regime, the labelling cost is drastically reduced compared to PLTS. We thank the reviewers for pointing out the crucial role of the prevalence, and will add this discussion to the camera-ready version of the paper.

Second, we acknowledge that the method is tailored to the Chowder architecture, leveraging its specificity of (i) computing 1D tile scores, (ii) selecting top-k and bottom-k scores, and (iii) passing the scores to a subsequent MLP. Intuitively aligning the tile scores with TSM allows the final MLP to work in a pseudo “in-domain” fashion when the model is deployed in a new center. We now wish to provide a more thorough explanation on the difficulty of extending TSM to other MIL architectures. Attention-Based MIL [Ilse2018] is another classical MIL architecture. Given a bag $(h_1, …, h_N)$ of embeddings, where $N$ denotes the number of tiles, and each embedding $h_i$ is of dimension $d$, AB-MIL computes a vector of 1D attention scores $(a_1, …, a_N)$. Then, a slide-level representation $z = sum_k {a_k \times h_k}$ is computed, before being passed to a last MLP. While it would be possible to apply TSM to the attention scores $(a_1, …, a_N)$, the fact that they are then multiplied by the features $h$ to produce the slide-level representation $z$ would not allow the last MLP to work in an “in-domain” fashion. This makes the translation of TSM to an architecture such as AB-MIL really not straightforward. In that sense, it also becomes an “advantage” of Chowder to allow for a coupling with TSM.

We also acknowledge that the hypothesis required for the theoretical proof on the sensitivity control of TSM (Theorem 1) may not apply in real-world settings, especially for the i.i.d aspect, conditional on the slide label. As pointed out by Reviewer #3, “this assumption is almost always violated in practical scenarios since tissue samples have correlated spatial structure to them”. However, when tissue patches are extracted from the slide, it is possible to control the sparsity of the tiling. When processed at 20x magnification, a typical resection slide can contain up to 10000 patches. Controlling the sparsity of the tiling, for instance by extracting only a subset of tiles (5 or 10%) can be an effective way to reduce the spatial correlation between the considered tiles. We will add this discussion to the camera-ready version of the paper.

Finally, as noticed by Reviewer #4, we agree that proofs, and some of the mathematical formalism would fit in an appendix. Yet, Appendices were not allowed in this year’s MICCAI venue, and the 8-page limit was strict. However, to promote the adoption of the method by a wider audience, we shared our implementation at https://anonymous.4open.science/r/tsm/README.md, and will share it in a de-anonymized fashion for the conference.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Robust sensitivity control in digital pathology via tile score distribution matching

Author(s):