Abstract

Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1975_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1975_supp.zip

Link to the Code Repository

https://github.com/swiss-ai/patho-ssl-data-curation

Link to the Dataset(s)

https://huggingface.co/datasets/swiss-ai/patho-ssl-data-curation

BibTex

@InProceedings{CheBoq_Revisiting_MICCAI2025,
        author = { Chen, Boqi and Vincent-Cuaz, Cédric and Schoenpflug, Lydia A. and Madeira, Manuel and Fournier, Lisa and Subramanian, Vaishnavi and Andani, Sonali and Ruiperez-Campillo, Samuel and Vogt, Julia E. and Luisier, Raphaëlle and Thanou, Dorina and Koelzer, Viktor H. and Frossard, Pascal and Campanella, Gabriele and Rätsch, Gunnar},
        title = { { Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {564 -- 574}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study presents a stratified sampling approach based on hierarchical clustering of patch embeddings from TCGA to automatically curate an optimal dataset for upstream pre-training of pathology encoders. Compared to utilizing all data or sampling based on supervised WSI-level labels, the author’s proposed method achieves higher accuracy on the aggregate across 9 ROI-level datasets and 10 WSI-level datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors address an unmet need in the literature. There are many pre-trained encoders that try to leverage WSI-level information that comes for free in the LIS (like tissue type, procedure) or the expertise of pathologists in order to curate data. In fact, one could argue that many of these pre-trained encoders only vary in their data selection process. Moreover, this issue has not been explicitly called out in my knowledge, so the authors ahead of the curve, so to say.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of methodological novelty: The authors borrow their data selection method from “Automatic data curation for self-supervised learning: A clustering-based approach” from the general computer vision literature. This limits the novelty of the method.

    Lack of statistical comparisons: Fig 3 depicts a sensitivity analysis of varying the number of clustering centers at various levels (T1 and T2) along with varying the amount of pre-training data. The analysis is important, as the method should not be too sensitive to how the clusters are initialized.

    Impracticality at scale: The current dataset requires computing 350 million embeddings and use of distributed k-means on these embeddings. The first level of embeddings (3.5 million centroids) is also quite large. Nearly all recent foundation models in pathology utilize nearly 1 million slides, which would put the number of tiles at 33.5 billion, which pushes the limits of tools like FAISS. The current method does not seem like it would scale for those institutions interesting in training their own foundation model. This is why (as the authors point out) they use simple features often at low-resolution to pre-select regions for downstream SSL. Embedding all patches is impractical from the get-go.

    Lack of comparisons: The authors leverage only UNI in the assessment of their automatic data curation method. It would make for stronger results if the overall superior performance of their selection method were to hold with other models (and is not a result of the tested model itself).

    Lack of convincing results: Although the author’s proposed method does seem to improve downstream performance on the aggregate, it seems to vary in its effectiveness depending on the benchmark used. Out of 8 RoI-level benchmarks, the author’s proposed method outperforms other half the time. Likewise, in WSI-level it’s 3/9. F-BR (i.e. no sampling, use everything) is the best in a few benchmarks. Finally, there are no error metrics estimating the variance in performance. This is standard practice in the benchmarks the authors report.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Missing labels: The mid-level and dark blue bars in Fig 3 do not have labels. Also, the BR label doesn’t have a corresponding bar that is the same color. Please fix for clarity.

    Crucial clarification: Y’all should mention that you use kmeans++ (as is in your code), as conventional kmeans is sensitive to initialization of centroids, which a reader could understand to have affected your analyses.

    Additional information: Can the authors comment on their hardware and the time it took to build the indexes for their clustering? This would put into perspective the practicality of this method.

    Future work: Maybe consider performing clustering top-down rather than bottom-up. Start at the slide-level, get your centroids from the patches from there (maybe 20 or so), then move down a level to tissues (epithelial, connective, muscle, nervous), as these are architecturally similar within categories. Then organ, then organ system. You would have to stick with what is free info in the LIS, so you could play a bit with which abstractions you use. The authors might also considering trying their method by first clustering to 266, similar to their supervised sampling approach. That way, they can see if it is the granularity of the clusters (i.e. 3.5 million) or their quality (i.e. curation) that contributes to downstream performance gains (or losses).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I am recommending a score of 4 because the paper addresses a timely and important problem in computational pathology—automatic data curation for pretraining encoders—and proposes a thoughtful application of clustering-based sampling that outperforms some existing strategies. However, the paper falls short in several critical areas: the method lacks originality, as it is a direct adaptation from the general vision literature; the evaluation is not comprehensive, with missing comparisons to other models and insufficient statistical rigor (e.g., lack of error bars or variance estimates); and the proposed approach is computationally impractical at scale, limiting its utility for institutions aiming to train large foundation models. While the conceptual contribution is valuable and forward-looking, these shortcomings collectively temper the overall enthusiasm, meriting a “weak accept.”

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The author applies an unsupervised data curation framework (hierarchical clustering [21]) to improve the training of foundation models (FMs) in digital pathology. The curation begins by clustering 350M pathology tiles into clusters at four hierarchical levels. Higher levels contain fewer clusters representing abstract concepts, while lower levels have more clusters capturing finer concepts. Clustering is performed on image embeddings extracted from an existing FM (UNI) using k-means. Starting from the bottom level, each higher level is formed by clustering the centroids obtained from the level below. A curated subset of size N is then generated using a top-down sampling strategy by balancing the samples extracted from all clusters at the same level.

    This curated subset is used to train FMs, and a batch stratification technique is proposed: samples are uniformly distributed among all top-level clusters to ensure diversity and balance in each training batch, without considering lower-level structures.

    Evaluations were conducted on 8 ROI- and 9 WSI-level benchmarks across different sites by comparing the performance of FMs trained on the full (uncurated) dataset and curated datasets, both supervised and unsupervised.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Valuable insights into the clustering results.
    2. Extensive evaluations on clustering tree sizes (T1, T2), sampling levels (3 and 4), and curated subset sizes (1%, 10%, 20%).
    3. Benchmarked on 8 RoI-level and 9 WSI-level tasks.
    4. Batch stratification improves performance on curated datasets (both supervised and unsupervised) compared to random sampling.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Method Component 1 (Section 2.1): The data curation method closely follows [21] and is not novel.

    2. Method Component 2 (Section 2.2): The proposed batch stratification seems like a natural approach for achieving balanced batch distribution. And it is unclear why the top-down sampling strategy applied in data curation is not used in batch construction. Specifically, in the proposed batch stratification method, batches are constructed to include an equal number of tiles from each top-level cluster (e.g., 2048/64 = 32 samples per cluster in T1, and 2048/2048 = 1 sample per cluster in T2). Within each top-level cluster, tiles are randomly selected without considering lower-level hierarchies or subtree cluster sizes. Although the authors track sampled tiles and prioritize under-sampled clusters to improve intra-cluster diversity, it is unclear why the top-down sampling strategy (used in data curation) is not reused here (e.g., with batch size as N), which might reduce engineering effort and improve distribution among lower-level clusters.

    3. The curated training set does not consistently outperform the uncurated full dataset or supervised curated data. For instance, T1-BR achieved equal or worse overall performance compared to F-BR in ROI- and WSI-level tasks, respectively. Did the authors implement the resampling-clustering strategy proposed in [21], which was shown to be crucial for performance gains?

    4. The computational cost and strategies for large-scale clustering, particularly at the bottom level with 350M tiles, is not discussed. Addressing this would enhance reproducibility.

    5. Why are the results of T2 not presented in Table 1 and Table 2?

    6. Figure 3 appears incomplete, with results only shown for the 10% setting in the BR scenario.

    7. The method is clearly presented and easy to follow; however, the interpretation of results needs additional explanations for general readers (details see additional comments below).

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Figure 2(c–d) is hard to interpret without prior knowledge of [21]. Additional explanations would improve clarity. For example, it took me quite long to understand “a significant imbalance for all curated subsets with more than 1% data, since the smallest top-level clusters are depleted during sampling”: curating, for instance, 10% of the data (350M × 10% = 35M images) across 62 clusters uniformly means 35M / 62 ≈ 560k shall be extracted from each cluster, but small clusters have only ~100k tiles (hidden information from Fig 2a) , the depletion of these clusters leads to high total variation (TV) from a uniform distribution. However, the claim that “sampling from any level results in a nearly maximal imbalance at the other level” is confusing. As I understand, the maximal imbalance corresponds to the highest TV. However, sampling at level 4 (with >50% proportion) result in lower TV at level 3 than at level 4, as shown in Figure 2(c).

    2. Some result interpretations are difficult to follow without reading [21]. For example, “Optimizing this trade-off can be achieved with a minimal subset size by allocating samples accordingly to the volume of each bottom-level cluster…” is hard to understand without additional explanation. Additionally, conclusions like “Among the most comparable settings, e.g., T1-BS and S-BS, we observe strong correlations between performance and both tile proportions and their origins” on page 6 lack sufficient supporting details in the figures/tables. As experts, the authors likely understand the implications, but additional elaboration would help general readers.

    3. In the subsection Interpretability of Hierarchical Clusters, including a figure showing ARI variation against different hand-crafted feature types would be nice.

    4. In the subsection Benchmarked Methods, the authors state that 10% of the dataset is curated and used for training. However, Section 2.2 mentions a complete pass over the full dataset (350M tiles) for training. This inconsistency should be clarified.

    5. In Table 2, “Site 1” and “Site 2” likely correspond to breast and lung, respectively.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the dataset curation method largely follows [21] and the proposed batch stratification approach is natural and have limitations, the evaluations and insights provided are nonetheless valuable to the community. The method is clearly presented and easy to follow; however, the interpretation of results needs additional explanations for general readers.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper tackles the critical issue of pre-training data curation for Foundation Models in Digital Pathology. The authors argue that traditional curation methods often rely on WSI-level annotations or heuristics, potentially overlooking tile-level diversity and imbalance within massive datasets. They propose a fully unsupervised, automatic data curation pipeline operating at the tile level on a large dataset. Experimental results demonstrate that the combination of hierarchical curation and stratified batch sampling could improve foundation model performance on diverse downstream DP tasks compared to baseline methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Data scale, diversity, and imbalance are fundamental challenges in training robust FMs, especially in DP. This paper directly confronts the need for better, scalable data curation strategies beyond simple WSI-level filtering.

    2. While clustering for data curation isn’t entirely new in ML, this work presents the first fully automated, unsupervised, tile-level curation framework specifically designed and evaluated for pathology FMs.

    3. The authors perform extensive experiments comparing their proposed method against relevant baselines.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The entire curation process hinges on the quality and structure of the embeddings produced by the initial FM (UNI). The paper doesn’t explore how sensitive the resulting clusters and downstream performance are to the choice of this initial embedding extractor. Would using a different FM lead to significantly different cluster structures or performance outcomes?

    2. The choice of hierarchical clustering parameters and the decision to sample/stratify at Level 4 seem somewhat empirically driven. While the sensitivity analysis shows robustness within the BS strategy, a deeper justification or exploration of how to optimally choose these parameters a priori would strengthen the work.

    3. This method requires extracting features in advance using UNI before sampling, which may however incur significant computational costs.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although there are some limitations in terms of computational load and hyperparameters, the motivation behind this method is highly reasonable, and comprehensive experiments ultimately demonstrate its effectiveness.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their constructive and thoughtful feedback. Below we provide detailed responses to the raised concerns.

Sensitivity to the Initial Feature Model (R1, R2, R3): Our batch stratification uses the highest-level clusters, which are designed to capture broad semantic concepts shared across tissue types. These high-level clusters are expected to remain relatively stable across different FMs, as such models tend to learn similar overarching representations. Naturally, as we move to lower levels in the hierarchy, cluster assignments become more sensitive to the specific FM used, as these clusters will capture finer details of the learned representations by these models. Nevertheless, the exponential reduction in the number of clusters as we ascend in the hierarchical levels introduces a degree of robustness to the variability introduced by different FMs. While a systematic evaluation of multiple embedding extractors is an important future direction, we believe that the current formulation is a self-contained and solid contribution. We highlight this limitation in the manuscript and plan to address it comprehensively in future work.

Choice of clustering level and parameters (R3): Our choice was guided by Vo et al. (2024), whose empirical results showed Level 4 yielded the best performance among depths {1–4} and theoretical analysis suggested this choice depends primarily on embedding dimensionality—a factor shared in our setup. As their work did not address tree width, we focused our analysis there. Sampling at the top level is also intuitive, as these clusters are more likely to have uniform volume and capture comparable semantic information. That said, our sensitivity analysis shows that stratifying at intermediate levels (e.g., T1, Level 3) can sometimes improve performance, likely due to correlations between pretraining and downstream datasets. We also see promise in analyzing embedding similarities between datasets to quantify the absolute gains and mitigate the lack of patch-level labels for our curation/training strategies as well as the extensions mentioned, as discussed in the conclusion.

Scalability (R1, R2, R3): Our data curation does not use FAISS. The Kmeans algorithm can be run with GPU-based batching, scaling up to billions of samples. Obtaining T1 (A100 GPUs, Kmeans++ initialization and resampling after Level 1) required ~1.2k GPU hours (1k for Level 1). For comparison, FM training took on ~1.9k GPU hours. Therefore, we consider it computationally affordable within the FM-for-pathology context, even at scale. Nonetheless, exploring more efficient variants remains a promising direction. Since the complexity is O(N*K1), where N and K1 is the number of samples and number of clusters at level 1, method likes hierarchical schemes from low to high resolutions, more informative sampling at level 1 with smaller K1 and semi-supervised approaches to split WSI (e.g. per tissue types before curation) can potentially improve the scalability.

Completing Figure 3 (R1, R2): The T-BR method using 10% of curated samples aligns with the default strategy of Vo et al. (2024), making it the most relevant setting for sensitivity analysis. Our BS method consistently outperforms BR across three settings, providing sufficient evidence of its effectiveness. Due to the high computational cost of training FM, we did not conduct T-BR experiments at varying percentages of curated data.

Minor issues: (R1, R2) the use of Kmeans++ and resampling will be clarified. (R1) Only two colors will be used in Fig.3. (R1) We will report the aggregated standard deviations across datasets. (R2) We will further detail our analysis on hierarchical clustering for a broader audience. (R2) “A complete pass” refers to the same number of training iterations as a complete pass of the full dataset, but with 10% data. (R2) Site 1 and 2 correspond to different clinical centres (MSHS and MSKCC). We will modify these in the revised version.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top