Abstract

In recent years, the advent of foundation models (FM) for digital pathology has relied heavily on scaling the pre-training datasets and the model size, yielding large and powerful models. While it resulted in improving the performance on diverse downstream tasks, it also introduced increased computational cost and inference time. In this work, we explore the distillation of a large foundation model into a smaller one, reducing the number of parameters by several orders of magnitude. Leveraging distillation techniques, our distilled model, H0-mini, achieves comparable performance to large FMs at a significantly reduced inference cost on HEST and EVA public benchmarks. Additionally, we conduct robustness analyses on the PLISM-WSI dataset and a multi-scanner, multi-staining private breast cancer cohort. We demonstrate that our distilled model reaches excellent robustness to variations in staining and scanning conditions, significantly outperforming other state-of-the-art models. This opens new perspectives to design lightweight and robust models for digital pathology, without compromising on performance. We publicly release H0-mini \url{https://huggingface.co/bioptimus/H0-mini} along with \texttt{plismbench} \url{https://github.com/owkin/plism-benchmark}, the first robustness benchmark of pathology foundation models based on the PLISM dataset.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4289_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/owkin/plism-benchmark/tree/main

Link to the Dataset(s)

https://huggingface.co/datasets/owkin/plism-dataset

BibTex

@InProceedings{FilAle_Distilling_MICCAI2025,
        author = { Filiot, Alexandre and Dop, Nicolas and Tchita, Oussama and Riou, Auriane and Dubois, Rémy and Peeters, Thomas and Valter, Daria and Scalbert, Marin and Saillard, Charlie and Robin, Geneviève and Olivier, Antoine},
        title = { { Distilling foundation models for robust and efficient models in digital pathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a distilled model of the H-optimus-0 foundation model. The hypothesis is that a distilled model could be vastly more efficient but maybe also more robust than the respective teacher model. This is investigated by experiments on multiple datasets, comparing against different SOTA baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • I see the major strength of this work in the public availability of a distilled model and the experiments showing its usefulness, so other researchers can now test this on other tasks.
    • Having a smaller-scale model for fine-tuning opens up a lot of possibilities, so I would assign this a good value for the community.
    • The authors follow up with a nice set of evaluations, both in latent space and in classification.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Metrics: The paper introduces top-10 accuracy for the evaluation of robustness. From what I understand the authors rank the tiles according to cosine similiarity, and a tile is a „match“ if it ranks in the top 10 of all tiles. I’ve not really heard of this metric for evaluation, and it seems to be custom to this paper. I am really not a fan of non-standard metrics, as they might favor the work of the authors over other approaches. The choice of hyperparameter k=10 is also not explained.
    • Comparisons: Instead, on the PLISM dataset I am completely missing a downstream task (e.g., organ classification) that could be perfectly doable on this dataset, and that would provide a measure of real accuracy on a downstream task. The point the authors raise is about robustness (which is sensible for a distilled model that has potentially a lower tendency to overfit specific characteristics), and given this I was disappointed the authors did not provide comparisons on a downstream task on a public dataset.
    • Comparisons: In Table 4, the authors only compare against H-optimus-0, and not against the other SOTA models. In particular Virchow2 has shown strong performance (Table 1), so it’s not clear why the authors did not compare against this model additionally.
    • Novelty: The paper is basically using an established workflow for model distillation and the novelty is limited methodologically hence.
    • Methodology: The authors made several design decisions that were neither explained nor ablated. For instance, they decided to remove the stochastic depth and Koleo regularization. These are commonly used to improve generalization and training stability. Given the potential impact of these components on training stability and performance, further explanation or empirical validation would strengthen the justification of this design choice.
    • Data: The authors do not detail how they selected the 6k tiles from TCGA. Given that the task is about generalization, the composition of the dataset matters, and I think the authors should be more transparent here in order to allow for the research to be reproducible.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Comparisons: While there are many recent publicly available foundation models, the authors only distilled one of them. While this is understandable given the half a year of compute that was spent on this model, I would still have liked to read a rationale on why the H-optimus-0 was chosen.
    • The authors state that SSL-based models are a cornerstone of modern CPath frameworks. While foundation models have gained attention, I think this is a considerable overstatement. Traditional supervised pretraining (e.g., on ImageNet) and domain-specific supervised models (e.g., trained on CAMELYON or TCGA data) are currently widely used, especially in clinical translation settings.
    • The authors state that they use the PLISM dataset, but in fact only use the part of the PLISM dataset that was acquired using WSI scanners (PLISM-wsi).
    • Please fix your running title.
    • Please add a link to your HF page or wherever you distribute the model and code.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While I see the merits to the community, I think the scientific value of the paper could be improved by a more balanced evaluation, which I feel might be biased towards the model that is being proposed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper addresses the challenge of large computational cost and potential lack of robustness associated with pathology foundation models. The authors propose distilling knowledge from a large ViT-Giant foundation model (H-Optimus-0) into a much smaller ViT-Base model (H0-mini). The results show that H0-mini achieves performance comparable to larger state-of-the-art models on downstream tasks while being significantly smaller, and notably demonstrates superior robustness to scanner and staining variations. The authors also release the H0-mini model and a processed robustness benchmark based on PLISM.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The work directly tackles two major hurdles for the clinical adoption of pathology FMs: their large size/computational cost and their robustness to real-world variations.

    2. Despite its significantly reduced size, H0-mini achieves competitive performance on diverse downstream tasks covered by the EVA and HEST benchmarks.

    3. The public release of the distilled H0-mini model and the plismbench dataset/benchmark is a commendable contribution that facilitates reproducibility and future research in efficient and robust pathology AI.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. A central claim of the paper is the improved efficiency (reduced inference cost) of the distilled H0-mini model. While the reduction in parameter count strongly implies this, the paper lacks quantitative evidence in the results tables. It is strongly recommended that the authors add a comparison of inference times (e.g., time per WSI or time per 1k patches for feature extraction) for H0-mini versus the relevant baseline models (especially H-Optimus-0 if possible, and other models in Tables 1-3) in the main results tables.

    2. Although the distillation method builds on prior work, the paper would benefit significantly from a dedicated figure illustrating the overall training/distillation workflow. This figure should ideally depict the teacher (H-Optimus-0) and student (H0-mini) models, the input augmented views, the flow through patch/class tokens, the DINO and iBOT heads/objectives, and potentially the EMA update mechanism.

    Typo: On page 3, under the “Distillation setup” section, there is a typo: “128 Nvidia V100 32Go”. This should likely be corrected to “32GB”.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the overall approach lacks particularly innovative aspects, the lightweight foundational model for pathology indeed holds promising application prospects. Moreover, this paper thoroughly validates through experiments that the rectified lightweight model can still achieve relatively good performance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors use knowledge distillation to distill the knowledge of H-optimus-0 (H0) model with 1 billion parameters to the student model called H0-mini (with 86 million parameters). And it turns out that H0-mini achieves competetive performance on HEST and EVA benchmarks and more interestingly, the H0-mini achieves better robustness to staining and scanner variances on PLISM and another private dataset. The authors also organize the PLISM dataset and make it easier for the community’s application.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strength of this work lies in the experiments and benchmarking on 4 different datasets (3 public 1 private) across multiple tasks. While the benchmarking results on the HEST and EVA datasets are reasonable for a distilled model of one-tenth the size of its competitors, the robustness results on PLISM and the private dataset achieve SOTA, which is impressive. And as the authors say, this is the first work in computational pathology to distill a large foundation model into a smaller model and show that the distillation can improve the robustness of the model.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While this might be the first work in computational pathology to distill a foundation model (not including GPFM here as the resulting model there is also a large model), it is not the first in general computer vision to do so. And accordingly the technical aspect of the distillation entirely relies on the prior works (references 5 and 20 in the paper as cited by the authors). The technical contribution of the work is therefore limited.

    Secondly, the benchmarking results on HEST and EVA show that the H0-mini is comparable to that of the large foundation models. This is an expected result that there would be some loss in the performance of the model due to the limited size of the resulting model and the results are not so high as to compete with the top performing foundation models.

    Thirdly, the PLISM dataset is a publicly available dataset and the authors mention that they process the dataset and make the processed dataset publicly available. It is unclear what is the extent of this processing and if it is worthy of a contribution. I encourage the authors to clarify this in the paper if the space allows for it.

    Finally, while the results of robustness on PLISM and the private dataset look promising, it needs to be studied more thoroughly (in a future work) as I believe given the complexity of the experimental setup and the choice of evaluation metrics (cosine similarity and top-10 retrieval) used for justifying the robustness, the description in the paper does not do enough justice to it. I don’t blame the authors as I understand the conference limits the paper at 8 pages of content, which is why I highly encourage the authors to follow-up on this topic in a future work.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It is interesting to see that even a distilled version of a foundation model required intensive computational resources such as 6000 TCGA WSIs, 100,000 training iterations, 128 NVIDIA V100 GPUs of 32 GBs (please correct the typo in the paper) and over 4300 GPU hours. While this is much less than what a foundation model needs, this is still significant compute that perhaps makes this a mini foundation model.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a good paper. Given the extensive experiments and the interesting and promising results on robustness of the foundation models, I believe this should be accepted to the conference.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

First, we want to thank the reviewers for their positive and valuable feedback, and suggesting several avenues for improvement. We start this answer by providing a general clarification on the PLISM evaluation, before answering each reviewer individually.

Clarifications on PLISM. As noted by reviewer #1, the choice of a retrieval metric such as the top-k accuracy to evaluate the robustness of a foundation model (FM) is not standard. A more common metric is the cosine similarity. However, it has limitations: a FM producing constant embeddings would reach a cosine similarity of 1, while having no value as it provides no information on the discriminative capacity of the model. For this reason, we also introduced the top-k accuracy. We also acknowledge that the choice of k=10 may seem arbitrary. Lower values of k, such as k=1, make the retrieval task much harder. However, we noticed the same trend in the results by considering other values, such as k=1 and k=5. For the top-1 accuracy and top-5 accuracy, the order of the 5 best FM remains unchanged, with Ours > H-Optimus-0 > CONCH > Virchow2 > Gigapath. Note that to guarantee the reproducibility of these results, they can be generated with the anonymized repository that was shared, https://anonymous.4open.science/r/plism-benchmark-E13D/README.md

Answer to R1 We acknowledge that we did not implement a downstream task on the PLISM dataset. While we completely agree that implementing an organ classification task on PLISM would be doable, it is not a standard benchmark to evaluate the downstream performance of pathology foundation models. For this reason, we decided to focus on the public HEST and EVA benchmark, as well as the private Breast-Biomarker tasks. Those benchmarks contain several classification tasks, allowing the evaluation of the accuracy of the various FM.

We dropped the stochastic depth during the distillation following the recommendation from the Dinov2 paper [DinoV2]. About the removal of the Koleo regularization, we noticed that recent papers [Virchow2, Midnight] no longer use the Koleo regularization, in favor of KDE regularization. Besides, we distill a teacher model which has itself been trained with Koleo regularization. Our assumption is that this regularization transfers to the student model during the knowledge distillation. This part of the camera-ready paper has been modified accordingly.

To allow for other researchers to use the same dataset, we will release the coordinates of the tiles within each patch, and make the link to a shared google drive available in the de-anonymized version of the paper.

Answer to R2 We updated a table with the inference times in mixed precision for the various sizes of architectures of the FM considered in this study. We estimate the time to extract 10,000 features from a WSI from TCGA, averaged over 10 repeats. The hardware configuration is a NVIDIA T4 GPU equipped with 16Gb vRAM, 32 CPUs and 128 GB RAM. The torch version is 2.5.1+cu124. We share here the inference time for a subset of FMs of this study. Ours, 33.7s (std. 0.5), UNI, 69.7s (0.1) UNI2h, 193.3s (0.2) Virchow2, 205.1s (0.2) Gigapath, 236.2s (0.1) H-Optimus-0, 341.0s (0.3) There is a 10x factor between the distilled model and the original teacher H-Optimus-0, a key difference to promote its adoption in low-compute settings.

Answer to R3 We clarified the description of the pre-processing of the PLISM dataset. The main processing steps were:

  • a registration step, using the Elastix software,
  • a matter detection step using a U-Net model,
  • a tiling step to extract tissue patches at precise coordinates. By making this dataset available at XXX, we hope to promote its usage, jointly with https://anonymous.4open.science/r/plism-benchmark-E13D/README.md, to assess the robustness of digital pathology FMs.

We also agree that the computation cost remains significant, and the name “mini foundation model” is appropriate!




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top