Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Digital pathology has seen the advent of a wealth of foundational models (FMs), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PathoCellBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (H&E) stained histopathology images. We provide both PathoCell , a new H&E dataset featuring 14 cell types identified via multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in a range of generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores > 0.70 on previously established benchmarks such as Lizard and PanNuke, on PathoCell , we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PathoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4441_paper.pdf

SharedIt Link: https://rdcu.be/eHwWF

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_39

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Kainmueller-Lab/Pathology-Foundation-Model-Benchmark

Link to the Dataset(s)

https://github.com/Kainmueller-Lab/Pathology-Foundation-Model-Benchmark

BibTex

@InProceedings{LüsJér_PathoCellBench_MICCAI2025,
        author = { Lüscher, Jérôme AND Koreuber, Nora AND Franzen, Jannik AND Reith, Fabian H. AND Winklmayr, Claudia AND Baumann, Elias AND Schürch, Christian M. AND Kainmüller, Dagmar AND Rumberger, Josef Lorenz},
        title = { { PathoCellBench: A Comprehensive Benchmark for Cell Phenotyping } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {411 -- 420}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a new dataset (PhenoCell) of H&E images with 109 high-resolution FoVs, with a total of 88 million individual cells and 14 distinct cell types. The authors benchmark seven pathology foundation models on PhenoCell, against the baseline of HoverNext. They also use 2 existing H&E datasets (Pannuke, Lizard) and one artificial dataset (Arctique) to evaluate the cell phenotyping performance of the foundation models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper introduces a new dataset (PhenoCell) of H&E images with 109 high-resolution FoVs, with a total of 88 million individual cells and 14 distinct cell types (with considerable imbalance among the number of different cell types), data collected from 35 colon carcinoma patients from one hospital - Universitätsspital Bern.
- The dataset identifies the following domain shifts: colon- vs. mucinous adenocarcinoma cancer staging (Stage 3 vs. Stage 4)
- The authors evaluate using three different splits (Base-split, Tumor-type-split and Tumor-stage-split) to enable detailed benchmarking under domain shifts.
- They benchmark linear probing on a diverse set of seven pathology foundation models (Uni, Uni2, Virchow2, Phikon-v2, Prov-GigaPath, MUSK, TITANv) on PhenoCell, against the end-to-end baseline of HoverNext.
- They also trial 2 existing H&E datasets (Pannuke - nuclei, 5 classes, 19 organs, Lizard - nuclei, 1 organ, 6 classes, 6 different medical centers) and one artificial dataset (Arctique - subset, 1 organ, 5 classes) to evaluate the cell phenotyping performance of the foundation models.
- I adore the graphics of the paper. The authors clearly have a lot of skill in this domain, which makes the paper easy on the eyes.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Transparency: Both dataset and code are not available for review. Given that the paper is mainly about a dataset, this is a major problem, since the reviewers cannot assess dataset quality. Please see this resource for anonymizing the github repositories: https://anonymous.4open.science/
- Dataset quality: My biggest concern is with the quality of the dataset. As the results of the authors show, all approaches yielded really inferior results, with F1 scores < 0.3. In my view, this could point towards three main issues: 1) the label quality might be really suboptimal. This could be due to the automatic segmentation pipeline (CODEX toolkit segmenter) that the authors used. The authors also do not specify how exactly the checking of the data was performed and merely state that an expert (pathologist) was involved. Given the enormous size of the dataset (88 M cells), it is hard to believe that the data was thoroughly checked by the expert. 2) It might be because the images have no trace that could be utilized to discriminate the classes. Given that 56 antibody markers were used (for labeling, I assume) but these are apparently not made available, one could assume that the information that differentiates, e.g., dendritic cells from tumor cells (as in the Fig. 2), could just not be available in the image and only available with the IHC stain. In this case, the dataset would not show great promise for machine learning either. 3) It could simply point towards an issue with the training pipeline. In either of these cases, this is a problem with the submission and/or the dataset.
- Methods: ‘These samples were imaged using 56 antibody markers across multiple cycles of staining, imaging, washing, and re-staining.’ – If the slides were stained and washed multiple times, was there not any loss of data? Were the images still registered?
- Methods: If the main contribution of the paper is a dataset, I think there needs to be more extensive descriptions of how it is obtained, especially accounting for the chemical processes and the loss of data, if any.
- Methods: There also needs to be a proper validation of the dataset, to make sure that the labels are correct and make sense. If the performance is so low, then there’s a question about the label quality. I think rather than benchmark foundation models, it would have been better to benchmark on more supervised approaches.
- Methods: The experiments were not described comprehensively (like how many times the experiments were repeated, cross-validation if any, standard deviations not reported)
- Results: While the figures are visually really appealing, I would appreciate a table of the different results with the actual numbers (macro F1 scores), with the best/worst ones highlighted, and not statements like ‘In the Tumor-Type-Split, HoVer-NeXt’s performance remains stable, decreasing by only .005 F1, whereas foundation models drop by 0.02 F1 on average’
- Other topics: The name of the dataset is not cleverly chosen, as it is not unique: https://www.phenobench.org/. While the other PhenoBench dataset is from another domain (agricultural), both are computer vision datasets, which could encourage mixing them up.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

For a good dataset paper, I would suggest extensively describing the process of acquiring the dataset, including the processes of staining, washing and restaining. I would also question the sanity of the dataset after multiple such cycles. Lastly, I think it is very important to validate the label quality with extensive baseline experiments (some classes have a score of 0.6 whereas some have score of 0, Fig. 2(b)).

I also highly recommend to provide insights into how biases in the annotation process were counter-acted. Especially in semi-automated workflows we can expect automation bias and maybe even confirmation bias. To this end, the authors should make sure they independently assess label quality.

I still don’t quite understand why foundation models are actually important for the assessment in this work. The PhenoBench dataset does not work on an end-to-end-trained setup, why would the authors expect any different behavior on foundation model-based evaluation? I can’t help but to have the feeling that foundation models were added because they are somewhat a current topic, but they were added with little motivation on why this nicely combines with the dataset at hand.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While this paper creates a huge new dataset and benchmarks with several foundation models and one supervised model, the performance remains low. This paper could have been of much more value if the experiments could convince the readers of the label quality of the dataset. Instead, the authors try to persuade us that low F1 scores (<0.3) are good as they imply a lot of algorithmic headroom. Even if huge algorithmic advances were made, I have the feeling we would be well within the clinically meaningless regime of performance scores.

I also have the feeling that the foundation models serve as a side-topic of the paper, that adds little value to it.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper
- The authors publish PhenoCell, the largest publicly available dataset for H&E-based cell phenotyping with fine-grained cell type annotations, encompassing 14 distinct cell types and 88 million individual cells, much larger than all previous datasets.
- The authors benchmark seven pathology foundation models and a strong supervised baseline, on the PhenoCell dataset with four different dataset splits to test generalization performance under domain shifts.
- The authors test and compare linear probing of ViT patch-level features and a UNetR decoder architecture for dense prediction of cell phenotypes.
- The authors publicly release their benchmarking pipeline along with the dataset, enabling more comprehensive benchmarking of FMs in pathology.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors publish PhenoCell, the largest publicly available dataset for H&E-based cell phenotyping with fine-grained cell type annotations, encompassing 14 distinct cell types and 88 million individual cells, much larger than all previous datasets.
- The authors benchmark seven pathology foundation models and a strong supervised baseline, on the PhenoCell dataset with four different dataset splits to test generalization performance under domain shifts.
- A SOTA supervised learning model and the existing foundation models show high performance on conventional datasets (Lizard, Pannuke, and Arctique) but significantly lower performance on the dataset proposed by the authors (PhenoCell), suggesting a new challenge in pathology image analysis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Although the dataset is new, only conventional methods are evaluated as methods, and this is a proposal paper for the dataset only.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A SOTA supervised learning model and the existing foundation models show high performance on conventional datasets (Lizard, Pannuke, and Arctique) but significantly lower performance on the dataset proposed by the authors (PhenoCell), suggesting a new challenge in pathology image analysis. On the other hand, only conventional methods are evaluated as methods, and this is a proposal paper for the dataset only. It is unclear to me how to handle this kind of paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

A SOTA supervised learning model and the existing foundation models show high performance on conventional datasets (Lizard, Pannuke, and Arctique) but significantly lower performance on the dataset proposed by the authors (PhenoCell), suggesting a new challenge in pathology image analysis.

Review #3

Please describe the contribution of the paper

Authors propose a comprehensive benchmark for cell phenotyping on hematoxylin and eosin (H&E) stained histopathology images, focusing on the behavior of existing foundation models (FMs) compared with a strong baseline model, specifically HoVer-NeXt. Inside this benchmark, the Authors introduce a new dataset called PhenoCell, which seems more challenging based on the models’ performance than the other datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel Dataset Annotations: Authors propose the segmentation masks for the existing PhenoCell dataset.
- Novel Benchmark: Authors select diverse datasets by introducing different concepts of dataset splits (random, technical, medical), which were evaluated on different FMs using a simple linear projection head and the UNetR head.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Potential Dataset Overlap: The paper acknowledges that while the pre-trained versions did not explicitly use any of the other datasets, there is a possibility that images from the same dataset sources were used. For example, Pannuke contains images from TCGA, which also contributes to Lizard. Similarly, MUSK and Phikon-v2 are trained on TCGA. Although the annotations may differ, there is a potential intersection between images used for pre-training and the dataset used in the study.
- Terminology “Medical centers”: Instead of using “different medical centres,” I suggest that the Authors should use “different dataset sources.” This is because Pannuke, which is part of Lizard, consists of images from four datasets. Using “medical centers” could mislead readers into thinking that the images and annotations originate from the same institution, which is not the case.
- Choice of Glas for Lizard Test Set: The reasoning behind selecting Glas as the test set for Lizard is not explicitly clarified. The Authors should explain whether this decision was based on the dataset’s unique features.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
To avoid confusion and ensure a unique benchmark name, I recommend changing the current name if it makes sense, as “PhenoBench” already exists as a benchmark in the agricultural field.

The following comments are divided by section, figure, or table for clarity:

Fig. 1(b):
- Clarification of Model Values: The figure should specify which value corresponds to the linear decoder and which one corresponds to UNetR, either directly in the figure or in the caption.
- UMAP Explanation: The UMAP visualization is presented in Fig. 1 but is not explained. A brief description should be included to clarify the key insights.
1. Datasets:
  - Pannuke Evaluation: The paper should clarify why Pannuke was evaluated separately when it is already part of Lizard.
  - Subset of Arctique: The Authors should justify why only a subset of the Arctique dataset was used instead of the full dataset. Additionally, they should clarify the criteria used to select this subset.
2.1 PhenoCell Dataset:
- Granular Term Usage: The term granular might not be necessary. If it does not add value, I suggest removing it, as [20] simply states “28 cell types”.
- Clarification of “Granular Cell Phenotypes”: The sentence “we merged overly granular cell phenotypes…” should be reformulated for clarity.
1. Pathology Foundation Models:
  - Uni2 Citation: A citation should be added when mentioning Uni2.
  - Dataset Name for Prov-GigaPath: When stating that Prov-GigaPath was trained on “one of the largest proprietary WSI datasets,” the name of this dataset should be included.
Fig. 2:
- Dataset Split Clarification: It is unclear whether the performances shown in Fig. 2(a) are based on the “base-split” dataset split for each dataset. This should be explicitly stated, especially since Lizard and PhenoCell also use other dataset split techniques.
- Consistency in Terminology: In Fig. 2(b), the terminology should match that used in Fig. 1(a) for consistency. For example, “tumor cells” should be used instead of “tumor,” and “smooth muscle” instead of “muscle,” ensuring clarity and uniformity across figures.
Fig. 3:
- Consistency in Terminology: In Fig. 3(b), the terminology should match that used in Fig. 1(a) for consistency. For example, “tumor cells” should be used instead of “tumor,” and “smooth muscle” instead of “muscle,” ensuring uniformity across figures.
1. Experiments and Results:
  - Training Parameters Clarification: The manuscript mentions that additional training parameters can be found in the code, but it should specify which parameters were tuned and based on what criteria.
  - Average F1 Calculation: Clarification is needed on how the average F1-score is computed. Specifically: Is the highest F1-score selected between the linear/UNetR models for each case before averaging? Does this average include HoVer-NeXt as well?
  - Domain Shift Terminology: The manuscript refers to “technical, medical, and biological domain shifts.” If “medical domain shift” is synonymous with “biological domain shift”, one unique term should be used for clarity.
  - Incorrect Figure Citation: Fig. 3(a) is incorrectly cited; it refers to Lizard, not PhenoCell. The correct references should be: PhenoCell results are in Fig. 3(b).
  - Fig. 3(c) relates to medical data splits, not technical data splits (it covers both medical data splits).
Fig. 4(b):
- Training Strategy Clarification: It should be explicitly stated whether Fig. 4(b) corresponds to the base-split training strategy. If so, specifying this would improve clarity.
For future work, I would recommend:
- Additional Metrics: In addition to the F1-score, other metrics such as DSC, Precision, and Accuracy should be included for a more comprehensive evaluation.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a comprehensive benchmark focusing on different datasets and testing it on FMs compared to HoVer-NeXt. The strength of this benchmark lies not only in the testing of datasets with respect to technical and medical splits but also in the introduction of segmentation masks for PhenoCell. I hope that the Authors may consider these comments to facilitate readability and understanding for the readers.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I thank the Authors for addressing the questions. I am also pleased that the Authors appreciated the detailed suggestions and plan to incorporate them into the camera-ready version of the paper. The main concerns I had have been resolved, and I am satisfied with the proposed adjustments. Therefore, I recommend acceptance.

Author Feedback

We thank the reviewers for their feedback and constructive criticism. We are glad they recognized the value of our benchmark dataset, the comprehensive evaluation, and the clarity of our presentation.

Scope of our study (R1,R2) Our goal was to benchmark pathology Foundation Models (FMs) for cell phenotyping, assessing their understanding of tissue composition. We observed that on existing datasets (PanNuke, Lizard), FMs often do not outperform strong supervised baselines like HoVer-NeXt (HNX) (<0.02 F1 score difference), partly because these datasets might approach performance saturation. In addition, some previously published datasets have been part of FM training data (R3). For a more insightful FM performance assessment we introduce PhenoCell (PC), a larger and more challenging H&E cell phenotyping benchmark dataset, which is, to our knowledge, not part of any existing FM training data. Based on the PC dataset our work identifies contexts in which FMs do outperform supervised baselines: on rare cell types (FMs outperform baseline by up to 0.06 F1 score) and under technical domain gaps (Fig. 3a).

Dataset quality and accessibility (R1) The PC data was originally presented in Schürch et al. (2020), which lays out details about the original CODEX imaging protocol. The CODEX version of the dataset has since been used in many high-impact works, including biomedical (e.g. Gavish et al., 2023) and methods publications (e.g. Brbic et al., 2022, Amitay et al., 2023). For PC, we use the same cell labels as in the CODEX version of the dataset. We obtained the respective H&E data and revised the dataset in collaboration with an expert pathologist. We inspected all FoVs and discarded 31 of them due to H&E image quality or registration issues. We consolidated the 28 cell types uncovered in CODEX into 14 H&E-distinguishable types. PC’s 14 cell types are more granular than what existing benchmarks offer and yield a harder task for FMs and baselines alike. In summary, the cell labels in PC are certainly not free of errors; yet the practical usability of the reference labels for biomedical and methods works has already been shown. We will make the dataset accessible upon publication (as MICCAI regulations prohibit us from adding URLs during rebuttal). The same training pipeline was used for all datasets; thus, pipeline errors are unlikely to affect PC selectively (R1).

Presentation of Results (R1) While full tabular results were omitted due to page limits, the source data for all plots, detailing individual runs and configurations, will be made available in our public repository and we will enhance Fig. 3 with numerical F1 scores for clarity (R1).

Training Details (R3) The Lizard-GlaS split (R3) was chosen following established practices (e.g. HNX) to evaluate domain generalization. Regarding other datasets, PanNuke was evaluated separately as only its small subset of colon images is included in Lizard (R3). Regarding the ARCTIQUE dataset, we clarify misleading wording in the manuscript (R3): our experiments use the entire 1,5K “normal” split of the latest v3 release. Regarding average F1 scores reported for FMs, we consistently used the UNetR decoder results as they yielded higher performance; HNX results were not included in these FM averages (R3).

Naming conflict and minor points (R1,R3) We acknowledge the naming conflict and will rename our benchmark and dataset to “PathoCell” or a similar (R1,R3). We also thank R3 for the many detailed suggestions regarding terminology, consistency, etc., all of which we will incorporate into the camera-ready version of the paper. We hope to have clarified the scope of our work and have made a convincing argument that our contribution highlights a previously uncharacterized difference between FMs and other models thanks to the PC dataset. We truly think that the research community would benefit from having this challenging benchmark at hand when developing future models.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

PhenoCell presents a novel, large-scale H&E cell phenotyping dataset and benchmarks seven pathology foundation models against HoVer-NeXt. Reviewers praised its scope but raised concerns about label quality, methodological details, and result presentation. After the rebuttal—and with R2 and R3 upgrading to “accept”—remaining concerns about transparency and quality have been addressed. Therefore, given PhenoCell’s demonstrated challenge to current models and its value as a community resource, I recommend acceptance. Please make sure to update all clarifications and discussions from the rebuttal in the final version of the manuscript.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The majority of reviewers vote for acceptance.

back to top

PathoCellBench: A Comprehensive Benchmark for Cell Phenotyping

Author(s):