Abstract

Toxicity assessment of candidate compounds is an essential part of safety evaluation in the preclinical stage of drug development. Traditionally, drug safety evaluations depend on manual histopathological examinations of tissue sections from animal subjects, often leading to significant effort in evaluating normal tissues. Moreover, the collection of abnormality samples poses significant challenges due to the rarity and diversity of various types of abnormalities. This makes it impractical to develop a comprehensive training dataset that encompasses all potential anomalies, particularly those that are underrepresented. Consequently, traditional supervised learning methods may face difficulties, leading to a growing interest in unsupervised approaches for anomaly detection. In this study, we present GraphTox, a multi-resolution graph-based anomaly detector designed to assess hepatotoxicity in Rattus norvegicus liver tissues. GraphTox is built upon a novel resolution-aware foundation model pre-trained on 2.7 million liver tissue patches. Additionally, GraphTox employs graph-based feature distillation on normal liver whole slide images (WSIs) to identify hepatotoxicity. Our results demonstrate that GraphTox achieves an 11.1% improvement in area under the receiver operating characteristic curve (AUC) on an independent testing set compared to the best-performing non-graph-based anomaly detection models, and an 8.1% improvement over a graph-based model derived from a resolution-agnostic foundation model UNIv2. These findings highlight that GraphTox effectively leverages the resolution-aware digital pathology foundation model to capture multi-scale tissue characteristics within the local tissue graphs, thereby enhancing anomaly detection across various scales. Code: https://github.com/xxx

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1466_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://linlilamb.github.io/GraphTox-project-page/

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiLin_Unsupervised_MICCAI2025,
        author = { Li, Lin and Shelton, Lillie and Forest, Thomas and Janardhan, Kyathanahalli and Jenkins, Tiffany and Napolitano, Michael J. and Wang, Roujia and Leigh, David and Shah, Tosha and Carlson, Grady Earl and Soans, Rajath and Chen, Antong},
        title = { { Unsupervised Anomaly Detection on Preclinical Liver H&E Whole Slide Images using Graph based Feature Distillation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15971},
        month = {September},
        page = {670 -- 679}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contributions of this paper are the implementation of a student–teacher model trained on 2.7 million liver tissue patches and the introduction of a graph-based feature distillation approach for anomaly detection in Rattus norvegicus liver tissue Whole Slide Images (WSIs) across multiple resolutions (5×, 10×, and 20×). While conventional feature distillation methods often overlook resolution-specific information, this work emphasizes the importance of multi-resolution analysis in WSI anomaly detection, recognizing that certain anomalies may only be visible at specific magnification levels.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Incorporating resolution-awareness significantly enhances anomaly detection performance on WSIs, while the addition of graph-based feature distillation further strengthens the model by capturing precise relationships between patches within the broader tissue context. The paper also presents qualitative comparisons with baseline methods using heatmaps, which aid in improving the interpretability of the results. Moreover, the comparison between single-resolution and mixed-resolution approaches effectively demonstrates the value of a multi-resolution training scheme.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While previous works on foundation models in digital pathology have been explored at scale, such as GigaPath [1], with nearly 1.3 billion patches, and Virchow [2], which includes 1.5 million WSIs. These approaches are not included in the baseline comparisons. In contrast, the dataset used in this study comprises only 819 subjects with 2.7 million patches, which is relatively limited when compared to other foundation model efforts in the WSI domain. Additionally, the paper does not benchmark against DAS-MIL [3], a closely related graph-based multi-scale MIL method that also employs knowledge distillation for WSI analysis. DAS-MIL has demonstrated that combining multi-scale training with graph-based mechanisms can yield substantial performance improvements in WSI tasks (also see [4]). The organization of the paper further hampers clarity, making it challenging to follow the proposed method. Crucially, the paper lacks detailed explanation of the graph-based feature distillation pipeline—specifically, how features are aggregated and utilized within the graph framework.

    [1] Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., … & Poon, H. (2024). A whole-slide foundation model for digital pathology from real-world data. Nature, 630(8015), 181-188. [2] Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Liu, S., … & Fuchs, T. J. (2023). Virchow: A million-slide digital pathology foundation model. arXiv preprint arXiv:2309.07778.

    [3] Bontempo, G., Bolelli, F., Porrello, A., Calderara, S., & Ficarra, E. (2023). A graph-based multi-scale approach with knowledge distillation for wsi classification. IEEE Transactions on Medical Imaging, 43(4), 1412-1421. [4] B. Li, Y. Li and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning”, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 14313-14323, Jun. 2021.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    In the citation section, number 16 there is a minor mistake that the citation is missing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main motivation of the paper is compelling, and the experimental results demonstrate notable improvements in evaluation metrics. However, several important baselines are missing, as outlined in the weaknesses, which limits the strength of the comparative analysis. Additionally, the paper would benefit from careful editing to improve clarity and conciseness, along with more detailed explanations of the graph models used in the proposed method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The core motivation of the paper is compelling, and the use of multi-resolution inputs is a strong and well-motivated design choice. The experimental results also show notable improvements across key evaluation metrics. However, the authors have not sufficiently addressed the central concerns regarding the omission of important baselines for WSI analysis. While differences between preclinical and clinical datasets may justify training a separate foundation model, it remains essential to assess how this method performs relative to existing clinical foundation models, at least as a point of reference. Likewise, the second concern regarding the lack of comparison with DAS-MIL [3], a closely related graph-based, multi-scale MIL method that also incorporates knowledge distillation used for WSI, has not been adequately addressed. Given the architectural and methodological similarities, a controlled experiment or discussion of expected differences helps readers evaluate tradeoffs and provide valuable context for evaluating the proposed approach.



Review #2

  • Please describe the contribution of the paper

    The paper outlines two main contributions for detecting liver pathologies in whole slide images: a resolution-aware pretrained vision transformer and GraphTox, an unsupervised anomaly detection method utilizing patch hierarchy across multiple resolutions. The core idea is to combine the pretrained foundation model with a novel graph-based feature distillation (FD) approach. For a given image, GraphTox purportedly matches student-teacher feature embeddings for each patch and its subpatches at higher resolutions (4 patches at 10x and an additional 16 patches at 20x). Given that the distillation is done only on normal images, the discrepancy between the student and teacher embeddings can be used as the anomaly score. The experimental evaluation shows that GraphTox improves the SSIM (AUC) by 11.1% over the best-performing non-graph model and by 8.1% over a baseline graph-based method, underscoring its efficacy in capturing scale-dependent anomalies in liver tissue samples.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper provides a strong motivation for their work, with the current needs of digital pathology clearly laid out
    • The authors clearly describe the high level details of the transformer model used and their inclusion of resolution embeddings during training
    • The authors report a noticeable improvement in performance compared to baselines
    • Authors test against multiple baselines and training paradigms
    • Figures are helpful in gleaning out certain details
    • If foundation model made open-source via the GitHub repo, it could help the community at large (although this is never made explicit)
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Overall lack of clarity and detail regarding methodology and hyperparameters
    • Only AUC-ROC is reported in evaluation. It is not uncommon to report additional metrics in the anomaly detection field such as Per-Region Overlap or a Dice/Hausdorff distance after automatic thresholding. Additionally, it is unclear whether this is pixel-level or image-level AUC.
    • Results do not have standard deviations (seems to be a range). Is this across multiple runs or within a single test set? It is also unclear which resolution is used for evaluation when in the ‘Combined’ results
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Details around GraphTox in Section 2.2 are lacking. The main contribution seems to be the loss function $L_{FD}$ in $Eq. 3$ which implicitly uses the graph-hierarchy information by accumulating the losses across neighboring patches across scales. Concretely, for a given patch at 5x resolution, the model accumulates anomaly scores for subpatches at higher resolutions i.e. in $Eq. 3$, $n = 1 + 4 + 16 = 21$ for each patch location. This is my current understanding, and I could have easily misinterpreted what was done. The authors would benefit from making these details explicit. As of now, there is a high burden on the reader to parse through the section and assume certain details.
    • Stating that ‘nearest neighbour patches’ were identified is a bit vague. How is this done? Presumably the code will have more details but since this is central in building the graph, a few more qualifying sentences would benefit the reader.
    • Table 1 needs a better caption describing what is being reported. Is this average AUC?Across how many runs? What is the range in parentheses over? What is output resolution when you combine the resolutions?

    • Given the general lacks of a granular discussion of hyperparameters (e.g., learning rates, batch sizes) and training protocols, it is likely difficult to reproduce what was done in this work. Providing these details - or confirming that released code will include full training details - would greatly assist reproducibility.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Minor revisions required. The performance of the model seems to promising. However, it is difficult to ascertain the details of what the authors did in GraphTox to construct and use the graph. I believe a revision could easily clarify this for the reader.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes an approach for unsupervised anomaly detection on preclinical liver whole slide images (WSIs), achieving a novel resolution-aware digital pathology foundation model and employing graph-based feature distillation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method takes into account the gigapixel resolution of whole slide images (WSIs) and their hierarchical organization across various magnifications.
    2. Multi-scale tissue information enhances the performance of anomaly detection.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The limited number of comparison methods may not adequately cover the existing state-of-the-art anomaly detection techniques.
    2. The analysis only considers resolutions of 5x, 10x, and 20x. Would incorporating additional resolutions be advantageous?
    3. The pre-training methods used in constructing the base model lack ablation studies, as only DINOv2 has been considered.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering multiple resolutions to provide more comprehensive tissue information for anomaly detection is an innovative concept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors’ rebuttal addressed my concerns. I believe this paper makes a meaningful contribution to the field of anomaly detection and is suitable for acceptance.




Author Feedback

We appreciate the reviewers’ valuable feedback. Comments are grouped and addressed using tags [R#]. This study explores two hypotheses: (1) A resolution-aware foundation model trained on Rattus norvegicus liver WSIs effectively captures domain-specific features for preclinical toxicity assessment. (2) Multi-resolution graph-based feature distillation enables anomaly detection as a one-class classification problem, where normal samples are abundant but abnormal cases are rare and diverse.

Foundation model comparison and motivation [R3] 1) We appreciate the recommendation to compare more foundation models. UNI and UNIv2 were chosen as strong resolution-agnostic baselines, with UNIv2 consistently outperforming others like GigaPath and Virchow. 2) Training our own preclinical foundation model was motivated by the differences between preclinical and clinical pathology studies. Preclinical data focus on toxicity detection in whole organ sections with primarily normal tissues, whereas clinical data often include biopsies from patients with limited normal tissue variance. Our approach requires learning the full distribution of normal tissue. Therefore, we hypothesize that our preclinical model can effectively serve as a teacher network for capturing normal liver variations in rats. In contrast, DP foundation models like Gigapath trained on 1.3 billion clinical patches (~3.4 million liver patches) may be inappropriate for this task.

GraphTox graph construction [R1,R3] GraphTox is a multi-resolution graph-based feature distillation approach built on a resolution-aware foundation model. It leverages graph-hierarchy information by aggregating losses across neighboring patches at different scales. For each 5x patch, we identify 4 10x patches (2-by-2) and 16 20x patches (4-by-4) from the same location, connecting each patch node to its corresponding lower resolution node (Fig.2(c)). Experiments showed that mean aggregation of patch-level anomaly scores outperformed the graph transformer with message passing. These details will be clarified in the manuscript for better readability.

Multiple instance learning (MIL) [R3] We appreciate the suggestion; however MIL is unsuitable for anomaly detection. It assumes positive bags contain at least one positive instance called a “witness”, while negative bags contain only negative instances. Due to anomaly being diverse, it’s implausible to assume all anomaly bags have a similar witness categorized into a single class (e.g. necrosis vs. vacuolation, Fig. 1). It is also challenging to collect a training set of exhaustively labeled anomalies.

SOTA comparison, model architecture, and resolutions [R5] 1) We included comparison with a GAN-based and a set of UNI-based models as strong baselines. We will extend the SOTA comparisons in the final paper. 2) Since DINOv2 is widely used for DP foundation models, we preserved its self-supervised learning design while adapting the ViT for resolution awareness. 3) Our study uses 5x, 10x, and 20x resolutions, standard in digital pathology WSIs. While 40x is included in WSIs, our collaborating pathologists rarely use it for toxicity detection. Our approach can be extended to 40x if needed.

Results clarity [R1] 1) Table 1 presents WSI-level AUC (mean with 95% confidence interval) on an independent test set, estimated via 500 bootstrap iterations. We can switch to standard deviation if preferred. 2) Pixel-wise anomaly annotation is expensive and challenging. Weakly supervised models allow experimentation with larger, more diverse datasets. Since our experiments focus on WSI-level decisions, we use AUC as the evaluation metric instead of region-based metrics like Dice. 3) The ‘combined’ results (Table 1) use max pooling to select the highest anomaly score across three single-res FD models, meaning different resolutions may be chosen per WSI. Hyperparameters and foundation model weights will be released with code.

Citation[R3]. Ref.16 is our own publication so masked




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers are mixed, with two recommending acceptance and one rejection. The paper proposes a resolution-aware foundation model and a novel graph-based feature distillation method for anomaly detection in liver WSIs, which are conceptually interesting and address an important MIC problem. Strengths include the multi-resolution design and reported improvements over baseline methods.

    However, key concerns were raised about missing comparisons to relevant prior work, particularly large-scale foundation models (e.g., GigaPath, Virchow) and related multi-scale methods such as DAS-MIL. While the rebuttal clarifies several methodological points, it does not fully address these omissions, which limits the ability to position the contribution relative to the state of the art. I recommend acceptance, while encouraging the authors to strengthen the discussion of related work and provide clearer methodological details in the final version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    GraphTox introduces a resolution-aware vision transformer and a graph-based feature distillation method to detect anomalies in preclinical rat liver whole-slide images. Concerns about methodological details and missing baselines were partially addressed in the rebuttal, with explanations and justifications for using a preclinical teacher model. Although some issues—such as foundation model and backbone selection—were not fully resolved in the rebuttal, the multi-scale graph design still provides valuable insights into anomaly detection in digital pathology. Therefore, I recommend accepting this paper. The authors are encouraged to update the final version of the manuscript based on the rebuttal and further address R3’s concerns.



back to top