Abstract

Foundation models have demonstrated significant promise in medical image analysis, particularly in pathology. However, their black-box nature makes it challenging for clinicians to understand their decision-making processes. In this paper, we evaluate the explainability of existing pathology foundation models based on visual concepts. Considering the hierarchical structure of pathological anatomy, comprising of regions, units, and cells, we introduce a novel Hierarchical Concept-based Explanation (HCE) method to illuminate how concepts at different levels influence the model’s predictions. Specifically, our approach begins with the utilization of a specialist-generalist collaborative segmentation model to perform instance segmentation across various levels. We then employ a surrogate model to approximate the target foundation model and compute the Shapley values for each concept. Finally, we visualize these contributions through a comprehensive global ShapMap. We evaluate several state-of-the-art pathology foundation models, including CONCH, UNI, and Virchow, on an adenoma classification task. The findings reveal that the explanations provided by CONCH and UNI show better composability, suggesting they draw from a wider contextual understand demonstrate great separability, reflecting a reliance on specific regions. Additionally, we explore the consistency of concept explanations across different foundation models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1641_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XuShu_Explain_MICCAI2025,
        author = { Xu, Shuting and Hou, Junlin and Chen, Hao},
        title = { { Explain Any Pathological Concept: Discovering Hierarchical Explanations for Pathology Foundation Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {218 -- 228}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method called Hierarchical Concept-Based Explanation (HCE) aimed at interpreting pathology foundation models. The core idea is to leverage the hierarchical structure of pathological anatomy (regions, units, cells). The method first performs instance segmentation at these different scales using a combination of specialist segmentation models and a generalist model (SAM2). It then employs a lightweight surrogate model to approximate the target foundation model’s predictions based on the presence or absence of these segmented regions (represented via binary encoding). Shapley values are computed using Monte Carlo simulation to quantify the contribution of each segmented region (defined as a “concept” at a specific scale) to the final classification outcome (an adenoma classification task in their experiments). Finally, these contributions are visualized using a “ShapMap”. The authors evaluate their approach by applying it to three pathology foundation models (UNI, CONCH, Virchow) and analyze aspects like classification accuracy at different scales, concept separability/composability (using deletion AUC), and the similarity of attended regions across models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Addresses a Critical Problem: The paper tackles the crucial and timely challenge of interpretability for foundation models in computational pathology. Enhancing the transparency and trustworthiness of these powerful models is essential for their safe and effective clinical adoption.
    2. Clinically Relevant Motivation: The motivation to provide explanations aligned with the hierarchical nature of pathological examination (analyzing tissues from broad regions down to cellular details) is well-grounded and clinically relevant. Attempting to connect explanations to the multi-scale analysis performed by pathologists is a valuable direction.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Significant Overclaiming and Misleading Definition of “Concept”: The title “Explain Any Pathological Concept” and the framing throughout the paper constitute a major overstatement. The method does not explain arbitrary or semantically meaningful pathological concepts as understood by clinicians (e.g., nuclear atypia, mitotic activity, glandular architecture distortion, desmoplasia, necrosis). Instead, it defines “concepts” merely as segmentation masks corresponding to physical regions at different pre-defined scales (cells, units, regions). This conflation is a fundamental flaw, as these physical regions are not synonymous with the complex, often subtly defined, visual features and patterns that truly constitute pathological concepts.
    2. Superficial Interpretability (Attribution, Not Explanation): The method primarily provides attribution, indicating which regions of the image influence the model’s prediction, rather than offering a deeper explanation of why. Calculating Shapley values for these regions tells us about the importance of spatial areas, but it fails to reveal the specific visual features or patterns within those regions that the model detected and used for its decision (e.g., was it cell density, nuclear morphology, tissue texture, gland shape irregularity?). This lacks the explanatory depth needed for genuine understanding of the model’s reasoning process. The approach remains largely within the realm of feature attribution techniques applied to segmented regions.
    3. Limited Novelty and Insight of Findings: The core findings offer limited novel insights.
      • The observation that classification accuracy tends to increase when models consider larger context (region > unit > cell) is largely expected in computer vision tasks.
      • The analysis based on deletion AUC (composability/separability) provides descriptive metrics but offers limited mechanistic understanding of how the models integrate information differently. The link drawn to clinical reasoning styles (“key abnormality detector” vs. “pattern reasoning”) feels rather speculative without more direct evidence or validation.
      • The similarity analysis (Fig. 4) highlights overlap in attended regions but doesn’t progress beyond a surface-level description.
    4. Methodological Concerns:
      • Limited Model Scope: The evaluation is based on only three foundation models. Drawing general conclusions about the interpretability of “pathology foundation models” based on such a small sample is problematic.
      • Unclear Technical Details: The technical pipeline description and Figure 1 are too high-level. Key details regarding the surrogate model architecture, the binary encoding implementation, and the specifics of the Monte Carlo approximation for Shapley values are insufficiently explained, hindering reproducibility and a full grasp of the method’s intricacies.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is Weak Reject. The decision is primarily driven by the significant disconnect between the paper’s ambitious claims (explaining “any concept”) and the method’s actual capabilities and contributions.

    The major factors leading to this score are:

    1. Fundamental Misrepresentation of “Concept”: The work’s core premise relies on equating physical image regions with pathological concepts, which is an oversimplification and misrepresentation that undermines the claimed contribution to explainability.
    2. Lack of Explanatory Depth: The method delivers region-based attribution rather than a deeper explanation of the model’s reasoning based on relevant pathological features. It answers “where” the model looks, but not “what” it sees or “why” it’s important.
    3. Limited Novelty and Insight: The findings are largely descriptive or expected, offering insufficient new knowledge about foundation model behavior or significant methodological advancement in XAI for pathology.
    4. Scope and Clarity: The limited number of models evaluated and the lack of clarity in the technical description further weaken the paper’s current standing.

    While the problem domain (XAI for pathology foundation models) is highly relevant and the hierarchical motivation is sound, the current execution suffers from fundamental conceptual and methodological limitations. The paper significantly overclaims its contribution and does not provide the level of interpretability suggested. Substantial revisions addressing the definition of concepts, the depth of explanation provided, the scope of the evaluation, and the clarity of the methodology would be required for reconsideration.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Dear AC,

    Thank you. After reviewing the rebuttal, my stance remains negative. The authors’ response does not adequately resolve the core issue: the paper overclaims its ability to explain “pathological concepts” by equating them with pre-defined physical regions. While the method offers improved region-based attribution, it falls short of providing deep, semantic explanations of why those regions are important based on specific pathological features. The interpretability offered is therefore limited. Best regards,

    Best regards, Weiqin



Review #2

  • Please describe the contribution of the paper

    This paper evaluates the explainability of pathology foundation models by adopting the principles of Concept-Based Explanation. A concept is generally defined as a semantically meaningful part of an image (e.g., the tail of a cat). However, in this study, the concepts are manually selected and consist of three categories: Cells, Units, and Regions. To estimate the importance of each concept, the authors apply the method from [1], leveraging SAM2 to segment the target concepts and training a surrogate model to estimate their Shapley values.

    The approach is applied to three foundation models—CONCH, UNI, and Virchow—using linear probing for the adenoma classification task. The authors assess classification performance, “Separability and Composability,” and the similarity of important features identified by these models. Results show that CONCH outperforms the other foundation models in classification performance. While UNI achieves the highest separability scores, CONCH demonstrates superior composability. The analysis of feature similarity reveals that UNI and CONCH focus on many of the same features, but CONCH identifies additional regions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Relevance of the Study: The analysis of learned histopathology foundation models is a valuable and timely contribution.

    • Clarity and Readability: The paper is well-written and easy to follow.

    • Sound Experimental Design: The experimental setup is logical and well-structured, supporting reproducibility and reliability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Limited Novelty: The paper primarily adopts the Explain Any Concept (EAC) methodology with minimal modifications to the architecture. The proposed ShapMap/HCE lacks methodological innovation, as it mainly stems from the manual selection of target concepts (cells, units, regions) and their multi-scale nature.

    • The authors assume that relevant features of foundation models can be fully captured by three predefined concepts—cells, units, and regions—which may be an oversimplification.

    • The assessment of feature importance using Linear Probing on a single classification task may not generalize well. Results could vary significantly across different aggregation methods or datasets.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper provides a relevant analysis of histopathology foundation models using Concept-Based Explanation. The study is well-written, and the experimental setup is sound. However, the novelty appears limited, as the proposed HCE builds on the existing EAC framework with minimal modifications. The manual selection of target concepts and reliance on Linear Probing for a single task may also limit generalizability. Nevertheless, the paper remains valuable, and I am open to increasing my score if the authors can better highlight the methodological novelty.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although the authors did not convincingly address the highlighted weaknesses in the paper, these are still outweighed by the overall quality and strength of the contribution



Review #3

  • Please describe the contribution of the paper

    This paper proposes a Hierarchical Concept-based Explanation (HCE) method that analyzes how concepts at different feature levels influence a model’s predictions. The authors explore the interpretability of three pathology foundation models through an adenoma classification task.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem addressed in this paper is well motivated. The manuscript is well written and structured, and the proposed method is clearly explained. The authors conduct thorough evaluations on three state-of-the-art pathology foundation models.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (1) The proposed framework has quite a bit of similarity to the prior work, Explain Any Concept (EAC). Notably, Figure 1 and the section on “Shapley Value” appear to overlap significantly with Figure 1 and the “Phase Three: Concept-based Explanation” section in the EAC paper. The authors should clarify the novelty and distinct contributions of their work compared to EAC. (2) The paper does not provide sufficient detail regarding the concepts used at each level for surrogate model training. Specifically, how many concepts are there at each level, and what clinical features are these concepts associated with? More details are also needed about the model training process that leads to the results in Table 1. Specifically, the distinctions between the “Cell,” “Unit,” and “Region” methods are unclear. Are the inputs to these methods binary concept vectors? What is the dimensionality of these inputs? Furthermore, are the Fully Connected Layer (FCL) and the surrogate model in Figure 1 trained using the same parameters?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper addresses an important question regarding the interpretability of foundation models and proposes a Hierarchical Concept-based Explanation (HCE) framework for this issue. The paper is well written and structured, and the proposed method is convincing. However, the novelty beyond prior work should be more clearly emphasized, and additional details on the concepts used for training the surrogate model are needed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns regarding the novelty of the paper and the essential methodological details have been addressed in the rebuttal. Overall, I believe this is a solid piece of work and recommend it for acceptance.




Author Feedback

Thanks for the comments.

[R1-W1, R3-W1: Model Innovation] While EAC performs well on natural images, it has limitations in pathology, such as less meaningful segmentation, limited approximation ability, and single visualization target. We proposed several improvements:

  1. In pathology, accurate diagnosis relies on examination across multiple structural levels, from broad regions to units and individual cells. Thus, we designed a hierarchical segmentation method that generates concepts more interpretable to pathologists than those from EAC.
  2. Given the complexity of pathology foundation models, we replace EAC’s linear surrogate with a non-linear one for better approximation.
  3. Natural images usually contain a single significant target, thus the SHAP explanation in EAC shows the only concept. However, pathological diagnosis involves multiple key targets. To capture this complexity, we developed SHAPMap offering more information.

[R1-W2: Predefined Concepts] To explain which areas of a pathology image contribute most to the prediction, we defined three hierarchical levels based on the nature of pathology, which cover all meaningful areas.

[R1-W3: Experiment] We conducted experiments on Chaoyang dataset, including hierarchical classification, separability/composability, feature similarity, and SHAPMap visualization. These form a systematic framework for explaining pathology foundation models. We will extend validation to more pathology datasets in future work.

[R2-W1: Concept Overclaiming] Visual concept refers to semantically meaningful parts in an image, which can explain model decisions. “Concept” in our study refers to visual concept in pathology images (cell, unit, and region), which aims to find which areas the model focuses on in classification. Such area-based explanations help clinicians assess whether the model’s focus is reasonable and enhance trustworthiness. We will explore more clinically relevant concept explanations as suggested in future work.

[R2-W2: Attribution vs Explanation] Feature attribution explains predictions by indicating how much each input feature’s contribution of an image. Traditional methods may produce ambiguous explanations, while our method provides human-interpretable concepts as explanations, helping clinicians better understand predictions.

[R2-W3: Insight of Findings]

  1. To test interpretability at three levels, it’s essential to first validate the classification performance. Results show that performance improves with larger segmentation as expected, supporting the validity of our segmentation. More importantly, CONCH consistently outperforms other foundation models across all levels.
  2. The differences in composability/separability may link with SSL methods (DINOv2 vs iBOT). High separability indicates a model detects abnormalities via a few key areas (“key abnormality detector”), while high composability reflects integration across multiple areas (“pattern reasoning”). This analogy aims to aid interpretation, not scientific equivalence.
  3. The similarity analysis reveals differences in model focus. At broader levels, CONCH captures more features than UNI, contributing to its superior performance. At finer levels, the similarity among the three models is lower, likely due to the high diversity of cell-level concepts.

[R2-W4, R3-W2: Model Scope and Details] Three pathology foundation models we selected are widely used and SOTA, which is sufficient to validate the generality of our method. For model details, each image contains 286 cells (e.g. epithelial cell), 4 units(e.g. gland), and 2 regions(e.g. foreground) on average. The foundation model takes 3D images as input, while surrogate model receives a (1000, X) binary matrix (X = concept count per level). FC Layer is frozen and not shared with the surrogate model, which is a lightweight non-linear network trained with cross-entropy loss. Monte Carlo sampling uses a binomial distribution (p = 0.5). We will add these details in the revision.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Paper Summary: This paper investigates how pathology foundation models make classification decisions by introducing a Hierarchical Concept-Based Explanation (HCE) framework. To align with the multi-scale reasoning used by pathologists, the authors segment images into cells, units, and regions using a specialist-generalist segmentation pipeline (expert models plus SAM2), then train a lightweight surrogate model that takes binary concept masks as inputs. Monte Carlo–estimated Shapley values quantify each concept’s contribution, and the results are visualized in a “ShapMap.” The method is evaluated on adenoma classification using three foundation models (UNI, CONCH, Virchow), assessing classification accuracy, concept separability/composability, and cross-model feature similarity.

    Key Strengths: Reviewers consistently praised the paper’s focus on a clinically important interpretability problem, its clear motivation grounded in pathology practice, and the logical experimental design. The hierarchical segmentation approach and thorough evaluation across multiple SOTA foundation models were noted as particularly well executed, and the manuscript was commended for being well written and easy to follow.

    Key Weaknesses: All three reviews highlighted limited methodological novelty beyond the existing EAC framework, with concerns that the formalization of “concepts” as fixed segmentation masks oversimplifies pathological features and conflates attribution with true explanation. The evaluations, while thorough, are restricted to a single task and three models, and critical implementation details, such as the number and clinical interpretation of concepts at each level and the surrogate model’s input encoding, are insufficiently specified.

    Review Summary: There is broad agreement that the hierarchical focus and clarity of the presentation add value and that the study addresses an important gap in XAI for pathology. Disagreements center on the depth of innovation and conceptual framing: some reviewers view HCE as a straightforward extension of prior work lacking new insight, while others find the hierarchical emphasis and multi-scale analysis a meaningful contribution. Concerns were raised about overselling the term “concept,” the explanatory depth of Shapley-based attributions, and the need for more transparent methodological details to support generalization.

    Decision: Invite to rebuttal: to allow the authors to clarify their novel contributions, more rigorously define and justify the concept hierarchy, and provide detailed methodological and implementation information.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    As pointed out by one of the reviewers, this paper seems to overclaim its ability to explain “pathological concepts” by equating them with pre-defined physical regions. While the method offers improved region-based attribution, it falls short of providing deep, semantic explanations of why those regions are important based on specific pathological features. The interpretability offered is therefore limited.



back to top