Abstract

With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this, we propose a new benchmark called Pathology Visual Grounding (PathVG), which aims to detect regions based on expressions with different attributes. To evaluate PathVG, we create a new dataset named RefPath which contains 27,610 images with 33,500 language-grounded boxes. Compared to visual grounding in other domains, PathVG presents pathological images at multi-scale and contains expressions with pathological knowledge. In the experimental study, we found that the biggest challenge was the implicit information underlying the pathological expressions. Based on this, we proposed Pathology Knowledge-enhanced Network (PKNet) as the baseline model for PathVG. PKNet leverages the knowledge-enhancement capabilities of Large Language Models (LLMs) to convert pathological terms with implicit information into explicit visual features, and fuses knowledge features with expression features through the designed Knowledge Fusion Module (KFM). The proposed method achieves state-of-the-art performance on the PathVG benchmark. We will release our dataset and methods upon the acceptance of the paper. The source code and dataset have been available at https://github.com/ssecv/PathVG.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1180_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ssecv/PathVG

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhoChu_PathVG_MICCAI2025,
        author = { Zhong, Chunlin and Hao, Shuang and Wu, Junhua and Chang, Xiaona and Jiang, Jiwei and Nie, Xiu and Tang, He and Bai, Xiang},
        title = { { PathVG: A New Benchmark and Dataset for Pathology Visual Grounding } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {456 -- 466}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces PathVG, a new benchmark for pathology visual grounding, along with a newly constructed dataset, RefPath, which comprises over 27,000 multi-scale pathological images and 33,000+ language-grounded region annotations. It also proposes PKNet incorporates pathology-related knowledge via an LLM and a knowledge fusion module to enhance grounding performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper introduces a new formulation of visual grounding specifically tailored to computational pathology, aiming to localize pathological regions based on descriptive language expressions. 2.The authors present a large-scale dataset of 27,610 multi-resolution (20×, 40×) pathology image patches with 33,500 corresponding natural language region descriptions.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The RefPath dataset is the core contribution, but its construction pipeline is not rigorous by benchmark standards:

    1. Expert involvement must be explicitly defined. The paper refers to “professional pathology experts” validating the expressions and annotations, but does not specify how many experts, what their levels of experience were, or whether inter-rater agreement or quality control procedures were used. This is essential for establishing the dataset’s reliability as a benchmark.
    2. The choice of LLaVA-Med as the base model for knowledge extraction is not sufficiently justified. Given the emergence of pathology-specific LLM, the authors should explain why a general-purpose medical LLM was preferred over pathology-trained models. Did the authors attempt using any domain-specific pathology LLMs, or is this a matter of resource convenience? This impacts both the baseline fairness and the generalizability of the method.
    3. In Figure 4, the knowledge branch is shown as relying on GPT-based explanation of pathology terms. However, the technical detail is too vague: How are visual-grounded terms extracted? Is there a pre-defined list of medical terms? Is LLaVA-Med used in zero-shot mode or with task adaptation? These design choices are crucial for understanding how the knowledge representation is built and how reliable it is for grounding tasks. The current explanation lacks sufficient algorithmic clarity.
    4. What “designed specific prompts” to guide GPT-4V for generating pathology expressions?The prompt content is not shown. Since prompt engineering is a crucial part of the dataset design, it should be presented in full.
    5. The magnification 20× and 40× are central to the benchmark, but the biological meaning and practical distinction between “cell arrangement” and “cell structure” is unclear. The authors should either cite clinical sources to justify the distinction or clearly define how these concepts are operationalized in expressions and annotations. If the authors can adequately address the technical and methodological concerns raised, I will increase my score.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    novelty, solid and reliable construction pipelines.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The construction of the benchmark is limited due to the lack of detail, but it adds to the ongoing efforts to establish evaluation frameworks in pathology images.



Review #2

  • Please describe the contribution of the paper

    This paper extends Visual Grounding to the pathology domain by constructing PathVG, the visual grounding benchmark for pathological images. Supported by GPT4V and pathologists, this paper builds a dataset called RefPath, covering localized regions of pathological slide patches at both 20× and 40× magnifications. This paper further proposed PKNet, a baseline model for pathological visual grounding that integrates knowledge from LLaVA-Med.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work introduces a new pathology research task. When curating datasets, this paper uses YOLOv10 for region detection and GPT-4V for doctor-referenced image description generation.
    2. The baseline method in this paper enhances text representation via LLMs, transforming pathological expressions into morphological features.
    3. Results show the proposed model achieves the best performance, with ablation studies confirming each module’s contribution.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of clarification on whether the dataset originates from ethically reviewed private sources or publicly available projects such as TCGA or GTEx.
    2. The dataset includes 20× and 40× magnifications but omits other clinically common magnifications such as 10×, which limits its comprehensiveness.
    3. There is no quantitative evaluation standard for comparing the generated bounding boxes with expert pathologists’ annotations.
    4. The Vision-Language Transformer architecture lacks a clear explanation of how visual and textual features are fused or aligned.
    5. The model generates only one bounding box per textual expression, which may not reflect the complexity of real-world diagnostic scenarios where multiple regions might correspond to a single description.
    6. The dataset is restricted to grounding of cell clusters, and the accuracy of medical descriptions generated by GPT-4V is not validated—although pathologist review is mentioned, no quantitative results of this review are provided.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Dataset Quality and Dataset contribution to this field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The first benchmark dataset for Pathology grounding.



Review #3

  • Please describe the contribution of the paper

    This paper proposes the PathVG benchmark to evaluate the model’s visual grouding capabilities in pathology images. The author created a large-scale dataset and proposed a baseline model PKNet to verify the effectiveness of the benchmark.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed insights are targeted and relevant.

    A large-scale benchmark dataset has been created, addressing a critical gap in current computational pathology research.

    A baseline model was developed to effectively evaluate the dataset, and the comparative experiments are comprehensive.

    The manuscript is written in a rigorous and well-organized manner.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I only have a few minor concerns:

    The introduction of the dataset is somewhat brief. The current manuscript does not clearly illustrate the types of visual grounding examples included in the dataset. Additionally, the anatomical origins of the images are not well described. I suggest that the authors include a figure in the revised version to illustrate the distribution of RefPath, such as the real-world sources of the data (which can be provided at the camera-ready stage), anatomical site distribution, and representative examples (e.g., various types of cells, glands, and tissue structures). This would help readers better understand the clinical relevance of the dataset.

    The design of the baseline model is somewhat coarse, lacking more detailed ablation studies. For example, ablations comparing the CNN and Transformer modules in the Visual Branch, as well as variations in the number of components in the Knowledge Fusion Module, are missing.

    Although the authors mention that the baseline model is not pretrained on any pathology-specific data, recent foundation models in computational pathology have already shown strong capabilities in feature representation. I am curious about how PKNet would perform if using vision encoders pretrained on histopathology images and text encoders trained on medical semantic descriptions. I recommend that the authors either include some preliminary results or add a discussion section to elaborate on these promising future directions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The benchmark proposed in this paper deserves attention and exploration by the computational pathology community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their constructive feedback. Below, we address the key concerns regarding dataset rigor, model clarity, and design justification:

Ethical sourcing and expert involvement (R1W1, R2W1, R3W1): RefPath was obtained from our collaborating hospital and approved by its institutional ethics committee. Five attending-level or above pathologists joined in the annotation process. A standardized annotation protocol was jointly defined, and a cross-checking strategy was used to ensure consistency. We will clarify our quality control procedures in the final version.

Visual grounding scope and magnification settings (R1W2, R1W5): Following Medical Phrase Grounding in X-Ray [1], this work focuses on establishing a foundational one-to-one grounding benchmark. We intentionally excluded 10× to avoid the ambiguity of one-to-many mappings at 10× resolution. One-to-many grounding and broader magnification ranges are our future research focuses, which we will discuss in the final version.

Validation of our dataset (R1W3, R1W6): During the dataset validation phase, pathologists first classified the reviewed data into three classes based on specific criteria: completely correct (32%), partially correct but revisable (27%), and incorrect (41%). Subsequently, incorrect data were discarded, and partially correct data were revised.

The mechanism of Vision-Language Transformers(R1W4): In our architecture, visual and textual features are concatenated into a mixed sequence via joint input, fed into shared transformer layers, and fused through self-attention mechanisms to enable global cross-modal interaction.

Choice of LLaVA-Med over pathology-specific models(R2W2): As the knowledge of authors, among available open-source options before submission, LLaVA-Med (NeurIPS2023) and QUILT-LLaVA (CVPR2024) were the only candidates with pathology-related pretraining. We selected LLaVA-Med due to its larger scale of training corpus, including pathological data. Additionally, LLaVA-Med has stronger generalizability.

Clarification on Knowledge Branch (R2W3): To preserve the diversity and freeform nature of expression, we do not employ a predefined medical term list; Instead, we utilize LLaVA-Med to identify and extract visual-related terms and generate explanations in a zero-shot setting.

Prompt of GPT-4V in dataset construction(R2W4): Inspired by Ferret [2], we designed structured prompts to guide GPT-4V in generating expressions. The prompt includes two parts: First, Basic Task Definition, instructing the model to output expressions strictly corresponding to the specified region. Second, Content Constraints, including limiting descriptions to the designated area, focusing on features according to magnification, and requiring standard medical terminology and formats.

Definitions of cell arrangement and cell structure (R2W5): According to our consulting pathologists, 40× magnification is used to examine fine-grained cellular morphology, such as nuclear features, staining patterns, and structural abnormalities. In contrast, 20× magnification assesses broader tissue context, including cell cluster arrangement, spatial distribution, and glandular organization. These definitions are also supported by [3] in Introduction. We will include these definitions in our final version.

More analyses of PKNet (R3W2, R3W3): To ensure fairness, we initially did not use pathology-pretrained encoders (e.g., CONCH, UNI). Inspired by your suggestion, we will supplement the revised version with results of replacing PKNet’s encoders with pathology-pretrained ones, conducting module ablation, and discussing related future directions.

Reference: [1] Chen Z, Zhou Y, Tran A, et al. Medical phrase grounding with region-phrase context contrastive alignment. MICCAI2023 [2] You H, et al. Ferret: Refer and Ground Anything Anywhere at Any Granularity. ICLR2024. [3] Rasoolijaberi M, et al. Multi-magnification image search in digital pathology[J]. JBHI, 2022.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Paper Summary: This paper introduces PathVG, a benchmark for visual grounding in computational pathology. To support this, the authors constructed RefPath, a large-scale dataset of over 27,000 pathology image patches at 20× and 40× magnification, each paired with natural-language region descriptions generated via GPT-4V and refined by pathologists. They further propose PKNet, a baseline visual-language model that fuses image features with pathology knowledge by incorporating LLaVA-Med and a knowledge-fusion module, achieving strong grounding performance with ablation studies to assess each component.

    Key Strengths: The work fills a clear gap by defining the problem of visual grounding for pathology images and providing a substantial, multi-scale dataset tailored to this domain. The dataset curation pipeline, leveraging state-of-the-art object detectors and GPT-4V for annotation, demonstrates an innovative use of both vision and language models. PKNet offers a solid baseline by integrating medical-knowledge features via an LLM, and the experimental evaluation is comprehensive, with ablations confirming the contribution of individual modules.

    Key Weaknesses: Several methodological details remain under-specified: the ethical provenance and sourcing of the images, the role and number of expert pathologists and any inter-rater‐agreement metrics, and the exact mechanism by which visual and textual features are fused in the Vision-Language Transformer. The omission of other clinically relevant magnifications (e.g., 10×) and the restriction to one bounding box per expression may limit real-world applicability. Important design choices, such as the prompt content for GPT-4V, the selection of LLaVA-Med over pathology-specific LLMs, and quantitative validation of generated descriptions, lack sufficient justification or evaluation.

    Review Summary: All reviewers agree on the novelty and importance of establishing a pathology-specific visual grounding benchmark and commend the scale of RefPath and the baseline results of PKNet. They concur that the manuscript is generally well-written and organized. Disagreements center on the rigor of the dataset construction (clarity on expert involvement, QC procedures), the depth of methodological explanations (fusion architecture, prompt engineering), and the extent of ablation studies. One reviewer highlights the need for richer dataset diversity and validation of GPT-4V outputs; another calls for more extensive model ablations, and another suggests clarifying the biological rationale behind the chosen magnifications.

    Decision: In light of these converging praises and calls for clarification, we invite the authors to submit a focused rebuttal. Addressing the provenance and annotation protocols, elaborating on prompt design and fusion mechanisms, and expanding the ablation study will enable the preparation of a well-informed final manuscript.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top