Abstract

Segmenting hepatocellular carcinoma (HCC) and vessels encapsulating tumor clusters (VETC) are new paradigm for prognostic analysis. However, the clustered morphology of VETC nuclei, which is difficult to represent at the patch level, makes segmentation highly challenging. Recent visual prompt-based methods incorporating nucleus prior knowledge have shown promise but assume patch pixels lack spatial correlation, failing to capture nuclei morphology at the pixel level. To address this, we propose a Patch-to-Pixel Visual Prompt (VPP2P) framework, which models VETC morphological features by propagating visual prompts from patches to pixels. Built on contrastive learning, our semi-supervised approach samples positive and negative pairs within patches to enhance feature learning. Experiments show that VPP2P achieves performance comparable to fully supervised methods using only 10% of the training data. With 30% of the training data, VPP2P attains a Dice score of 90.52%, outperforming state-of-the-art visual prompt-based methods by an average margin of 6.6%. To the best of our knowledge, this is the first semi-supervised deep learning approach for VETC morphological analysis, offering new insights into HCC clinical research. Code is available at https://github.com/sm8754/VPP2P.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0208_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/sm8754/VPP2P

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YuJia_Segmenting_MICCAI2025,
        author = { Yu, Jiahui and Ma, Tianyu and Gu, Shenjian and Guo, Yuping and Chen, Feng and Li, Xiaoxiao and Xu, Yingke},
        title = { { Segmenting Vessels Encapsulating Tumor Clusters via Fine-Grained Visual Prompt } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {496 -- 506}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a novel patch-to-pixel visual prompting framework (VPP2P) specifically designed for segmenting vessels encapsulating tumor clusters (VETC), VPP2P demonstrates remarkable efficiency, achieving comparable performance to fully supervised methods while utilizing significantly less labeled data. The idea of propagating visual prompts from patch-level to pixel-level is interesting and experimental validation shows VPP2P achieves superior results.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. VPP2P, which is built upon a semi-supervised learning framework, achieved comparable performance to fully supervised methods while utilizing significantly less labeled data. This makes it highly beneficial for clinical applications where labeled data are scarce or expensive.
    2. The idea of propagating information from patch to pixel via visual prompt is interesting.
    3. Clear motivation and the paper is overall well-organised.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The title “Preliminaries” in Section 2.1 may not clearly represent the content. Since the entire section is dedicated to explaining the generation of visual prompts, consider using a more precise title, such as “Visual Prompt Generation,” to enhance clarity and accurately reflect the section’s content.
    2. The description of your “Patch-to-Pixel” mechanism can be improved for better clarity. Specifically, the current explanation is somewhat confusing because it states visual prompts propagate information from patch-level to pixel-level. However, according to your description, these visual prompts originate from nuclei segmentation, which already provides fine-grained, pixel-level information.
    3. It’s unclear what exactly is meant by “source nuclei” and “target nuclei” in section 2.1.
    4. what are the differences between weak and strong augmentation? since you have only mentioned one set of augmentations in the implemention details.
    5. Some of the notations used in the manuscript are unclear and potentially confusing. For example, in Section 2.2, you first introduce f_{h,w} to represent the feature map. Could you clarify what exactly the subscripts h,w indicate here? Do they represent spatial dimensions, feature lengths, or the number of features? Additionally, what does the subscript p represent? Does it denote ‘patch’? Lastly, it would be helpful if you clearly specify the dimensions of these feature maps to avoid ambiguity.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I appreciate the innovative idea of using visual prompts as a medium to transfer information from patch-level to pixel-level representations. However, the current manuscript does not clearly describe the mechanism behind this propagation process. Additionally, although the motivation behind adopting a semi-supervised learning approach is clearly stated, the overall framework appears limited in novelty, as it largely resembles existing teacher-student frameworks. The experimental results do effectively validate the proposed method. If the authors could provide a clearer and more detailed explanation of how the visual prompts propagate information at different levels, I would consider raising my evaluation score.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I thank the authors for their efforts in addressing my concerns. However, my concern primarily lies with the lack of clarity in Section 2.1 regarding the initialization and integration of the visual prompts. In the original submission, the prompts appear to be derived from spatial attributes computed over tissue segmentations. However, the rebuttal introduces that visual prompts are to be initialized from feature maps extracted from image patches. I find it difficult to conceptually reconcile these differing descriptions, and the confusion remains unresolved. Furthermore, the responses to my concerns (3, 4, and 5) were too vague, and the promise of future code release does not sufficiently improve the clarity or rigor of the current manuscript.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a semi-supervised, pixel-level visual prompting framework named VPP2P for VETC segmentation, overcoming the spatial limitations of previous visual prompt techniques and achieving competitive results with minimal labeled data in various datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper introduces a patch-to-pixel visual prompting approach to model the fine-grained morphological features of VETC in hepatocellular carcinoma, addressing limitations of prior methods that focused only on patch-level representations.
    2. It adopts a semi-supervised contrastive learning framework, effectively reducing the reliance on extensive manual annotations. 3.The proposed method achieves significant performance improvements over existing state-of-the-art approaches.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited Novelty in the Contrastive Learning Framework The paper claims to propose a novel semi-supervised architecture; however, its contrastive learning strategy closely resembles existing methods like CDCL [21] and MiDSS [23], which also implement dense multi-level contrastive supervision. The paper does not sufficiently distinguish its approach from these prior works in terms of conceptual or methodological innovation.
    2. Insufficient Explanation of Pixel-Level Feature Extraction In Section 2.2, the paper mentions extracting a joint feature map from the image and visual prompts. However, it does not clarify how pixel-level features are derived. If features are obtained from DenseUNet and ViT-L, both of which operate on patches or tokens, it is unclear how these are reconciled at the true pixel level. The description lacks the technical detail needed to assess whether the features are genuinely pixel-aligned or still patch-based.
    3. Lack of Justification and Ablation on Network Choices The proposed framework adopts DenseUNet for feature extraction, HoVer-Net for nucleus segmentation, and ViT-L as the prompt encoder-decoder. However, there is no justification for these choices or accompanying ablation studies. For instance, it remains unclear why DenseUNet was selected over lighter alternatives like standard UNet or transformer-based backbones, and why HoVer-Net was used for nucleus extraction without evaluating alternatives or robustness to segmentation quality.
    4. No Analysis of Computational Cost The architecture employs both ViT-L for prompt generation and DenseUNet for feature extraction—two computationally heavy models. However, the paper does not provide any analysis of runtime, GPU memory usage, or inference speed. This raises concerns about the model’s practicality in real-world clinical workflows, especially in resource-constrained settings.
    5. Lack of External Dataset Validation The dataset appears to be private, as its source is not clearly disclosed (“**” is used in place of institutional names), which raises concerns about generalizability. Morphological and staining variations across labs can significantly affect segmentation performance. The method’s applicability to broader clinical scenarios is uncertain without validation on public datasets such as TCGA-LIHC or the LiTS challenge.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weakness part for details.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After considering the rebuttal, I have revised my recommendation to weak acceptance, as the authors have largely addressed my concerns regarding the computational cost and the novelty of the proposed contrastive learning framework.



Review #3

  • Please describe the contribution of the paper

    This paper proposed a patch-to-pixel visual prompt based segmentation framework that need less training data to achieve high performance. It addressed annotation limitation problem.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The paper presents comprehensive experiments comparing the proposed method against state-of-the-art techniques across different data regimes and the ablation study clearly showed the reason why pixel-level visual prompts is better.

    1. By incorporating nucleus spatial distributions into the visual prompt generation process, the paper effectively fuses domain-specific knowledge with advanced deep learning techniques
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper employs a sophisticated hierarchical contrastive learning approach at both patch and pixel levels. This paper should add some experiments to clarify why such complexity is necessary or advantageous over simpler baseline methods

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Using contrastive learning at both patch and pixel levels is good. The method creates positive and negative pairs based on how close parts of the image are and the tissue structure. This step-by-step approach makes the training more detailed and improves segmentation accuracy. The paper also uses the pattern of nucleus locations in generating visual prompts, which combines medical knowledge with modern deep learning techniques. I would recommend this paper be accepted

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    accept but not very confirmed




Author Feedback

We sincerely thank the reviewers for their constructive feedback. We appreciate the recognition of our innovative and interest approach (R1, R3), effective performance (R1, R2, R3), clarity expression (R1, R2, R3) and reproducibility manuscript (R1, R3). We address each question as follows: Q1: Patch-to-Pixel propagation mechanism (R1-2, R2-2) A1: We would clarify that visual prompt refers to a semantic supervisory signal that encodes the spatial distribution attribute of VETC in pixel-level representation. It is derived from nuclear segmentation and differs from the “pixel-level information” mentioned in the theme. Due to dimensional constraints, it also does not represent morphological semantics directly. Instead, we embed this task-specific signal into each real pixel of the feature map. Each pixel is associated with a visual prompt embedding. Specifically, we extract a spatially aligned patch‑level morphological feature map (128, 32, 32) via DenseUNet, representing each of the 1024 pixels as a 128‑dim vector. Concurrently, a ViT decodes nuclear spatial into 10‑dim visual prompts per pixel. By flattening the 32 × 32 feature map, we concatenate each pixel’s 10‑dim prompt with its 128‑dim morphology vector—yielding a 138‑dim representation—and fuse prompts via soft pixel alignment across all 1024 positions, thereby enabling true pixel‑wise supervision. Unlike position-sensitive token assignments, our “prompt” is task-specific yet class-agnostic. Finally, L_{pixel} conducts pixel-level contrastive learning across all pixel positions. Q2: Novelty of the teacher-student frameworks (R1, R2-1) A2: We acknowledge that many current works adopt feature-sharing teacher-student frameworks (TSF). Different from these fundamentally, VPP2P addresses the spatial independence assumption inherent in such frameworks (see introduction). These methods fail to capture task-specific nuclear patterns in spatial, which are only discernible at the pixel-level (R1). In contrast to [21] and [23], our method focuses on incorporating dynamic, domain-specific visual features as prompts, leveraging them at the pixel-level. To our knowledge, few studies have explored visual prompt-based pipelines for pixel-level semi-supervised segmentation. VPP2P is specifically designed for VETC and achieves superior scores (Table 1). In comparison, previous works rely on subjective, task-agnostic supervision. Q3: Detail description (R1-1, R1-3, R1-4, R1-5) A3: We appreciate the key feedback. We will release our code within 1 month of publication to ensure full reproducibility. Q4: Network Choices (R2-3) A4: The effectiveness of visual prompt hinges on high-dimensional representations that retain fine-grained local feature. The dense connectivity of DenseUNet, where each layer receives inputs from all preceding layers, enabling texture and margin features to propagate to higher layers. Experiment shows a 2.85% improvement over UNet (87.67%). HoVer-Net is a widely used nuclear extraction tool, and we referred to its pipeline. Q5: Baseline (R3) A5: We agree the constructive feedback. We will further investigate the performance under ablated settings—specifically, without pixel-level prompts and positive-negative sampling. Q6: Computational Cost (R2-4) A6: We acknowledge that the deployment cost is a consideration. While training involves two backbone branches and contrastive learning, it typically completes within one GPU-day. During inference, VPP2P achieves an average speed of 2.3e−2 seconds per patch, which is completely feasible in real clinical settings. Q7: External Dataset Validation (R2-5) A7: We trained VPP2P on a clinical dataset (“*” for blind review). Table 1 shows that VPP2P outperforms SOTA method by 3.04%, surpassing other prompt-based methods significantly. Tests will be carried out on TCGA in the future. We are also committed to releasing our dataset after publication to support future research of VETC.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Paper Summary: The paper tackles the challenge of segmenting vessels encapsulating tumor clusters (VETC) in hepatocellular carcinoma whole-slide images under extreme label scarcity. To address the clustered morphology of VETC nuclei and the high cost of pixel-level annotation, the authors propose VPP2P, a semi-supervised patch-to-pixel visual prompting framework. VPP2P first generates morphology-aware prompts from nucleus locations, then propagates these prompts from coarse patches down to individual pixels within a contrastive learning architecture. With only 10–30 % of labels, the method achieves segmentation accuracy on par with or exceeding fully supervised baselines.

    Key Strengths: The proposed patch-to-pixel prompting strategy is an innovative extension of visual prompting, embedding fine-grained nuclear morphology directly into feature learning. Extensive experiments demonstrate that VPP2P matches or outperforms state-of-the-art methods using a fraction of labeled data, with clear gains in Dice score and boundary accuracy across multiple label regimes. Comprehensive ablation studies validate the benefit of pixel-level prompts over patch-level or random baselines, and visualizations illustrate that the learned prompts capture distinct VETC morphology.

    Key Weaknesses: The manuscript leaves several technical details ambiguous, including the precise mechanism by which prompts propagate from patches to pixels and the definitions of “source” versus “target” nuclei. Architectural choices, such as using DenseUNet, HoVer-Net and ViT-L, are not justified or ablated, and there is no analysis of computational cost or inference speed. Novelty is questioned since the contrastive learning framework resembles prior dense contrastive methods. Finally, the private dataset limits assessment of generalizability, as no external or public dataset validation is provided.

    Review Summary: All three reviews concur that the paper addresses a clinically important problem and delivers strong empirical results with limited supervision. They agree that the patch-to-pixel prompt formulation is promising and that experiments are well executed. Two reviewers raise concerns about the method’s novelty relative to existing contrastive prompt works, unclear exposition of core mechanisms, and a lack of justification for key design choices and resource costs. One reviewer strongly supports acceptance, emphasizing the hierarchical contrastive design and integration of nuclear spatial priors.

    Decision: Invite for Rebuttal—given this mix, a rebuttal stage will allow the authors to resolve ambiguities and strengthen their case. VPP2P shows promising novelty and strong results under label scarcity, but key details (patch-to-pixel prompt propagation, definitions of source/target nuclei, notation), justifications for core architectural choices, computational cost metrics, and external dataset validation are currently insufficient, and addressing these points in rebuttal will clarify the method’s novelty, interpretability, and practical feasibility.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces VPP2P, a semi-supervised “patch-to-pixel” visual-prompting framework for segmenting vessels encapsulating tumor clusters (VETC). It encodes nucleus-distribution patterns as pixel-level prompts and learns them through a dual-branch contrastive teacher-student design, achieving performance on par with fully supervised DenseUNet while using only a quarter of the manual annotations and outscoring recent prompt-based and semi-supervised baselines by up to three mIoU points on a multi-centre hepatocellular-carcinoma cohort.

    During the first review round the scores were 3 / 3 / 5, reflecting two weak rejects and one accept. After the authors’ rebuttal clarified the propagation of 10-dimensional nucleus prompts, explained the necessity of Stage 2 refinement, justified the DenseUNet + HoVer-Net backbone and reported practical training and inference times, the second and third reviewers raised their recommendations to accept; only the first reviewer maintained a reject, citing remaining presentation ambiguities. The final tally is therefore two accepts against one reject.

    The AC agrees with the accepting majority. The central contribution—the explicit injection of task-specific spatial priors as fine-grained prompts, learned at pixel level—is original and empirically well supported. The residual concerns centre on notation, section headings and baseline labelling rather than on algorithmic soundness, and can be resolved editorially without breaching MICCAI’s no-new-experiments rule.

    Consequently the AC recommends acceptance. In the camera-ready version the authors should rename Section 2.1 to “Visual-prompt generation,” introduce all symbols at first use, add the concise prompt-concatenation description from the rebuttal, distinguish clearly between “Baseline” and “WSSS-Tissue” in the tables, incorporate ground-truth masks into Figure 2 and include the promised code-and-data release note. With those clarifications the paper will be fully transparent for readers.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers recognize the novelty of its visual prompting design and acknowledge its superior performance. Therefore, I recommend acceptance.



back to top