Abstract

Survival prediction using whole-slide images (WSIs) is crucial in cancer research. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative representations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solution. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple language prompt and basic cosine similarity, which fails to learn fine-grained associations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interactions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features corresponding to different survival-related attributes from prompts, thereby improving vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency between patch and region levels. Experiments on three The Cancer Genome Atlas (TCGA) datasets demonstrate our state-of-the-art performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1060_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/gluucose/HiLa

Link to the Dataset(s)

https://www.cancer.gov/ccg/research/genome-sequencing/tcga

BibTex

@InProceedings{CuiJia_HiLa_MICCAI2025,
        author = { Cui, Jiaqi and Wen, Lu and Fei, Yuchen and Liu, Bo and Zhou, Luping and Shen, Dinggang and Wang, Yan},
        title = { { HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {239 -- 249}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a novel hierarchical vision-language collaboration framework for survival prediction. The core contribution lies in the integration of multiple language prompts with Optimal Prompt Learning (OPL) to strengthen the alignment between diverse survival-related attributes and visual features extracted from whole slide images (WSIs). In addition, the authors introduce two key modules: Cross-Level Propagation (CLP) to facilitate effective hierarchical interaction, and Mutual Contrastive Learning (MCL) to enforce consistency across different levels.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The methodology is clearly explained, and the proposed components are well-motivated. The authors present an interesting approach by combining prompt learning with optimal transport. This integration enhances the alignment between survival-related language prompts and visual features from whole slide images. The introduction of Cross-Level Propagation (CLP) module is also an additional contribution which models hierarchical interactions between different levels of visual and semantic information.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.While the proposed framework is well-structured, it primarily builds upon existing techniques, namely, prompt learning, optimal transport, hierarchical feature integration, and contrastive learning. The combination of these components, though meaningful, offers limited technical innovation, as each element has been explored in prior literature. 2.The paper lacks critical implementation details regarding the prompt learning component, which raises concerns about reproducibility: Are prompt features shared across all WSIs within a dataset, or are they instance-specific? What is the dimensionality of the prompt features? Why was PLIP selected as the text encoder over other alternatives? 3.For compared baselines, many SOTA methods seem missing, i.e. Song, A. H., Chen, R. J., Ding, T., Williamson, D. F., Jaume, G., & Mahmood, F. (2024). Morphological prototyping for unsupervised slide representation learning in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11566-11578). Additionally, many of the compared methods have evaluated performance on multiple TCGA datasets (e.g., BLCA, GBMLGG), whereas the current work focuses on a limited subset. Expanding the evaluation to include more diverse datasets would strengthen the evidence for generalizability and robustness of the proposed method.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper demonstrates a promising direction and solid implementation, many aspects could be further improved to strengthen the contribution and the quality of this paper, especially in enhancing the experimental scope and clarifying methodological details.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Thank you for the rebuttal by the authors. Most of my concerns are addressed. I tend to rate accept to this manuscript after rebuttal.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel Hierarchical Visual-Language Collaboration (HiLa) framework for survival prediction of whole slice images (WSIs) in cancer research. Specifically, the paper proposes Cross-Level Propagation (CLP), which establishes hierarchical and attentional connections between patch and region levels by using patch-level knowledge for region-level prediction. Moreover, the paper also proposes Mutual Contrastive Learning (MCL), which ensures the consistency between patch-level and region-level visual features of each patient through contrastive learning. Experimentally, HiLa is verified to have superior performance on three public cancer datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Hila solves the limitation of the previous method that only considers the local patch information of WSIs by introducing global region information and multi-level feature extraction
    2. In terms of text-branch, the article proposes Optimal Prompt Learning (OPL), which can align language information and visual features more finely.
    3. The ablation experiment in the article is detailed enough to clearly see the contribution of each component.
    4. The article is written clearly enough and the experimental results are relatively superior.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In the previous study VLSA[1], they used five datasets and an additional metric (MAE). Why did this article only use three of the datasets for experiments?
    2. The article cited other models such as RRT-MIL, Coop, VLSA, etc. for comparison, but the results of other models in Table 1 of the article are quite different from those of VLSA. If the article wants to compare other models, it should be more fair to use the same settings as the previous article.
    3. The article only mentions using GPT-4o to generate more detailed prompts. It would be better if different LLMs were added for comparative experiments (Claude, Grok, Deepseek, Llama)

    [1]Liu, P., Ji, L., Gou, J., Fu, B., & Ye, M. (2024). Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology. arXiv preprint arXiv:2409.09369.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The article has some innovations and relatively good results, but it is somewhat insufficient compared with previous research experiments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author answered my questions very well in the rebuttal. This article has certain innovation in methods and superior experimental results, so the article should be accepted.



Review #3

  • Please describe the contribution of the paper

    The manuscript presents HiLa, a novel framework for enhanced survival prediction using WSIs. The authors address three significant challenges in current methodologies: (1) the limitations of simple language prompts, (2) the inadequacy of cosine similarity metrics for capturing nuanced associations between language and visual features, and (3) insufficient modeling of the patch-region structure within WSIs. Their approach integrates optimal prompt learning, class-level propagation, and mutual contrastive learning to overcome these limitations. Experimental evaluation demonstrates that the proposed method achieves state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to follow
    2. The authors provide a thorough analysis of critical limitations in existing approaches, specifically addressing the ineffectiveness of naïve prompting methods, the shortcomings of simple cosine similarity measurements between language and visual features, and the inadequate modeling of WSIs’ hierarchical nature.
    3. The experimental methodology is robust, with results that convincingly demonstrate the efficacy of the proposed approach. The ablation studies also validate the contribution of each component to the overall performance.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The manuscript would benefit from a sensitivity analysis of key hyperparameters, including the ratio, \lambda value, and memory queue length. This additional analysis would enhance reproducibility and provide valuable insights into the robustness of the model.
    2. The authors’ choice of \lambda warrants further examination and justification. Given this parameter’s relatively small magnitude, it would be valuable to include a comparative analysis demonstrating how this specific value influences the learning dynamics, particularly by examining the relative magnitudes of the two loss components and their respective contributions to the overall optimization process.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper identified several key challenges in the field and proposed effective solutions to them, despite there are still minor flaws.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed my concerns. I understand that in the rebuttal phase, no additional experimental results are allowed. I would recommend to accept this paper.




Author Feedback

We thank all the reviewers (R1, R2, R3) for their constructive comments. Q1: Technical novelty. (R3) A1: We would like to highlight our novelty as follows: First, rather than conventional oversimplified prompt, we design tailored language prompts enriched with survival-related priors to enable a comprehensive survival assessment from multiple perspectives. Second, moving beyond basic cosine similarity for vision-language alignment, we recognize the complex nature of WSIs and design OPL which enhances optimal transport via a visual token selection mechanism to capture fine-grained discriminative patterns in WSIs. Third, different from traditional hierarchical feature integration, our CPL establishes explicit correspondences between patch-level features and their respective regions, ensuring effective hierarchical information propagation from local to global. In addition, we introduce MCL which improves standard contrastive learning by enforcing mutual interaction between patch-level and region-level prototypes, enhancing hierarchical consistency. We contend that our method, as an early exploration of VL-based survival prediction with tailored techniques, offers novel insights for survival prediction in CPath. We will strengthen our novelty in final paper. Q2: Implementation Details. (R3) A2: 1) Prompt features are shared across all WSIs within a dataset to ease the notation burden of pathologists. 2) The prompt dimension d is 512. Specifically, the original visual features extracted by the patch-level and region-level feature extractors have dimensions of 384 and 192, respectively. These features are then linearly projected to 512 dimensions to facilitate subsequent vision-language collaboration. 3) PLIP is trained using substantial pathology images paired with expert-written texts. Compared to alternatives such as CLIP and CXR-BERT (trained on natural and X-ray image-text pairs), PLIP’s text encoder imparts rich pathological contexts during language prompts generation, thereby enhancing survival prediction. We will include these details in final paper. Q3: Experiments regarding method and datasets. (R1&R3) A3: (1) Based on the official implementation by Song et al., our method achieves a 2.2% improvement in overall CI. (2) We had evaluated five datasets prior to submission but reported three due to space constraints; HiLa consistently outperforms all baselines across them. Full numerical results will be included in final version. Q4: Results different to VLSA. (R1) A4: Our method employs HIPT [24] as the feature extractor to obtain both patch-level and region-level visual features, while VLSA uses CONCH for feature extraction in both the proposed and comparing methods. Since the original implementations of the comparison methods adopt different feature extractors (e.g., ResNet, PLIP, etc.), we uniformly adopt HIPT across all comparison methods to ensure a fair and consistent evaluation in this paper. Therefore, variations in performance compared to the original VLSA paper are expected. Our paper focuses on CI and KM analysis, which are more appropriate and common for survival prediction than MAE, as they align better with time-to-event data to evaluate discriminative ability and clinical utility. Q5: Utilization of other LLMs. (R1) A5: We had tried LLaMa-3, Claude-3, and GPT-4o during model development and found that our method yields robust results across different LLMs, and GPT-4o has the best performance. Therefore, GPT-4o is selected for use. Numerical details will be provided in final paper. Q6: Sensitivity Analysis of Hyperparameters. (R2) A6: We had tested various hyperparameter values during model development to select the optimal ones. Specifically, we evaluate ratio r%∈{0.4,0.5,0.6,0.7}, λ∈{0.005,0.01,0.05,0.1}, and queue length B∈{10,20,30} and find that the CI peaks when using the settings reported in our paper. We will include numerical details. Q7: Reproducibility. A7: Code will be released upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top