Abstract

Histo-genomic multimodal survival prediction has garnered growing attention for its remarkable model performance and potential contributions to precision medicine. However, a significant challenge in clinical practice arises when only unimodal data is available, limiting the usability of these advanced multimodal methods. To address this issue, this study proposes a prototype-guided cross-modal knowledge enhancement (ProSurv) framework, which eliminates the dependency on paired data and enables robust learning and adaptive survival prediction. Specifically, we first introduce an intra-modal updating mechanism to construct modality-specific prototype banks that encapsulate the statistics of the whole training set and preserve the modality-specific risk-relevant features/prototypes across intervals. Subsequently, the proposed cross-modal translation module utilizes the learned prototypes to enhance knowledge representation for multimodal inputs and generate features for missing modalities, ensuring robust and adaptive survival prediction across diverse scenarios. Extensive experiments on four public datasets demonstrate the superiority of ProSurv over state-of-the-art methods using either unimodal or multimodal input, and the ablation study underscores its feasibility for broad applicability. Overall, this study addresses a critical practical challenge in computational pathology, offering substantial significance and potential impact in the field.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2488_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/cyclexfy/ProSurv

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuFen_PrototypeGuided_MICCAI2025,
        author = { Liu, Fengchun and Cai, Linghan and Wang, Zhikang and Fan, Zhiyuan and Yu, Jin-Gang and Chen, Hao and Zhang, Yongbing},
        title = { { Prototype-Guided Cross-Modal Knowledge Enhancement for Adaptive Survival Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {531 -- 541}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a prototype-guided cross-modal knowledge enhancement framework for multimodal (i.e., WSI and Genomics) survival prediction. In particular, the proposed intra-modal prototype banks help capture modality-specific survival-relavent features, whereas the cross-attention-based alignment helps align the cross-modal features.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed intra-modal prototype learning is interesting, especially when considering the even-aware sampling such that the prototype bank can be more informative to certain event. This is meaningful in survival prediction tasks.

    2. The proposed knowledge-enhanced learning works for both complete modalities and missing modalities.

    3. The proposed method shows promising empirical results on four survival prediction datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The multimodal ProSurv appears to not outperform the unimodal ProSurv in some datasets, which is quite counterintuitive and somewhat defeats the effectiveness of the proposed multimodal learning.

    2. Given the fact that the improvement (on average) between unimodal and multimodal ProSurv is quite marginal, can the authors provide any statistical-test results.

    3. The proposed learning scheme under missing modalities appears to only optimize either L^p or L^g, how this can preserve the knowledge from other modalities ? Presumably, when one modality is missing, one would like to align the unimodal representations with the multimodal representations via e.g., knowledge distillation (i.e., a reminiscent of the coordinated models [1] in multimodal learning; which align unimodal and multimodal representations on distribution by optimizing KL).

    4. Although ablation studies are provided, the sensitivity analysis on \alpha and \beta is missing.

    [1] Joint multimodal learning with deep generative models

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents an interesting idea for multimodal survival prediction with WSI and Genomics. My main concern is the effectiveness of the multimodal framework over the unimodal ones, supported by the empirical results on Table 1, as well as the rationale of how the proposed method can handle missing modalities. Therefore, my initial evaluations lean toward “weak reject”.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This study proposes a Prototype-guided Cross-modal Knowledge Enhancement (ProSurv) framework, which removes the dependency on paired data and enables robust learning and adaptive survival prediction. Specifically, an intra-modal updating mechanism is introduced to construct modality-specific prototype banks that capture the statistical patterns of the entire training set and preserve risk-relevant features/prototypes across intervals within each modality. Subsequently, a cross-modal translation module leverages these learned prototypes to enhance knowledge representation for multimodal inputs and generate features for missing modalities, thereby supporting robust and adaptive survival prediction across diverse scenarios.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The cross-modal translation module utilizes the learned prototypes to enrich knowledge representation for multimodal inputs and to synthesize features for missing modalities, thereby enabling robust and adaptive survival prediction across diverse scenarios.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In the survival analysis based on pathological images and multi-omics data, it is still unclear which modality is more expensive and whether multi-omics fusion will definitely have a better prediction effect (Table 1 in the manuscript also illustrates this point). In the Knowledge-Enhanced Learning and Prediction section, when the modality is missing, is it necessarily beneficial to predict survival by querying the feature representation in the prototype bank of the relevant missing modality through the existing modality?

    What is the difference between the ptototype proposed in this manuscript and the methods in the following 3 articles? Song A H, Chen R J, Jaume G, et al. Multimodal prototyping for cancer survival prediction[J]. arXiv preprint arXiv:2407.00224, 2024. Zhang Y, Xu Y, Chen J, et al. Prototypical information bottlenecking and disentangling for multimodal cancer survival prediction[J]. arXiv preprint arXiv:2401.01646, 2024. Xu Y, Zhou F, Zhao C, et al. Distilled Prompt Learning for Incomplete Multimodal Survival Prediction[J]. arXiv preprint arXiv:2503.01653, 2025.

    What is the relationship between the size of the prototype bank and the size and complexity of the dataset? The author only conducted experiments on TCGA-CRAD, which is not enough to answer this question.

    Methods based on prototypes and cross-attention have already appeared in survival analysis, and their innovation is average. The method in this manuscript performs well in dealing with missing modality, but may not be as good as direct cross-attention based methods when dealing with multimodal fusion.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The cross-modal translation module can synthesize features for missing modalities, thereby enabling robust and adaptive survival prediction across diverse scenarios. However, methods based on prototypes and cross-attention have already appeared in survival analysis, and their innovation is average. The method in this manuscript performs well in dealing with missing modality, but may not be as good as direct cross-attention based methods when dealing with multimodal fusion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have answered most of my concerns.



Review #3

  • Please describe the contribution of the paper

    The authors propose ProSurv, a prototype-guided cross-modal knowledge enhancement framework for adaptive survival prediction. The key innovation lies in eliminating the dependency on paired data by learning modality-specific prototype banks that capture risk-relevant features across survival intervals. These prototypes are then used to guide cross-modal translation, enabling the generation of features for missing modalities. Depending on the data availability, ProSurv adaptively combines original and translated features for robust survival prediction. The framework is applied to histopathology whole slide images and genomic data in a multimodal setting. It introduces novel mechanisms including intra-modal prototype updating with contrastive learning, and alignment losses to ensure semantic consistency between translated and real features. Extensive experiments are conducted across four datasets, evaluating ProSurv under both unimodal and multimodal inference scenarios. The method demonstrates state-of-the-art performance compared to baselines and competitive models addressing the missing modality problem. Additionally, comprehensive ablation studies validate the contributions of each module and loss component.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Thorough ablation studies: The paper conducts well-designed ablation studies that isolate the contributions of each major component — the prototype banks, the intra-modal similarity loss, and the alignment loss. Results across multiple modality configurations (complete, pathology-only, genomics-only) clearly demonstrate the performance impact of removing each component (Table 2), offering strong empirical evidence for their necessity.
    2. Novel and significant use of prototype banks: The idea of learning modality-specific prototype banks tied to survival intervals and using them for cross-modal translation is novel in the context of survival analysis. Unlike previous methods requiring paired modalities or direct reconstruction, ProSurv uses these learned prototypes to guide feature synthesis in a more structured and risk-aware manner (Sections 2.3 & 2.4), making the approach more flexible and clinically applicable.
    3. Demonstrated robustness to missing modalities: The authors explicitly evaluate ProSurv under varying percentages of unimodal data during training (Fig. 3b), showing that performance remains stable even when over half of the training data lacks one modality. This demonstrates the framework’s robustness and adaptability to real-world clinical settings where complete multimodal data is often unavailable.
    4. Comprehensive comparative evaluation: The paper compares ProSurv against baseline models spanning: unimodal pathology models, unimodal genomic models, multimodal models without missing modality handling, multimodal methods with missing modality consideration. These comparisons are carried out on four diverse datasets using the C-index metric, with ProSurv achieving state-of-the-art results under both unimodal and multimodal testing.
    5. Flexible inference mechanism: ProSurv introduces a unified inference strategy that dynamically adjusts based on available data: combining original and translated features when both modalities are present, or synthesizing missing modality features when only one is available. This flexibility is well-supported with separate equations and loss formulations, making the model truly adaptive in practical deployment scenarios.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Authors need to clarify the clinical feasibility of the proposed framework. While the method shows strong technical results, its reliance on both whole slide images and genomic profiles may limit its real-world applicability. These modalities are not always collected together in routine clinical workflows, especially in low-resource settings or when patient data is fragmented. The construction of the prototype banks lacks sufficient detail. It is unclear how the samples used to initialize or update the prototype banks are selected — whether randomly or based on specific criteria. the choice of 32 prototypes per bin is not justified, nor is there discussion about whether this number should be adapted depending on dataset size or feature distribution.
    2. The results raise an important concern: in 2 out of the 4 datasets (e.g., BRCA and STAD), using pathology alone during inference yields better or similar performance compared to using both modalities. The overall gain from adding genomic information appears marginal. This raises the question of whether pathology features alone are sufficient in many cases, and whether genomic profiles should be deprioritized given their limited availability and clinical cost.
    3. The authors do not provide a sensitivity analysis or discussion on the choice of hyperparameters, particularly the weights assigned to loss terms like the alignment loss. While the current setting (0.2) appears to work reasonably well, it is possible that tuning these weights could yield further performance improvements or offer insights into the model’s behavior.
    4. The paper does not report on the computational cost associated with maintaining and updating the prototype banks. Since the banks are updated per interval and modality, they may introduce memory or compute overhead, particularly on larger datasets. The lack of a runtime or resource analysis makes it hard to assess scalability.
    5. The authors state that prototype banks “encapsulate statistics of the whole training set,” but there is no evaluation or evidence provided to support this claim. It remains unclear how well these prototypes generalize or whether they truly reflect global data characteristics, especially in diverse or imbalanced datasets.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Recommendation: Accept

    The paper addresses an important and clinically relevant problem of survival prediction using multimodal data in the presence of missing modalities. It introduces a novel and effective methodology through the use of prototype-guided cross-modal knowledge enhancement, and demonstrates strong empirical results across four public datasets. The framework is flexible, robust, and outperforms several state-of-the-art methods, with particularly good design around adaptive inference and loss component analysis.

    However, to improve clarity and completeness, the authors should further justify several methodological decisions. These include how the prototype bank is constructed and why its size is fixed at 32 prototypes per interval. A more detailed explanation would enhance the transparency and reproducibility of the work. Additionally, the choice of hyperparameters should be discussed or explored.

    Finally, the paper would benefit from a more explicit discussion about clinical feasibility and real-world deployment, especially considering that whole slide images and genomic profiles are not always collected together in practice. It would also be helpful to clarify why pathology-only inference performs nearly as well as multimodal inference, and whether this suggests that pathology alone may often be sufficient.

    Despite these limitations, the core contribution is strong, well-motivated, and clearly demonstrated through experiments. I recommend acceptance, with minor revisions to address the above points.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed the weaknesses well. I still recommend that they revise the paper to assert more on the clinical impact and the choice of samples and size of the prototype bank.




Author Feedback

We appreciate reviewers for their valuable comments and suggestions.

  1. Ablation Study. Although hyperparameters analysis (e.g., α/β and K) was not included in manuscript, they were all carefully selected based on results across datasets. The results in Table 1 correspond to the best-performing configurations determined from ablation studies.
    • Experiments concluded that setting α to 0.2 (with β fixed) yields the best C-index on all datasets. Similarly, setting β to 0.2 is the optimal option.
    • We evaluated the prototype bank size K across datasets and concluded that 32 is the optimal option, balancing feature discrimination and redundancy, independent of dataset sizes.
    • We evaluated the computational cost of maintaining and updating the prototype bank which increases by 1/3 in time.
  2. Prototypes Banks.
    • We analyzed prototypes distributions by computing Jensen-Shannon divergence and Silhouette score of dimension-reduced features across different survival stages. Results on all datasets show clear separation and compactness, confirming that prototypes of both modalities effectively capture survival-stage-specific data patterns.
    • The comparison of ProSurv with its variant (w/o prototypes) in Table 2 quantitatively confirms the effectiveness of querying under missing modality. Fig.3 further verifies this qualitatively.
  3. Clinical Feasibility and Necessity of Gene Data. Although collecting multimodal data is challenging, public datasets (e.g., TCGA) provide many paired samples. Meanwhile ProSurv utilizes both uni- and multi-modal data for training, making it feasible in clinical scenarios. The unstable effectiveness of genome data (good in BLCA, CRAD; bad in BRCA, STAD) is due to disease characteristics. When trained on partial paired data, ProSurv reduces dependency on gene input while testing, making it more scalable.

  4. Reasons for better results of using unimodal data and performance differences in ProSurv.
    • ProSurv narrows the gap between uni- and multi-modal inputs by using prototype-guided crossmodal translation to complete missing modality. Additionally, H&E samples outnumber paired image-gene data during training (e.g., only 687 of 868 BRCA cases are paired). Previous multimodal methods train only on paired data ignoring unpaired data. However, ProSurv can learn from the whole dataset with paired and unpaired data. Moreover, the inferior performance using genome data affects the overall multimodal results on BRCA and STAD, leading to H&E input being able to obtain results close to or even exceeding those of multimodal.
    • The difference between H&E-only and multimodal inputs is not statistically significant, showing the effectiveness of cross-modal reconstruction. The gene-only vs. multimodal difference is significant, confirming multimodal ProSurv’s advantage. Still, gene-only ProSurv outperforms other gene-only baselines.
  5. Preserving modality knowledge and completing missing modality. When both modalities are available, ProSurv fuses and optimizes them using L^p,g, preserving multimodal knowledge by L_sim and L_align. When only one modality is available, it preserves input modality by optimizing L_sim and reconstructs missing-modal features via prototype-guided cross-modal translation. Methods (e.g., knowledge distillation) aligning unimodal representations with multimodal ones require complete multimodal data for training, which is not as flexible as ProSurv, which can train on both uni- and multi-modal data.

  6. Method Differences. Prototypes: The prototypes in MMP and PIBD learn and fuse features; distinct from ours for reconstructing missing modalities. DisPro uses LLM with prompts to distill knowledge for unimodal inference, whereas ProSurv achieves this via prototype-guided crossmodal translation, which is essentially different. Cross-attention: Prior cross-attention is for processing paired multimodal data, distinct from flexible ProSurv that can leverage both uni- and multi-modal samples.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The author addresses the concerns in the rebuttal



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All concerns presented by reviewers are well addressed. This work presents enough technical contributions and meets the bar of MICCAI.



back to top