Abstract

Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes.However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively.Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model’s textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data.Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at https://github.com/zhengwang9/Rasa.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1264_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zhengwang9/Rasa

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanZhe_Enhancing_MICCAI2025,
        author = { Wang, Zheng and Liu, Hong and Wang, Zheng and Li, Danyi and Cen, Min and Magnier, Baptiste and Liang, Li and Wang, Liansheng},
        title = { { Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {199 -- 209}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study presents a novel approach to incorporate pathology report data into survival analysis frameworks utilizing Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSI). Furthermore, the authors introduce a risk-aware mix-up strategy aimed at mitigating the challenges associated with limited dataset sizes. The efficacy of the proposed method is assessed through comparative evaluations with state-of-the-art (SOTA) techniques, and systematic ablation studies are conducted to provide insight into the underlying mechanisms.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The methodology is rigorously developed and thoroughly described, facilitating a comprehensive understanding of the approach.
      • The proposed method is evaluated against a relevant and diverse set of state-of-the-art (SOTA) baseline models, ensuring a robust assessment of its performance.
      • Systematic ablation analyses are conducted to elucidate the functional contributions of individual submodules, providing valuable insights into the method’s internal workings and validating the design choices.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The evaluation of the method is limited and fails to provide conclusive evidence of its utility, primarily due to the lack of external validation, which is a standard practice in the field. Furthermore, the use of a non-publicly available CRC dataset hinders reproducibility, despite the existence of publicly available alternatives such as TCGA-CRC. Additionally, the small sample size of 331 cases in the TCGA-BRCA cohort is not justified, given that it comprises over 1000 patients.
      • The choice of using BioBert for text encoding, while utilizing Conch for H&E image encoding, is not well-justified, as Conch could have been employed for both modalities to facilitate better image-text alignment from the outset.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is promising. However its potential is hindered by the absence of external validation and large-scale training cohort evaluations, which are essential for establishing its robustness and generalizability.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed my main concern about external validation. It is now clear to me that the authors trained on an internal CRC cohort and evaluated on TCGA-CRC. While we believe further external validation for the other experiments (e.g. BRCA) would be beneficial in future work, we recommend the manuscript for acceptance.



Review #2

  • Please describe the contribution of the paper

    This paper introduces real-world pathology reports into WSI-based survival analysis by proposing a two-stage multimodal self-distillation framework, Rasa. The method leverages large language models to structure clinical text, uses a teacher model to guide patch selection, and incorporates a risk-aware mix-up strategy to enhance the robustness of training. The approach demonstrates strong performance across multiple cancer datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The manuscript is generally well-written, clearly structured, and easy to follow.
    2. The authors successfully leverage GPT-4 and BioClinicalBERT to transform unstructured clinical text into structured representations aligned with image semantics, enhancing both interpretability and modality consistency.
    3. The proposed two-stage distillation framework integrates text-guided patch selection and risk-aware augmentation, enabling more effective training.
    4. The method demonstrates strong performance across two datasets, and the authors have committed to releasing their private dataset and implementation code.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The teacher model is frozen after training in Stage 1, and its distilled knowledge is only utilized for non-augmented samples. This design may limit the adaptability of the teacher’s guidance during training.
    2. Since clinical narratives may vary across institutions, it is unclear how the authors ensure the selected textual tokens remain effective and generalizable across datasets.
    3. In Section 2.3, the authors refer to a “predefined threshold” but do not clarify how it is determined. It remains unclear how different threshold settings affect downstream performance.
    4. The accuracy of the predicted sample-level risk scores is critical, yet the manuscript does not sufficiently justify their reliability. It is particularly concerning that potential errors around the median-based dichotomization into high/low-risk groups might adversely affect Stage 2 training.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors need to justify that how they can ensure the selected textual tokens remain effective and generalizable across datasets and how some designed parameters affect the performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I am satisfied with the authors’ response.



Review #3

  • Please describe the contribution of the paper
    1. A task-specific prompt guides LLMs (e.g., GPT-4) to extract fine-grained, WSI-aligned textual descriptions from raw pathology reports, addressing noise (e.g., WSI-irrelevant lymph node information) in traditional reports.
    2. A teacher-student architecture leverages textual knowledge to filter redundant WSI features (e.g., non-tumor regions) and introduces a risk-aware mix-up strategy to enhance data diversity.
    3. Superior performance on both a self-collected colorectal cancer (CRC) dataset and the public TCGA-BRCA dataset, achieving CI scores of 0.6834 and 0.6972, respectively, outperforming state-of-the-art methods (e.g., TransMIL, MCAT, PseMix).
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. LLM-generated textual descriptions align pathology reports with WSI visual features (e.g., emphasizing keywords like “tumor cells”), surpassing the limitations of generic text in prior methods (e.g., QPMIL-VL) (Table 2).
    2. SOTA results on both private CRC and public TCGA-BRCA datasets demonstrate robustness.
    3. Eliminates manual tumor region annotation (via text-guided patch sampling), reducing reliance on expert annotations, making it suitable for resource-constrained scenarios.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Key parameters (e.g., γ=0.5, paug=0.7) lack ablation studies, with no justification for their selection (e.g., dataset-specific dependency).
    2. Using GPT-4 and CONCH encoders may incur high computational costs, with no discussion of lightweight deployment (e.g., smaller LLMs or model compression).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite minor limitations in novelty and experimental details, the paper demonstrates significant innovation in integrating pathology reports with WSIs, self-distillation framework design, and risk-aware mix-up, validated through cross-dataset experiments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I think the author’s attitude is quite sincere, and the paper have a certain degree of innovation.




Author Feedback

We thank the reviewers for appreciating our work’s novelty and clarity. Their constructive comments significantly help improve the quality of this manuscript.

R1Q1 Justification of hyper-parameters We have conducted detailed analysis experiments on p_aug and γ in Fig.2 and Fig.4, respectively, and specified their values according to the optimal results on the validation dataset. Our previous γ-effect studies, consistent with the ‘Effect of Threshold’ discussion in Sec.3.4, show performance drops at 0.3/0.7 but still outperform SOTA baselines.

R1Q2 Computational costs GPT-4: While we acknowledge the additional computational costs (e.g., time and API fees), our method significantly benefits from the powerful capabilities of GPT-4 in report preprocessing as evidenced by Tab. 2. We will investigate cost-efficient alternatives (e.g., smaller LLMs) in future work. CONCH: The overhead of feature extraction using CONCH is affordable, as a single RTX3090 is sufficient for deployment.

R2Q1 Justification of evaluation For the issue of external validation, our previous experiments show our model (only trained on our CRC dataset, Tab. 1) consistently outperforms SOTA baselines on the mentioned TCGA-CRC under the same setup. The limited scale of TCGA-BRCA (n=331) is because we included only patients with at least a 5-year follow-up. This approach ensures both sufficient statistical power and reliable data support for the study conclusions, since the 5-year follow-up period is a widely accepted standard in clinical studies [1]. Additionally, all baselines used consistent dataset settings to ensure fair comparison. [1] Clark, Taane G., et al. “Survival analysis part I: basic concepts and first analyses.”

R2Q2 Issue on Reproducibility To promise reproducibility, we will open-source both the code and the extracted features of the non-public CRC dataset.

R2Q3 Choice of Text encoder As shown in the ‘CONCH Text-Encoder’ row of Tab. 2, using CONCH for both modalities underperforms our method. We attribute the CONCH text encoder’s weak performance to its limited understanding of long pathological reports, as it was only pre-trained on short patch-level captions.

R3Q1 Configuration of teacher model and distill loss We have carefully compared the different implementations of the teacher module (i.e., non-frozen teachers and full-sample distillation) during development, and empirical findings suggested we shouldn’t adopt them. We infer that the non-frozen teacher may overfit training data, and the teacher’s prediction on unseen augmented data may introduce undesired noise.

R3Q2 Generalization of the selected textual tokens As outlined in Sec. 2.1, we enhance textual token generalizability through carefully designed prompts, ensuring the reports generated by GPT-4 emphasize WSI microscopic visual characteristics while excluding information unrelated to WSIs. This strategy helps improve the consistency of resulting reports across different institutions. We will publicly share transformed reports of the two datasets from different institutions (i.e., CRC and TCGA-BRCA).

R3Q3 Impact of γ Please see R1Q1.

R3Q4 Reliability of predicted sample-level risk scores We agree that accurate sample-level risk scores are crucial. To reduce the impact of inaccuracies in Stage 2, we guide the mixture of textual features only between the high/low-risk samples from a coarse view - instead of always assigning the textual features of the sample with relatively higher risk scores to the mixture from a fine-grained view. This approach lessens reliance on precise predictions and balances leveraging risk scores with minimizing noise. Although median-based dichotomization is a simple method for identifying risk groups, it serves as an initial step to use mix-up to enhance WSI-based survival analysis, and the effectiveness was empirically verified through our experiments. We will develop more adaptive alternatives to median-based dichotomization for future improvements.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper presents a novel approach to incorporate pathology report data into survival analysis frameworks utilizing Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSI). Furthermore, the authors introduce a risk-aware mix-up strategy aimed at mitigating the challenges associated with limited dataset sizes. The efficacy of the proposed method is assessed through comparative evaluations with state-of-the-art (SOTA) techniques, and systematic ablation studies are conducted to provide insight into the underlying mechanisms. All reviewers are happy to accept this manuscript for publication.



back to top