Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

To improve the prediction of cancer survival using whole-slide images and transcriptomics data, it is crucial to capture both modality-shared and modality-specific information. However, multimodal frameworks often entangle these representations, limiting interpretability and potentially suppressing discriminative features. To address this, we propose Disentangled and Interpretable Multimodal Attention Fusion (DIMAF), a multimodal framework that separates the intra- and inter-modal interactions within an attention-based fusion mechanism to learn distinct modality-specific and modality-shared representations. We introduce a loss based on Distance Correlation to promote disentanglement between these representations and integrate Shapley additive explanations to assess their relative contributions to survival prediction. We evaluate DIMAF on four public cancer survival datasets, achieving a relative average improvement of 1.85% in performance and 23.7% in disentanglement compared to current state-of-the-art multimodal models. Beyond improved performance, our interpretable framework enables a deeper exploration of the underlying interactions between and within modalities in cancer biology. Code and checkpoints are publicly available at: https://github.com/Trustworthy-AI-UU-NKI/DIMAF

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4457_paper.pdf

SharedIt Link: https://rdcu.be/eHxb0

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_12

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Trustworthy-AI-UU-NKI/DIMAF

Link to the Dataset(s)

https://portal.gdc.cancer.gov

BibTex

@InProceedings{EijAni_Disentangled_MICCAI2025,
        author = { Eijpe, Aniek AND Lakbir, Soufyan AND Erdal Cesur, Melis AND Oliveira, Sara P. AND Abeln, Sanne AND Silva, Wilson},
        title = { { Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {117 -- 127}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces DIMAF (Disentangled and Interpretable Multimodal Attention Fusion), a novel framework for cancer survival prediction using whole-slide images (WSIs) and transcriptomics data. DIMAF innovatively separates modality-specific and modality-shared representations through a disentangled attention mechanism composed of self-attention and cross-attention layers. To enforce and quantify this disentanglement, it employs a Distance Correlation-based loss, and uses SHapley Additive Explanations (SHAP) for interpretability. Evaluated on four TCGA cancer datasets, DIMAF outperforms state-of-the-art multimodal models by achieving a relative average improvement of 1.85% in survival prediction and 23.7% in disentanglement. Importantly, it provides biologically interpretable insights into the distinct and shared contributions of each data modality.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Interpretability via SHAP and Prototype Analysis: The paper employs SHAP values to quantify feature contributions and introduces morphological prototypes in WSIs.
- Empirical Performance on Diverse Cancer Types: DIMAF outperforms baselines like MMP and PIBD across four distinct TCGA cancer datasets. The consistent gains in both prediction (c-index) and disentanglement (DC) suggest robustness and adaptability of the method across cancer types with varied modalities and data complexities.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Methodological Novelty is Incremental: While the disentangled attention fusion is well-motivated, the core architecture is an assembly of known components: cross-attention layers, self-attention layers, SHAP, and DC. SurvPath (CVPR 2024) and MMP (ICML 2024) already integrate WSIs and transcriptomics using transformer-based fusions and interpretable tokenizations. PIBD (ICLR 2024) already attempts disentanglement in cancer multimodal learning.
- Limited Scope in Generalizability of Interpretability: While SHAP and expert annotations provide interpretability for WSIs, there is less interpretability or qualitative validation for transcriptomics features. Since transcriptomics are more abstract than WSIs, some visualization or clustering of pathway activations could have strengthened biological insights.
- Dataset Selection Omits Key Comparability with Prior Works: The chosen TCGA datasets (BRCA, BLCA, LUAD, KIRC) differ from those used in SurvPath (BRCA, BLCA, COADREAD, HNSC, STAD), which makes direct performance comparison less meaningful.
- No Discussion on Model Scalability: The use of multiple attention blocks and GMM summarization could impose computational overhead, especially as the number of pathways or WSI patches increases.
- Statistical Significance Missing in Performance Results: Although improvements in c-index and DC are reported, no statistical tests (e.g., paired t-tests or Wilcoxon) are presented. It weakens the claim of performance superiority over baselines like MMP and PIBD when no confidence intervals or p-values are provided.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper proposes a well-structured and interpretable fusion framework for multimodal cancer survival prediction, its methodological novelty is incremental over prior work like SurvPath and PIBD. Additionally, the lack of key datasets used in comparable baselines limits the generalizability and fairness of its empirical evaluation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper presents DIMAF, a novel multimodal fusion framework for cancer survival prediction that explicitly disentangles modality-specific and modality-shared representations using attention-based fusion. The authors introduce a distance correlation-based loss to promote disentanglement and use SHAP values to interpret the importance of each component. The model is evaluated on four TCGA datasets and shows superior performance and interpretability over state-of-the-art baselines.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Measuring and enforcing disentanglement between latent representations in multimodal survival prediction is a meaningful yet underexplored direction. The paper clearly motivates the need to separately modality-shared and modality-specific features for both performance and interoperability.
2. Employing distance correlation (DC) as a differentiable loss to promote disentanglement is well-founded. It effectively captures non-linear dependencies and is more general than orthogonality or contrastive constraints used in prior work.
3. The model achieves consistent improvements in c-index across four datasets and demonstrates significantly better disentanglement compared to baselines.
4. The integration of SHAP analysis to assess the relative contributions of each representation type is a practical and insightful addition that goes beyond standard performance metrics.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

SHAP principle insufficiently explained: While the paper leverages SHAP to analyze the contribution of modality-specific and modality-shared representations to survival prediction—thereby enhancing model interpretability—it does not provide a sufficient explanation of the underlying principles, assumptions, or applicability of the SHAP method itself. Given that many readers may not be familiar with SHAP, it would be beneficial to include a brief description to make the interpretability analysis more accessible.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-motivated, technically sound, and thoughtfully executed. It addresses an important gap in multimodal survival analysis by explicitly modeling and quantifying disentanglement. The results are strong, and the use of SHAP and pathologist assessment adds a valuable interpretability dimension.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

After reading the authors’ rebuttal to the other reviewers’ comments, I decided to maintain my original score.

Review #3

Please describe the contribution of the paper

The authors propose a novel framework for multimodal survival prediction from whole-slide images (WSIs) and transcriptomics data. Its key component is explicitly modeling the shared and exclusive information from the modalities, which is enforced through (i) the fusion model architecture in combination with (ii) a disentanglement loss based on distance correlations between internal representations. The framework is evaluated on four TCGA datasets, where it achieves the highest mean C-index compared to 3 state-of-the-art baseline methods. The authors further show that the disentanglement loss was effective by comparing the disentanglement measures to a model trained without it and one separate baseline method also designed for disentanglement. Finally, they describe how to measure modality-specific and modality-shared contributions from the predictions with SHAP, and showcase how to derive insights from these attributions.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well written, clearly motivated, and easy to follow. A better utilization of modality-specific information together with the interpretability of modality contributions could be highly impactful in various multimodal fusion tasks, especially when modalities are more orthogonal. The provided formalism is helpful and easy to understand. The figure gives a good overview of the method. All relevant methodical and experimental details are provided.
- The framework has a clean, convincing set up. Even though many of its components are not novel but adopted from previously tried-and-tested approaches, it neatly combines them to achieve state-of-the-art performance with additional interpretability. It does not introduce any unneeded or overly engineered components with questionable purpose or effect; every part is justified.
- The concept for interpretability is convincing and may yield interesting insights into the model predictions.
- With four evaluated datasets, the experimental setup is sufficiently extensive. The chosen baselines are strong, meaning that the performance requirements for the new framework were high. The additional evaluation of disentanglement adds a relevant angle to the evaluation beyond just downstream performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Significance of performance improvements: The reported performance improvements in DSS prediction are relatively small, while the standard deviations are high. Therefore, the experiments may not prove that the framework produces significantly enhanced performance compared to the baselines. Nonetheless, the experiments clearly show that the framework is competitive in performance while providing better disentanglement.
- Results for KIRC: Table 2 indicates that for KIRC, the DIMAF disentanglement is comparably low, and overall even lower than for the PIBD baseline. At the same time, its mean performance is considerably higher than for all baselines. Do the authors have any intuition or explanation why, apparently, the disentanglement loss had a limited effect here, and where the DSS performance gains over the baselines came from in this case?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces a novel approach for a relevant problem integrated with many state-of-the-art ideas and cleanly shows its effectiveness from multiple angles. At the same time, I could not spot any major weaknesses. Therefore, I believe that the paper will be a strong contribution to the conference.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author response has not changed my perspective on the manuscript, and the strengths and weaknesses I mentioned persist. Therefore, my final recommendation is still accept.

Author Feedback

We sincerely thank the reviewers for their thoughtful feedback, time and effort they dedicated to our submission. Below, we address the concerns raised by the reviewers and provide clarifications. Incremental methodological novelty (R1): We thank the reviewer for acknowledging that our method is well-motivated. To address the concern about the methodological novelty, we acknowledge that our method builds on established literature. However, to our knowledge, this is the first work to propose an attention-based architecture specifically designed to support easy integration of disentangled representation learning losses, instantiated here with DC. Unlike prior works that rely on a single attention matrix, we introduce four distinct attention mechanisms to learn modality-shared and specific representations separately, enabling disentanglement and enhancing interpretability. Moreover, as also noted by reviewer 2, applying DC in this context is novel, offering a better alternative to orthogonal or contrastive losses, or the mutual information approximations that require additional networks, as used in PIBD. Additionally, while prior works often claim disentanglement based on architecture or loss design, they typically do not measure it. In contrast, we are the first to explicitly quantify and report disentanglement in this context. Thereby, our novelty lies in restructuring transformer-based fusion to enable disentangled representations, employing DC to promote and quantify the disentanglement, and incorporating SHAP for additional explainability. Taken together, these contributions create a disentangled framework that achieves state-of-the-art performance with additional interpretability, as also stated by reviewer 3. Limited scope in generalizability of interpretability (R1): To clarify the interpretability of the transcriptomic features, we process the gene expression data to be grouped in biological pathways, which then serve as the transcriptomic features. These pathways correspond to well-established, expert annotated biological processes and functions related to cancer and are thereby semantically meaningful and inherently interpretable by domain experts. For example, one of the pathway features is annotated as the E2F targets, a well-known regulator of cell cycle progression. We agree that visualizing pathway activations can further enhance biological insight by highlighting which pathways are up or downregulated. However, due to space limitations, we were unable to include such visualizations. Lack of key datasets (R1): Thank you for raising the point on the dataset selection. Our dataset selection balanced space constraints with the goal of a fair and robust evaluation. We prioritized the MMP datasets, as MMP previously outperformed SurvPath. We did not adopt PIBD’s dataset selection, as their results are based on validation rather than test sets, limiting direct comparability. From the six datasets used by MMP, we selected the four with the largest sample sizes. Notably, two of these overlap with datasets used in SurvPath and PIBD, enabling direct comparison. Model scalability (R1): While increasing the number of pathways or prototypes can indeed add computational cost, our design reflects a deliberate trade-off: by adding complexity to the fusion mechanism, we eliminate the need for post-fusion FFNs, as used in MMP. As a result, our model achieves computational efficiency comparable to MMP and improved efficiency compared to SurvPath & PIBD. Statistical significance of the results (R1, R3): In this work, we followed the reporting conventions of our baselines. Nonetheless, we agree that including statistical tests would strengthen the performance claims and will include them in future work. Explanation of SHAP (R2): We thank the reviewer for the suggestion and will include a brief explanation of SHAP to improve clarity. Results for KIRC (R3): This is indeed an interesting point and we will investigate this in future work.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviews highlighted strengths in the motivation, performance, and interpretability of this work. Most of the reviewers’ concerns were adequately addressed in the rebuttal. I recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This work presents enough technical contributions and meets the bar of MICCAI.

back to top

Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction

Author(s):