Abstract

Most recently, molecular pathology has played a crucial role in cancer diagnosis and prognosis assessment. Deep learning-based methods have been proposed for integrating multi-modal genomic and histology data for efficient molecular pathology analysis. However, current multi-modal approaches simply treat each modality equally, ignoring the modal unique information and the complex correlation across modalities, which hinders the effective multi-modal feature representation for downstream tasks. Besides, considering the intrinsic complexity in tumour ecosystem, where both tumour cells and tumor microenvironment (TME) contribute to the cancer status, it is challenging to utilize a single embedding space to model the mixed genomic profiles of the tumour ecosystem. To tackle these challenges, in this paper, we propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics data. Specifically, to enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting a cross-modal deformable attention module and a gene-guided consistency strategy, which Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordinatio n learning strategy. Extensive experiments on two public datasets demonstrate the effectiveness of our proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3080_paper.pdf

SharedIt Link: https://rdcu.be/dY6iC

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_25

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3080_supp.pdf

Link to the Code Repository

https://github.com/helenypzhang/Subspace-Multimodal-Learning

Link to the Dataset(s)

https://portal.gdc.cancer.gov/projects/TCGA-GBM https://portal.gdc.cancer.gov/projects/TCGA-LGG https://www.cancerimagingarchive.net/collection/ivygap/

BibTex

@InProceedings{Zha_Knowledgedriven_MICCAI2024,
        author = { Zhang, Yupei and Wang, Xiaofei and Meng, Fangliangzi and Tang, Jin and Li, Chao},
        title = { { Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {263 -- 273}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a new multi-modal learning framework that extracts tumor and tumor microenvironment (TME)-related morphological features from whole slide images (WSI) and genetic information from genomics to provide accurate and interpretable automated cancer diagnosis. For effectively capturing the tumor- and TME-related morphological features from WSIs, the authors leveraged cross modal deformable attention module with a gene-guided consistency strategy. Furthermore, the authors addressed the challenges in obtaining a global optimum in training multi-modal classifiers by proposing a novel approach based on dynamic regulation. The proposed model architecture was assessed across three downstream tasks: Glioma Diagnosis, Glioma Grading, and Survival Analysis, against eight state-of-the-art (SOTA) models on TCGA GBM-LGG and IvyGAP datasets. The proposed model architecture outperformed all the eight SOTA models on all the downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This research delves into the interaction between the whole slide images and the genomic markers from the perspective of tumor- and TME-related genes. This is a promising direction of research, and the authors back this claim with research evidence.
    • The idea of encouraging gene-knowledge penetration towards the WSIs features is promising. Furthermore, ablation study confirmed that this fusion enhances the downstream tasks performance.
    • The authors also discussed a critical aspect of training a classifier based on multi-modal features, which is obtaining a global optimum.
    • The authors proposed a Confidence-guided Gradient Coordination scheme, through which the gradients from different subspace features are modulated to avoid conflicts dynamically.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors mentioned that the proposed model is biologically interpretable, but they haven’t discussed model interpretation.
    • The manuscript does not specify how many times the experiments were repeated. It seems the experiment was conducted only one time. The better result can be obtained accidentally with the split dataset, so it would be required to report multiple experimental results.
    • It is mentioned that the top 30% of shared gene signatures across the TCGA GBM-LGG and IvyGAP were selected, but the justification for selecting only the top 30% is missing.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The idea of conducting multi-modal data analysis by fusing tumor- and TME-related genes with histology features is interesting and a potentially encouraging path to improve patient care in cancer patients.
    • However, the authors mentioned that the proposed model is biologically interpretable, but they haven’t discussed model interpretation. Model interpretation is a key aspect (especially in healthcare) which offers a view on what factors drive a model to give a certain prediction and enables healthcare professionals to take appropriate courses of treatment.
    • The experimental results should be reported with multiple independent replicates.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The main idea is innovative.
    • However, the lacks of biologically interpretation/discussion and performance report without multiple experiments brought major concerns.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    Although some minor questions were answered well, the failure of reporting multiple experimental results (mean+-std or confidence interval) is critical. Even, it was not clearly noted that how many the experiments were specifically repeated in the rebuttal. Moreover, the lack of biological interpretation is remained as a concern.



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors have proposed a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics data on various clinical prediction tasks, the experimental results show its powerful prediction ability in comparison with SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and the experimental results are convincing in comparison with SOTA methods.
    2. The deformable attention module is effective in integrating the multi-modal data.
    3. Design the CG-Coord scheme to obtain the global training optimum via dynamic gradient regulation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The Cross-modal Deformable Attention Module is widely applied in the multi-modal learning. The author should state the difference between them.
      [a] Cross-modal learning with 3d deformable attention for action recognition. CVPR 2023. [b] Three-Dimensional Medical Image Fusion with Deformable Cross-Attention. In International Conference on Neural Information Processing. MICCAI 2023.
    2. Can you comment on the convergence properties of the optimization strategy?
    3. Accuracies need to have confidence intervals – e.g. for table 1 and 2. Without that, saying our method is much better is meaningless. 4, The author should report the results of ablation stduies on survival prediction task.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The author should state the novelty of the proposed Cross-modal Deformable Attention Module in comparison with the existing stduies.
    2. Cross-validation should be applied to verify the effectiveness of the propsoed method.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall. This paper presents a interesting study for the integrative analysis of imaging and genomics data on various clinical prediction tasks. The paper is also presented in clear English and an easy way for audiences to follow.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a method for predicting multiple outcomes by combining histology and genomics data for glioma datasets. The key contributions are an explicitly separate modelling of tumour and tumour micro-environment genes, along with novel fusion, cross-modal attention mechanisms and a gradient adjustment scheme to ensure balanced learning. 

Results show a clear improvement compared to the state of the art baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow. The results are impressive and conducted in a relatively well-established framework for this type of problem.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The test set is very small for the type of analysis being performed, particularly the survival analysis problem. It would be helpful to have confidence intervals here to give a stronger impression of the significance of the results.

    Since there are so many comparable datasets available in TCGA, why was the method not evaluated on these also? There is no obvious reason why this should only apply to gliomas.

    Further discussion on how compatible the two datasets are would be helpful. If they are collected from different centres then there are likely to be significant differences in the underlying data distribution (I.e. staining profiles). If this is routinely done by other works with these problems already addressed, please make more clear and cite the key work.

    A key component of the work is that it separates out the tumour related genes from the tumour micro-environment related genes; however, there is no discussion or evaluation of this in the results. In general, a lot more justification is required for this choice, why is it seen as beneficial to explicitly model this in the architecture?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code is provided, datasets are publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    No further comments.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Whilst the paper and methodology are good and well presented, the evaluation is lacking which doesn’t give the reader confidence that it represents a genuine improvement.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for their constructive comments and recognize the advantage of our work.

R1Q3 & R3Q1 & R4Q2: Confidence intervals and repeat times We randomly split the dataset and repeated all experiments multiple times. Due to space constraints, we only reported the mean values. We will clarify the description.

R1Q1: Difference from other deformable cross-attention methods We thank the reviewer for pointing out recent approaches, which are either uni-modal deformation (Sangwon et al., CVPR 2023) or resampling of raw data of each modality with concatenated multi-modal information (Liu et al., ICONIP 2023), or computing attention between tokens within pre-set windows (Chen et al., International Workshop on MLMI 2023). While effective, these may be less applicable to our study given the significant heterogeneity between genes and WSIs. In contrast, our method integrates histological and genetic features to deform the attention on WSIs. To the best of our knowledge, this is the first work of cross-modal deformable attention in multi-modal (Gene and WSI) cancer analysis.

R1Q2: Convergence properties of the optimization strategy Our CG-Coord module adjusts the contradicted and less-confident gradient, dynamically avoiding subspace conflicts during training, which accelerates convergence. Table 2 shows this module improves performance within a fixed number of epochs, demonstrating its effective convergence ability.

R1Q4: Ablation study of survival prediction We appreciate the reviewer’s insightful suggestion. Due to the page limit, we did not present all ablation results, which will be added in the final version.

R3Q1: Dataset size We leverage a meta-dataset with over 2,000 high-quality slides from IvyGAP and TCGA, the largest open-source WSI dataset for glioma as far as we know. Consistent results of multiple tasks across our multi-cohort could validate our model’s generalization ability to some extent.

R3Q2: Disease beyond glioma Glioma, as a representative tumor with remarkable heterogeneity, is an excellent testbed to evaluate our method. We would thank the reviewer for this constructive comment - future studies warrant validation on pan-cancer research.

R3Q3: Meta dataset variations TCGA GBMLGG itself is a multi-cohort dataset, containing WSIs from multiple institutions. For our meta dataset, we performed stain normalization to harmonize the data, as per (Chen et al., IEEE TMI 2020). Details will be added to the final version.

R3Q4: Separating tumor- and TME-related genes Tumor and TME provide crucial information for cancer analysis while presenting significant variances in genes and WSIs. Inspired by this, we separate genes to deform the attention on WSIs regarding each subspace. Comparisons with other methods without gene separation in Table 1 prove our method’s effectiveness. We will improve the discussion to elaborate on this.

R4Q1: Model Interpretation Mounting research (Rebeca et al., Cancer Letters 2019; Karin E. et al., Cancer Cell 2023) has revealed the crucial importance of characterizing tumor and TME features for a deeper cancer understanding. We built our method on this biological prior knowledge, learning clinically relevant multimodal features in each subspace. Due to the page limit, discussions on the motivation and outcomes are brief. We thank the reviewer for this constructive comment and will add more discussions.

R4Q3: Top 30% genes Highly Variable Genes indicate high signal-to-noise ratio information within an organism, allowing for more biological information captured with smaller dimensions. The 30% of genes is an empirical choice according to previous studies (Akhilesh et al., Nature Communications 2020). We appreciate the reviewer’s suggestion and will add a discussion in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    NA

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NA



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Authors have proposed a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics data. Reviewer concerns included differences relative to previous work, convergence, sample size, confidence intervals, additional ablation studies. Rebuttal clarifies and addresses these concerns quite extensively. Paperw worthy of acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Authors have proposed a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics data. Reviewer concerns included differences relative to previous work, convergence, sample size, confidence intervals, additional ablation studies. Rebuttal clarifies and addresses these concerns quite extensively. Paperw worthy of acceptance.



back to top