Abstract

Real-life medical data is often multimodal and incomplete, fueling the growing need for advanced deep learning models capable of integrating them efficiently. The use of diverse modalities, including histopathology slides, MRI, and genetic data, offers unprecedented opportunities to improve prognosis prediction and to unveil new treatment pathways. Contrastive learning, widely used for deriving representations from paired data in multimodal tasks, assumes that different views contain the same task-relevant information and leverages only shared information. This assumption becomes restrictive when handling medical data since each modality also harbors specific knowledge relevant to downstream tasks. We introduce DRIM, a new multimodal method for capturing these shared and unique representations, despite data sparsity. More specifically, given a set of modalities, we aim to encode a representation for each one that can be divided into two components: one encapsulating patient-related information common across modalities and the other, encapsulating modality-specific details. This is achieved by increasing the shared information among different patient modalities while minimizing the overlap between shared and unique components within each modality. Our method outperforms state-of-the-art algorithms on glioma patients survival prediction tasks, while being robust to missing modalities. To promote reproducibility, the code is made publicly available at https://github.com/Lucas-rbnt/DRIM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1276_paper.pdf

SharedIt Link: https://rdcu.be/dV1Vp

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72384-1_16

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1276_supp.pdf

Link to the Code Repository

https://github.com/Lucas-rbnt/DRIM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Rob_DRIM_MICCAI2024,
        author = { Robinet, Lucas and Berjaoui, Ahmad and Kheil, Ziad and Cohen-Jonathan Moyal, Elizabeth},
        title = { { DRIM: Learning Disentangled Representations from Incomplete Multimodal Healthcare Data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {163 -- 173}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work introduces a new method for fusing four modalities – DNA Methylation, RNA-Seq, MRI, and WSI – in the context of glioma survival prediction. The model learns a joint latent space to handle missing information effectively. It employs two encoders for each modality: a shared encoder to capture patient-specific information common across modalities, and a unique encoder to extract modality-specific details. This approach distinguishes itself by recognizing that each modality offers unique insights into survival prediction. The authors introduce a novel loss function combining task-specific, shared, and unique loss terms. The task-specific term focuses on survival prediction, the shared term uses supervised contrastive learning, and the unique term minimizes mutual information between modality embeddings for independence. This model surpasses other fusion strategies and performs well even when dealing with missing data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Authors introduce three different loss terms that take into account how the embeddings are optimised and the final performance of the model. This allows to obtain a good modality embedding that captures important information, and that can be further used for the task.
    • The model can be easily extended to any given task with minimal changes (by changing the L_task term of the loss function).
    • They compare their model with 12 different models, showing that in the majority of the cases their model outperformed the other proposals.
    • Their model can accurately stratify high and low risk patients, obtaining a lower logrank p-value than other models presented in literature.
    • Authors state in the manuscript that they will release their code and model, which is highly appreciated by the scientific community.
    • They present results with missing information, where they set to zero different modalities. Their model obtains similar results across the different tests (Table 2), showcasing the resistance of their model to missing information.
    • Their methodology also presents a better scalability, since incorporating new modalities only increases the encoders part of the architecture, not the fusion module. This is not the case for other methodologies.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Authors state in their evaluation section that they firstly divided intro a training and test sets, and then performed a five-fold cross-validation on the training set. Then, they present the performance metrics on the test set from 4/5 training models. This is not a proper way of performing the cross-validation. Authors should have use the remaining split (1/5) to test the model trained on the 4 folds, and use that to select their hyperparameters. Then, they could the remaining 20% as hold-out test set. That would have been a better way to evaluate all models.
    • While the idea of using using two encoders to obtain different representations of the data, one modality-based and one-patient based, this idea has already been explored in previous works using distance metrics between embeddings (Cheerla et al. 2019, Bioinformatics; Qui et al. 2024, Physics in Medicine & Biology; Chen et al. 2020, IEEE Transactions on Medical Imaging).
    • Authors compare with multiple models in the same task. However, it is not clear if they have trained the models, they are using the pretrained models, or if they are just using the fusion methodology with the embeddings that their encoders are outputting.
    • No statistical evaluation of results: paired tests would give statistical weight to the argument of “superiority” of the proposed method.
    • While their model outperforms other fusion methodologies (even though is not clear how they are comparing them), the improvement is marginal in some cases. For instance, in terms of C-index, the improvement of their proposed method over a simple max fusion is only 0.008.
    • Regarding the 5-fold cross validation, authors do not state if they are performing a patient-wise stratification. A TCGA patient can have more than one slide, thus, it is crucial that all slides belonging to the same patient are in the same split, be it training, validation, or test.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The idea of obtaining different embeddings that capture both patient-specific and modality-specific information is really interesting. Furthermore, the use of a transformer-based model that does not require an increase number of parameters if new modalities are added is really interesting, and it looks like is going to be the way forward. However, authors needs to improve their methodology strategy, and use a real test set. It is really difficult to understand how the models were tested by just reading the manuscript, which highly invalidates the results that have been obtained. The comparison to other fusion strategies is also difficult to comprehend with the information given, which highly diminish the values presented. If only the fusion strategies where used, and not the models presented in those work, it seems that maybe the other works obtained better results than those presented by the authors. Thus, it would show that, even though the fusion metholodogy is indeed better, there are some components of the architecture that need to catch up with the other models presented in literature. Regarding the encoders selected, at least for digital pathology, there are multiple newer self-supervised learning trained encoders that can improve the performance of the feature extraction, without requiring authors to train their own. While it might be excessive for this work, it would be really interesting to explore the latent space created by the architecture, to test if patients with specific phenotypes or similar characteristics are grouped together. It would also be interesting to test if a single embedding can be optimized using the three losses, instead of requiring to use two embeddings and adding complexity to the architecture.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the architecture proposed by authors is novel, especially the fusion strategy using a transformer-based model and the incorporation of two encoders, the incorporation of different losses to account for patient and modality similarities have been already explored in literature (Cheerla et al. 2019, Bioinformatics; Qui et al. 2024, Physics in Medicine & Biology; Chen et al. 2020, IEEE Transactions on Medical Imaging). Especially, the improvement they obtained over other “simpler” fusion methods is small, plus it adds complexity. The way of comparing their proposed model with other available in literature is not clear, which highly diminish quality of the results presented. If compared with the raw models, or if only the fusion methodologies are used, would highly differentiate how to evaluate the results obtained.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have succesfully answer my concerns regarding the evaluation process, the novelty and related work and the comparison classification. Thus, I recommend the acceptance of the work now.



Review #2

  • Please describe the contribution of the paper

    The paper focuses on an important clinical problem, i.e., incomplete multimodal survival prediction. The representations of each modality are decomposed into two parts, i.e., common features and unique features. The main idea is feasible for multimodal learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, The proposed shared loss and unique loss are for reasonable decoupling representations. 2, Applying masked transformers for multimodal fusion also benefits the prediction task. 3, The performance of the proposed method is competitive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1, Novelty. Applying masked self-attention for incomplete multimodal fusion has been explored in [1, 2]. What’s the main difference between this work and other approaches? 2, Clarity. For DRIM-U, I do not understand how it works. Is it an unsupervised pretraining stage for encoders? 3, Experiments. All experiments focus on the various fusion strategies, and the effectiveness of each component is not well explored. Besides, I suggest comparing with more approaches to verify the superiority of the proposed decoupled representation learning, e.g., learning robust joint features [3, 4].

    [1] Ma M, Ren J, Zhao L, et al. Are multimodal transformers robust to missing modality?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18177-18186. [2] Zhou Q, Zou H, Jiang H, et al. Incomplete Multimodal Learning for Visual Acuity Prediction After Cataract Surgery Using Masked Self-Attention[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023: 735-744. [3] Zhao J, Li R, Jin Q. Missing modality imagination network for emotion recognition with uncertain missing modalities[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 2608-2618. [4] Zhang Y, He N, Yang J, et al. mmformer: Multimodal medical transformer for incomplete multimodal learning of brain tumor segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022: 107-117.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have provided the codes in the supplementary files.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    More details about the dataset should be provided, e.g., the missing rate of multimodal samples. Besides, I am curious about whether a single unified model is utilized to handle various combinations of input modalities, or if multiple models are trained and tested for each distinct scenario. This would affect the flexibility of proposed approach heavily.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I major concerns about the paper is the novelty and the experiments. It seems that the proposed model mainly benefits from the masked self-attention fusion, which has been explored in other methods. More experiments to verify the effectiveness and superiority of decoupled representation are needed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author has addressed my concerns.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a multimodal method, DRIM, for the fusion of MRI, WSI, and genomics data to predict prognosis in patients with glioma, which could learn shared and unique representations among modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article has a clear structure and is easy to follow. The idea about learn shared and unique representations among modalities is interesting. The proposed model demonstrated competitive results compared to other methods in predicting prognosis in patients with glioma. This method also shows good results when missing modalities during inference.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This model was conducted on multimodal data of patients with gliomas. However, due to the inclusion of low-grade gliomas and glioblastomas in the data, there are significant differences in the prognosis of these two types of patients. For clinical practice, distinguishing between high and low-risk patients with low-grade gliomas is more meaningful. Alternatively, the author should evaluate the performance of the model on more publicly available datasets, such as TCGA-BRCA.

    Due to page limitations, the author’s description of multimodal fusion is not very clear.

    The author should compare the proposed model with existing multimodal fusion models.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author has provided the code in the supplementary files.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The author should use statistical tests to evaluate whether there is a significant difference in metrics between the proposed method and other methods. Secondly, the author should evaluate the proposed method on more publicly available datasets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This model was conducted on multimodal data of patients with gliomas. However, due to the inclusion of low-grade gliomas and glioblastomas in the data, there are significant differences in the prognosis of these two types of patients. For clinical practice, distinguishing between high and low-risk patients with low-grade gliomas is more meaningful. Alternatively, the author should evaluate the performance of the model on more publicly available datasets, such as TCGA-BRCA.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The author has addressed most of my concerns in the rebuttal, but I still recommend using statistical tests to compare models. Secondly, the author should provide detailed descriptions of the network in the supplementary materials. Finally, the issue of distinguishing between low-grade gliomas remains my biggest concern.




Author Feedback

We thank all reviewers for their insightful comments and for underlining that we propose an interesting (R1-4), novel (R1-4) and well-organized (R1-3-4) approach to solve an important clinical problem (R3). Below, we address concerns raised by reviewers.

  • Regarding the evaluation process (R1), the authors acknowledge that the formulation of the methodology may have been misleading and will be clarified in the final version. Indeed, we implemented the method suggested by R1, using cross-validation within the training set to select hyperparameters. Then, we evaluate each of the five trained models on the hold-out test set, aggregating the results as mean±std. As for the patient-wise stratification (R1), each TCGA patient is linked to a single, hand-selected slide with minimal artifacts (pen marks, etc.). Coverage rates and percentages of modality combinations for the dataset (R3) are given in the experiments section and Table 2.

  • Novelty and related work (R1-3): We propose the first method to learn disentangled representations through mutual information and combine them with a two-stage attention-based fusion on incomplete radiology, pathology and genomics data. Our method can be extended to any task and any new modality. All the mentioned references (R1-3) use single stage fusion and do not use dedicated disentangling criteria. R1.2 (reference 2 of R1) is designed for bimodal interaction without easy modality extension and do not rely on mutual information as advocated by (Bengio et al. 2019, ICLR). R3.3 is designed for 3 temporally aligned modalities and handles missing modalities only at inference. R3.3 and R1.2 aim to reconstruct the missing latent space, a non-trivial task; our approach differs by using only available information. Both R3.2 and R3.4 use only imaging modalities, applying attention to the learned tokens before they are input into specific decoders: the closest comparison possible may be the one with the vanilla MAF in Table 1.

  • Regarding component influence analysis (R3), we detailed an ablation study on the disentanglement term in the appendix.

  • Comparisons clarification (R1): We retrained all comparative methods. As our study is the first to tackle these modalities with incomplete data, no pre-trained models are directly available. Also, some methods do not provide their data splits, or use cross-validation without a hold-out test set, which does not meet our evaluation criteria. We agree with R1 on the use of different backbone encoders. However, since contributions in these methods arise from fusion techniques and auxiliary losses (with simple CNN/MLP backbones), we chose to fix modality encoders for all our experiments. This ensures consistency in parameters and training processes across comparisons, focusing on where the novelty lies.

  • Marginal gain over max fusion (R1): although our method yields a slightly better C-index, the gap is much larger when looking at the model calibration (IBS, INBLL), especially with varied modality combinations (Table 2). Previous methods may deteriorate with added modalities, ours consistently improves. Paired T-tests will be added in the manuscript if allowed.

  • Results in Table 2 (R3) indeed come from a single unified model handling any combination of modalities as variable-sized sequences, which is faster than using zero-filled tensors.

  • DRIM-U (R3): indeed, it highlights our method’s ability to derive unsupervised embeddings achieving competitive results by quickly fine-tuning top layers.

  • We agree with including results for patients of various grades (R4) for greater clinical relevance.

We thank the reviewers for considering the page limit while providing suggestions for future work which align perfectly with the aim of the paper, with new datasets (R4), methods (R3-4) and the further exploration of the latent space to better discern patient phenotypes (R1). This will be accelerated by the full availability of dataset preprocessing, code and models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposed a new method based on mutual information for learning disentangled feature representations from incomplete radiology, pathology and genomics data. Reviewers reached a consensus to accept this paper. The previous major concerns from Reviewer 1 about the evaluation process and novelty, etc., have been cleared. The technical contributions from this paper are appreciated. On the other hand, it is suggested that the authors consider providing computational complexity for all methods compared in Table 1.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposed a new method based on mutual information for learning disentangled feature representations from incomplete radiology, pathology and genomics data. Reviewers reached a consensus to accept this paper. The previous major concerns from Reviewer 1 about the evaluation process and novelty, etc., have been cleared. The technical contributions from this paper are appreciated. On the other hand, it is suggested that the authors consider providing computational complexity for all methods compared in Table 1.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This manuscript investigates the way to utilize multi-modal information in the diagnosis of glioma. Although I agree with some reviewers that the technical novelty is somewhat limited, I think the new application on glioma genomics and WSI in survival prediction is quite novel. After the rebuttal, all the reviewers agree that the strengths outweigh the weaknesses. I also share this sentiment and look forward to more insights along the multi-modal analysis of MRI and genomics.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This manuscript investigates the way to utilize multi-modal information in the diagnosis of glioma. Although I agree with some reviewers that the technical novelty is somewhat limited, I think the new application on glioma genomics and WSI in survival prediction is quite novel. After the rebuttal, all the reviewers agree that the strengths outweigh the weaknesses. I also share this sentiment and look forward to more insights along the multi-modal analysis of MRI and genomics.



back to top