Abstract

Early prediction of hepatocellular carcinoma (HCC) is necessary to facilitate appropriate surveillance strategy and reduce cancer mortality. Incorporating CT scans and clinical time series can greatly increase the accuracy of predictive models. However, there are two challenges to effective multi-modal learning: (a) CT scans and clinical time series can be asynchronous and irregularly sampled. (b) CT scans are often missing compared with clinical time series. To tackle the above challenges, we propose a Temporal Neighboring Multi-modal Transformer with Missingness-Aware Prompt (\textbf{TNformer-MP}) to integrate clinical time series and available CT scans for HCC prediction. Specifically, to explore the inter-modality temporal correspondence, TNformer-MP exploits a Temporal Neighboring Multimodal Tokenizer (\textbf{TN-MT}) to fuse the CT embedding into its multiple-scale neighboring tokens from clinical time series. To mitigate the performance drop caused by missing CT modality, TNformer-MP exploits a Missingness-aware Prompt-driven Multimodal Tokenizer (\textbf{MP-MT}) that adopts missingness-aware prompts to adjust the encoding of clinical time series tokens. Experiments conducted on a largescale multimodal datasets of 36,353 patients show that our method achieves superior performance with existing methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0728_paper.pdf

SharedIt Link: https://rdcu.be/dVY86

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_8

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0728_supp.pdf

Link to the Code Repository

https://github.com/LyapunovStability/TNformer-MP.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xu_Temporal_MICCAI2024,
        author = { Xu, Jingwen and Zhu, Ye and Lyu, Fei and Wong, Grace Lai-Hung and Yuen, Pong C.},
        title = { { Temporal Neighboring Multi-Modal Transformer with Missingness-Aware Prompt for Hepatocellular Carcinoma Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {79 -- 88}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The work introduces a multi-modal framework (TNformer-MP) to perform early HCC prediction by integrating clinical time series and CT scans. The authors propose a temporal neighboring multi-modal tokenizer and a missingness-aware prompt-driven multimodal tokenizer to bridge the paired modalities in the time domain while addressing the modality incompleteness. (3) The effectiveness of our method is validated inlarge-scale real-world patients.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The way of combining temporal multimodal information for future HCC prediction is quite novel design

    This work collects a large inhouse dataset for validating the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I am not quite convinced combining clinical and ct information will give more accurate prediction of future HCC.

    Expandability for this work is insufficient: what clinical features are beneficial? Are clinical and CT features equally contribute to the task? Why the missing features can be replaced by so called missing-aware prompts? Are these prompts learnable if so what do they learn fro the task?

    This work lack of external validation and it is hard know whether the proposed method is generalized well on other centers.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    What clinical factors are collected in this work? Why are the clinical factors important and helpful for predicting the HCC? The authors may need to elaborate more on the background information

    In equation (4), ELU is not defined either in the equation nor in Fig2. Is this an activation function?

    “CT scans, being the auxiliary modality, may be missing”. Why CT is used as an auxiliary modality for predicting HCC? CT should be the most informative indicator for predicting HCC. How the missingness of Cts affect the prediction accuracy?

    How do you decide the number of missing-ware prompts? Is the proposed method sensitive to the choice of hyperparameters?

    Are there missing-aware prompts learnable? Why can they replace the missing CT features? How robust is it in the case of CT totally missing?

    In terms of risk stratification, what criteria and basis are you used for dividing the patients into low and high-risk group?

    For prognostic evaluation, C-index should be a standard metric to use in addition to time-dependent AUC. Why C-index is not included in the metrics?

    Different metrics being used for evaluating the low risk group (sensitivity and precision) and high risk group (specificity and NPV). Can you explain the inconsistency? And for risk stratification, why survival analysis is not conducted e.g. KM plots may be a better presentation of the results.

    If possible, multi-center retrospective data should be collected for further evaluation

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    motivation and clinical significance of the work is not clear; I am not convinced the proposed method is principle and reproducibe; prognostic metrics used for evaluation is not standard

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks for the authors’ resposne. From the clinical perspective, I am still not quite convinced combing serum biomarkers and CT would lead to better prediction since serum factors are volunerable and fragile, can be highly varied at different timepoint that could potentially make the prediciton more noisy. “It includes 46 clinician-recommended clinical parameters, which are closely linked to liver disease severity”. this has not been articulated at all in the paper. Imaging itself should be the most informative modality for predicting HCC, not only the contrast CT but also contrast MRIs would provide useful features that sre complementary to the task, for instance, differentiation of hepatic nodules can be done better on mri,e.g. LGDN and HGDN and this is critical to predict the development of HCC. So it would be making more sense that combing different image modality to improve the performance. The authos respond “Previous research has shown that learning clinical time series is effective for predicting HCC.” this has neither been included in the paper nor be found anywhere in prior arts.



Review #2

  • Please describe the contribution of the paper

    This paper presents a Transformer-based model for early prediction of hepatocellular carcinoma (HCC) using multimodal data comprising time-series CT scans and clinical indicators. Specifically, the authors propose two token generation methods: 1.Temporal Neighboring Multi-modal Tokenizer (TN-MT): This tokenizer connects each CT scan with nearby clinical time-series data to address inconsistencies in acquisition times, while incorporating multi-scale image feature extraction. 2.Missingness-aware Prompt-driven Multi-modal Tokenizer (MP-MT): For samples lacking CT scans, this tokenizer generates a prompt from clinical time-series data to simulate the processing as if all modalities were present, enhancing the Transformer’s ability to predict early HCC. Additionally, the paper validates these methods on a large-scale clinical dataset with 36,353 patients, demonstrating the model’s effectiveness in improving early detection of HCC.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper introduces a novel transformer-based approach for early prediction of hepatocellular carcinoma (HCC) by leveraging multimodal data that combines time-series CT scans and clinical indicators. The main strengths of this work include:

    1.Detailed Consideration of Clinical Problems: The authors meticulously address the clinical issues associated with two time-related modality types. They have developed innovative token construction techniques to manage the discrepancies in acquisition times between modalities, as well as to handle missing modalities effectively. TN-MT effectively integrates CT and clinical data despite their temporal discrepancies by creating a bridge between each CT scan and a neighboring range of clinical time series. Meanwhile, MP-MT addresses scenarios where CT scans are missing, using learnable prompts to adapt the clinical time series tokens for consistent processing. 2.Comprehensive Validation on a Large Dataset: The model’s effectiveness has been validated on a substantial dataset comprising 36,353 patients, demonstrating its superiority over existing methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper introduces an innovative algorithm for early Hepatocellular Carcinoma (HCC) prediction using multimodal data but has several weaknesses that need addressing: 1.Concerns with Token Construction: The authors address two limitations related to time multimodal data by creating two types of tokens: TN-MT and MP-MT. The MP-MT dynamically adapts tokens generated solely from clinical indicators using learnable prompts to mitigate performance drops due to missing CT data. However, these MP-MT generated tokens are solely based on clinical data , so there may be significant differences from those generated under fully modal conditions. This raises concerns about potential new interferences introduced by MP-MT. 2.Poor Performance on Critical Metrics: The proposed method and the methods it is compared against perform poorly on key metrics such as AUPRC, Precision, and Sensitivity, even below 50%. Despite the proposed method outperforming the compared methods, the low performance raises concerns about its clinical applicability. 3.Imbalanced Data Concerns: Although the sample size used in this paper was large, including 36,353 patients, the distribution of sample categories was extremely unbalanced, among which positive samples (HCC patients) accounted for only 7.4%. Therefore, I can’t help but worry about whether these data will interfere with the direction of model learning during training and affect different results, which will also make quantitative analysis less reliable. 4.Confusing Experimental Setup and Analysis: In fact, due to the large sample base (36,353), even according to the proportion of positive samples (7.4%), the author can take samples from more than 5000 patients (the ratio of positive and negative samples will be 1:1), which is also a large-scale data set. Assuming 5000 class-balanced data were obtained, of which more than 700 patients had both CT and clinical data (based on the data analysis mentioned in the paper), enough to support the authors to conduct a comprehensive experimental evaluation. In addition, the author mentions that the test set is divided into three different subsets (TS+CT) ALL, (TS+CT) pairs, and (TS+CT)PARTIAL. This statement is confusing. As far as I understand it, (TS+CT)PARTIAL actually represents the sample with missing CT data. In addition, the authors used (TS+CT) ALL, (TS+CT)PAIR, and (TS+CT)PARTIAL in Table 1, i.e. uppercase subscripts, whereas in Table 2 they are lowercase subscripts. 5.Inadequate Comparative Analysis: The authors compare three other methods, cited in the paper [5], [10], and [20]. However, [5] considering the fusion between clinical data and X-rays, [10] derived from non-peer-reviewed work, [20] proposes a multitask approach for segmentation and survival prediction. The comparison with these methods does not seem so adequate and objective.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    To provide detailed and constructive feedback for this innovative study on early Hepatocellular Carcinoma (HCC) prediction using multimodal data, I recommend the following:

    1.Consistency Constraint in Token Generation: I am curious whether it would be feasible to introduce a consistency constraint during the training process between the new tokens generated by MP-MT and those generated under fully modal conditions. This could ensure that the tokens produced by MP-MT align more closely in distribution with those generated from paired data, potentially reducing the introduction of biases or errors. 2.Data Sampling for Balanced Analysis: I recommend that the authors consider conducting a random selection of balanced positive and negative samples from such a large dataset. Even selecting 5,000 patients would maintain a substantial dataset size, which could support comprehensive experimental evaluation. Training a model on a balanced dataset might result in more reliable outcomes and improved performance on key metrics. 3.Detailed Methodological Description: It would be beneficial for the authors to enhance the description of the methodological implementation details, such as the dimensions of the input tokens and the number of time points included in the input. Additionally, I suggest clarifying the confusion in the paper, especially regarding the division, naming, and usage of the test subsets. Moreover, the reference to supplementary material on page 6, specifically “(Appendix C),” appears to be missing or unclear.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation for this paper is influenced by its innovative approach and the use of a large clinical dataset, tempered by significant concerns regarding its performance and experimental details. Here are the key factors shaping my evaluation:

    1.The paper’s methodological innovation and its application to a vast clinical dataset are commendable. These elements typically suggest potential for significant impact in the field of early Hepatocellular Carcinoma (HCC) detection. 2.All methods discussed exhibit suboptimal performance on crucial metrics such as AUPRC, Precision, and Sensitivity, even falling below 50%. This poor performance critically undermines the clinical applicability of the proposed solutions.

    1. The setup and rationale for the experiments are not well-explained, leading to confusion about the division and use of the test set, among other aspects. More clarity and justification in the experimental design would strengthen the study’s validity.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for your rebuttal, which has largely addressed my concerns regarding the experimental setup and principles. While there is some methodological innovation, the overall description and key performance in the submitted version are suboptimal. Therefore, I still consider this a borderline paper. Nonetheless, I have revised my score to weak accept at this stage.



Review #3

  • Please describe the contribution of the paper

    This manuscript proposed a temporal neighboring multi-modal Transformer with missingness-aware prompt for early HCC prediction based on clinical time series and CT scans.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main advantages of this manuscript are as follows: (1)To bridge the paired modalities from temporal correspondence, the authors introduced a Temporal Neighboring Multi-modal Tokenizer (TN-MT) to combine each CT embedding with a neighboring range of clinical time series. (2)For CT-missing patients, the authors introduced a Missingness-aware Prompt-driven Multimodal Tokenizer (MP-MT) to adopt learnable prompts to adapt the clinical time series tokens to the missing modality scenario. (3)The effectiveness of the proposed method TNformer-MP is validated in large-scale real-world 36,353 patients.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The possible shortcomings of this manuscript are: (1) There is no innovative solution proposed for multimodal alignment and modal loss, and the method innovation is slightly insufficient; (2) There is very little background introduction about HCC prediction, and there is also relatively little introduction to related methods. The gold standard for discrimination is also unclear. (3) Due to the private collection of data, it may affect subsequent repeatability verification.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is even better to share the source code of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Constructive comments: (1) Increase the introduction of clinical background, especially the characteristics of HCC prediction data; (2) HCC Prediction Gold Standard; The timing of HCC diagnosis is related to the time of information collection, and the process of transitioning from benign to early HCC is very difficult to determine in clinical practice. (3) Add an introduction to HCC prediction from CT.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I made the above choice mainly based on the following factors: (1) Although the technological innovation of the method is limited, a large amount of data validation has reference value for clinical practice. (2) Although the introduction of clinical background and purpose is insufficient, the overall presentation is relatively clear and the process is straightforward.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Keep my earlier decision




Author Feedback

Reply to R1: 1.Background of Clinical Data:Previous research has shown that learning clinical time series is effective for predicting HCC. And our dataset collects 15 years of clinical data. It includes 46 clinician-recommended clinical parameters, which are closely linked to liver disease severity. 2.How Missing-Aware Prompts Work: The missing-aware prompts are the learnable vectors. We determine the optimal prompt number by hyperparameter search. Instead of recovering missing information, the learnable prompts are utilized to adapt the model to the different input distribution caused by missing modality via a particular tokenization. 3.External Validation: We evaluate our method in a territory-wide large dataset from all public hospitals and clinics. The performance already reflects our generalized ability in different institutes. We would like to conduct further experiment to validate the cross-region performance.

  1. CT as Auxiliary Modality: Most patients contain clinical time series during long follow up. In contrast, CT scans are often missing because few patients undergo CT scans at early stage. Therefore, we regard CT as auxiliary modality. Missing CT can cause the different input distribution, which limits the performance in fusion.
  2. No Survival Analysis with C-index: We perform N-year early prediction task instead of survival analysis. Therefore, we utilize the AUROC/AUPRC rather than C-index. Survival analysis requires the exact HCC occurrence time. However, we can only reliably determine the HCC occurrence year, since patients are not intensively monitored during long follow up. 6.Risk Stratification Criteria & Metrics: A common dual-cutoff strategy is used for risk stratification, where the upper threshold is set at 90% specificity and the lower threshold at 90% sensitivity on the validation set. This strategy already ensures the high-risk group has high specificity & npv and the low-risk group has high sensitivity & precision. Therefore, we evaluate the sensitivity & precision for the high-risk group, and the specificity & npv for the low-risk group. Reply to R3:
  3. Clinical Background: Please refer to our answer R1.1. We would also introduce more HCC prediction based on CT. 2.HCC Prediction Gold Standard: The real-world HCC risk scores include CU-HCC, GAG-HCC, PAGE-B, REAL-B scores. Reply to R4: 1.Concern with Token Construction: Our MP-MT avoids the issue of significant token difference. Table 1 demonstrates this with notable and near improvements (12% v.s.10%) in (TS+CT)PAIR and (TS+CT)PARTIAL, compared with uni-modal accuracy. It is attributed to that MP-MT mimics the multi-modal tokenization by dynamically adapting each missing-modal token with prompt, rather than simple concatenation.
  4. Performance on Critical Metrics: Our method achieves >50% sensitivity in high-risk group and maintains great specificity & npv in low-risk group (Fig.3). This reveals the potential clinical application that assist clinicians to filter out at‐risk patients and allocate healthcare resources efficiently.
    3.Imbalanced Dataset: Our large imbalance dataset reflects the real-world and territory-wide patient distribution. Despite this challenging scenario, we still achieves comparable improvement (e.g. Fig.3), which highlights the practical importance. We would like to conduct extra validation in a balanced subset. The results can be even better.
    4.More Comparative Analysis: There are few methods available for disease prediction using medical imaging and clinical time series, especially for HCC prediction. Our baselines include three most relevant and recent multi-modal methods. [10] was published in MICCAI 2023. For [20], we only utilize the fusion module supervised by HCC prediction task for fair comparison. 5.Test Set Division: (TS+CT)ALL denotes the whole test set. It is further categorized by modality availability: (TS+CT)PAIR for patients with paired modality, (TS+CT)PARTIAL for patients with missing CT




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Despite using large sample data, the clarities of methodology, data, and experimental design lead to difficulties in understanding the work. Meanwhile, the large data mainly consists of clinical information data, which belongs to the field of bioinformatics; and technical details regarding the image encoder seem to be missing. Furthermore, the performance of the results lacks validation for clinical relevance, e.g., Precision (about 30%) is too low and Specificity (about 67%) is low, which leads to a lot of false positives given the low prevalence of HCC (7.4% in this paper), consequently, may lead to excessive waste of healthcare resources. Additionally, the use of the term “territory-wide” may violate MICCAI’s blind review guidelines.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Despite using large sample data, the clarities of methodology, data, and experimental design lead to difficulties in understanding the work. Meanwhile, the large data mainly consists of clinical information data, which belongs to the field of bioinformatics; and technical details regarding the image encoder seem to be missing. Furthermore, the performance of the results lacks validation for clinical relevance, e.g., Precision (about 30%) is too low and Specificity (about 67%) is low, which leads to a lot of false positives given the low prevalence of HCC (7.4% in this paper), consequently, may lead to excessive waste of healthcare resources. Additionally, the use of the term “territory-wide” may violate MICCAI’s blind review guidelines.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I think the comments from meta-reviewers make sense. This is a border line case for me, (can also be reflected on the ranking from other two meta reviewers). I agree with the meta-reviewer that the performance is still low and the validation can be more rigor. But I think the problem setting would raise some interesting discussion at MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I think the comments from meta-reviewers make sense. This is a border line case for me, (can also be reflected on the ranking from other two meta reviewers). I agree with the meta-reviewer that the performance is still low and the validation can be more rigor. But I think the problem setting would raise some interesting discussion at MICCAI.



back to top