Abstract

Current prognostic and diagnostic AI models for healthcare often limit informational input capacity by being time-agnostic and focusing on single modalities, therefore lacking the holistic perspective clinicians rely on. To address this, we introduce a Time-Aware MultiModal Transformer Encoder (TAMME) for longitudinal medical data. Unlike most state-of-the-art models, TAMME integrates longitudinal imaging, textual, numerical, and categorical data together with temporal information. Each element is represented as the sum of embeddings for high-level categorical type, further specification of this type, time-related data, and value. This composition overcomes limitations of a closed input vocabulary, enabling generalisation to novel data. Additionally, with temporal context including the delta to the preceding element, we eliminate the requirement for evenly sampled input sequences. For long-term EHRs, the model employs a novel summarisation mechanism that processes sequences piecewise and prepends recent data with history representations in end-to-end training. This enables balancing recent information with historical signals via self-attention. We demonstrate TAMME’s capabilities using data from 431k+ hospital stays, 73k ICU stays, and 425k Emergency Department (ED) visits from the MIMIC dataset for clinical classification tasks: prediction of triage acuity, length of stay, and readmission. We show superior performance over state-of-the-art approaches especially gained from long-term data. Overall, our approach provides versatile processing of entire patient trajectories as a whole to enhance predictive performance on clinical tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3193_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/go31glX57/tamme

Link to the Dataset(s)

https://mimic.mit.edu

BibTex

@InProceedings{SusTob_AHolistic_MICCAI2025,
        author = { Susetzky, Tobias and Qiu, Huaqi and Braren, Rickmer and Rueckert, Daniel},
        title = { { A Holistic Time-Aware Classification Model for Multimodal Longitudinal Patient Data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {24 -- 34}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes the TAMME model which analyzes complex, multimodal EHR to predict various targets for patients, e.g. long vs. short hospital length of stay or readmission. The model uses EHR that has been reduced to tokens of category and temporal information (age, time interval). The authors use merged MIMIC datasets to train and test their model. For different tasks, the TAMME model improves prediction vs. various other models used as baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The model demonstrates that tokens comprised of EHR category and temporal information may provide predictive value for various clinical-related assessments.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The token generation seems to primarily rely on existing domain-specific, pre-trained models. If there are new methods developed to create the tokens, they need to be better defined. The contribution of the paper is unclear. There is only a high-level diagram for the TAMME model. The model would benefit from a more granular, technical illustration. It is not clear whether the underlying data which a clinician would consider (images, reports, etc.) factors into the prediction at all. An ablation study or an exploration of different numbers of data categories might help clarify the relevance of various data. The separation between information used for prediction and “overly salient and revealing events” could be more clear. The maximum limit of 12 hours is unclear. Does this mean no records after the first 12 hours of a particular visit are included for a task like length of stay? The related works section should include more on the various models used as comparison baselines, including why they were chosen. The paper does not discuss why the inclusion of additional history beyond 6 months often reduces task accuracy/AUROC. Similarly, the decision to use the best configuration from Table 2 in Table 3 (including in some cases no history) is not justified. Information on model training is limited.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The meaning of payload and how it might contribute to the model is unclear. The meaning of ED Triage Acuity is unclear.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is limited by the lack of technical detail regarding the presented model and the baselines against which it is compared. The results are not really discussed, particularly the effects of including different lengths of historic EHR.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces TAMME, a Time-Aware Multimodal Transformer Encoder specifically designed for longitudinal electronic health records (EHRs). TAMME eliminates the need for evenly sampled data or a fixed vocabulary by representing time via delta encodings and text/image content using frozen domain-specific encoders, ensuring extendability to unseen data without retraining.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Innovative token representation that combines type category, type-specific semantics, temporal context (delta to previous entry), and value (numeric, image, or text).

    The model introduces a summarization technique to process very long sequences by replacing older segments with summary tokens and prepending them to the recent sequence. Evaluated on six clinically meaningful tasks using four MIMIC databases (MIMIC-IV, ED, Notes, and CXR), including both imaging and text modalities.

    Tasks (triage, readmission, survival, LOS) are directly aligned with clinical decision-making workflows, and the paper enforces realistic constraints (e.g., excluding discharge summaries during survival prediction)

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although attention weights among token types are analyzed (Figure 4), there’s no discussion of prediction rationale or failure modes.
    2. The paper does not include clinician experiments, or human evaluationof model predictions.
    3. The paper does not discuss latency, and training cost.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (6) Strong Accept — must be accepted due to excellence

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    TAMME presents a well-designed and novel architecture that pushes the boundary of multimodal temporal modeling in healthcare. Its tokenization scheme, time-aware encoding, and piecewise summarization mechanism are highly original and offer both technical and clinical value.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have provided a clear and thoughtful rebuttal that directly addresses the key concerns raised in the initial reviews.

    The authors respond appropriately to concerns about the novelty of using pre-trained models. They argue convincingly that this is a pragmatic design choice rather than a methodological flaw, and their integration strategy remains novel.

    Regarding performance, the authors provide a reasonable explanation rooted in the nature of the task and the impact of long-term vs. short-term data.

    No new experiments were promised or introduced, and the authors adhered to rebuttal guidelines. Based on the original submission and this rebuttal, the paper merits serious consideration for acceptance.



Review #3

  • Please describe the contribution of the paper

    The paper presents TAMME, a model that captures relationships across different types of patient data—like vitals, imaging, and prescriptions—before hospital admission. This approach is valuable for hospitals, as it helps improve clinical decision-making by including a patient’s full medical history. The model outperforms existing methods in five out of six prediction tasks, showing that using long-term data can lead to better results.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method models the patient’s full medical history available in different formats (text, numerical, images) rather than focussing only one format for prediction.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper mentions 32 manually defined data categories, but does not explain how these were chosen or whether they are based on clinical reasoning.

    Many results are shown only through figures and tables, with little explanation in the text—Figures 3 and 4, in particular, are hard to understand.

    Also, TAMME performs much worse than the model by Yaddaden et al. on the ED triage acuity task. The authors do not explain why this happens. Since triage decisions rely on immediate patient data, TAMME’s focus on long patient history might not help in this case. This difference should be discussed more clearly.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (6) Strong Accept — must be accepted due to excellence

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their thorough evaluation, especially reviewers R1 and R3 for their positive feedback acknowledging the methodological contribution of TAMME’s integration of temporal data, the proposed summarization mechanism, and its novel multimodal token representation. In the following, we would like to address their concerns.

  1. Composition of Token Representation (R2) We would like to clarify a crucial misunderstanding: R2 claims, TAMME reduced EHRs to time and category, omitting the integration of the actual values. We consider multimodal EHRs, i.e. sequences of images, texts, numerics, and categoricals, each element with time and type information. Instead of reducing them, TAMME obtains a uniform token sequence integrating all of these multi-modal and temporal data, detailed in Sec. 3 and Fig. 1. As recognized by R1 and R3, our token representation is the sum of embeddings for type category, type specifics, time, and value.

  2. Use of External Models (R2) R2 states that our token representation solely consisted of obtaining embeddings from existing models and therefore lacked novelty. We clarify that existing models are used only to embed image and text values, which is just one part of our proposed method for joint representation of all multimodal EHR data, including numeric values as well as time and two-fold type information of each element (Sec. 3, Fig. 1). Further, obtaining semantically meaningful embeddings from pre-trained encoders exploits progress in unimodal learning. We argue this is not a weakness, but substantially strengthens the method’s ability to extend seamlessly to the most advanced models in the field. With this capability as only one of our building blocks, our methodology remains clearly distinctive and highly novel, acknowledged by R1 and R3.

  3. Performance Drops in some Experiments (R2, R3) R3 requests an explanation on TAMME’s performance on the triage task compared to the baseline. Unlike our general approach, the baseline is highly specialized using manually selected data and features. We agree with R3 that short-term signals relevant for triaging could easily be cluttered by long-term data. We suspect a similar signal-vs-noise tradeoff explains the minor performance drops when using an extended EHR window in some tasks, which R2 questioned. However, this negative impact from long-term information is marginal and vanishes in additional long-run experiments, whereas the positive impact is significant (Tasks 2, 3, 5). Overall, we reveal the insight that the ideal EHR window varies among tasks (Tab. 2, Fig. 2). For instance, readmission and survival tasks benefit from long-term history, while recent records are sufficient for ICU length of stay and triaging.

  4. Minor Clarifications (R1, R2, R3) We agree with the reviewers that certain aspects could benefit from additional detail or refinement. Our decision to only use the best config from Tab. 2 in further experiments (R2) aims to reduce computational cost, assuming uniform pre-training affects performance equally for all context windows used in fine-tuning. The high-level type categories for EHR elements (R3) have been chosen for maximum generalizability in consultation with experienced clinicians. All requested refinements and training details (R1, R2) will be incorporated in the camera-ready version along with a strengthened discussion of results. Additional external sources for relevant information will be explicitly referenced.

  5. Extensions (R1, R2) Due to space limits, we only include an ablation study on EHR history length for now. A more extensive study will be provided in the future, encouraged by R2. We agree with R1 that exploring the model’s reasoning, e.g. by human and clinical evaluation, is of high interest. Thus, it will be addressed in our future work.

We again thank all reviewers for their valuable feedback and look forward to refining our work based on their constructive remarks.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The authors should clarify the points raised by the reviewers in their rebuttal.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I see that the scores are high and that two reviewers appear to be excited about this work, but I question whether this is the right paper for MICCAI altogether. BioNLP or some other natural language processing venues, perhaps? Yes, the authors employ image encoders, but the time-aware analysis (summarization, tokenization, etc.) are those from the NLP field. The radiological reports and other modality descriptions are also just texts. Also, in my reading of the paper, I noticed very little details about the model training, and the github repository contains no code as of today (June 5).



back to top