Abstract

In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally sparse CXR imaging and radiology reports with high-frequency clinical data—such as vital signs, laboratory values, and respiratory flow sheets—to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A transformer is trained to predict CXR embeddings at each hour, conditioned on previous CXR embeddings and clinical measurements. In a retrospective study of 20,000 ICU patients, CXR-TFT demonstrated $95% accuracy in predicting abnormal CXR findings 12 hours before they became radiographically evident, indicating that clinical data contains valuable respiratory state progression information. By providing distinctive temporal resolution in prognostic CXR analysis, CXR-TFT offers actionable predictions with the potential to improve the management of time-sensitive critical conditions, where early intervention is crucial but timely diagnosis is challenging.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4128_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Kamaleswaran-Lab/cxrgen

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AroMeh_CXRTFT_MICCAI2025,
        author = { Arora, Mehak and Ali, Ayman and Wu, Kaiyuan and Davis, Carolyn and Shimazui, Takashi and Alwakeel, Mahmoud and Moas, Victor and Yang, Philip and Esper, Annette and Kamaleswaran, Rishikesan},
        title = { { CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {158 -- 167}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a novel multi-modal framework that integrates temporal CXR imaging and EMR data—to predict the future CXR findings in critically ill patients. This may create a more robust characterization of the multi-modal latent representation, and enable a richer and deep fidelity in the generated images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is dealing with a critical and important clinical issue.
    2. Using fusion of multimodality to predict a future modality embedding in time is novel and interesting.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. As the authors have mentioned - most ICUs today do not routinely perform daily CXRs therefore the direct clinical significance of the paper can’t be determined.
    2. An ablation study on the input and output components of the model and of the fusion strategy is missing. The paper presents a single fusion strategy with a single input-output pipeline and interpolation strategy.
    3. An ablation study on model architecture is also missing.
    4. In the evaluation of the results the image embedding baseline comparison is incomplete and somewhat biased -It should also include a baseline of fusing all tabular data with the CXR embedding as baseline.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Additional comments:

    • Typo in the abstract: is are
    • BioCLIP mentioned with a wrong citation (I think the authors meant BioMedCLIP instead)
    • Figure 1: How is the Baseliine computed? Only CXR? Need to compare to CXR + EMR
    • Figure 3: Why is the overall AUROC in (a) the previous embeddings so high ?
    • The results are lacking clinical significance testing.
    • Overall the chosen architecture for the method is not clear there is no ablation study on the fusion or comparison to only using CXR embeddings or only EMR embeddings as inputs.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The evaluation of the results compared to the CXR only baseline is only partial and it is difficult to asses the method in this way. In addition, an ablation study is missing, and the application is weak from a clinical point of view.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The lack of comparisons to baselines including both CXR and EMR along with lack of comparisons to other related methods and/or ablation study on model architectures weakens the results presented in the paper



Review #2

  • Please describe the contribution of the paper

    The authors propose CXR-TFT, a sequence-to-sequence Transformer architecture that integrates sparse, temporally irregular chest X-ray (CXR) embeddings with high-frequency clinical time-series data to forecast future CXR latent representations. By linearly interpolating embeddings in the latent space of a pretrained vision encoder (BioCLIP), the model predicts what the next CXR embedding will be at each hourly timestep, enabling early estimation of radiographic findings up to 24 hours in advance .

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The task itself is an important and novel application of deep vision models for medicine. The paper is indeed amongst the earlier works to approach this very clinical useful yet underexplored problem.
    2. Writing is detailed and easy to follow. Yet minor grammar mistakes exist.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. I really like the paper’s simplistic approach to a novel and useful problem. The point of MICCAI review, starting 2024, is not encourage adding a bunch of additional experiments. However, one baseline and no ablation does make the argument relatively weak. Writing papers is about making new findings/claims and proving it with results.

    2. grammar mistakes, please fix to make your paper more professional. “encoder that is are “ “acted on prior the time “ “Numerical values were min-max normalization” “Gradient clipping were used” “Figure 3, shows the temporal variations…”

    3. Any reason why this is limited to a single center study when MIMIC-CXR exists as one of the largest publicly available multimodal datasets (images can be linked with MIMIC EHR)?

    4. Lacking related works, although I am fully aware of the page limit. If you talk about multimodal fusion, discuss if not compare with some fusion sotas; if you talk about temporal, discuss if not compare with some temporal sotas. Not going to suggest any specific papers here because I am not going to risk the validity of this review by promoting self-cites.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Great problem with simple methods but limitations exists. Please check major comments above. “dependent on rebuttal” means you need to fix my main concerns: 1) comparison. 2) dataset. 3) grammar.

    3) is easy. 2) authors already acknowledged in limitations but is indeed a weakness. 1) is tricky: the authors should justify their methods by comparing with more baselines. However, I am aware that miccai doesn’t (and so should other venues) promote additional experiments. Thus authors need to use additional writing to justify.

    I will review the paper again after rebuttal and write additional comments on how much your paper have improved in terms of convincing readers.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Reviewers have carefully addressed each of my questions and concerns, therefore I recommend Accept after the rebuttal.



Review #3

  • Please describe the contribution of the paper

    The paper’s main contribution is CXR-TFT, a framework that predicts future CXR “trajectories” by fusing latent embeddings from past chest X-rays with high-frequency clinical data (e.g., labs, vital signs, respiratory parameters). Unlike typical cross-sectional methods that only interpret static CXR snapshots, CXR-TFT operates at an hourly time scale, interpolating CXR embeddings and aligning them with continuously measured clinical features via a transformer model. This design enables more granular forecasting of abnormal findings up to 12–24 hours before they become radiographically visible, thus offering a potentially transformative way to identify and manage emerging pathologies in ICU patients.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. CXR-TFT can estimate which findings (edema, consolidation, pneumonia, etc.) are likely to appear up to 12–24 hours before they are visible on a new CXR. This goes beyond most existing research, which either uses static images for classification or broad “worsening/improving” trends.

    2. The method handles irregularly sampled chest X-rays by linearly interpolating the embeddings between consecutive CXRs.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although the authors assemble a sizeable ICU cohort (17,690 patients), all data come from a single institution. This choice raises concerns about the framework’s generalizability to other hospital systems with different patient demographics or clinical practices.

    2. The paper linearly interpolates consecutive CXR embeddings to align imaging with hourly clinical data. However, direct linear interpolation in latent space may be suboptimal or biologically implausible for abrupt pathophysiological changes (e.g., acute pneumothorax).

    3. The model forecasts abnormal CXR embeddings, but it does not assess direct impacts on clinical endpoints (e.g., mortality, ventilator days, ICU length of stay). While predicting pneumonia or edema earlier is clinically meaningful, many ICU decisions hinge on broader patient-level outcomes.

    4. CXR-TFT infers future latent embeddings via a large transformer, which can be seen as a “black box.” Although the paper discusses performance gains and potential clinical implications, it does not detail how interpretability or explainability might be offered to clinicians.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite limitations such as single-center data and linear interpolation in latent space, I recommend a Weak Accept because the paper’s central idea, predicting near-future CXR abnormalities by fusing sparse chest imaging with high-frequency clinical measurements, is both novel and clinically motivated, showing significantly improved lead time in identifying emerging pathologies.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #4

  • Please describe the contribution of the paper

    In this paper the authors propose a multi-modal framework that integrates temporally sparse CXR imaging and clinical data to predict future CXR findings. This is achieved by embedding the CXR images using a vision encoder, enhancing the CXR embeddings with additional clinical data. Based on this input, a transformer model is trained to predict CXR embeddings at an hourly rate, conditioned on previous CXR embeddings and clinical measurements. A classifier is used on the predicted CXR embeddings to identify abnormal CXR findings. Empirical results using data from a single site illustrate that the proposed framework can predict radiological findings with a 95% accuracy 12 hours before, and a 94% accuracy 24 hours before the next CXR scan.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel approach to combine CXR embeddings with clinical data to enable prediction of future radiological findings that could potentially be useful in an ICU setting.
    • Well written & structured paper.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The authors should empirically validate how much added value the CXR embeddings provide compared to simply using the clinical data measurements. This can be achieved by for example using the CXR radiology report labels and not the CXR images. Then a classifier can be trained to predict future clinical findings using only clinical data. Such an experiment would show the potential value of augmenting the clinical data with the CXR embeddings.
    • In Section 2.4 all the details of the Transformer architecture and model training process should be provided.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    On the positive side, this paper presents an interesting approach to combine CXR embeddings with clinical data. Empirical evidence is presented that the proposed framework can enable prediction of future radiological findings that could potentially be useful in an ICU setting.

    On the negative side, the authors did not empirically validate how much added value the CXR embeddings provide compared to simply using the clinical data measurements. The authors should design the appropriate baselines to verify that the CXR embeddings provide substantial additional information to these predictions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their constructive feedback and address major concerns.

Choice of baselines and lack of input modality ablations(R2,R3): In clinical practice, the last known CXR findings are a reference point until new clinical indicators suggest a change in status. Our central hypothesis tests whether EHR data, collected at higher temporal frequency than CXRs, contains valuable respiratory state progression information. Our baseline, which carries forward previous CXR findings, tests this hypothesis by evaluating the added predictive value of EHR data trends beyond this clinical standard. The high baseline AUROCs (R3’s question) are expected since follow-up CXRs are typically ordered for patients with abnormal initial findings which persist when lung physiology remains stable. We can add sensitivity metrics to our experimental results, which show CXR-TFT’s 72% improvement over the previous CXR baseline (avg recall: M:0.657, B:0.387) for a 12-hour lead time, forming the key takeaway supporting our central hypothesis. Training a baseline model solely on EHR wouldn’t reflect clinical practice where clinicians integrate previous CXRs with emerging clinical indicators rather than using either in isolation.

Limited comparisons with other CXR+EHR baselines(R1, R3, R4): We acknowledge this limitation and motivate our design choices. Recent multimodal (CXR+EHR) models train unimodal encoders along with classifiers for static tasks like phenotyping, mortality (Guerra-Manzanares et al.,TMLR 2025, Yao et al.,AAAI 2024). Fair comparison would require significant modifications as CXR-TFT predicts hourly latent embeddings in a seq-to-seq manner, which are evaluated on a pre-trained, frozen MLP classifier. Studies address asynchronicity using modified multimodal bottleneck transformers (Lee et al., PMLR 2023) or generating CXRs via conditional Latent Diffusion Models (Yao et al., NeurIPS 2024), but would need modifications to utilize a sequence of previous CXRs and clinical time series as CXR-TFT does. Adapting these techniques within our framework is important future work, and these related works can be added to the Discussions. R1 questions the validity of linear interpolation in cases of acute disease onset. While we agree, we’d like to add that any interpolation method only approximates the true “latent” trajectory, and interpolated trajectories should be considered noisy labels. We chose linear interpolation for simplicity and vanilla transformers for their proven performance with multivariate EHR (Lee et al. ICLR 2023). Alternate interpolation methods, architectures and interpretability (R1) will be explored in future work.

Lack of external validation(R1,R4): Our current cohort has carefully curated clinical features and temporal labels for key clinical events (Sepsis etc.). Creating a comparable external dataset (MIMIC) with correct mapping between features requires thorough validation and is an ongoing effort.

Concerns about clinical significance(R1, R3): Reflecting R3’s concern, patients may have baseline CXRs as part of clinical workup but most ICUs order follow-up CXRs only when clinically indicated. While studies show infrequent imaging doesn’t worsen mortality or morbidity (R1’s concern), predicting radiological findings, like hospital-acquired pneumonia, before overt clinical indication could enable earlier imaging and intervention, which would likely impact previously understudied metrics like time to antibiotics and cost of workup components, fluid intake etc. Early prediction of radiological findings using CXR+EHR data trends is previously unexplored, so this work is a foundational step. This approach also has extensions to other imaging modalities where interval imaging is critical, like serial head CTs after intracranial hemorrhage. Assessing direct clinical impacts would require a prospective study-a logical next step for this initial work.

All writing suggestions(R2, R3, R4) will be diligently addressed.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    this seems to be a borderline paper amongst reviewers, all reviewers agree the paper addresses an underexplored clinical problem. However 2 points are hard to justify acceptance; 1. single institution evaluation of method which limits generalizability and 2. authors mentioned daily CXRs are not routinely performed in the ICU which limits clinical applicability. with lack of ablation study and no additional experiments are permitted, i dont think the paper can be accepted in its current form



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top