Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Longitudinal medical image studies often involves multiple scans of the same patient taken at different times, potentially with different modalities such as (2D vs. 3D volumetric medical imaging). In this work, we propose a single diffusion-based framework that can predict future embeddings of imaging data for predefined time points. Our approach uses a universal vision encoder, able to ingest either 2D or 3D scans, combined with a temporal transformer to fuse embeddings across multiple timepoints. A conditional latent diffusion model then produces the future output in latent space encoding the longitudinal information of the patient. We challenged our method in two crucial tasks involving radiological imaging: (1) predicting future pathology in the form of segmentation masks, exemplified by Interstitial Lung Disease (ILD) progression on 3D chest CT scans of Systemic Sclerosis (SSc) patients, and (2) generating radiology reports that incorporate prior imaging context, exemplified by longitudinal chest X-rays from MIMIC-CXR. Results indicate that this unified diffusion approach outperforms existing baselines in both pixel-level forecasting and report generation, highlighting its versatility and effectiveness for longitudinal medical imaging.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2656_paper.pdf

SharedIt Link: https://rdcu.be/eG4C3

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_11

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{MouNab_Conditional_MICCAI2025,
        author = { Mouadden, Nabil AND Laousy, Othmane AND Marini, Rafael AND Ong, Valentin AND Revel, Marie-Pierre AND Chassagnon, Guillaume AND Christodoulidis, Stergios AND Vakalopoulou, Maria},
        title = { { Conditional Latent Diffusion Models for Irregularly Spaced Longitudinal Radiological Data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {106 -- 115}
}

Reviews

Review #1

Please describe the contribution of the paper

In longitudinal medical image studies, it is necessary to process multiple scan images taken at different time points for the same patient, which may involve different imaging modalities. The paper proposes the use of UniMiSS to encode 2D and 3D medical data, a Transformer for the time dimension to process multiple time points, and then uses CDM to predict image data and generate radiology reports.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposes the use of a transformer to model the longitudinal temporal dimension, capturing relationships between different examinations through a self-attention mechanism, and combining the features of each examination with a trainable temporal embedding to form a single representation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The flowchart in the paper is not very clear, and it does not clearly depict how the features obtained from the image data are input into the Temporal Transformer. Additionally, the explanation on how to use large models to generate medical reports is also unclear.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The flowchart in the paper is not very clear, and it does not clearly depict how the features obtained from the image data are input into the Temporal Transformer. Additionally, the explanation on how to use large models to generate medical reports is also unclear.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper presents a novel framework for analyzing longitudinal medical imaging data called the Conditional Latent Diffusion Model (CLDM). This framework addresses challenges in longitudinal studies where patients undergo multiple scans at irregular intervals and possibly different imaging modalities (2D and 3D). By integrating a universal vision encoder for both 2D and 3D scans and a temporal transformer for synthesizing multi-timepoint data, the model intelligently weighs the importance of various scans in the context of disease progression. The study focuses on two primary applications: predicting future pathological conditions, specifically in Interstitial Lung Disease (ILD) for Systemic Sclerosis patients, and generating automated radiology reports based on imaging context. Results indicate that the CLDM framework outperforms existing methods in both tasks, highlighting its potential utility in clinical decision-making and enhancing patient care through improved forecasting and reporting capabilities.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1: The framework effectively integrates both 2D and 3D imaging data through a single architecture, facilitating broader application in various medical contexts 2: Utilizes a temporal transformer that accurately captures inter-exam relationships, which are crucial for understanding disease progression over time. 3: Validates the approach on real-world datasets, ensuring the findings are applicable to clinical settings
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1: The framework’s complexity may limit its adoption in clinical settings without adequate technical support and infrastructure. 2: Generative models typically require large datasets for training, which might not be readily available in all medical fields
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a novel framework for analyzing longitudinal medical imaging data and demonstrates superior performance in predicting future disease states and generating reports compared to existing methods.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Accept

Review #3

Please describe the contribution of the paper

The proposed work introduces a general conditional latent diffusion model tailored for longitudinal medical imaging tasks. It leverages the UniMiSS framework to extract embeddings and integrate both 2D and 3D images. These embeddings are concatenated with temporal embeddings and fed into a temporal transformer to capture inter-exam relationships, addressing the challenge of irregular time intervals between scans. The conditional latent diffusion model is then used to predict future lesion features, and task-specific decoders are applied to generate the final outputs. The framework is validated on two clinical tasks: CT-based segmentation of diffuse lung diseases and longitudinal chest X-ray report generation. Experimental results demonstrate the effectiveness of the proposed method for longitudinal medical imaging.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper explores an important and interesting problem in medical image computing: predicting future clinical information based on historical medical images acquired at different time points.
2. The proposed method effectively supports inductive learning from both 2D and 3D data across patients with varying numbers of scans.
3. Experiments conducted on both public and real-world clinical datasets enhance the model’s practical relevance and applicability to real clinical settings.
4. The framework demonstrates the flexibility to handle multiple downstream tasks through the use of task-specific decoders.
5. Compared to baseline methods, the proposed approach shows significantly improved prediction accuracy.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the proposed application scenario is both novel and practical, the model itself appears to be a combination of existing components, including UniMiSS, Transformer, and diffusion models. This raises concerns about the level of technical innovation. The authors have not provided sufficient explanation regarding their contributions beyond integrating these existing techniques.
2. The workflow diagram is somewhat unclear, with several important implementation details missing.
3. I have concerns about the actual capability of the model, as the comparison with other longitudinal imaging methods is limited. In addition, the paper lacks clarity regarding the pretraining setup, which could have a significant impact on the model’s performance.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. Figure 1 only provides a simplified overview of the model pipeline and does not clarify how the encoder outputs are fed into the Temporal Transformer. Moreover, there is a lack of detailed architectural descriptions for key components, including the encoder, decoder, Temporal Transformer, and the conditional latent diffusion model. It remains unclear whether any architectural improvements or modifications were introduced.
2. While the authors mention that the decoder is pretrained, they do not specify the dataset used for pretraining. It is also unclear whether the decoder was pretrained jointly with the encoder, and whether such pretraining is feasible in real-world clinical scenarios. Clarifying these aspects would greatly improve the reproducibility of the study and provide a better understanding of the model’s initialization conditions.
3. The authors should provide a rationale for selecting a conditional latent diffusion model for feature prediction over other potential alternatives.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main concerns lie in the lack of sufficient novelty and clarity in the model design. The proposed approach largely integrates existing components without introducing clear architectural innovations. Moreover, the paper lacks detailed descriptions of key model components and training procedures, which hinders reproducibility and limits the reader’s understanding of the technical contributions. The comparative experiments are also relatively limited, particularly with respect to existing longitudinal imaging methods.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have responded to my concerns regarding the novelty of the model and the insufficiency of the experiments, provided clarifications on the experimental setup, and also committed to modifying the workflow diagram in the manuscript.

Author Feedback

We thank the reviewers for their constructive criticism and positive evaluation, especially for mentioning that our method is novel (R1, R2), flexible with respect to the input data (R1,R2), tested on real-world data (R1,R2), and overall reports better performance from the baselines (R2). Here, we address major concerns, while minor ones as well as the flow chart of our method (R2,R3) will be updated in the camera-ready version.

###Complexity and missing details about the training and pretraining of the different models.

R1 raised a concern about the potential complexity of our proposed framework, which might hinder its adoption in clinical settings. While our method comprises several components—including an input encoder, a temporal transformer for integrating temporal information, a latent diffusion model, and a task-specific decoder—the inference process is highly efficient. Specifically, predictions for new exam sets take only a few seconds on a V100 GPU, which is comparable to the requirements of other deep learning methods already in clinical use. This highlights the practical utility of our approach. Moreover, our flexible encoder, capable of handling both 2D and 3D exams, further supports the model’s applicability to a wide range of multi-temporal radiological tasks in clinical practice.

R1 also raised valid concerns regarding the training of generative models on large medical datasets, which are often difficult to obtain. We fully acknowledge and share this concern, as data availability remains a critical challenge in medical AI applications. To address this issue, we adopted two key strategies: (i) we employed pretrained encoders and decoders to avoid training the entire framework end-to-end, thereby reducing data requirements; and (ii) we applied data augmentation by generating diverse combinations of available temporal exams. These approaches allowed us to effectively train our pipeline for both applications explored in this study.

R2 raised some concerns for the pretraining setup for our method, while R3 claims that the explanation for the report generation is unclear. We apology for any lack of clarity regarding some parts of our pretraining. UniMiSS encoder has been trained to a big collection of public 2D X-rays and 3D CT scans as it is described in [18]. For the decoders depending on the task we used different source of data. For the report generation we fine tuned the BioGPT [9] on the MIMIC-CXR dataset. For the disease segmentation, we fine tuned the decoder that mirrors the UniMiSS’s resolution hierarchy, using transposed convolutions on 200 patients from the Systemic Sclerosis patients. We hope now this part is more clear.

###Novelty and comparisons

R2 noted that, although the framework is novel, it is composed of existing components. We agree that our study does not introduce a new architectural module. However, our primary contribution lies in the design of a flexible and practical framework capable of handling both 2D and 3D exams, as well as longitudinal data with varying time intervals. We believe this versatility is highly relevant for real-world clinical applications. Furthermore, our extensive experiments across two challenging tasks demonstrate the robustness and effectiveness of our method, particularly when compared to other similar approaches.

R2 pointed out the limited number of comparisons with existing longitudinal methods. We agree that longitudinal studies are relatively sparse in the current literature. In our work, we compared our approach with two recent methods—SADM and [20], both presented at MICCAI 2023—and observed superior performance. We believe that these results, combined with the flexible design of our framework, underscore the strengths and advantages of our method. Additionally, an ablation study illustrating the contribution of each component of our framework is provided in Table 1.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

After rebuttal, there are two accept and one reject but after reading the comments carefully, AC think the claims from the negative reviewer has been addressed by the rebuttal, as confirmed by another reviewer. So overall AC would recommend that this paper may be acceptable by MICCAI.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Conditional Latent Diffusion Models for Irregularly Spaced Longitudinal Radiological Data

Author(s):