Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3840_paper.pdf

SharedIt Link: https://rdcu.be/eHaYG

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_59

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/VinyehShaw/TRACE

Link to the Dataset(s)

https://huggingface.co/datasets/ibrahimhamamci/CT-RATE

BibTex

@InProceedings{ShaMin_TRACE_MICCAI2025,
        author = { Shao, Minye AND Miao, Xingyu AND Duan, Haoran AND Wang, Zeyu AND Chen, Jingkun AND Huang, Yawen AND Wu, Xian AND Deng, Jingjing AND Long, Yang AND Zheng, Yefeng},
        title = { { TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {627 -- 637}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents TRACE, a framework for text and segmentation guided 3D CT generation. The authors propose to formulate the problem as a video generation task, where they employ a 2D slice-wise diffusion model that jointly generates frame pairs. The authors claim to generate coherent CT scans with a variable axial resolution, while being computationally efficient.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addresses a relevant problem, namely the guided generation of 3D volumes in resource-constrained environments. The use of multiple conditioning mechanisms is generally interesting.

While the idea of using video generation models for 3D generation is known from previous works, it has not yet been properly explored, so I think the research direction of this paper is interesting to explore further.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
This paper has some major weaknesses. I’ll focus on the most relevant ones here and will address minor flaws in the additional comments:

(1) The description of the method lacks important information:
- It remains unclear how optical flow is integrated into the model. The authors just state that they compute dense optical flow using RAFT and integrate it via a trainable adapter to guide temporal alignment. It remains unclear how this adapter network looks like and how this condition is injected into the diffusion model. I additionally don’t understand where this flow information should come from during inference.
- The authors state to employ an additional coherence loss term to enhance consistency between frames (Equation 2). It is however never stated how this loss is incorporated into the diffusion loss (Equation 3). Is it just added? Is there a weighting factor?
- The authors claim to additionally provide semantic guidance via text prompts. These text prompts are embedded using a pretrained CLIP encoder and are then added to encoders hidden state h_t via some trainable adapter. This adapter is composed out of \phi_1 and \phi_2, as well as W_1 and W_2. It however remains unclear what these variables denote. I also don’t understand the difference between the encoders hidden state denoted as h_t and the previously mentioned latent representation of a frame z_t. Can you clarify this?
- The inference procedure lacks important information. The authors speak about some transformations that are applied to x(n) (Equation 6), but there is no explanation on what these transformations do and why they are necessary. Some parameters, e.g. \gamma are just never introduced.
- Relevant information on network architectures are missing completely. As no code is provided, this makes the paper not reproducible.
(2) The evaluation of this paper is not convincing and poorly presented:
- The authors use scores from VBench, a video evaluation framework. First I think the authors should at least cite the paper that presents this framework. As most metrics used in this framework rely on some pretrained feature extractors, the authors should clarify which feature extractors were used and they should state why these metrics (developed for natural videos) are at all meaningful in the medical domain. I believe what makes a good natural video does not necessarily make a good CT scan.
- The presentation of results in Table 1 is very confusing. The authors talk about PFM, DCG (which is never mentioned anywhere else in this whole paper), and OFG before ever introducing them anywhere. It is very hard to understand what 50/[1], 100/[1,2] or 50/[1,2,4] should mean at all.
- The authors just made the scores in the last row bold. This is however not always the best scores, which is very confusing.
- It remains unclear how the authors identify an improvement of 1142% (705% would be correct) in Dice and 705% in Jaccard (1152% would be correct) scores. There is a problem with these numbers there. The authors should also not claim this improvement over both comparing methods if they actually just compare to tone of them.
- The authors first compare their method against GenerateCT and MedSyn. Then they change comparing methods to Imagen, SD, Phenaki and GenerateCT without giving any reason for this.
(3) Information on training/sampling times is missing completely. As sequential modeling is usually very time consuming this information should be added.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
There is some additional comments:
- The authors claim to reduce training and inference cost (not specified what “cost” means in this context) by some percentage (end of Section 1). There is no experiments or numbers to support this claim.
- Figure 1 is very overwhelming. It remains hard to understand what is exactly shown where. The authors additionally never refer to this Figure in the text. It remains unclear when to look at it.
- The authors speak about c being the channel count at the beginning of section “Anatomical Guidance”. c is however missing in the description of m.
- When referring to Figure 2, the authors sometimes write fig. 2c and sometimes fig. 2 (d). The notation should be consistent. I would prefer Fig. 2(a) etc. There is also missing brackets in the description of this Figure.
- Used feature extractors for FID and FVD should be clearly mentioned. The number of samples to compute these scores must also be mentioned.
- The ablations of Anatomical Mask Granularity are not convincing at all as there is no scores or anything given for this study.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(1) Strong Reject — must be rejected due to major flaws
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of this paper is limited. It simply combines existing approaches and fails to explain them properly, which makes this paper impossible to reproduce. In addition, the experimental results presented in the paper are not really convincing and poorly presented. Taking this into account, a lot of work and changes need to be done to this paper and I really hope the comments (especially on methods and experiments section) help the authors to improve the manuscript. For now, this paper must be rejected due to major flaws.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Although the authors addressed some of my concerns in their response, I still believe that this paper’s weaknesses outweigh its strengths. The description of the method in particular needs to be improved. Even after the authors’ response, the notation remains unclear to me. In their response and Eq. 2, the noisy latent is called z, but in Eq. 3 and Algorithm 1, it is x. Is this the same? It is also still unclear to me how optical flow is incorporated during sampling. My concerns regarding the evaluation still remain. I think this paper can unfortunately not be accepted in the current state.

Review #2

Please describe the contribution of the paper

The paper proposes an iterative image generation framework for CT imaging, conditioned on text prompts, anatomical shape information, and relative positional embeddings.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper proposes a memory-efficient image generation framework suited for large-scale CT imaging, addressing an important practical bottleneck.
- The evaluation is broad, covering a wide range of metrics, which demonstrates the papers’ effort to comprehensively assess the method. Although a deeper discussion of the relevance of each metric would strengthen the work, the breadth of evaluation is appreciated.
- The integration of anatomical consistency into the generation process, combined with the use of text embeddings, is an interesting and novel aspect.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The naming of the paper and method is somewhat misleading. While the approach treats CT slices in a video-like manner, CT volumes are inherently spatial, not temporal, and the analogy could be made more carefully. The naming throughout the paper should be more consistent.
- Not all proposed modifications appear to be ablated. A more thorough and structured ablation study would have helped clarify the contribution of each component.
- The organization of results, particularly between Table 1 and Table 2, can be improved. Table 1 mixes baselines and multiple metrics without clear separation. Adding a meta-column for metric categories and clarifying bolding practices would improve readability. Abbreviations should also be explained directly in the captions, and the structure of the tables could better differentiate baselines, ablations, and final models. Splitting medical baselines and natural image baselines into the two tables also adds to the confusion.
- While there are many qualitative results, their value is difficult to assess. Several generated images show severe visible artifacts, raising questions about the claimed anatomical and especially the claimed temporal/slice consistency. Although the reported metrics are strong, the visual quality appears less convincing and should be addressed. Especially the comparison w.r.t GenerateCT. If that is indeed not an artifact, further discussion in to the metrics, and if they even portray slice wise-consistency would benefit the paper. Further discussion or analysis of the discrepancy between quantitative and qualitative results would be warranted.
- Mathematical definitions are inconsistent. For example, the functions in Equation 6 are not properly introduced, and the iterative procedure described around Equation 7 conflicts with the definition of G. Instantiations of G_n and G_n+1 are ambiguous within the loop description. These inconsistencies reduce the overall rigor of the method section, albeit having some intuitive meaning.
- Table 2 refers to the method as “3D,” when it would be more accurate to describe it as 2.5D, following common conventions.
- The computation of 2D metrics on 3D images is not sufficiently explained. It is unclear whether they are averaged slice-wise, stacked into volumes, or otherwise aggregated.
- It is also unclear how ground truth (GT) data could yield a negative FID score; this suggests either a reporting error or a need for clarification on how FID is computed in this setting.
- Furthermore, the reported results for GenerateCT differ from those in the original work [9], where stronger FID, video, and medical metrics -the latter of which was not reported- were achieved. Since the same dataset was used, a discussion of these discrepancies is warranted to understand whether differences stem from reimplementation details, experimental settings, or other factors. The absence of this analysis weakens the credibility of the reported comparisons.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The claims made in the paper, particularly regarding temporally reliable 3D generation, appear overstated. The qualitative examples, even at reduced scale, show artifacts and inconsistencies relative to baselines, raising concerns about the actual fidelity of the generated images. This further questions the validity and interpretation of the reported metrics, especially given the comparatively poor performance of e.g. GenerateCT.

Nonetheless, the paper introduces a memory-efficient generation approach that achieves strong quantitative results across multiple metrics, addressing a relevant and important problem. With a more balanced discussion of qualitative versus quantitative performance, a clearer explanation of the evaluation metrics, and improved methodological clarity, the work has potential to evolve into a strong contribution.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Even after the rebuttal, we remain divided. The paper has clear strengths, particularly in efficient text- and anatomy-guided large image generation. The experimental section is reasonably thorough, with a wide range of metrics.

We remain pedantic about the use of the term temporal, and the use of “2.5D vs 3D”. This is a minor issue and was not properly addressed, but it’s not critical to our decision. Still, in our opinion the term 2.5D captures the methodology and the type of method well enough, whereas there is literally no time involved, despite the similarities to Videos. However in videos time has clearly a distinguished direction, whereas here the method could be applied to all axis, and in any order. Hence the use of “time” is misleading.

Our main concern remains the qualitative comparison between TRACE and GenerateCT. The rebuttal did not address the apparent slice-wise artifacts in TRACE, which are absent in GenerateCT. For an image generation method, stronger qualitative results, especially regarding slice consistency, would have been important. While we don’t agree with R2’s overall negativity, some of it stems from the shallow explanation, which could also be due to limited space. This paper suffers more from the page restriction than other papers. Yet we share R2’s concern that not all chosen metrics may fully capture clinical relevance. That said, we acknowledge the use of anatomically meaningful metrics, and the segmentation / anatomical fidelity results are promising in that respect. We are still not certain how language based conditions were tested.We hope that the authors improve some structure in the finished manuscript.

Since we did not reproduce GenerateCT ourselves, we cannot verify the performance difference, but we note that the authors claim identical experimental conditions.

In summary: sufficient novelty, despite some limitations in qualitative evaluation and presentation.

Review #3

Please describe the contribution of the paper

This method repurposes 2D diffusion models to generate anatomically accurate 3D CT images while preserving spatiotemporal features. By treating each 2D slice as a video frame, the method enforces structural integrity using an overlapping frame strategy. The diffusion model is conditioned on anatomical segmentation priors and text prompts from radiology reports to enforce semantic constraints, while positional encodings are used to enhance temporal coherence, followed by the use of optical flow to maintain spatial consistency.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This method synthesizes 3D images with unrestricted axial length. To ensure temporal consistency, the method uses frame skipping and positional encoding, while optical flow is used for spatial alignment. Semantic priors in the form of segmentation masks and radiology-informed constraints are used to maintain anatomical integrity. The proposed method achieves superior anatomical accuracy. The segmentation prior has information about majority of the organs, thus ensuring the anatomical soundness of the generated samples. Since CLIP is pretrained and validated on natural, in the wild data, it may not generate accurate text latent representations for radiology prompts. To address this, the authors incorporate a separate trainable module to extract meaningful embeddings before integrating them into the diffusion model. Temporal sinusoidal embeddings use the concatenated representations from adjacent frames, enhancing temporal coherence. The generation process follows an overlapping frame strategy, similar to autoregressive techniques, where each generated frame guides the synthesis of subsequent frames.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Why a 2D diffusion model is used to capture temporal information along the channel dimension is unclear. The convolution operation alone does not inherently enforce temporal consistency for frames concatenated in this manner, and the temporal position embeddings were not directly fused along the channel dimension to ensure the model effectively understands frame differences. The need for a large number of segmentations (n = 128) may not align with real-world medical data, which often contains missing information and lacks such clinical annotations.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Here are additional comments that I would prefer be addressed:

In the CT frame sampling strategy for training, why is the “i mod k=0” constraint present?

How is the full resolution optical flow down sampled during training? More technical details on this method would be helpful.

“This transformed embedding is integrated into the encoder’s hidden states as ht = ht+ v′t.” In this line, is ht the output of each convolutional block in the encoder of the noise prediction network \epsilon_\theta?

Why are the intermediate temporary frames \tilde{x}(n) and \tilde{x}(n+1) required to be computed (line 8 of algorithm)? Why is \tilde{x}(n) used in line 9 instead of \tilde{x}(n+1) to compute G(n+1)?

More details on how the features of the CLIP adapter, optical flow adapter and relative position embeddings are combined together before being used as condition in the diffusion model would be helpful. If they are not combined, explanation on how the optical flow and position embeddings are individually integrated with the UNet is needed, since only the integration of text embeddings is specified.

For the Imagegen and Stable Diffusion baselines, were the 3D volumes generated slice-wise? If so, why were video property metrics not reported in Table 1 for these methods?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Though this paper lacks some technical details, it proposes a new method for integrating several constraints and priors to ensure anatomical soundness of the generated volumes. It has great potential for future work, integrating these modalities to form a multi-modal generation framework.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Although the paper and the authors’ rebuttal still did not clarify most of their implementation details and methodological design choices, this paper introduces a novel and efficient approach to integrate multiple modalities, achieving higher performance metrics than existing methods. The authors could benefit from a poster session with detailed a discussion to further improve this work.

Author Feedback

R2W1.1:Optical flow is passed through a learnable convolutional MLP and is injected via mid-block residuals as a structural condition. During inference, as shown in Fig1, flow is computed from overlapping generated frames and recursively guides the synthesis of subsequent pairs. We will clarify these in revision. R2W1.2:We use a single MSE loss. Eq2 describes the implicit modeling of temporal correlation across paired frames and anatomical masks jointly fed into the UNet, rather than another term. Eq3 shows how anatomical conditions are integrated via channel-wise concat during denoising. To be clarified in revision. R2W1.3:Text adapter is a MLP: W₁/₂ are linear projection and φ₁/₂ are GELU. It injects the CLIP embedding via cross-attention into the hidden state hₜ. Note that hₜ refers to the hidden representations, whereas zₜ is the noisy latent input at diffusion step t; they occupy different stages of the network. Details will be clarified in the revision. R2W1.4/5:CT slices at small Z-intervals show HU continuity and stable anatomy. Amplifying high-intensity regions (γ), smoothing low-confidence areas (H), and normalizing (F) stabilize key features and suppress diffusion noise, reducing uncertainty in subsequent frame synthesis. To be clarified in the revision and code. R2W2.1:We respect your opinion but our view differs. In fact, VBench submetrics are cited in Section 3.1, and we will cite VBench itself in the revision. Our use of video-based metrics follows and extends GenerateCT (e.g., FID, FVD), as CT sequences exhibit consistent anatomy and gradual intensity transitions, these metrics are meaningful in this context. These metrics are therefore meaningful in this context. R2W2.2-4:Abbreviations PFM, DAG, and OFG refer to Paired Frame Modeling (Sec. 2.1.1), Dual Anatomical Guidance (Sec. 2.1.2), and Overlapping Frame Guidance (Sec. 2.2), respectively. Notation such as “100/[1,2]” indicates the number of patients and the set of skip intervals. The percentage improvements were mistakenly inverted: the correct Dice gain is 705% and Jaccard is 1142% (rounded from 4.98). These issues, including the bolding, will be revised. R2W2.5: To ensure fairness, we followed GenerateCT and included representative baselines used in their evaluation. R2W3: Will be included in the revision as advised.

R3W1: 2D convolutions capture frame-wise changes by modeling cross-channel correlations in concatenated frame pairs, with the MSE objective enforcing local temporal coherence. Temporal position encoding indicates each pair’s location in the volume, not frame pairs differences. VISTA3D claims 128 classes covering major organs and performs well in thoracic segmentation. Will clarify this in the revision.

R4W1:We respect your point. CT slices are scanned sequentially, not simultaneously, thus temporally akin to video. Will be clarified. R4W2-4:Due to space limits, we modularized and abbreviated key components, though some are too interdependent for isolated ablation. Tab1 ablations are detailed in Sec. 3.4, with row references provided; see also R2W2.2–4 for clarification. GenerateCT results are from their official Hugging Face release. Our method improves volume length, frame coherence, slice quality, and anatomical fidelity. Some GT volumes contain fringe artifacts, which can propagate into generation. Will revise and clarify these as advised. R4W5:At iteration n, we compute G(n) via H, F, and γ to guide x(n+1), then update G(n+1) for the next step. Notation was simplified due to space limits; we’ll clarify this in the revision. R4W6-9:We label our method “3D” as videos are treated as H×W×T and our output is a volume. We follow GenerateCT’s protocol and apply all metrics to full volumes; no standalone 2D metrics are used. The negative FID stems from highly similar slices causing unstable covariance estimates. Our evaluation used their released samples, but reveal differ from their reported claims. We will clarify these in the revision.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers are generally in favour of the acceptance of this paper, appreciating its novelty and comprehensive experiments,. There are still remaining concerns regarding the lack of details and methodological ambiguities. The authors are suggested to address these concerns and improve the clarity, and if possible release the source code for better reproducibilty.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency

Author(s):