Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volume directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process: an initial score-based diffusion model synthesizes low-resolution PET/CT volumes from the demographic variables only, providing global anatomical structures and approximate metabolic activity, followed by a super-resolution residual diffusion model refining spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most of the deviations in metabolic uptake values remained within 3–5\% of the ground truth in sub-group analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications. Codes can be found at \href{https://github.com/siyeopyoon/TotalGen}{https://github.com/siyeopyoon/TotalGen}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3070_paper.pdf

SharedIt Link: https://rdcu.be/eHwPw

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_10

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/siyeopyoon/TotalGen

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YooSiy_Cascaded_MICCAI2025,
        author = { Yoon, Siyeop AND Song, Sifan AND Jin, Pengfei AND Tivnan, Matthew AND Oh, Yujin AND Kim, Sekeun AND Wu, Dufan AND Li, Xiang AND Li, Quanzheng},
        title = { { Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {99 -- 109}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a two-stage framework for generating 3D FDG PET/CT volumes conditioned on patient demographics, including sex, age, height, and weight. Specifically, a 3D diffusion model is trained to generate low-resolution volumes from a four-dimensional demographic vector, followed by an additional 3D diffusion model that upsamples the generated volume and performs patch-wise super-resolution. The paper further analyzes whether the distributions of quantitative measures — such as organ volumes (liver, heart, kidney) and SUVs — in the generated images differ statistically from those in real images at the group level.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Generating PET/CT images conditioned on demographics is a particularly challenging and important task, especially given the limited data availability in this modality and the need to accurately model not only anatomical structures but also quantitative measures such as SUVs. The paper tackles this with a reasonable modeling approach, making it a meaningful contribution.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The technical novelty of the method is limited. Using a two-stage diffusion model that first generates low-resolution images and then applies super-resolution is a well-established approach in both the general vision and medical imaging domains. It would have been more interesting if the authors had proposed a method that better captures the unique conditioning of demographic features or explicitly guides the model to pay closer attention to quantitative measures such as SUV.
- The method performs patch-based sampling and simply aggregates the patches to form a global volume. This approach often leads to boundary artifacts, especially in modalities like CT. While PET images are inherently low-resolution and heavily smoothed, making such artifacts less noticeable, they could be more problematic in CT images.Incorporating techniques such as MultiDiffusion [1] could potentially address this issue.
- In Fig. 3, the generated CT images appear not only to have blurred anatomical structures but also to exhibit a noisy texture resembling PET images. While this may be less of an issue for PET images due to their inherently noisy and smooth characteristics, it raises concerns about the quality of the generated CT images.
- Although the paper presents statistical comparisons of several quantitative measures, it remains unclear whether the generated images accurately reflect demographic characteristics such as taller height, obesity, or smaller body frame. At the very least, some qualitative results or more detailed subgroup analyses comparing statistics within and across demographic subgroups would strengthen the paper. Additionally, an ablation study comparing the use of demographics versus not using them would have been beneficial.
- Given the importance of image quality in medical image synthesis tasks, it would have been helpful to report image quality metrics such as FID, using a backbone network pre-trained on the AutoPET dataset.
- Comparisons with other medical image synthesis methods (e.g., [2]) are also lacking and should be included to better position the proposed approach relative to existing work.
[1] Bar-Tal, Omer, et al. “Multidiffusion: Fusing diffusion paths for controlled image generation.” (2023). [2] Peng, Wei, et al. “Generating realistic brain mris via a conditional diffusion probabilistic model.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper addresses an important problem and demonstrates engineering effort, the lack of technical novelty, insufficient comparisons with recent methods, and concerns about the quality and consistency of the generated images lead me to recommend a reject.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
- Technical novelty and patch artifacts: While the methodological contribution still appears limited, I am willing to overlook this point considering the work as an application study—it seems acceptable in that context.
- Concerns about image quality and the need for quantitative metrics (e.g., FID): Since this is fundamentally an image generation study, the quality of the generated images is crucial. Concerns remain regarding the perceptual quality of the outputs, and I did not find the authors’ rebuttal sufficiently convincing on this matter. Additionally, unlike PSNR and SSIM, FID can be used regardless of whether the setting is paired or not. The authors failed to justify why FID was not included.
- Need for additional statistical comparisons and ablation studies: These are essential to validate the effectiveness of the proposed method. Again, the authors did not provide an adequate rebuttal addressing this point.
- On the image generation method [2] I referenced in the review: This approach trains a 2D diffusion model to generate slice-consistent 3D volumes through iterative inference. Since it relies solely on a 2D model, it may actually consume less GPU memory than the method proposed in this paper. Therefore, the authors’ argument—that a comparison was infeasible due to GPU memory limitations when generating whole 3D volumes—is not convincing.
For these reasons, I recommend rejection.

[2] Peng, Wei, et al. “Generating realistic brain mris via a conditional diffusion probabilistic model.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.

Review #2

Please describe the contribution of the paper

The paper proposes a cascaded 3D diffusion framework that generates realistic whole-body PET/CT volumes directly from demographic data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Synthesizing whole-body 3D PET/CT directly from demographics is a unique and impactful direction for digital twin generation and data augmentation. The global-to-local generation strategy effectively separates structural layout and detail refinement, improving both anatomical plausibility and scalability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The framework is only validated on one dataset (AutoPET) and focuses on FDG PET/CT; generalizability to other imaging modalities or pathologies is not shown. No direct comparisons with existing phantom-based or conditional generative methods (e.g., text/image-conditioned diffusion, GANs) are provided, making it harder to assess relative performance. While interesting, conditioning only on demographics may restrict downstream clinical applicability unless further fine control (e.g., pathology, anatomical constraints) is enabled. The approach requires significant computational resources (e.g., 4x A100 GPUs, 72 hours), which may limit its accessibility for broader adoption.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel and well-executed framework for synthesizing whole-body PET/CT from demographics, with strong quantitative results and clear potential for data augmentation and digital twin applications. However, limited validation across datasets and modalities, and the absence of baseline comparisons, slightly weaken its impact. A strong rebuttal clarifying generalizability and comparative performance could justify acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper introduces a dual-diffusion model framework for generating high resolution 3D PET/CT volumes solely from patient demographic data and is validated by comparing the distributions of organ volumes and uptake values between the ground truth and generated volumes. Baseline comparison to flow-matching showed increased variability due to the stochastic sampling in diffusion models and resulted in similar high-resolution images, without compromising quality.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This framework, for the first time, utilizes only demographic information to generate PET/CT images, eliminating the need for anatomical or structural image priors such as MRIs. This method also uses enhanced data augmentation while maintaining anatomical consistency via the preprocessing technique of producing multiple low-resolution, sub-sampled volumes. The generated volumes have accurate and consistent anatomical and metabolic features that are also correlated with the input demographics. The authors also performed statistical tests on the computed error percentages for the organ volumes and SUVs and show that most high error rates reported by the baseline are statistically significant.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The super-resolution model lacks technical details such as the method of patch extraction, whether the patches overlap, and how spatial information is encoded before being passed as conditional input to the diffusion model. A strong motivation for the decoupled approach of low and high resolution generation is not explained in detail. What are the performance implications when the diffusion model is trained directly on the high-resolution images? Using two separate diffusion models to generate a final image significantly increases computational costs. Future work could explore a more efficient single-step process for generating both the anatomical blueprint and the high-resolution image. No quantitative metrics have been reported for directly assessing the quality of the generated samples, such as PSNR, SSIM, or FID for CT images.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Here are additional comments that I would prefer be addressed:

How are demographic sub-groups determined for the sub-group analysis? Clear criteria and justification for their selection would be helpful.

Appropriate citations could be provided for all score-based generative equations listed.

How are continuous demographic variables encoded? Information on the embedding layer or encoder type used would help clarify the process.

The second super-resolution model is described as being conditioned on noise level. Where is this explicitly shown in equation 4?

What is the intuition behind the heart exhibiting large errors in both volume measurements and SUV values?

Typo corrections: A missing parenthesis in equation 4 and the word “raining” on page 6.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This method is one of the first works to provide a framework for complete FDG PET/CT image synthesis using demographic information alone, thus paving way for structure-independent generations.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I recommend that this paper be accepted, as the authors have addressed most of my comments. Although some experimental decisions remain unclear, such as the criteria for defining demographic subgroups and evaluating the fidelity of the generated samples against the test cohort, the core idea of generating whole-body 3D PET/CT solely from demographic information is novel and valuable.

Author Feedback

Dear Reviewers, We thank the reviewers for their insightful evaluation of our demographic-only, global-to-local 3D diffusion models, recognizing it as a “unique and impactful direction.” Below, we address key concerns and outline planned revisions.

Novelty While low-to-high-resolution diffusion is common in 2D, our work is the first to scale this to multi-modal, whole-body 3D volumes under extremely weak conditions (age, height, weight, sex) with only 565 cases. Demographics serve as spatial priors to guide global anatomical layout. To expand training data without geometric or intensity augmentations, we use Cartesian-product sub-sampling (global stage) and 3D patch extraction (super-resolution). This two-stage strategy yields a novel, data-efficient pipeline for PET/CT synthesis.

Controllability Controllability is central to our model, omitting demographics leads to anatomies lacking demographic consistency. Subgroup analyses (Table 1) stratified by sex showed heart volume and SUV differences due to segmentation variability, which we will clarify in Section 3. No significant effects were seen across age (<65 vs. ≥65) or weight (<70 kg vs. ≥70 kg), likely due to mixed demographic influences. However, Figure 2 shows smooth transitions in torso size and adipose distribution as BMI varies from 19.6 to 24.5 (fixed age/height), supporting meaningful demographic control of our models.

Other Methods and Comparison Full-resolution 3D trainings were infeasible due to GPU limits; patch-wise high-res methods disrupt anatomical coherence (e.g., duplicated organs), and VAEs oversmooth the details. Standard metrics (FID, PSNR, SSIM) are unsuitable due to the cross-modal, unpaired nature of our task, they fail to capture inter-modality alignment or clinical fidelity. Instead, we report organ volumes and SUVs agreeing within 3–5% across sex-stratified subgroups. Phantom-based models like XCAT/BOSS [1] produce only surface meshes or homogeneous intensities. MAISI [2] requires internal organ masks for both training and inference—making direct comparison incompatible. Therefore, our design relies solely on demographics, eliminating segmentations and focusing evaluation on clinical variables.

Resource & Reproducibility Training takes 2×72 h on 4×A100 GPUs, similar to 2D EDM2 [3], but handles 224×224×384 vs. 32×32 CIFAR-10. Inference takes ≈30 s/sample (global stage on 2 GB GPU) and ≈6 min/sample (super-resolution on 24 GB, reducible via optimization when GPU has limited memory), highlighting our framework’s efficiency. Code and models will be released upon acceptance.

Other Clarifications We will incorporate clarifications, add discussions, correct typos, and cite all relevant works in the revised version.

PET/CT was chosen for its co-registered nature; while whole-body MRI remains scarce. As future work, we are assembling a multi-sequence MRI cohort and plan to extend to focal-lesion CT (WAW-TACE [4]) using pathology labels from LLMs.

Position encoding follows PatchDiffusion[5], tagging each patch with normalized (–1 to 1) x,y,z coordinates. Demographics (age, height, weight)/100, sex (0=female, 1=male) are concatenated as additional channels of input volume.

CT noise (Fig 3) arises from residual noise in the global stage, which has only 35 reverse sampling steps (30 s), and these reduced as the sampling step is increased.

For seams, volumes are partitioned into 16-slice overlapping patches along z (224x224x96), with averaging at overlaps; we will discuss advanced artifact solutions [6].

We will add missing parentheses, include σ in Eq. 4, and cite SDE literature.

We sincerely thank the reviewers for time and constructive feedback.

Best, All authors

[1] Shetty et al., BOSS, Comput. Biol. Med. (2023)

[2] Guo et al., MAISI, WACV (2025)

[3] Karras et al., EDM2, CVPR (2024)

[4] Bartnik et al., WAW-TACE, Radiology: AI (2024)

[5] Wang et al., PatchDiffusion, NeurIPS (2023)

[6] Bar-Tal et al., MultiDiffusion, ICML (2023)

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This is a great work but lack of experiments especially when there are limited numbers of full-body PET/CT datasets are available. Ablation studies would play an important role. I partially agree with R2’s concern about the assessments and would urge the authors to add unpaired FID (even just for the compared organs) as a quantitative assessment. Although the organ volumes and SUV measurements are strong indicators, they only reflect to one aspect of the performance rather than the general image quality. This is the reason why the quality of images shown in Fig. 3 is uncertain. I really hope the authors can share their code asap after this stage. Although with clear shortcome, this work definitely has its innovation with great interdisciplinary efforts behind. I believe it is worth to be seen in MICCAI community. I vote for accepting this work, but agree that the manuscript need to be improved at present shape.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Although the authors have adequately addressed most major concerns, remaining issues include image quality and insufficient quantitative comparisons supported by statistical analysis.

back to top

Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics

Author(s):