Abstract

X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at https://github.com/xiechun298/SV-DRR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4926_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xiechun298/SV-DRR

Link to the Dataset(s)

Original Chest CT dataset (LIDC-IDRI): https://www.cancerimagingarchive.net/collection/lidc-idri/ For the preprocessed CTs and DRRs (LIDC-IDRI-DRR), please see: https://github.com/xiechun298/SV-DRR

BibTex

@InProceedings{XieChu_SVDRR_MICCAI2025,
        author = { Xie, Chun and Yoshii, Yuichi and Kitahara, Itaru},
        title = { { SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. Introduces SV-DRR, a diffusion-based model leveraging a Diffusion Transformer (DiT) to synthesize high-resolution, anatomically coherent multi-view X-ray images from a single input.
    2. Stabilizes high-resolution synthesis via progressive resolution refinement and positional embedding interpolation, addressing instability in prior methods.
    3. Curates LIDC-IDRI-DRR, a dataset with 1,500 views per CT scan, enabling robust training and evaluation across diverse viewpoints.
    4. Demonstrates potential to reduce radiation exposure, enhance medical education, and support sparse-view CT reconstruction.
    5. Achieves perceptual quality indistinguishable from simulated ground-truth X-rays, validated via expert evaluation.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Outperforms state-of-the-art methods (e.g., XraySyn, Zero123) in metrics like PSNR, SSIM, and FID, particularly for large angular displacements.
    2. Weak-to-strong training reduces computational overhead while maintaining consistency across resolutions (256 to 1024 pixels).
    3. Explicit view embeddings and cross-attention mechanisms enable precise control over target viewpoints.
    4. Public release of LIDC-IDRI-DRR facilitates reproducibility and future research.
    5. Expert evaluations confirm realism, with synthetic images indistinguishable from simulated ground truth (48.6% classification accuracy, near chance)
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Tested only on chest CT scans; performance on other anatomies (e.g., joints, pelvis) remains unverified.
    2. Simulated X-rays (via DiffDRR) may not fully capture real-world noise, artifacts, or clinical variability.
    3. Training on H100 GPUs and large batch sizes (e.g., 64 for 256px) limits accessibility for resource-constrained settings.
    4. Inference requires 20 diffusion steps, which may hinder real-time applications.
    5. Relies on SDXL’s VAE and CLIP encoders, introducing biases from non-medical pre-training.
    6. No explicit evaluation of anatomical alignment across synthesized views, critical for 3D reconstruction tasks.
    7. Excludes CT scans with slice thickness >2.5mm, reducing diversity.
    8. Ground-truth X-rays are synthetic, lacking validation against real clinical data.
    9. Limited details on angular range limits (e.g., maximum viable azimuth/elevation).
    10. No ablation studies to isolate contributions of DiT vs. weak-to-strong training.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reliance on simulated data, unproven generalization to real-world clinical settings, and high computational costs limit immediate practical impact. Key technical gaps (e.g., cross-view consistency, anatomical diversity) require further investigation. While groundbreaking in methodology and results, the work’s clinical applicability remains hypothetical until validated on real X-rays and diverse anatomies.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Comprehensive answer to the review comments.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single 2D X-ray input. The proposed model follows the Latent Diffusion Model framework and incorporates explicit view embeddings and latent space conditioning for multi-view generation. This approach addresses key limitations in prior methods by enhancing angular coverage and resolution, thereby significantly expanding the range of synthesizable viewpoints. Evaluation against the state-of-the-art methods demonstrates the superior performance of the proposed SV-DRR model.

    The paper also presents a large dataset comprising multi-view X-ray images with precise view annotations, which the authors intend to release publicly.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The user study involving 15 board-certified medical experts and clinical practitioners validates the anatomical accuracy and clinical relevance of the generated novel views.

    2. By incorporating explicit view embeddings and latent space conditioning, the proposed model generates high-quality and anatomically accurate outputs from a single 2D X-ray image. Clinically, this capability to synthesize high-resolution, multi-view X-rays has the potential to reduce patient exposure to X-rays, streamline clinical workflows, and improve diagnostic accuracy by providing comprehensive anatomical views from multiple perspectives.

    3. The paper is well-written and includes visual illustrations that effectively communicate the proposed method and its results.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The method was only evaluated on one dataset consisting exclusively of chest X-rays. While the authors aim to create views for other body parts, such as the knee, shoulder, or pelvis, the evaluation does not include these regions. Publicly available datasets like CTSpine1k and CTPelvic1k could have been used to broaden the scope of evaluation.

    2. The paper does not address whether the dataset represents diverse patient populations. For example, it is unclear how the method performs for patients with disabilities (e.g., scoliosis) in novel view synthesis for spine X-rays. This could be evaluated using datasets like the CT Spine Scoliosis dataset (https://huggingface.co/datasets/TrainingDataPro/ct-of-the-spine-scoliosis). Similarly, there is no mention of its performance on pediatric patients, which could be assessed using datasets like Pediatric-CT-SEG.

    3. The authors always set the source image to Posterior Anterior (PA) view, which is standard for chest X-rays, but there is no evaluation of how the method performs when the source view is changed to Anterior Posterior (AP), lateral, or oblique views, which are more common for other body regions.

    4. The evaluation only compares against Zero-1-to-3 and Zero123-XL, despite known limitations in these models (i.e., inconsistent and implausible multi-view outputs). More recent and advanced single-image novel view synthesis methods—such as Consistent-1-to-3 (Ye, Jianglong, et al.), ViVid-1-to-3 (Kwak, Jeong-gi, et al.), Free3D (Zheng, Chuanxia, and Andrea Vedaldi.), and Zero-to-Hero (Sobol, Ido, Chenfeng Xu, and Or Litany.)—were not included in the comparisons. These methods have demonstrated superior performance compared to Zero-1-to-3 and Zero123-XL and should have been considered to strengthen the evaluation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. It would be valuable to compare the proposed method with other state-of-the-art models, such as X-Gaussian and DVG-Diffusion, in terms of output quality and computational efficiency. This could provide deeper insights into the relative strengths and weaknesses of the approach.

    2. Exploring the impact of conditioning on additional scanner parameters, such as Tube Voltage and Focal Spot Size, might lead to further improvements in the quality of the generated outputs.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper demonstrates strong performance in predicting novel views from a 2D X-ray image, outperforming existing models. This is supported by expert-validated evaluations, which confirm the anatomical accuracy and clinical relevance of the generated outputs. These contributions have the potential to improve radiology practices by reducing patient radiation exposure and enhancing diagnostic workflows. Additionally, the authors’ publicly released dataset could facilitate further research and innovation in this area.

    However, the paper’s impact and rigor could have been strengthened by evaluating the method on more diverse datasets, including other body regions, patients with disabilities (e.g., scoliosis), and pediatric populations. Furthermore, comparisons with more recent and superior novel view synthesis methods would provide a clearer understanding of its relative performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My original positive impression of the paper remains after having read all other reviews and the authors’ response.



Review #3

  • Please describe the contribution of the paper

    This is a nice topic of research about novel view synthesis conditioned by AP X-ray view and target view angle. The network uses denoising diffusion transformer within VAE latent space and allow to produce high resolution generated images (256-512-1024 image sizes). The network is trained from projections derived from 889 CT-scans, and is assessed from 36 (or 1499) views generated from 16 CT-scans. Synthesized views reached SSIM ranging from 0.73 to 0.75, authors claimed that the proposed method outperforms the SOTA methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The developed method has high potential for clinical applications

    -First time that VAE and Diffusion Transformer are combined it seems, moreover using the time step of denoising step

    -Good SSIM results reached for synthetized views in high resolution

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    -It is unclear how the frozen parts of the network are pre-trained

    -It is unclear what is trainable in the block “Projection” (Fig. 1)

    -The errors distribution per angle are missing: in which angle-view the errors are more important for instance ?

    -Evaluations: since the model has ability to predict multiple angle-views, it is expected that authors also provide a 3D reconstructed volume and analyze the performance in 3D volume reconstruction (as SOTA often did using 3D SSIM and PSNR).

    -It is not specified from where are coming the real X-ray images, and their descriptions (imaging device, etc…)

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    -Providing training time and inference time are useful technical information -Please define the abbreviation DiT -Introduction: before the three statements (bullet points), there is a missing sentence to introduce the bullets -AdmW: typo –> AdamW

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper can be still improved by winning clarity in the method description and adding 3D volume reconstruction metrics for the performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Despites authors provided answers about reviewers’ concerns in the rebuttal, authors did not specify which modifications they plan to make to further improve the paper (the paper still requires improvements).

    Since the proposed method outperforms the SOTA and enrolled an impressive panel of experts for visual inspection of generated X-rays, this paper brings interesting information for the scientific community about X-ray novel view synthesis.




Author Feedback

We sincerely thank the reviewers for their thoughtful feedback. We are encouraged by the recognition of our work’s novelty, strong methodology, and clinical potential. We address the raised concerns below.

  1. Evaluation limited to chest X-rays / Data diversity (R1,R3) Our framework is designed anatomy-agnostic. The significant improvement shown in Tab.1 and Fig.2 indicates that if trained for other anatomies, our model would also outperform baselines. To demonstrate view flexibility, we tested Hemisphere views on chest X-rays; the flexibility is transferable to other anatomies due to the anatomy-agnostic design. For fairness, we adopted the same data and preprocessing as XraySyn. Following XraySyn, scans with slice thickness>2.5 mm were excluded to obtain clearer DRRs(Sec.4.1), aligning with our goal of generating high-fidelity images. As the code and dataset will be released, we respectfully suggest that our model can serve as a strong prior, which the community may further adapt to diverse anatomies, disabilities, or pediatric cases in future applications.

  2. Use of simulated X-rays vs. real clinical images (R3) As collecting annotated multi-view X-rays is hard, we followed prior practices (e.g., XraySyn, MedNeRF) by using DRRs. We have tested real X-ray inputs in Sec.4.4 and more results will be released. Expanding validation on real data is key future work.

  3. Comparison to newer methods (R1) Thank you for pointing it out. Although mentioned newer methods have improved the view consistency of Zero123XL, they primarily target the natural image domain. Fig.2 indicates that the domain bias issue is significantly more severe than view consistency; therefore, they are less likely to produce more competitive results.

  4. Pre-training details and biases (R2,R3) We use SDXL’s VAE and CLIP (Sec.4.2) to leverage their zero-shot capability, a strategy widely adopted in related work like Zero123. Results in Fig.2 proved that potential biases can be corrected by trainable modules. We agree that using medical pre-training may further improve performance and plan to explore this direction in future.

  5. Cross-view consistency and 3D reconstruction evaluation (R2,R3) Thanks for the insightful suggestion. We currently focus on high-fidelity novel view synthesis, but improving cross-view consistency is an important future direction for 3D reconstruction.

  6. Image source and technical clarifications (R2) The projection block is a learnable linear layer. Real X-rays are from Radiopaedia. We will clarify them in CR.

  7. Computational cost (R3) Training requires a powerful GPU, but inference is efficient (~2s, 12G VRAM). Acceleration (e.g., LCM) is feasible, though it may impact quality and needs careful tuning. Despite this, the method remains deployable on consumer GPUs.

  8. Source view, angular range, and error distribution (R1,R2,R3) We use PA views as the source to align with chest X-ray practice and ease the training, but our method can be adapted to arbitrary source views for other anatomies with appropriate training. We covered azimuth and elevation ranges of ±90°, and the model is scalable to full 360° generation if needed. Typically, larger angular displacements cause larger errors due to increased missing information. We agree that analyzing per-angle trends is a meaningful extension.

  9. Lack of ablation study (R3) While an ablation study would further strengthen the analysis, our design is grounded in prior findings: DiT generally outperforms UNet-based diffusion models, and weak-to-strong training is validated in studies like PixArt-Σ. We also attempted direct training at 1024 resolution, which failed to converge. Detailed ablations are planned for future work.

  10. Optional Comments (R1, R2) We will correct typos and add train/test time information, also consider DVG-Diffusion and X-Gaussian in future evaluations and explore conditioning on scanner parameters.

We sincerely thank the reviewers again for their feedback.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers were satistfied with the rebuttal and found the paper impressive. The authors should address the remaining concerns in the camera-ready version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top