Abstract

Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misalignment detection and classification. Reconstructing 3D structures directly from 2D PX has recently been explored to address limitations with existing methods primarily reliant on Convolutional Neural Networks (CNNs) for direct 2D-to-3D mapping. These methods, however, are unable to correctly infer depth-axis spatial information. In addition, they are limited by the intrinsic locality of convolution operations, as the convolution kernels only capture the information of immediate neighborhood pixels. In this study, we propose a progressive hybrid Multilayer Perceptron (MLP)-CNN pyramid network (3DPX) for 2D-to-3D oral PX reconstruction. We introduce a progressive reconstruction strategy, where 3D images are progressively reconstructed in the 3DPX with guidance imposed on the intermediate reconstruction result at each pyramid level. Further, motivated by the recent advancement of MLPs that show promise in capturing fine-grained long-range dependency, our 3DPX integrates MLPs and CNNs to improve the semantic understanding during reconstruction. Extensive experiments on two large datasets involving 464 studies demonstrate that our 3DPX outperforms state-of-the-art 2D-to-3D oral reconstruction methods, including standalone MLP and transformers, in reconstruction quality, and also improves the performance of downstream angular misalignment classification tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2442_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2442_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_3DPX_MICCAI2024,
        author = { Li, Xiaoshuang and Meng, Mingyuan and Huang, Zimo and Bi, Lei and Delamare, Eduardo and Feng, Dagan and Sheng, Bin and Kim, Jinman},
        title = { { 3DPX: Progressive 2D-to-3D Oral Image Reconstruction with Hybrid MLP-CNN Networks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper has proposed a new architecture for 3D oral reconstruction, which could effectively improve the performance of state-of-the-art models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method utilizes a progressive hybrid architecture of MLP and CNN that could improve reconstruction of 3D oral structure compared with the baselines.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper misses some comparison with baseline models that utilize other kinds of hybrid architectures, like CNN+Transformer, in the experiment.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Could the proposed method be extended to some other cross-dimension transfer work like [1]?
    2. Is there any comparison with other hybrid architecture, like CNN+Transformer?

    [1] Ying, Xingde, et al. “X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the hybrid architecture of CNN and MLP has been proposed in many earlier works, this method has rarely been evaluated on cross-dimensional transfer models. This paper fills the gap by demonstrating the effectiveness of such a model in the task of oral reconstruction.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a new U-Net-based network for the 2D to 3D reconstruction of panoramic X-ray images. The proposed method includes 1) hybrid MLP-CNN blocks in the decoder of U-Net, 2) progressive guidance of the intermediate layers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s main strength lies in its novel approach of incorporating MLP-CNN blocks into the U-Net decoder and utilizing progressive guidance of the intermediate layers to enhance 3D reconstructions. It employs two large datasets to compare PSNR and SSIM metrics across various methods. Additionally, ablation studies are conducted to demonstrate the enhancements provided by the proposed progressive guidance and MLP-CNN blocks. The downstream applications of the 3D reconstructions, such as jaw bone segmentation and angular misalignment classification tasks, are also analyzed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper is difficult to follow regarding the various referenced methods. For instance, methods like UNETR and MLP-Unet are used for comparison, but their references are missing. Furthermore, the standard deviations are not shown for the metrics, which makes it difficult to compare the results effectively.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The network is described sufficiently to get started. One dataset used was sourced from another open source, while the other dataset is mentioned as private.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I cannot find references for the compared methods listed in Table 1, except for Oral 3D. Are the remaining methods custom? In that situation, perhaps, it would be clearer if Table 1 were included as part of an ablation study. Additionally, the DSC scores for UNet-based models consistently increase alongside PSNR and SSIM, yet the Residual-based models exhibit lower DSC scores than the UNet-based ones despite higher PSNR and SSIM values. Could authors explain this discrepancy?

    On page 7, the text states that ‘With the progressive intermediate guidance, our 3DPX achieved the highest score on all three metrics.’ However, Table 2 indicates that the highest DSC score belongs to 3D Decoder UNet. Furthermore, Table 2 lacks references for UNETR and MLP UNet, and they (references ) are not mentioned elsewhere in the paper. There also appears to be a disconnect between the DSC scores and PSNR/SSIM in Table 2, which warrants discussion. Including details about the number of parameters and training/inference times could also enhance the comparison of models.

    The performance of the DSC score can depend on how the ground truth was generated. This could be clarified in the study.

    Possible future improvements: Comparison with Deep supervision loss, separating the test results between the two datasets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main issues included difficulty reading the article, missing references, and the absence of error rates. This might be due to the limited space available, given that there appears to be a large amount of good work done.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This study proposes a method for reconstructing 3D volumes from 2D Panoramic X-ray (PX), overcoming the limitations of 2D imaging in revealing 3D anatomical information. The study makes two main contributions: 1) a progressive reconstruction strategy, where the 3D images are progressively reconstructed in the proposed method with progressive guidance imposed on the intermediate reconstruction results at each pyramid level; 2) proposes the Hybrid MLP-CNN Block, integrating the advantages of MLPs and CNNs, allowing the capture of long-range visual dependence and subtle details, thereby improving semantic understanding during reconstruction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper is generally well written and structured, and it is also easy to read and understand the purpose and improvements over previous research (CNN and Oral-3D) and challenges (e.g., reconstruction of ack details). 2) The progressive reconstruction strategy represents an interesting and meaningful approach, utilizing 3D images as progressive guides for intermediate reconstruction at each pyramid level. 3) The Hybrid MLP-CNN Block, which integrates the advantages of MLPs and CNNs, can capture long-range visual dependencies, which improves semantic understanding during the reconstruction phase. 4) The 3DPX, a progressive hybrid MLP-CNN pyramid network for 2D-to-3D oral PX reconstruction, demonstrates superior reconstruction performance compared to CNN, GAN, and Oral-3D.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) In clinical practice, actual PX images usually have over 900 × 1500 resolutions. However, the authors use synthesized PX images of size 128 × 256. Regarding clinical feasibility, is the proposed method applicable to actual PX images?

    2) The extracted low-level features and latent features, such as ordinary CNNs and Unet, are often not intuitively understandable to humans. The progressive guidance with the error sum of square loss makes the intermediate feature maps of each depth layer follow the scaled Y. Although the complex and deep layers are unexplainable, it is believed that they will output more meaningful features as the network processes the data. I wonder if it makes much sense to have the incremental output feature map follow the downscaled Y. What do the authors think?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    One of the main contributions, Progressive Guided Reconstruction, requires more detailed information about the scaled label Y. Although all decoder blocks have an identical output channel size of 128, there is a variation in the channel size per depth level at the encoder. The Reviewer thinks it is hard to reproduce the Progressive Guided Reconstruction method due to the lack of details about the output channel and activation function of B(x) and scaled Y in Fig 2 (red arrows).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) The authors need to correct typographical errors in quotations and formula abbreviations (For example, revise L_see to L_sse in Eq 1).

    2) The SSIM (Structural Similarity Index) loss can be used to learn structural information [1]. [1] Zhao, Hang, et al. “Loss functions for image restoration with neural networks.” IEEE Transactions on computational imaging 3.1 (2016): 47-57.

    3) The Reviewer recommends embedding the diffusion process at the high-level feature (orange arrow in Fig 2), which can improve the volume stability and quality after reconstruction [2]. [2] Croitoru, Florinel-Alin, et al. “Diffusion models in vision: A survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although some limitations for clinical feasibility remain, this paper provides a novel approach to 2D-to-3D reconstruction between synthetic PX and real CBCT images.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Reviewer #1 Q6&Q10-2: Missing comparison with baseline hybrid models A: We did compare our model with UNETR, a hybrid transformer-CNN architecture in the paper. Unfortunately, we accidentally omitted the UNETR reference, which has now been corrected. Q10-1: Extend method to other cross-dimension transfer work like X2CT-GAN [1]? A: Yes, the proposed method can be extended to X2CT-GAN by substituting the decoder blocks with hybrid MLP-CNN blocks and adding in the intermediate guidance. In general, variants of the encoder-decoder structures would be compatible with our techniques. In future work, we will conduct experiments to demonstrate this capability. Reviewer #3 Q6-1: Clinical PX images usually have resolution over 900 × 1500. Synthesized PX images are of size 128 × 256. A: The straightforward solution to the resolution discrepancy is to down-sample the clinical or synthesized PX when the method is applied. It is a common strategy to improve calculation effectiveness across the X-ray analysis domain [1] (X2CT-GAN). In future work, we propose to fill the information gap between the synthesized PX and clinical PX by exploring domain adaption methods with realistic CBCT resampling methods to simulate the clinical PX resolution. Q6-2: understandability / explainability of features. Incremental output feature map to follow the downscaled Y. A: The manipulation of intermediate features is the core concept of the proposed progressive guidance. In the general concept of most deep learning tasks, e.g., classification and regression, the networks are trained to focus on the high-level semantic features that are abstract because it should ‘understand’ the scene to make predictions. However, the 2D-to-3D task is with a straightforward target, consistent throughout the reconstruction process in the shallow or deep layers. Q9: Progressive Guided Reconstruction (PGR) requires more detailed information about the scaled label Y. A: As suggested, we have clarified the details of the PGR as follows: the intermediate guidance denoted by red arrows in Fig 2 is scaled label Y. In every layer of the encoder and decoder, the label Y was reshaped into the size of the intermediate feature for a proper comparison. Q10-1: Typographical errors. A: We have corrected the errors and carefully proofed the paper. Q10-2: SSIM loss can be used to learn structural information [2] (Zhao, Hang, et al). A: As suggested, we will consider SSIM as part of the loss function in our future experiment. Q10-3: Embedding the diffusion process at the high-level feature (orange arrow in Fig 2), which can improve the volume stability and quality after reconstruction [3] (Croitoru, Florinel-Alin, et al). A: As suggested, we will investigate the diffusion process further in our 2D-to-3D reconstruction tasks in future work. Thanks! Reviewer #4 Q6 & Q10-1 & Q10-3: The paper is difficult to follow regarding the missing references in Tables 1 and 2. A: The missing reference for UNETR and unclear information for MLP-Unet has now been clarified. MLP-Unet is a custom-built Unet with MLP blocks for comparison and, therefore, there is no reference for it. The methods without reference in Table 1 are also custom-built for comparison. The reference has been updated accordingly and the description for custom-built comparison methods has been clarified. Q10-2: Disconnect between DSC for UNet-based models consistently increase alongside PSNR and SSIM. A: The discrepancy arises because PSNR and SSIM measure different aspects of model performance compared to DSC. While Residual-based models may produce images that are perceptually and pixel-wise closer to the ground truth (higher PSNR and SSIM), they might not be as effective at precise segmentation (lower DSC) compared to UNet-based models, which are specifically designed for segmentation tasks. We will add a statement to this effect in the revised paper.




Meta-Review

Meta-review not available, early accepted paper.



back to top