Abstract

Three-dimensional reconstruction of soft tissues from stereoscopic surgical videos is crucial for enhancing various medical applications. Existing methods often struggle to generate accurate soft tissue geometries or suffer from slow network convergence. To address these challenges, we introduce SDFPlane, an innovative method for fast and precise geometric reconstruction of surgical scenes. This approach efficiently captures scene deformation using a spatial-temporal structure encoder and combines an SDF decoder with a color decoder to accurately model the scene’s geometry and color. Subsequently, we synthesize color images and depth maps with SDF-based volume rendering. Additionally, we implement an error-guided importance sampling strategy, which directs the network’s focus towards areas that are not fully optimized during training. Comparative analysis on multiple public datasets demonstrates that SDFPlane accelerates optimization by over 10× compared to existing SDF-based methods while maintaining state-of-the-art rendering quality. Code is available at:https://github.com/IRMVLab/SDFPlane.git

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2098_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2098_supp.zip

Link to the Code Repository

https://github.com/IRMVLab/SDFPlane

Link to the Dataset(s)

https://github.com/med-air/EndoNeRF https://zenodo.org/records/8154924

BibTex

@InProceedings{Li_SDFPlane_MICCAI2024,
        author = { Li, Hao and Shan, Jiwei and Wang, Hesheng},
        title = { { SDFPlane: Explicit Neural Surface Reconstruction of Deformable Tissues } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, the authors present a framework for 3D+t reconstruction of soft tissues from stereoscopic surgical videos. For a stereoscopic surgical videos, a network is trained to compute the color and signed distance function (SDF) to the closest surface allowing a faster and better quality reconstruction of the soft tissue surfaces. The method is evaluated on two public datasets and compared to several SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    – Table 1 shows that the proposed framework offers an improvement in quality with respect to SOTA methods

    – The proposed method is also 10x faster to compute than SOTA method except for Lerplane wich is 3x faster.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    – Several part of the framework are direct use of building blocs from SOTA, which is normal, but what is specific and innovative is not always easy to find in the paper without checking original paper from the literature. Some sentences are also misleading: for example at the end of the intro, in the “Our contributions include” paragraph, the contribution 2 was already presented in [17] and [21].

    – The method is faster than all but one (Lerplane) of the SOTA’ methods compared. A paragraph on the computation time reavealing what make it faster is missing.

    – What is the convergence criterion used for the different method ? I wonder is optimizing Lerplane during the same time as the proposed method would lead to similar results.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • to highlight the differences between this paper and SOTA, it would be interesting to add a table (in supplementary for example) that show the features (sdf vs density, spatio temporal, computation time, …) for the current method and method from SOTA.

    MINOR:

    2.2 Error-Guided importance Aampling -> Sampling

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see weakness

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is interesting as it offer improved performance. Some clarification (especially with regard to the convergence criterion and the comparison to Lerplane) in the rebuttal period are however necessary before acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I don’t really know what is the conference acceptance policy for that case. I would accept the paper but only if the authors modify their paper regarding the claim regarding contibutions that are already present in the SOTA and incorrect formulations. for example:

    • end of the intro, in the “Our contributions include” paragraph, the contribution 2 was already presented in [17] and [21].
    • point (1) in section 6 Rev 4
    • Rev 5 : “training time is slightly slower than Lerplane[20]” - It’s more than 3 times slower than Lerplane so I don’t think the word “slightly” is appropriate here.



Review #2

  • Please describe the contribution of the paper

    The authors introduce SDFPlane, a method for geometric reconstruction of surgical scenes.
    A spatial-temporal structure encoder is employed for capturing scene deformation, and a SDF decoder is combined with a color decoder to accurately infer the scene geometry and color. Additionally, an error-guided importance sampling strategy is used to direct the network’s focus towards areas that are not correctly optimized during training. Experimental results on two public datasets demonstrate state-of-the-art rendering quality and 3D reconstruction with low training times.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The positive aspects are: (1) the paper is, in general, well written and simple to follow; (2) the proposed paper tackles two clear problems in the literature: improve 3D reconstruction of deformable tissues in stereoscopic surgical videos + improve training efficiency using SDF based NeRF. (3) the experimental results are very encouraging, show state-of-the-art performance in synthetic image rendering and 3D reconstruction in two public datasets (EndoNeRF dataset [18] and the StereoMIS dataset [7]). Also, the training time is considerably reduced with respect to other SDF-based approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The paper introduces a novel architecture for NeRF-based synthetic view rendering and 3D reconstruction. However, many of its components seem to be adaptations of existing approaches in the literature:

    • The Spatial-temporal Structure Encoder, employing a six-plane framework, is derived from prior work [5,2] and Lerplane [20]. The paper fails to acknowledge that Lerplane [20] also seems to be using the same six multi-resolution planes. Moreover, the method for feature vector extraction using bilinear interpolation (Equations 2 and 3) resembles that presented in Section 2.3 (Equation 3) of [20]. As a reader, it’s difficult to distinguish the relevant novel contributions from what has been done in e.g. [20].
    • Regarding the SDF-based volume rendering and optimization, apart from the inclusion of normals in Equation 4, it’s unclear how this differs from existing approaches.
    • The error-guided importance sampling using color loss appears to be a simple, yet relevant contribution. (2) Although the authors mention a reduction in the training time of SDF-based approaches, it’s essential to note that in many applications, the bottleneck lies in rendering time and 3D reconstruction time rather than training time. Unfortunately, the paper lacks information on rendering time, raising doubts about its practicality in clinical settings. (3) Gaussian Splatting [A] has been introduced last year as being a more efficient alternative with respect to NeRFs. It would have been relevant to understand the pros and cons of the proposed NeRF approach with respect to Gaussian Splatting.

    [A] “3D Gaussian Splatting for Real-Time Radiance Field Rendering”, Kerbel et. al, ACM Transactions on Graphics, 2023

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) Can you provide the time information in Table 1 for “SDFPlane w/o sample”? (2) Title of Section 2.2 “Aampling” should be “Sampling”; (3) For the competing approaches, have you re-trained the models or used the author’s models?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a new architecture of SDF-based NeRF for synthetic view generation and 3D reconstruction targeting stereoscopic surgical videos. The experimental results show state-of-the-art performance in two public datasets. Nevertheless, it’s not clear from reading the paper what are the relevant major contributions, and a discussion about the current suitability of the rendering/3D reconstruction time of the proposed approach for a real medical setting is missing. In case the authors tackle my comments, better highlight the relevant contributions for each module (e.g. discuss the differences of their approach with Lerplane [20], in particular the decomposition into the six multi-resolution grid planes and feature vector extraction), and discuss how far the rendering/reconstruction quality and time are from the requirements of a real medical setting, then I would be willing to increase my rating.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This submission proposes an improvement to neural 3D reconstruction methods that focus on deformable scenes viewed from a static stereo camera (same setting as EndoNERF, EndoSURF, etc). The key new ideas here are a new light-ray sampling process guided by their respective losses, and the incorporation of spatial-temporal encoding from recent CVPR work [2,5] .

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method performance seems very competitive with EndoNERF/EndoSurf at a fraction of the computational time.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The sampling process guided by loss is not fully justified comparatively to the one from LerPlane, neither in principle or empirical (see detailed comments)

    • The method is not validated in terms of shape reconstruction accuracy. E. g. EndoSURF reports RMSE and PCD in millimitres for EndoNERF and Scared datasets. The qualitative results are not sufficient to assess this in full.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    — Error Guided Sampling — The authors propose an alternative ray sampling process to LerPlane, where instead of weighting rays by the amount of motion of reconstructed points, rays are weighted by their previous loss value. This is a simpler process and I can see it making sense, but there is no evidence that this is better than LerPlane’s motion-based sampling. To justify this the authors vaguely mention that “the effectiveness of this method [LerPlane sampling] in clinical settings warrants further investigation.” - but it is not really clear what this means, comments:

    • Is there any more detailed reason why the authors prefer to guide the sampling by loss value rather than motion (LerPlane)?
    • Empirically demonstrating this would require apples-to-apples comparison of both sampling processes. In Table 1, LerPlane and the proposed method have different encoding strategies so it’s impossible to say that the difference in performance is caused by the sampling. Justifying this would only be possible by additionally testing SDFPlane w/ Lerplane sampling (or LerPlane w/ proposed sampling)

    — Method details —

    • Eq (1) symbols (L_i, M_i, L_color, etc) are not formally defined
    • Spatial-temporal features are exracted at different resolutions. How many and which resolutions? If the same as [5] or [2] it can be easily clarified in manuscript.
    • “deformation reconstruction is a serious pathological problem” - this can probably be worded better

    — Experiments —

    • No computational time for SDFPlane w/o sample. This would be very important to evaluate the impact of this component on method efficiency.
    • “training time is slightly slower than Lerplane[20]” - It’s more than 3 times slower than Lerplane so I don’t think the word “slightly” is appropriate here. I understand that relative to EndoNERF/SURF these are relatively much smaller differences, but still seems exagerated language.
    • As mentioned in “Weaknesses”, it’s a shame that 3D reconstruction errors are not reported, given that one of the advantages of using sdf is precisely to get better reconstructions.

    typos: it surpass traditional methods -> surpasses Aampling

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the experimental results look promising so I’m slightly pending positive here. However, the manuscript has clear flaws that definitely need to be discussed in rebuttal. The most novel contribution here is the new guided sampling (the feature extraction is directly taken from CVPR papers) and at the present time it’s not fully justified neither in principle nor empirically.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The rebuttal has solved my concerns.

    The rebuttal mentions a crucial implementation detail of LerPlane’s sampling that strictly limits it to static scenes. I’d suggest that the authors mention this in final version - since understanding this key detail would otherwise require a thorough reading of LerPlane’s paper.

    After reading the remaining reviewers comments, I also feel that remaining concerns have been addressed, namely clarification of contributions and runtime comparisons.




Author Feedback

Thanks to all reviewers for taking the time to review and provide feedback for our submission. Please see below for our response to major weaknesses: 1.Main Contributions(R3, R4) Our paper addresses the challenge of balancing accurate geometric reconstruction with fast training speeds in internal environments. Our framework combines Lerplane’s six-plane structure for acceleration with NeuS’s SDF rendering for precise geometry. As noted by reviewers, this simple yet effective design allows us to achieve accurate 3D structures without extended training times. Furthermore, we found limitations in Lerplane’s sampling strategy. To address this, we developed a more versatile Error-Guided Importance Sampling strategy. This approach significantly improving efficiency and achieving superior results. It also broadens applicability beyond single-viewpoint scenarios, marking a key difference from Lerplane. 2.Training Time Accelerate(R3) The training speed improvement in SDFPlane is primarily due to encoding spatial dynamic scenes with a multi-resolution six-plane structure. Essentially, it replaces implicit structures with explicit ones, significantly reducing the number of parameters the model needs to optimize. Besides, the six-plane structure uses trilinear interpolation from eight neighboring points for faster feature queries than MLP. 3.Convergence Criterion (R3) As mentioned in Sec 3.2, we standardized the batch=2048 and epoch=9600 for different methods to test model performance. We did not compare models based on the same training duration for several reasons. First, for models like EndoNeRF and Endosurf, which require hours of training, a few minutes of training is insufficient to demonstrate their capabilities. Second, maintaining consistent training times in code implementation is challenging and may lead to experimental inaccuracies if not precisely managed. 4.Render Time(R4) The rendering time for SDFPlane is 8 min, while Endosurf takes over 30 min. This is a huge improve in rendering speed for SDFs. Fast high-quality reconstruction is very important, and our reconstruction quality and training time meet practical requirements. 5.NeRF vs 3DGS(R4) Compared to NeRF, 3DGS has the advantage of faster training speeds and can produce high-quality, high-resolution images from new viewpoints. However, in internal environments, NeRF holds an advantage. Firstly, internal datasets often lack enough viewpoints. Since 3DGS uses direction-sensitive SH coefficients to calculate color, limited viewpoint information can easily cause training failures. Secondly, the 3D ellipsoids in 3DGS make it difficult to define surface normals and depth, resulting in poor performance with SDF-based methods. 6.Retrained?(R4) Our experimental results are derived from models we trained ourselves by modifying the SOTA to use batch=2048 and epoch=9600. Apart from that, we didn’t modify the baseline methods in any other way. 7.Training Time(R4, R5) The training time for w/o samples is similar to that with the sample strategy. 8.Sample(R5) Firstly, Lerplane’s sampling strategy uses a prior mask from input images to focus learning on masked areas, but it lacks the ability to adjust based on the model’s learning progress. In contrast, our method updates in real-time based on previous loss data, allowing the sampling weights to be adjusted according to the learning progress. This enables our method to adapt and refine the sampling process throughout the training period. Secondly, the implementation of this sampling involves averaging all images and measuring the differences between each image and the average to find areas with big deformation. This approach is only suitable for single-viewpoint datasets, as average image loses its meaning when the camera moves. Our proposed method isn’t restricted by camera movement, making it more broadly applicable. 9.Reconstruct Quality(R5) The average RMSE: 0.0247(endonerf), 0.0191(endosurf), 0.0479(lerplane), 0.0157(sdfplane).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces SDFPlane, an efficient and precise method for reconstructing deformable tissues from endoscopic videos. Utilizing a spatiotemporal structure encoder based on multi-scale planar composition alongside SDFbased volume rendering, the approach achieves higher quality reconstructions. Notably, the method’s speed surpasses that of existing SDF-based methods by more than tenfold. After careful consideration of the authors’ rebuttal, most reviewers now lean towards accepting the paper. I agree that the authors have adequately addressed the major concerns and questions raised by the reviewers regarding the novelty of the proposed methods and some of the technical aspects. This said, I lean towards a accepting the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces SDFPlane, an efficient and precise method for reconstructing deformable tissues from endoscopic videos. Utilizing a spatiotemporal structure encoder based on multi-scale planar composition alongside SDFbased volume rendering, the approach achieves higher quality reconstructions. Notably, the method’s speed surpasses that of existing SDF-based methods by more than tenfold. After careful consideration of the authors’ rebuttal, most reviewers now lean towards accepting the paper. I agree that the authors have adequately addressed the major concerns and questions raised by the reviewers regarding the novelty of the proposed methods and some of the technical aspects. This said, I lean towards a accepting the paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper score was improved post-rebuttal to Accept, Weak-Accept and Weak-Reject (unchanged). Reviewers agreed to accept paper upon addressing specific concerns in the final version regarding 1) contribution statement, and 2) providing implementation details.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper score was improved post-rebuttal to Accept, Weak-Accept and Weak-Reject (unchanged). Reviewers agreed to accept paper upon addressing specific concerns in the final version regarding 1) contribution statement, and 2) providing implementation details.



back to top