Abstract

Neural radiance field has recently emerged as a powerful representation to reconstruct deformable tissues from endoscopic videos. Previous methods mainly focus on depth-supervised approaches based on endoscopic datasets. As additional information, depth values were proven important in reconstructing deformable tissues by previous methods. However, collecting a large number of datasets with accurate depth values limits the applicability of these approaches for endoscopic scenes. To address this issue, we propose a novel self-supervised monocular 3D scene reconstruction method based on neural radiance fields without prior depth as supervision. We consider the monocular 3D reconstruction based on two approaches: ray-tracing-based neural radiance fields and structure-from-motion-based photogrammetry. We introduce structure from motion framework and leverage color values as a supervision to complete the self-supervised learning strategy. In addition, we predict the depth values from neural radiance fields and enforce the geometric constraint for depth values from adjacent views. Moreover, we propose a looped loss function to fully explore the temporal correlation between input images. The experimental results showed that the proposed method without prior depth outperformed the previous depth-supervised methods on two endoscopic datasets. Our code is available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1656_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/MoriLabNU/EndoSelf

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_EndoSelf_MICCAI2024,
        author = { Li, Wenda and Hayashi, Yuichiro and Oda, Masahiro and Kitasaka, Takayuki and Misawa, Kazunari and Mori, Kensaku},
        title = { { EndoSelf: Self-Supervised Monocular 3D Scene Reconstruction of Deformable Tissues with Neural Radiance Fields on Endoscopic Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work incorporates geometric and photometric loss into conventional deformable NeRF algorithm.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    o It is widely agreed that depths from NeRF are not reliable. Yet, this research reveals that unreliable depth is still beneficial in constraining photometric and geometric consistencies. I believe this is the main and most valuable contribution of this article. o The authors claim that code is available (Not attached or indicated in the manuscript).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    o What is basic loss in Fig. 1? o Please modify the matrix/vector size inconsistency issue in Eq. 1. o I am confused about the self-supervised learning module. How to combine Depth Consistency with the original deformable NeRF? Or, how to integrate Fig.2 to Fig.1? o How many images are used for training and testing? o The authors said, “ Our code is available.” No code has been attached or provided. o Considering this is a mapping research, more results should be shown in a video or attachment.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors claim that code is available (Not attached or indicated in the manuscript).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The writing of this manuscript can be further polished and improved. Relation between Fig.1 and Fig.2 is not clear.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall impression, presentation and completeness of the evaluation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a 3D reconstruction for posed monocluar endoscopic videos, combining aspects of neural radiance fields and structre-from-motion (sfm) techniques. The proposed method does not require ground truth depth maps, as sfm from adjacent views are used on the predicted depth values to provide geometric constraints. In addition temporal consistency is exploited between adjacent frames, where a looped loss function is proposed. Ablation studies on the different components provide clear information on what drives the performance. The evaluations on the ENDONERF and SCARED datasets show that the proposed technique outperforms previously proposed state-of-the-art techniques on all the image similarity metrics (PSNR, SSIM and LPIPS).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Combining the techniques from structure from motion (for depth consitency loss) and neural radiance field technique provides a novel approach and removes the need for ground truth depth when for the reconstruction.
    • The proposed looped loss function exploiting the temporal consitency.
    • Comprehensive evaluations on benchmark dataset with clear improvements compared to previously proposed state-of-the-art techniques.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Inclusion of 3D errors on the reconstructions could provide valuable indications compared to [25], since the evaluation metrics only indicate the 2D errors on the image similarity of the novel rendered view.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The readability of the Sec 2.2 can be improved. Some of the equations are quite known for people who are familar and moving it to appendix or supplementary would help in improve the readbility
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method eliminates the need for ground truth depth maps and exploits temporal consistency between adjacent frames. Comprehensive evaluations demonstrate clear improvements over state-of-the-art techniques. The inclusion of 3D reconstruction errors and improved readability of Section 2.2 could further enhance the paper. Overall, the strengths listed of the proposed method and its superior performance are the main fators for the rating.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Neural radiance fields (NeRFs) are an effective tool for modeling deformable tissue structures from intra-operative visible-spectrum imaging. Previous approaches leverage depth information as additional supervision, which is commonly available from stereo laparoscopic video. Endoscopic video, which is primarily monocular, does not enable convenient depth collection suitable for depth-supervised NeRF methods. The manuscript considers a NeRF-based architecture for surface reconstruction from posed endoscopic videos (such as might be obtained from a tracked endoscope). The method exploits geometric and photometric consistency between adjacent frames in the loss function during a looped optimization process. The method is evaluated on the ENDONERF and SCARED datasets and compared to EndoNeRF and EndoSurf methods in terms of the PSNR, SSIM, and LPIPS, for which it outperforms prior work.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The mathematical details are comprehensive and readable.
    • The method outperforms relevant prior work despite the lack of depth supervision.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The method requires posed endoscopic images as input.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None. Code will be appreciated.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The manuscript has several strong points in its favor. It meticulously outlines the approach to enforcing temporal consistency during deformable NeRF optimization. While there are a few sticking points (see below), the readability and exhaustiveness of this treatment is much appreciated.

    The reliance on camera poses for each frame is a minor drawback. These can be obtained, as done here, using structure from motion methods or, potentially, from a navigated endoscope, so it is not a major issue. However, the drawbacks of structure from motion, which can suffer from robustness and accuracy issues for endoscopic video, which features primarily depth-wise motion with a camera-mounted light source through highly specular anatomy, should be discussed.

    Finally, the text would benefit from further proof-reading and a grammar checker before final publication.

    Additional comments:

    • Abstract: “Neural radiance fields have …”
    • In equation 1, $\mathbf{D} : \mathcal{R}^2 \rightarrow \mathcal{R}$ is a function, not a matrix. Therefore I suggest a lowercase, non-bold symbol $d$. Same with $\mathbf{M}$ in eq 4.
    • In Section 2.2, the formulation with $t_r$ and $t_s$ is unnecessarily complicated, since these times are always adjacent. Without loss of generality (since the constraint is symmetric) you can just use $t$ and $t+1$. I highly recommend this, since it is not clear at first glance that the two times are adjacent, and the formulation suggests otherwise. This should be reflected in Fig. 1.

    References:

    [1] Phillips, Ian H. D., et al. “A Real-Time Endoscope Motion Tracker.” IEEE J. Transl. Eng. Health Med., vol. 10, 2022, doi:10.1109/JTEHM.2022.3214148.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this is strong work. The incorporation of temporal consistency into the optimization of a neural radiance field, divided here into optimization of the geometric and photometric information, enables higher quality NeRF reconstruction for surgical scenes. These scenes are challenging for a number of reasons, ranging from specular, inconstant lighting conditions to soft tissue deformation. Whereas previous approaches have leveraged the depth information to enable training for neural deformation fields, the approach here requires only camera poses. While these are not trivially obtained, navigated endoscopes are one option [1]. Despite not having accesss to depth information, the approach here results in higher quality NeRF renderings for test set images in laparascopic and endoscopic procedures, compared to relevant baselines.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all the reviewers for their encouraging feedback and constructive comments. We will include the reviewers’ suggestions in our final version and carefully proofread our paper.




Meta-Review

Meta-review not available, early accepted paper.



back to top