Abstract

Scale-aware monocular depth estimation poses a significant challenge in computer-aided endoscopic navigation. However, existing depth estimation methods that do not consider the geometric priors struggle to learn the absolute scale from training with monocular endoscopic sequences. Additionally, conventional methods face difficulties in accurately estimating details on tissue and instruments boundaries. In this paper, we tackle these problems by proposing a novel enhanced scale-aware framework that only uses monocular images with geometric modeling for depth estimation. Specifically, we first propose a multi-resolution depth fusion strategy to enhance the quality of monocular depth estimation. To recover the precise scale between relative depth and real-world values, we further calculate the 3D poses of instruments in the endoscopic scenes by algebraic geometry based on the image-only geometric primitives (i.e., boundaries and tip of instruments). Afterwards, the 3D poses of surgical instruments enable the scale recovery of relative depth maps. By coupling scale factors and relative depth estimation, the scale aware depth of the monocular endoscopic scenes can be estimated. We evaluate the pipeline on in-house endoscopic surgery videos and simulated data. The results demonstrate that our method can learn the absolute scale with geometric modeling and accurately estimate scale-aware depth for monocular scenes. Code is available at: https://github.com/med-air/MonoEndoDepth

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1856_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1856_supp.pdf

Link to the Code Repository

https://github.com/med-air/MonoEndoDepth

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wei_Enhanced_MICCAI2024,
        author = { Wei, Ruofeng and Li, Bin and Chen, Kai and Ma, Yiyao and Liu, Yunhui and Dou, Qi},
        title = { { Enhanced Scale-aware Depth Estimation for Monocular Endoscopic Scenes with Geometric Modeling } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce a novel image-based framework for scale-aware depth estimation in monocular endoscopic scenes. This approach integrates both monocular depth estimation and 3D instrument pose estimation, based on the known sizes of surgical tools within the scene. The authors compare the proposed work against other monocular depth estimation strategies, potentially showing the added value of the proposed strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The combination of the depth estimation with the known 3D shape of the tool for scale awareness is an interesting approach. To achieve this, the authors introduced a geometrically-driven formulation for estimating the 3D pose of the instrument, consequently calibrating the depth scale based on the radius and cylindrical shaft profile of the tool. Moreover, the authors compared the performance of the proposed strategy against other state-of-the-art monocular depth estimations.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper lacks quantification of the added value of the modules within the proposed strategy, particularly in comparing the performance of Multi-Resolution depth estimation with and without the scale-aware module. Furthermore, inadequate detail regarding the dataset used in the study, including specific surgeries performed and instruments used, raises questions about the reliability and reproducibility of the experiments. Additionally, there is a lack of discussion regarding potential errors associated with the ground truth generation method, specifically standard stereo matching. This omission raises concerns about the reliability of the ground truth data and how these errors might impact the final performance of the proposed strategy.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The dataset used in the study is not adequately detailed, namely the specific surgeries performed, instruments used and size of the instruments for the estimation of the pose. Moreover, it was stated that the training process of Monodepth2 was followed. The authors should make clear if the training parameters, namely learning rate, epoch, optimizer, … are the same.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors rely on standard stereo matching for generating ground truth data, but they do not discuss the potential errors associated with this method and how these errors might impact the final performance of their proposed strategy. Moreover, the paper mentions the need for tool segmentation strategy. The performance of this network can also influence the final performance of the methodology proposed. The authors could provide the results using the GT masks of the instruments to perform pose estimation step, thus, minimising the influence of this strategy on the final results and quantifying the added-value of this methodology.

    There is a lack of quantification of the added-value of the modules of the proposed strategy. The authors should report the comparison the performance of the proposed Multi-Resolution depth estimation with and without the scale-aware module. Thus quantifying the added-value of all modules of the proposed methodology. strategy

    The dataset used in the study is not adequately detailed, namely the specific surgeries performed, instruments used and size of the instruments for the estimation of the pose. For example, Figure 3 includes non-surgical images. If these images were used for validation, their inclusion could potentially bias the results. Moreover, It is unclear whether the other networks mentioned in the paper were trained on the same dataset created by the authors or on different datasets, which could affect the comparability of results.

    While the paper acknowledges the loss of information at different image resolutions, it does not discuss into how this loss affects the network’s performance. Additionally, it lacks clarity on the selection process for the two resolutions used in the study.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting approach to improve current monocular depth estimation strategies through an Enhanced Scale-aware Depth Estimation strategy. The authors presented results that show the potential of the proposed strategy for this task, which is an important topic for the field.

    Still, with more experiments and details about the current tests, the work can be improved. +

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a monocular depth estimation algorithm that can recover scale from multi-resolution processing and 3D pose of the instrument. As laparoscopic instruments do typically have a standard shaft size to fit through the ports, this method could have broad applicability. While frame-to-frame and anatomoy-based geometric constraints have been explored in past works, the absolute ground truth this paper proposes is novel and a significant contribution. The paper is well written and easy to follow.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel solution to depth estimation with scale from endoscopic images
    • Ablation over the resolutions and the pose estimation aspects of the method
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No evaluation on publicly available datasets
    • Evaluation may be biased towards the proposed method
    • Description of runtime is unclear, making it difficult to assess whether the proposed method can be used intraoperatively
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The pose of the instrument may not be uniquely determined from some views where the tip is hidden. How does the algorithm handle that?

    In a similar vein, the images from the dataset as shown in Fig. 2 are all very clean (no smoke, blood, good lighting). This method has a strong dependency on the pose of the instrument being well-identified for scale reconstruction. Further testing on more realistic datasets would provide a better understanding of the robustness of the proposed algorithm to surgical conditions.

    Making the evaluation dataset publicly available would help the reproducibility of this work as there is no evaluation on public datasets.

    The runtime reported is a little misleading since it’s only for the tool segmentation step. What’s the total runtime of the depth estimation pipeline?

    The evaluation is biased towards the proposed method as the previous methods do not claim to adjust for scale and therefore should not be expected to perform well in the absolute sense. Some adjustment for scale would have been a more fair comparison.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Recovering ground truth depth from endoscopic video is interesting and a useful contribution. My enthusiasm was somewhat dampened by the lack of evaluation on publicly available datasets.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a framework that can enhance the quality of monocular depth estimation. This framework can further calculate the instrument poses and enable the scale recovery of relative depth maps.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study only not can estimate the scale-aware depth of the monocular endoscopic scenes, but also consider the 3D pose of the instrument, which is practical and clinically relevant. This method do not require CAD model and sensor information, which makes it easier to apply. The theoretical framework is sound and the paper is very well written. Experiments are also convincing including in-house surgical clips and simulator data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I don’t see any major weakness in the paper worth mentioning, except that some of the author’s novelty claims are questionable ( see comments).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Despite code or data will not be shared, authors did report sufficient information to support the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors claim that “we first propose a multi-resolution depth merging strategy to enhance the quality of monocular depth estimation.” but [1] have proposed a framework that merges depth estimations for the same image at different resolutions adaptive to the monocular image content to generate a result with high-frequency details while maintaining the structural consistency.

    [1] Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S. and Aksoy, Y., 2021. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9685-9694).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clear, well written and has potential to be translated into clinical practice with a convincing theoretical framework and experiments.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks for your careful and valuable comments. We will explain your concerns point by point.

Q1 (R1&R4): Data: no evaluation on public and challenging data. A1: Currently, there are no available datasets on depth estimation of surgical robotic scene with GT depth. Therefore, we first extracted required clips from our in-house dataset and then performed the evaluation on top of it. Moreover, we utilize standard stereo matching method to calculate GT depth. Additionally, surgical scenes may suffer from smoke, blood, and poor lighting. It is meaningful to estimate the metric depth of these scenes. However, there is no public data for these scenarios, and it is difficult to obtain GT from such data. In this case, we have not evaluated the proposed method on the more challenging datasets.

Q2 (R4): Data: how does error in GT depth calculated by stereo matching affect evaluation? A2: We evaluated the proposed scale-aware depth estimation on clinical surgical data. The only way for calculating GT depth for clinical data is stereo matching. The stereo matching has been evaluated on public SCARED with high accuracy (RMSE of 2.959mm). Therefore, we believe that the collected data is sufficient to evaluate the performance of scale-aware depth estimation.

Q3 (R4): Data: details about the collected data. A3: We extracted data from the in-house daVinci robotic prostatectomy dataset. The instruments used in surgery are standard daVinci instruments with a cylindrical shaft with a constant radius 4.5mm.

Q4 (R1): Method: since the compared methods are not targeted at scale-aware depth estimation, the evaluation is biased towards the proposed method. Some adjustment for scale would have been a fairer comparison. A4: Among the compared methods, “MonoDepth Stereo” can estimate scale-aware depth from monocular images; nevertheless, our method outperforms it. For those methods not targeted at scale-aware depth estimation, we have manually performed scale adjustments using GT information.

Q5 (R1): Method: how does the algorithm handle when instruments are out of view? A5: In most cases, for safety reasons, the surgeon should keep instruments in view while performing procedures. If instruments are withdrawn from the scene, instrument-tissue manipulation is considered to have ceased since the surgeon cannot see instruments, so we will continue to wait for the instrument to return to the field for further surgery.

Q6 (R4): Method: the authors should report the comparison of the proposed multi-resolution depth estimation with and without the scale-aware module. A6: This problem means the scale-aware module could improve the depth quality. However, we utilized the module to recover the 3D pose of the instrument and calculate the scale. The merging strategy is used to enhance depth quality. In our evaluation, the depth estimates without real scale will be first scaled using GT and then the accuracy is calculated.

Q7 (R4): Method: how does low-res input affect network performance? How to choose image resolutions when performing depth merging? A7: In our paper, we described that the depth estimated by network with low-res image loses many details. You can refer to supplementary materials for further information. Besides, we have done the ablation of choosing resolutions in our paper.

Q8 (R1): Method: the runtime of the pipeline. A8: The runtime of the proposed pipeline is around 66ms per frame.

Q9 (R2): Method: The authors claim that “we first propose …” but [1] have proposed a framework that merges depth. A9: Sorry for the confusion. “First” here means that merging depth estimations is the first step in our pipeline. Our approach focuses on the scale-aware depth estimation of monocular endoscope images, not just on improving depth quality. However, [1] aims at depth quality improvement.

Q10 (R4): Experiment: providing depth evaluation results using GT tool masks. A10: Currently, the predicted masks are accurate enough for tool pose estimation.




Meta-Review

Meta-review not available, early accepted paper.



back to top