Abstract

Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at https://github.com/BeileiCui/EndoDAC.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0225_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0225_supp.pdf

Link to the Code Repository

https://github.com/BeileiCui/EndoDAC

Link to the Dataset(s)

https://hamlyn.doc.ic.ac.uk/vision/ https://endovissub2019-scared.grand-challenge.org/

BibTex

@InProceedings{Cui_EndoDAC_MICCAI2024,
        author = { Cui, Beilei and Islam, Mobarakol and Bai, Long and Wang, An and Ren, Hongliang},
        title = { { EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method to adapt an existing foundation model for depth estimation (Depth Anything – DA) for endoscopic applications. For this, the authors propose two modifications and one self-supervised learning strategy. The proposed approach is evaluated on two public data sets, demonstrating improved performance over state-of-the-art alternatives.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting method for a relevant application
    • Thorough evaluation and benchmarking on two data sets
    • Accurate method to estimate intrinsic camera parameters
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Effect size of individual contributions unknown, seems limited
    • Main contribution bit unclear, as several methods are simultaneously introduced (DepthNet, Pose-Intrinsics Net and Self-supervised depth and ego-motion estimation)
    • Confidence intervals and statistical significance not reported (minor)
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper addresses a relevant problem, is rigorous and clear. I think the authors did an excellent job, given the extensive amount of experiments and benchmarking. However, I still feel a bit on the fence regarding the paper, mostly due to two issues, which I outline below.

    1 MOTIVATION OF MAIN CONTRIBUTIONS To me, the motivations for the main contributions (DepthNet, Pose-Intrinsics Net, SS Depth and Ego-motion estimation) are still a bit unclear. For example, why was it necessary to adapt the original LoRA approach? What was the motivation to change it in this exact way? How will the network behaviour change as a result of this change? Similar questions could be asked for the other proposed additions. These are especially relevant questions to address, as the improvement w.r.t. the DA baseline (in Table 1) seems small and the three proposed additions seem largely based on existing methods (which is in itself not a problem, but may limit the novelty). I suggest the authors better frame their contributions, by drawing a clearer line between existing works and the proposed approach, including a motivation for each of the three methods. This might also help in putting the biggest contribution in the spotlight, as I was a bit lost between the different outcomes. To me alleviating the need for the intrinsic camera parameters may be the most striking contribution, but that could also be due to the fact that I could not fully appreciate the other contributions w.r.t. related work.

    2 EFFECT SIZE OF CONTRIBUTIONS The previous point is further emphasized by the results presented in Table 2, showing the ablation study. In absence of confidence intervals, these results are hard to interpret, as they seem extremely close. This poses the question whether any of these modules are significantly contributing, or that something else is actually driving the marginal improvement shown in Table 1. I do not want to disparage the authors contributions, but under the presented evidence, the story is not very convincing and it triggers the question what would happen if all of the modules were to be ablated. I would like to invite the authors to take away these concerns and maybe reconsider the main contributions of the paper, as discussed above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is rigorous and clear, introduces some interesting novelties, but it is unclear whether these actually contribute to the main result, or that the original method (DA) would have achieved the same results if the authors really wanted to boost this baseline.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a self-supervised depth estimation framework. It adopts foundation models and requires a small number of parameters for fine-tuning. The model incorporates DepthNet, which estimates the multi-scale depth map, and a pose-intrinsic network, which estimates the motion variation between adjacent images. The paper claims that the proposed approach results in low computational resources and short training time.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-structured and easy to follow.
    • The results are presented on two datasets and also include an ablation study.
    • The proposed model has less trainable parameters.
    • Exhaustive experiments are performed and compared with recent methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Justification for dataset split is missing: The sample count used in the test set for the SCARED dataset is not appropriate and requires reasoning behind such a split.
    • More details required on experimental settings: The experimental settings used for methods in the comparison table should have been provided in more detail.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The SCARED dataset has been split into 15351, 1705 and 551 frames for training, validation and test set. This shows that test frames are less than 5% of the total frames, which is generally not a good sample count for testing. A proper justification should be given for this.
    • The paper mentions that “all the other compared methods involving training have a total epoch greater or equal to 40 while our approach trains for 20 epochs.” Do the methods used for comparison present the best outcome after 40 epochs? Is an early stopping criterion used, or could any kind of overfitting be involved?
    • The paper mentions that the SCARED dataset videos are accompanied by ground truth depth maps. While showing the qualitative results such as those shown in Fig. 3, the ground truth images should also be included for better comparison.
    • Minor: (a) “” are incorrectly used in the table captions. (b) Table 3 and Table 4 do not show boldface markings. (c) Fig. 1 could mention the definition for notations used to denote source and reconstructed image as these notations are defined in later sections, and it becomes difficult to follow at once in Fig.1.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written and proposes a self-supervised approach to estimate depth in endoscopic videos. The proposed model also has fewer training parameters, although the inference speed is not affected much. However, there are a few concerns that must be addressed. For example, some details on experimental settings and justification for the dataset split ratio are required.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a method called EndoDAC for self-supervised depth estimation in the endoscopic domain. The method aims to adapt an existing foundation model called “Depth Anything” in order to perform depth estimation for surgical videos using a small number of trainable parameters, making it computationally efficient and fast to train. Additionally, the authors present a self-supervised adaptation strategy where depth, ego-motion, and camera’s intrinsic parameters estimations are trained in parallel, potentially allowing for the adaptation of the method to surgical videos from any unknown camera, making it broadly applicable to most surgical video datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One of the main strengths of the paper is the proposed Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) method, which efficiently adapts foundation models to surgical scene depth estimation. This method requires only a small number of trainable parameters, resulting in low computational resources and short training time.

    Additionally, the paper presents a self-supervised adaptation strategy where depth, ego-motion, and camera’s intrinsic parameters estimations are trained in parallel. This strategy allows for the adaptation of the method to surgical videos from any unknown camera, making it broadly applicable to most surgical video datasets. This is a novel approach that enhances the universality of the adaptation process.

    The paper also demonstrates the higher performance of the proposed method over other state-of-the-art SSL depth estimation methods with significantly fewer trainable parameters. Furthermore, the paper provides a comparison of quantitative results, showing that the proposed method exceeds all compared methods, regardless of the awareness of camera intrinsics.

    Overall, the combination of the introduced DV-LoRA method, the self-supervised adaptation strategy, superior performance, and real-time implementation capabilities make this paper a strong contribution to surgical self-supervised depth estimation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of comparison with recent foundation models developed specifically for endoscopy videos. Example: Wang, Zhao, et al. “Foundation model for endoscopy video analysis via large-scale self-supervised pre-train.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.

    This omission makes it difficult to assess the novelty or superiority of the proposed method in the context of depth estimation specifically or evaluating whether the introduced DV-LoRA method is generalizable to other types of foundation models.

    1. Limited discussion on generalizability: While the paper claims that the proposed method can be applied to surgical videos from any unknown camera, there is limited discussion on the generalizability of the model to different surgical scenarios or datasets. Without further evidence or experiments, it is unclear how well the method would perform in diverse surgical environments.

    2. Insufficient evaluation on clinical feasibility: The paper lacks a comprehensive evaluation of the proposed method’s clinical feasibility, such as its performance in real surgical settings or its impact on surgical outcomes. Without such evaluations, it is challenging to assess the practical value of the proposed method for endoscopic procedures.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Data: is public, so it should be accessible.

    Code: Authors have shared anonymized code with sufficient technical details.

    So overall their methods should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • There seems to be a typo in page 7: “Table 5 presents the comparison of quantitative results of the aforementioned methods. Our method exceeds all of the compared methods by a significant margin regardless of the awareness of camera intrinsics.”, there is no table 5 in the manuscript, I think authors meant Table 1.

    • Figure captions are not descriptive, the main message should be there. For example in the caption for Figure 3 we have “Fig. 3. Qualitative depth comparison on the SCARED dataset.” But what is the message here? What should we expect to see here ideally that we are not seeing for other models?

    • In the results section on page 7, there is a lack of explicit information regarding the criteria used for selecting the two example sequences for the comparative analysis of Pose and Intrinsics Estimation. It would have been beneficial if the authors had provided more detailed information about the selection process to avoid any doubts regarding the subjectivity of the selection.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper demonstrates the superior performance of the proposed method in SSL depth estimation compared to other state-of-the-art methods. The proposed model achieves great results with significantly fewer trainable parameters compared to other methods. Also, the paper presents both quantitative and qualitative results to support the effectiveness of the proposed model. These results show that the proposed method exceeds all compared methods without information about camera intrinsic. Furthermore, the paper includes ablation studies to validate the effectiveness of each module in the proposed model. This demonstrates a thorough analysis of the proposed approach and strengthens its credibility. These all have contributed to my acceptance decision.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their efforts and great comments on our paper. We will fix all the typo and grammar for the manuscript of camera-ready version. We would like to clarify some main misunderstandings from the reviewers:

  1. Lack of comparison with recent foundation models developed specifically for endoscopy videos. (R1) This is what we plan to do for our future works. Thanks for the advices.

  2. There is a lack of explicit information regarding the criteria used for selecting the two example sequences for the comparative analysis of Pose and Intrinsics Estimation. (R1) We select these two sequences followed many pervious works like[1-2].

  3. Why was it necessary to adapt the original LoRA approach? (R3) LoRA has been proven a very effective method for fine-tuning in NLP and language based foundation models. Therefore we plan to use LoRA-liked method as a great way to adapt foundation model for medical scenes.

  4. Justification for dataset split is missing (R5) The split followed previous works[1-2].

  5. Do the methods used for comparison present the best outcome after 40 epochs?(R5) We apply the same training settings for different baselines. All the other method converges around 30 to 40 epochs and our method mainly converge before 20 epochs.

  6. the ground truth images should also be included (R5) The ground truth depth for SCARED is sparse and only have the relatively full map for the first frame in a video. All the other GTs are obtained by reprojecting the GT of first frame with known camera kinetics. Therefore, the GTs for some of the selected visualization only contain small valid parts and we choose to not show the GT also followed previous work[1-2]

[1] Shao S, Pei Z, Chen W, et al. Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue[J]. Medical image analysis, 2022, 77: 102338. [2] Yang Z, Pan J, Dai J, et al. Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer[J]. IEEE Transactions on Medical Imaging, 2024.




Meta-Review

Meta-review not available, early accepted paper.



back to top