Abstract

Relative monocular depth, inferring depth correct to a shift and scale from a single image, is an active research topic. Recent deep learning models, trained on large and varied meta-datasets, now provide excellent performance in the domain of natural images. However, few datasets exist which provide ground truth depth for endoscopic images, making training such models from scratch unfeasible. This work investigates the transfer of these models into the surgical domain, and presents an effective and simple way to improve on standard supervision through the use of temporal consistency self-supervision. We show temporal consistency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self-supervision technique for this task. In addition we show our method drastically outperforms the state-of-the-art method from within the domain of endoscopy. We also release our code, models, and ensembled meta-dataset, Meta-MED, establishing a strong benchmark for future work.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1332_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1332_supp.zip

Link to the Code Repository

https://github.com/charliebudd/transferring-relative-monocular-depth-to-surgical-vision

Link to the Dataset(s)

https://github.com/charliebudd/transferring-relative-monocular-depth-to-surgical-vision

BibTex

@InProceedings{Bud_Transferring_MICCAI2024,
        author = { Budd, Charlie and Vercauteren, Tom},
        title = { { Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a method to address the data shortage issue in the depth estimation problem during endoscopic surgery. The proposed transferring method consists of three modules; standard supervision, temporal consistency loss, and augmentation consistency loss. In addition, this paper curated the existing depth dataset and it will be opened for further research.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper addresses depth estimation in endoscopic surgery using fine-tuning instead of building a new model. The proposed methods are simple and efficient.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It would be beneficial to test more depth estimation models. While two models showed consistent improvement, the proposed method’s effectiveness can vary depending on the type of dataset used for pre-trained depth estimation models (i.e., domain gaps). It would also be helpful if the author could analyze some of the failure cases.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is true that the proposed method was effective in solving the existing problem; however, more innovation or experiments are needed.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    • The paper demonstrates the successful transfer of state-of-the-art transformer-based relative monocular depth models from natural image meta-datasets to the surgical domain.
    • It showcases significant improvements in performance through careful fine-tuning and the incorporation of self-supervision techniques, particularly temporal consistency.
    • The research highlights the potential of transferring natural image models to endoscopy, showcasing the benefits of leveraging large-scale unlabeled data for depth estimation tasks in surgical settings.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clear and well-structured presentation of the research methodology, experiments, and results.
    • Comprehensive evaluation metrics, including SSIMAE for individual image accuracy and temporal smoothness analysis for endoscopic footage clips.
    • Comparison with existing state-of-the-art methods in endoscopy, demonstrating superior performance and advancements in the field.
    • Release of code, models, and the Meta-MED dataset to facilitate further research and benchmarking in the domain of monocular endoscopic depth estimation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Limited discussion on the generalizability of the proposed method to different types of surgical procedures or endoscopic imaging systems.

    • Further exploration of the impact of different hyperparameters or model architectures on the performance could enhance the robustness of the findings.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It would be beneficial for the authors to address the scalability of their approach to diverse surgical scenarios and datasets, emphasizing the adaptability of the proposed method.

    • Future work could focus on exploring the potential integration of additional tasks to further enhance the performance of the depth estimation models in surgical vision applications. The authors are encouraged to engage with the research community through workshops or presentations to share their methodology, results, and insights, fostering collaboration and knowledge exchange in the field.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper makes contributions to the domain of monocular endoscopic depth estimation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper the authors compile a dataset from multiple public sources to create Meta-MED, which they make public. They use this dataset to finetune the Depth Anything model to the surgical domain. They also add self-supervision temporal consistency loss that improves results. They add a scale and shift invariant error metric SSIMAE which they use in training and testing. They perform qualitative quantitative experiments with ablation studies and comparison to other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a novel approach to depth estimation. Their results show improvements to the state of the art. The paper is well written and structured. The novelty is clearly stated. Including a table explaining Meta-MED data used for training and testing is very helpful. They also show helpful video results in supplementary material.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses mainly include clarifications. In the detailed comments, I mention various clarifications that the authors could make. Another weakness is not comparing to [7]. The reason they provide is not very convincing.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Meta-MED is claimed to be public but I couldn’t find it. I assume it would be made available upon publication. Making Meta-MED would make research easier for other researchers but the licenses of these datasets should be clearly stated when this is used. The code is public but the link is not in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. “relative depth, meaning the estimated depth values are only sought up to scale and shift”. This statement is confusing, I believe the authors might have meant not up to scale and shift, since relative depth means not up to scale and shift.
    2. It is unclear what is meant by shift. Please explain why would there be a shift between ground truth and estimated.
    3. “structure from motion (SFM) approaches had been used to generate pseudo ground truth depth”. This statement is inaccurate. Such methods use the idea of SFM but they are self supervised where they estimate pose and depth to reconstruct the original image and only use reconstruction losses as a supervision signal.
    4. Please clarify what is meant by: “While elegant, this does not provide a scalable solution for general surgical vision”. All methods mentioned train on a specific dataset whether ex-vivo or in-vivo it is unclear what you mean by general surgical vision.
    5. There are more methods after af-sfmlearner that are not mentioned such as: a. Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer b. Tackling Challenges of Low-texture and Illumination Variations for Endoscopy Self-supervised Monocular Depth Estimation c. EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer d. Task-Guided Domain Gap Reduction for Monocular Depth Prediction in Endoscopy e. LightDepth: Single-View Depth Self-Supervision from Illumination Decline
    6. Another possible relevant paper to include: Depth Anything in Medical Images: A Comparative Study
    7. “present results for depth up to scale only rather than scale and shift,” it is possible to evaluate on only scale invariant results and compare to [7] by removing the shift from SSIMAE when testing.
    8. Please clarify where equation 1 is adapted from.
    9. Please clarify where was the Mslow Mfast idea adapted from or if it was novel.
    10. Are alfa and beta the same for all images or different?
    11. Fig. it would be interesting to comment of the masking result and if it looks reasonable.
    12. In the evaluation the authors mention using a clip from the SCARED dataset for the trajectory. However, earlier in the paper, they mention using only 45 images from SCARED. Please clarify.
    13. The results are interesting. More discussion would be nice to explain why the models with Laug were not the best.
    14. “Following the method used for Depth Anything, we also experimented with multitask learning for binary tool segmentation.” this is unclear; what is meant by multitask? Were depth and segmentation trained at the same time or was segmentation trained separately?
    15. As limitations of the framework, it is good to add in the paper that the same training metrics are used for testing.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper needs some clarifications, but the main aspects for acceptance are present. This includes novelty, and good experiments, results, introduction, and writing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper presents an approach to transfer the outstanding performances of general-purpose “natural” monocular depth estimation models to the domain of surgical endoscopy. The authors propose a fine-tuning pipeline that leverages custom augmentations, a combination of full and self supervision and temporal consistency to be applied to the popular transformed based architecture for monocular depth estimation. Moreover, the authors combine multiple publicly available surgical datasets into a larger meta-dataset that is employed for their scope. The authors also provide the details of an evaluation phase that shows the results of their contribution compared to state-of-the-art baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written, organized and structured. Every detail is carefully explained and justified. The motivation behind this contribution is extremely interesting and will prove useful for future work: the importance of depth data for 3D reconstruction tasks in surgery is undeniable and effectively translating the knowledge of ViTs is the key to jumps in state of the art performance. The contribution of the paper is surprisingly simple (a fine-tuning) but it is sensible, well contextualized, exhaustively described and adequately validated. Collecting (and releasing) a large meta-dataset that combines popular well-established smaller datasets further elevates this work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This work would be complete with the acknowledgement of some limitations, which are completely missing. For example: could camera motion hinder with the temporal consistency loss? Also, asserting scale preservation and distortion on the predicted depthmaps might be useful, as depth estimation ViTs for natural images often have to deal with large scale ranges, anisotropy in depth detailing and sky depth. Without an explicit inclusion for these factors in the loss, the depthmaps predicted by the fine-tuned models might be distorted to some extent. I would also specify which hardware was used for training.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Most of my comments are reported in the strengths and weaknesses paragraphs. It was a pleasure to read a paper with a clear motivation, a well-target application and a simple and effective approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a well-target application undertaken with a simple and effective approach: the authors provide an advanced fine-tuning methodology that includes a multi dataset source and few critically chosen loss function with the goal of adapting depth estimation ViTs to the surgical endoscopy domain. This same approach is also validated and compared with multiple baselines.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top