Abstract

Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map unseen video frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of pituitary surgery, i.e. transsphenoidal adenomectomy, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: https://surgicalvision.bmic.ethz.ch.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0376_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0376_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sar_VisionBased_MICCAI2024,
        author = { Sarwin, Gary and Carretta, Alessandro and Staartjes, Victor and Zoli, Matteo and Mazzatenta, Diego and Regli, Luca and Serra, Carlo and Konukoglu, Ender},
        title = { { Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a pose estimation model for endoscopy. The proposed framework employs two models; the first model generates bounding box predictions for detecting landmarks in the image. The second model is an encoder-decoder model that regresses the bounding box detection and landmark class, together with the viewing direction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Once the object detector is trained, the encoders for pose estimation can be trained in an un-supervised manner.

    The paper discusses the limitations of the current approach, which helps to understand the scope and future directions of the presented work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The model presents some metrics computed in the synthetic dataset, which is appreciated, considering the difficulty of obtaining pose ground truth in medical endoscopy. However, it does not compare against other methodologies for camera pose estimation, making it difficult to evaluate the proposed model’s position with respect to the current art. For example, it is common to compare models against the pose estimation generated SfM results (as an upper bound) in addition to other state-of-the-art models.

    Even though the work is presented as an unsupervised approach, it requires training a bounding box detector, requiring annotations for this task.

    It is unclear whether the output is a 3D trajectory/Pose and what the coordinate reference for the rotation obtained is. For example, is the obtained pose with respect to the world coordinate system or with respect to the current camera t?

    It is mentioned that the decoder regresses the bounding box information as seen by a centered (0-degree angle) view. How can it be guaranteed that the model is generating centered views? I think this is explained at the end of section 2.3 on page 5, but I find the explanation a little bit unclear. More clarification on this point is necessary.

    How is the localization component included in the framework? Is it predicted with respect to the video length?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper offers an option for camera view direction estimation. Some points for clarification are mentioned in the weaknesses section.

    In addition to the previous points, navigation systems usually localize the endoscope with respect to a reference (for example, CT). The proposed method appears to offer image-level and 2D clues about the endoscope’s position and orientation. Is this enough to provide precise localization during surgery?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The rating considers the points discussed in the weaknesses section. For example, the fact that annotations might be required for training the object detector reduces the impact of the unsupervised part of the pose estimation. In addition, the method is not compared against similar methods. For example, structure from motion could be employed to obtain a baseline upper bound for endoscopic sequences. 

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Because of a limited field-of-view, and a lack of distinguishable textures and landmarks, a common problem faced during endoscopic procedures, unless the surgeon is highly experienced, is understanding the context of the image to plan a surgical pathway. To address the need for more readily available access to anatomical details of the organ being investigated, the authors present a deep learning method that constructs a surgical path from surgical videos, modeling relative location and variations due to different viewing angles. Their approach has the potential to obviate the need for pre-surgical MR imaging as a complementarity guidance modality. They validate their method on a synthetic dataset, as well as one that consists of surgical videos of transsphenoidal adenoidectomies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work contains two novel components: 1: Instead of identifying salient structures explicitly, they rely on the detection of semantic bounding boxes of these structures. 2: An important parameter returned by the algorithm is the pose of the camera

    This work is important because the described approach enables the mapping of an unseen video’s image onto the surgical pathway, as well as estimating the viewing angle, aiming to provide additional guidance cues to reach the desired target safely. Overall, it provides a mechanism for the less experienced surgeon to navigate through otherwise unfamiliar territory (in this case the context of endoscopically-guided pituitary surgery. One of the outcomes is reducing the reliance on intra-operative images.

    By providing a reference visualization of the desired or planned viewing direction necessary to locate a particular anatomical structure the surgeon can modify the endoscope’s pose with respect to a reference direction. to quickly access the appropriate view of the desired target.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is difficult for the reader to understand the clinical significance of this technique, without some discussion of how, from the surgeon’s perspective, the procedure is enhanced through the use of this approach.

    This work is proposed in pursuit of a more cost-effective real-time solution to the problem of increasing understanding of the anatomical environment and surgical pathways. And the supplementary videos (if they are in fact real-time) are compelling. However, some indication of processing speed and limitations on frame rate or resolution would be appreciated. As would a textual description of what we are seeing in the three supplementary videos. At the review stage, without having access to ref 18, it is difficult to understand what the authors mean by the statement in section 3.3. The significance of the images being mapped to .81% of latent space rather than a single point needs elaboration than that in the sentence that follows.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Although there is a general discussion of the network architecture, along with a generic schematic, I don’t think there is sufficient detail to reproduce it precisely.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The first reference to Figure 2 in the text indicates that it shows a sequence of video frames, whereas Fig.2 shows an overview of the model used for the creation of the synthetic dataset. It would indeed be helpful to haver a figure that shows “a sequence of video frames”

    The second ref to Fig. 2 is appropriate for the actual Fig. 2.

    Fig. 3. It is not clear what the horizontal panel with dots along a line labeled 0 to 1 represents. There is a reference in the caption to the predicted location along the surgical path, but it’s difficult to see how this graphic relates to the image.

    The sentence “The idea that….. more or less so.” At > 50 words, this sentence is way too long. Consider shortening it. Is the difference between 0.97 and 0.94 really significant in terms of the ultimate impact of this procedure on the patient or the surgeon’s performance? This sentence is also confusing. Looking at Fig.3 it was my impression that the (yellow) arrows represented the current pose of the endoscope. Where is the “reference arrow” that is referred to in the above-mentioned sentence?

    What is the impact of the mean error in angle predictions of 0.43, and 0.69 with a standard deviation of 2.38 and 1.74, (presumably these are in degrees) for the pitch and yaw angle, respectively? Given the general imprecision of manually guiding an endoscope, how important are these errors?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well-articulated study with direct clinical relevance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present their work into unsupervised localization and camera pose prediction for vision-based neurosurgical guidance. The key challenge brought up is the frequent disorientation that is experienced by physicians during endoscopic procedures due to the limited field of view of the endoscope, lack of discernable features, and challenging intraoperative lighting conditions. An anatomy/object recognition network is used to construct a path of surgical features in an unsupervised manner. The goal is then to map unseen video frames on the path and estimate viewing angle and pathing to a specific target. The authors evaluate their method on synthetic and real benchmark surgical datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The challenge of disorientation and need for navigation support during endoscopic procedures is an important issue and stands as a difficult problem to solve.
    • The authors do a good job of highlighting the clinical challenges (tissue deformation, lack of discernable features, poor FOV, lighting, etc.) and limitations of current guidance systems (brain shift from preoperative imaging) to set the stage and describe novelty of their efforts relative to prior contributions.
    • Novel approach in the integration of endoscope viewing direction (pitch and yaw) to inform the surgical trajectory and relation to other anatomical landmarks.
    • The authors present qualitative performance of their algorithm on real procedure data (no ground truth labels) and quantitative performance on a synthetic dataset generated in Blender.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Not really a weakness, but more of an observation – the current approach does not consider the physical properties of the endoscope and the likely/possible range of motion of a user as constraints on the looking direction computation. For example, users are likely to move the endoscope around the patient body cavity at a similar rate, with large uncontrolled movements being very uncommon. If the per-frame direction is calculated, some sort of filtering could maybe assist with any jitter.
    • It might be worth mentioning alternative non-vision based approaches to endoscope look direction calculation that rely on other tracking approaches in the intro. For example, electromagnetic tracking, shape fiber-based tracking (like what’s used in the Ion catheter by Intuitive), or kinematic tracking in the case of a robotic system.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Why were only pitch and yaw incorporated in the construction of the viewing direction (and not endoscope roll)?
    • With the assumption that we know some generalized camera parameters of an endoscope used to capture data, how do you anticipate the performance of your approach would change?
    • How would you approach the extension of this model to more complex 3D anatomical regions where the 1D surgical path assumption does not hold?
    • How do you suggest that the information regarding direction/orientation to relevant anatomical landmarks is communicated to a surgeon user?
    • What is the runtime latency of your algorithm and is it suitable for real-time use?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Thanks for this work, it was an interesting read. My rating is due to (1) the novelty in the integration of endoscope viewing direction for path finding; (2) the detailed effort in validating your algorithm’s performance on real and synthetic datasets; and (3) the importance and need for a solution to these navigation challenges with many endoscopic procedures.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

First and foremost, we would like to thank the reviewers for taking the time to review our paper and for their insightful remarks. Below, we will respond to the reviewers’ specific comments as well as address general concerns.

General:

Inference Speed:

At this point in time, we do not yet have a streamlined setup that allows us to measure the inference speed of the entire framework. However, we recognize the importance of these numbers and would like to provide isolated inference times for the YOLO network as well as our encoder:

YOLO: 13ms (1920 pixels, NVIDIA A100-40GB)

Encoder: 20ms (average inference time per sample over 1500 samples, CPU Intel Core i7-6700K)

This adds up to an inference time of 33ms, indicating that real-time applications should be feasible.

Reproducibility:

We are working on providing the code to ensure reproducibility. Additionally, we provide the model weights of the YOLO network and have built an online tool where researchers can upload their videos to get detection predictions. (https://surgicalvision.bmic.ethz.ch)

Specific:

Reviewer #1:

  • We apologize for the faulty reference to Fig. 2. We intend to add a sequence in the camera-ready version to the figure.
  • The arrow in Fig. 3 is indeed the current pose of the endoscope, and we meant that surgeons can orient themselves by referencing that arrow.
  • It is challenging to determine the impact of the errors in angle predictions; these are indeed in degrees. The reason behind providing these numbers was to show that, even though we make considerable simplifications, we can still predict the angles with relatively small error. The intent behind this orientation mechanism is to provide general feedback regarding viewing direction and magnitude, rather than exact angles, as these might be hard to interpret.

Reviewer #3:

-We would like to thank the reviewer for the insightful suggestion and agree that incorporating movement constraints in combination with post-processing could improve performance significantly.

  • We considered incorporating endoscope roll into the model; however, for this particular surgery, the roll seemed to be negligible compared to yaw and pitch, and we therefore decided not to include it for simplicity. In other surgeries, endoscope roll could play a very significant role.
  • If camera parameters are available, we anticipate the predictions to improve since, in the current scenario, no camera parameters are used in the framework.
  • While we anticipate that various surgical paths can still be projected onto a 1D surgical path, expanding the current model to more complex surgical environments is our current research focus.
  • Since the endoscope video output is shown on a screen in the operating room, it could be superimposed in a fashion similar to car navigation applications. Another option could be showing the information in the corners of the screen since the video is circular.

Reviewer #5:

  • Since SfM is more general and our method is created with a specific purpose, we considered a direct comparison to be out of scope. However, we realize that a direct comparison is important and could provide valuable information.
  • We would like to clarify that supervision is indeed required for the detection part of the proposed method. The modeling of the viewing direction and the surgical path is performed in an entirely unsupervised manner.
  • The pose/localization/centered view is with respect to an abstract surgical path that best represents the data. This is learned by the model without supervision and generated by the decoders. Section 2.3 explains how we achieve consistency in the predicted rotation.
  • The best-case scenario would be to deploy a combination of existing reference methods, such as MRI, and our method. In this setting, existing methods could support precision, while the proposed method could provide live information.




Meta-Review

Meta-review not available, early accepted paper.



back to top