Abstract

Monocular depth estimation in colonoscopy video aims to overcome the unusual lighting properties of the colonoscopic environment. One of the major challenges in this area is the domain gap between annotated but unrealistic synthetic data and unannotated but realistic clinical data. Previous attempts to bridge this domain gap directly target the depth estimation task itself. We propose a general pipeline of structure-preserving synthetic-to-real (sim2real) image translation (producing a modified version of the input image) to retain depth geometry through the translation process. This allows us to generate large quantities of realistic-looking synthetic images for supervised depth estimation with improved generalization to the clinical domain. We also propose a dataset of hand-picked sequences from clinical colonoscopies to improve the image translation process. We demonstrate the simultaneous realism of the translated images and preservation of depth maps via the performance of downstream depth estimation on various datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1520_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1520_supp.pdf

Link to the Code Repository

github.com/sherry97/struct-preserving-cyclegan

Link to the Dataset(s)

endoscopography.web.unc.edu

BibTex

@InProceedings{Wan_Structurepreserving_MICCAI2024,
        author = { Wang, Shuxian and Paruchuri, Akshay and Zhang, Zhaoxi and McGill, Sarah and Sengupta, Roni},
        title = { { Structure-preserving Image Translation for Depth Estimation in Colonoscopy } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a method to perform domain transferring of colonoscopy images while preserving depth information.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • This paper generated two hand-picked dataset from clinical colonoscopies for depth estimation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • This paper just adopts existing deep learning algorithms, which shows a lack of novelty. • The datasets in the paper is not publicity open.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • The authors may consider to add inference time of the proposed method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experimental design in this paper is good. However, the overall method design lacks novelty. Also, the dataset is not shared to the public.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have justified the design of their model. I have no more comments.



Review #2

  • Please describe the contribution of the paper

    The authors propose an image translation method that generates realistic-looking video frames from synthetic colonoscopies while preserving depth information. This approach bridges the gap between unrealistic synthetic data with annotation and realistic but un-annotated clinical data, to improve depth estimation on unseen clinical data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper adapts a modified CycleGAN for structure-preserving image translation from synthetic to clinical colonoscopy domains, ensuring that depth information is maintained while enhancing the realism of synthetic images.
    2. The introduction of two distinct datasets, one consisting of oblique views and the other of en face views, provides a novel resource for testing and validating the image translation method.
    3. The effectiveness of the image translation method is demonstrated using monocular depth estimation as a performance metric.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The presentation of Table 3 is somewhat confusing, and it’s unclear what network architecture each model uses or what training data was involved. Including detailed descriptions of each network and the training data would clarify this.
    2. The authors claim that generated images can be used “for supervised or semi-supervised training of arbitrary networks for depth estimation.” I hope the authors can clearly indicate which experiments are supervised and which are semi-supervised. Additionally, I don’t find sufficient support for the claim of “arbitrary networks.”
    3. Providing more information on how the oblique and en face datasets were manually selected would offer valuable context.

    In addition, I also have some questions about the experimental results.

    1. Does “Ours_{CG}” in Table 3 use the data generated by CycleGAN in Table 1? If this is the case, why is there such a small difference in the depth estimation results when there is a significant difference in the quality of the generated images?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In addition to the points mentioned above, I hope the authors can demonstrate the stability of predictions between adjacent frames in the sequence.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed image translation method demonstrates promising visual results. However, the description of the experimental section could be clarified further.

    Additionally, the lack of real datasets for quantitative evaluation in this research area (not just in this paper) remains a limitation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Author’s response partially addressed my concerns. I keep my original score.



Review #3

  • Please describe the contribution of the paper

    The paper propose a method to support monocular depth estimation in colonoscopy by image-to-image translation form synthetic-to-real, dealing with the main gap between the availability of unannotated clinical data and annotate but not realistic synthetic data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper is deep knowledge of the clinical domain and the video data. The effort behind preprocessing, decomposing videos into 3 datasets and pairing them with available synthetic datasets denotes a clear understanding of the domain. The description of the process and the evaluation of the results (both for image translation and depth estimation) are modestly valid and clear. Data and code publicly available add value to the work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    no major weakness, Mutual information used into image-to-image translation is not novel as is ( https://arxiv.org/abs/1902.03938 as an example ) but it’s application in preserving depth consistency is novel enough.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    A reference to https://arxiv.org/abs/1902.03938 or more recent work, referring to MI loss applied to image translation, should be added. Vertical spaces between tables, captions, images and paragraphs are not consistent everywhere. I suggest the authors to check the consistency with the rules provided by the template.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper suggests a framework to handle a practical problem of data scarcity in the specific clinical domain, achieving realistic samples from synthetic data. The authors claim dataset and code availability. As the dataset is presented, the quality should be good. No major flaws in the method explanation or validation, enough overall novelty.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Clear responses, I confirm my choice.




Author Feedback

Dear Reviewers, Area Chairs, and Program Chairs:

Thank you for your insightful comments on our paper “Structure-preserving Image Translation for Depth Estimation in Colonoscopy”. We will take your suggestions into account moving forward. In this rebuttal, we would like to address the following concerns raised by the reviewers:

Concern 1: Lack of open source data and/or code We plan to publicly release our code and proposed datasets after the paper is accepted, and will update the final version of the paper to include these links everywhere that mentions release upon paper acceptance. The synthetic dataset used (SimCol3D) is already publicly available.

Concern 2: Lack of novelty (Reviewer #4) In the image translation component, we modify CycleGAN to use an additional mutual information-based loss to enforce depth consistency between the synthetic and generated images. While CycleGAN is an existing deep learning algorithm and mutual information losses have been proposed previously, we claim that the use of a mutual information loss for preserving depth consistency, especially without requiring feature extraction, is novel. For the downstream depth estimation task, we use the Monodepth2 architecture trained fully supervised for depth estimation. We note that while Monodepth2 is a common architecture for monocular depth estimation, this task is meant to demonstrate the effectiveness of image translation in allowing downstream depth estimation models to generalize well to challenging clinical frames. Thus, we are interested in the effect of changing the training data (and in particular, the effect of using our translated images) on the depth estimation result rather than performance improvements stemming from changes in the deep learning algorithm. Therefore, the use of an existing deep learning algorithm for this task is purposeful and not meant to demonstrate novelty in the depth estimation algorithm.

Concern 3: Data selection method for oblique and en face datasets We picked continuous sequences of the same (oblique or en face) viewpoint. Note, however, that the distinction between an axial and oblique view, and that between an oblique and en face view, is subjective so the cutoff between viewpoints is similarly subjective.

Concern 4: Ambiguity in Table 3 (Reviewer #5) Table 3 describes the depth estimation results measured on C3VD after median rescaling. All experiments labelled Baseline or Ours_{…} use the Monodepth2 architecture and are trained fully supervised, varying only on the training images. In particular, we use translated versions of the SimCol3D dataset, where the translation is performed using our various ablations and modifications of CycleGAN described in Table 2. The category label in the first column denotes whether the model had used C3VD data in training (where multi-shot models are trained in part or in whole on C3VD and zero-shot models are not trained on it). With regards to the unexpectedly high performance of Ours_{CG} in Table 3, we attribute this (and the overall similar performance across all models) to the domain gap between C3VD and clinical images. In particular, we find that high performance on this dataset does not require extensive training regimes (note the similar performance of NormDepth which is trained entirely self-supervised), so we rely upon the qualitative evaluation on clinical datasets as a more accurate indicator of generalization performance.

Concern 5: Ambiguity in training procedure for depth estimation (Reviewer #5) All our depth estimation results using our translation results are trained fully supervised and use the Monodepth2 architecture. However, the modular design of our framework allows for easy substitution of other architectures or training regimes (e.g. semi-supervised training) but we do not include those results in this paper.

Thank you for your time and consideration!




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a method for supporting monocular depth estimation in colonoscopy by using image-to-image translation from synthetic to real images, addressing the gap between unannotated clinical data and annotated but unrealistic synthetic data. Strengths of this paper include a deep understanding of the clinical domain, meticulous preprocessing, and pairing of video data with synthetic datasets. The evaluation of image translation and depth estimation results is clear, and the availability of data and code adds significant value. While the use of mutual information in image translation is not novel, its application in preserving depth consistency is innovative. However, the paper lacks novelty as it mainly adopts existing deep learning algorithms, and the datasets are not publicly available. There are minor issues with the clarity of some tables and the experimental description. Given the strengths and weaknesses of this paper, I suggest accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes a method for supporting monocular depth estimation in colonoscopy by using image-to-image translation from synthetic to real images, addressing the gap between unannotated clinical data and annotated but unrealistic synthetic data. Strengths of this paper include a deep understanding of the clinical domain, meticulous preprocessing, and pairing of video data with synthetic datasets. The evaluation of image translation and depth estimation results is clear, and the availability of data and code adds significant value. While the use of mutual information in image translation is not novel, its application in preserving depth consistency is innovative. However, the paper lacks novelty as it mainly adopts existing deep learning algorithms, and the datasets are not publicly available. There are minor issues with the clarity of some tables and the experimental description. Given the strengths and weaknesses of this paper, I suggest accepting this paper.



back to top