Abstract

Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorporating depth information. Still, in endoscopic contexts, they face challenges including reliance on pre-acquired depth maps, less effective fusion designs, and ignorance of the fidelity of 3D spatial details. To address them, we introduce a novel Depth-aware Endoscopic Video Inpainting (DAEVI) framework. It features a Spatial-Temporal Guided Depth Estimation module for direct depth estimation from visual features, a Bi-Modal Paired Channel Fusion module for effective channel-by-channel fusion of visual and depth information, and a Depth Enhanced Discriminator to assess the fidelity of the RGB-D sequence comprised of the inpainted frames and estimated depth images. Experimental evaluations on established benchmarks demonstrate our framework’s superiority, achieving a 2% improvement in PSNR and a 6% reduction in MSE compared to state-of-the-art methods. Qualitative analyses further validate its enhanced ability to inpaint fine details, highlighting the benefits of integrating depth information into endoscopic inpainting.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0179_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0179_supp.pdf

Link to the Code Repository

https://github.com/FrancisXZhang/DAEVI

Link to the Dataset(s)

https://datasets.simula.no/hyper-kvasir/ https://github.com/endomapper/Endo-STTN

BibTex

@InProceedings{Zha_DepthAware_MICCAI2024,
        author = { Zhang, Francis Xiatian and Chen, Shuang and Xie, Xianghua and Shum, Hubert P. H.},
        title = { { Depth-Aware Endoscopic Video Inpainting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a framework for inpainting endoscopic videos to remove corruptions. It proposes a framework based on depth guidance, where a depth map is estimated from corrupted inputs and fed as additional reference for inpainting. The model is evaluated on the HyperKvasir dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Novel framework: The paper introduces a novel framework for depth-aware inpainting, which combines several existing modules into a cohesive system.

    • Good results: The proposed framework shows improved results for endoscopic video inpainting on the HyperKvasir dataset compared to other methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Limited evaluation: The paper lacks a thorough evaluation of aspects that are integral to the application, such as 3D spatial consistency and temporal consistency. A comparison to similar depth-guided or generally guided (e.g., edge-guided, flow-guided) methods are missing. Additionally, two out of the four baselines originate from the pre-deep learning era and can thus be considered outdated.

    • Lack of clarity: Not all aspects of the method and experiments are clearly explained, for example, how inpainting ground truth is obtained for training and testing, and how STTN was adapted to the task.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • I have reservations about the suitability of generative inpainting as an aid in clinical decision making. There exists a risk that pathologies and abnormalities may be removed during the inpainting process. • The comparison to other methods is somewhat limited, as it solely relies on the experimental results reported in related work (Ref. [5]). Such a comparison might not be fair due to a different experimental setup (hardware, software), hyperparameters (batch size) and implementation details (data processing, data augmentation). Using results obtained from own experiments would provide a more reliable basis for comparison. Additionally, while several methods for depth-guided video inpainting are mentioned in the paper (e.g. [6,7,8]), no comparison to these closely related methods is made. Aside from depth guidance, many other guidance methods exist, such as flow-guided [1] or edge-guided [2] methods, which were also omitted. Instead, two of the baselines ([11,22]) date from a pre-deep learning era and can therefore be considered as outdated. • One of the key distinctions between the presented work and the benchmark by Daher et al. [5] is the introduction of a parallel depth branch, which probably significantly increases the parameter size and model capacity. It would be interesting to compare the two in terms of FLOPs, runtime, and model parameters to assess whether the improved performance stems from a clever network design or, partly, simply from an increase in parameters. • The paper claims that an omission of “assessing the 3D spatial fidelity” in related work leads to less reliable inpainting. However, it’s unclear what this means, and such a claim should be substantiated with corresponding experiments. • Although 3D spatial consistency and 3D spatial awareness is emphasized several times as an important capability and major contribution, the experiments fall short in evaluating this aspect thoroughly. Aside from a single example of depth RMSE, no qualitative or quantitative results regarding the 3D consistency of the framework is given. • For video inpainting literature commonly reports metrics that are connected to temporal consistency, such as flow warping error and video Frechet inception distance (VFID). I think these metrics would shed more light on the performance of the framework in terms of temporal consistency, which should be important for providing surgeons with artifact-free videos. Qualitative video results would also help in assessing the temporal consistency of inpainted frames. • I think the description of STGDE should be revised. Currently, a considerable portion of the description is dedicated to reiterating the basics of STTN. More care should be taken to differentiate between own designs/formulations, and related work. • How does the DED assess temporal dimensions? Is there any explicit strategy for this? • It does not appear like the used dataset provides a ground truth for the inpainting task, so I’m wondering what images are used for computation of the metrics. If the reference images shown in Fig.1 are used, it may be more meaningful to utilize metrics that do not rely on pixel-by-pixel concordance but instead on perceptual metrics such as learned perceptual image patch similarity (LPIPS) or Frechet inception distance (FID). • I have difficulties interpreting Figure 3. The images are very small, making it difficult to see any differences. There is no ground truth provided for RGB frames. It is unclear why the own result was generated using DepthNet from the RGB frame instead of utilizing the intermediate depth map. • For the experiment in Table 2, the question arises as to why only a subset of the features would be used for depth estimation. Maybe the rationale behind this experiment could be explained more clearly.

    [1] Li, Zhen, et al. “Towards an end-to-end framework for flow-guided video inpainting.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. [2] Gao, C., Saraf, A., Huang, J. B., & Kopf, J. (2020). Flow-edge guided video completion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16 (pp. 713-729). Springer International Publishing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The primary reason for recommending rejection is the insufficient evaluation of the framework. A more convincing evaluation could be made with a comparison to similar guided, deep learning-based video inpainting techniques, alongside state-of-the-art metrics such as (V)FID. Additionally, incorporating metrics that assess the spatial and temporal consistency of the framework would enhance the strength of the evaluation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a depth-aware endoscopic video inpainting framework to address the issue of corruption in such videos, which includes specular reflections and instrument shadows. It proposes three main modules, each with a specific role that includes depth estimation, fusion of visual and depth information and fidelity validation of inpainted frames and estimated depth images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Inclusion of depth for endoscopic video inpainting: The paper introduces a depth-aware inpainting method to mitigate the corruptions due to specular reflections or instrument shadows in endoscopic videos.
    • Ablation study: The paper includes an ablation study to analyze the importance of different modules by replacing each module with some other baseline network.
    • Generalizability validation: Some qualitative results are shown on a dataset apart from the one used for training.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some parts are difficult to follow: It is sometimes difficult to follow the different concepts presented in the paper.
    • Some results could have been added: (a) Although the study shows qualitative outcomes on another dataset SERV-CT, quantitative outcomes are missing. I would have preferred quantitative results as well for better generalizability validation and analysis. (b) Table 1 shows the ablation study by removing modules one by one. It would have been interesting if experiments with a single module had also been added. Currently, the outcomes w/o DED (with STGDE and BMPCF) are almost comparable with those of Daher et al..
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It is sometimes difficult to relate whether only reflections and instrument shadows are considered in the study or whether some other artefacts are also included. This is because many other artefacts, such as ghost colours, motion blur, etc., exist in endoscopic videos. It should be made more clear in the paper.
    • Section 2.3 mentions GEN loss and DED loss, and it is suggested that the same notations be reflected in Fig. 2.
    • Section 3.1 mentions that PSNR, SSIM and MSE are computed for corrupted regions. I am curious to know how to assess the overall translation performance of other regions that might be corrupted during this translation if inpainting is performed using the adversarial approach. If this is not the case, it is suggested that more clarity be provided on the procedure as it is currently difficult to follow in some places.
    • Section 3.1 mentions “..every 5 corrupted frames alongside 10 nearly corrupted frames sampled for reference..”. Is this line correct? Why were corrupted frames selected for reference? Fig. 1 caption states that “reference frames are selected from less corrupted frames”. It is suggested to check the statement in section 3.1 and provide clarity on it. Also, on what basis are the reference frames selected? This should also be clarified.
    • The depth ground truth is generated from a pre-trained endoscopic depth estimator. Why is the same method not used directly instead of STGDE when it is already being used to generate the ground truth? What could be the reason behind the better performance of STGDE over the depth estimator in Table 1?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting study by incorporating depth information for inpainting corrupted regions of endoscopic videos. While the model shows novelty, the paper’s clarity could be improved, as some sections are challenging to follow, prompting questions that require clarification.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal, I stick to my previous recommendation of ‘weak accept.’ While the authors have addressed several of my concerns, I still find some concepts difficult to follow. For example, as previously mentioned, it remains unclear how the translation approach ensures the retention of details other than those that are inpainted. The paper does not sufficiently explain how this aspect is handled within the DED module. Further clarification and detailed discussion on such points would enhance the overall understanding and strength of the paper.



Review #3

  • Please describe the contribution of the paper

    The paper presents a depth-aware endoscopic video inpainting method that takes ideas from general depth-aware video inpainting and adapts it to endoscopy. Specifically they add a spatial-temporal guided depth estimation module that generates depth and inpainted image features. These features are then fused using a Bi-Modal paired channel fusion module. After that a depth enhanced discriminator with reconstruction losses is introduced that takes RGB and depth into account. The reconstruction losses include L1 for depth and its pseudo ground truth from pretrained model, as well as L1, perceptual, and style losses for the inpainted image and its pseudo ground truth before making. The paper also performs comparisons with other methods using qualitative and quantitative evaluations. They also include ablation studies and visual results on the downstream task of depth estimation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written. Even though there are multiple complex parts of the system, the authors are able to explain it well and clearly state their contributions. The paper is well structured. The experiments are extensive; they perform ablation studies as well as comparisons to other methods. The also include few results on depth estimation as a downstream task. Their method shows improvement compared to other state of the art methods. Their framework is novel, making use of the literature and still adding novelty to adapt it to the endoscopic domain.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness is that the authors do not explain why they do not compare to [8]. There are also some clarifications that need to be made as mentioned in the detailed comments. Mainly these include the reason the authors only evaluate depth estimation on one example quantitatively and why the online inference performance analysis was not done for [5].

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The authors claim that the benefit of inpainting is to help surgeons, however, a more direct benefit is improving downstream tasks.
    2. “prodect” should be product
    3. Even though the system is complex and the authors explain it well. In a journal extension I would advise having a summary after each section of the methodology saying which parts are adapted from where and which parts are novel and the reason for including each part. Or even a table summarizing these aspects.
    4. In the methodology, it is unclear what the group-wise convolution is. Is it an architecture adapted from [16]? The same goes for the encoders and decoders use. Where is their architecture taken from. Please reference that.
    5. It is better to refer to the supplementary material within the main paper.
    6. It is unclear to me why the authors did not compare to [8]. The authors argue in the literature review that [8] needs ground truth depth. However, since a pretrained depth model is used to get pseudo ground truth for the proposed training, these also could be used to train [8].
    7. Corruptions are defined as reflections or instrument shadows. Since the authors use masks from [5], these are mainly reflection masks.
    8. In Fig 3 one visual example is shown with quantitative results. Why aren’t quantitative results included for the whole servct test set?
    9. Table 2 discussion: it is unclear which blocks are being referred to. Please include that these are the STGDE TB blocks with number Ns. It is also unclear here if the models were trained on different Ns number, why would it be different if the last or first 4 blocks are used?
    10. In the Online Inference Performance Analysis, the authors claim that their model can be run in an online fashion using only previous frames. However, as I understand from previous sections, the authors use the same sampling technique as [5] “For inference, the model processes every 5 corrupted frames alongside 10 nearby corrupted frames sampled for reference”. Please explain if you change that for this experiment, and if you do so why can’t you do the same for [5] and compare with it?
    11. In supplementary material, the references are split between the 2 pages which should be fixed.
    12. The authors also do not mention that the masks that they use from [5] are shifted and that’s why they have a pseudo ground truth in training. It would be good to mention that the pseudo ground truth is similar to [5].
    13. For a journal extension it would be interesting to see the effect of the different losses included.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though the paper could be improved, the authors did present a novel framework and performed ablation studies and showed improvement when compared to the state of the art.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I still believe this paper should be accepted. I would like to point out that [8] does have an available code and model https://github.com/lishibo888/DGDVI. It would be interesting to compare with them in a journal extension as well as adding quantittaive results for Fig. 3 on full test sequence.




Author Feedback

We thank the valuable feedback and positive comments about the novel framework (R3, R4, R5); extensive experiments (R5) and good results (R3).

We noticed suggestions for additional results. MICCAI doesn’t allow new results in rebuttal, but we are willing to include more details in the final paper or appendix if permitted.

  1. More Quantitative Results (R3, R4, R5): Our primary results (Table 1) are to compare with the existing benchmark [5]. For a more intuitive comparison, we presented qualitative findings (Fig. 3) to show the generalization and the preservation of 3D details, showcasing the superiority in helping downstream tasks (R5). More quantitative results about 3D consistency could be included in the final version if permitted.

  2. Baseline Selection (R3, R5): Flow-guided or flow-edge guided methods were not included initially because the typical brightness constancy assumption for flow estimation is often violated in endoscopy due to frequently changing lighting conditions [15]. Other depth-guided methods were not compared because [6] and [7] require LiDAR data and uncorrupted ground truth, respectively, which cannot be obtained by current endoscopic cameras. Additionally, [8] has not released its code and model.

  3. Table 2 (R3, R5): In a deep neural network, shallower layers learn low-level features (e.g., textures), while deeper layers learn high-level features (e.g., semantics) [Ref1]. In Table 2, three variants are tested with the same number of layers but different layers for depth estimation. The results show that using all layer features achieves the best performance, demonstrating the necessity of our current STGDE design.

Ref1: Hu, H., et al. “Learning implicit feature alignment function for semantic segmentation.” In: ECCV, pp. 487–505. Springer, 2022.

For R3: R3-1. Temporal Consistency: We respectfully argue that warping errors and VFID are not convincing in our task. Warping errors rely on flow estimation, which suffers from frequent lighting changes in endoscopy, making it noisy. For VFID, recent research [Ref2] has shown that it focuses more on individual frames rather than temporal dynamics. Thus, we qualitatively demonstrate temporal consistency by showing near frames with less corruption in Fig. 1. We will release video examples for further demonstration on our GitHub page.

Ref2: Ge, S., et al. “On the content bias in Fréchet video distance.” In: CVPR, 2024.

R3-2. Ground Truth: We used the pseudo ground truth provided by [5]. [5] establishes pseudo-ground truth by detecting corrupted regions and shifting the mask on the same frames.

R3-3. DED: It includes 3D convolutions for temporal assessment.

For R4: R4-1. Reference Frames: The caption “Reference frames …” in Fig. 1 refers to specific cases showing inpainted content consistency. In our inference, we automatically sample 10 reference frames around the corrupted clip (one every 5 frames) without manual selection, similar to [20]. Other corrupted frames provide varied uncorrupted regions due to frequently changing corrupted regions, giving the framework more context.

R4-2. Pre-trained Model: The corrupted regions on input frames are masked out, so the pre-trained estimator [15] cannot provide convincing depth maps (Fig. 1 of our appendix). In contrast, our STGDE estimates depth better.

R4-3. Adversarial Training: Our DED evaluates the fidelity of the entire inpainted frame against the ground truth.

R4-4. Ablation Study: We removed each module to analyze its influence on our framework and are open to trying your suggested ablation.

For R5: R5-1. Online Inference: We modified the sampling to include 5 frames before the corrupted clip. [5] samples every 10 frames as a reference throughout the whole video, resulting in many more reference frames for longer videos than ours. Even if only sampling past frames, its inference speed cannot meet the needs of online inference.

We will also follow the reviewers’ other suggestions carefully.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top