Abstract

In Minimally Invasive Surgery (MIS), temporally consistent depth estimation is necessary for accurate intraoperative surgical navigation and robotic control. Despite the plethora of stereo depth estimation methods, estimating temporally consistent disparity is still challenging due to scene and camera dynamics. The aim of this paper is to introduce the StereoDiffusion framework for temporally consistent disparity estimation. For the first time, a latent diffusion model is incorporated into stereo depth estimation. Advancing existing depth estimation methods based on diffusion models, StereoDiffusion uses prior knowledge to refine disparity. Prior knowledge is generated using optical flow to warp the disparity map of the previous frame and predict a reprojected disparity map in the current frame to be refined. For efficient inference, fewer denoising steps and an efficient denoising scheduler have been used. Extensive validation on MIS stereo datasets and comparison to state-of-the-art (SOTA) methods show that StereoDiffusion achieves best performance and provides temporally consistent disparity estimation with high-fidelity details, despite having been trained on natural scenes only.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0240_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0240_supp.zip

Link to the Code Repository

https://github.com/xuhaozheng/StereoDiff

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xu_StereoDiffusion_MICCAI2024,
        author = { Xu, Haozheng and Xu, Chi and Giannarou, Stamatia},
        title = { { StereoDiffusion: Temporally Consistent Stereo Depth Estimation with Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    For the first time, as authors claim, this paper incorporates the diffusion model into the stereo depth estimation. The proposed algorithm, as the results reveal, outperforms the state-of-the-art stereo depth estimation methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea is novel through incorporating the diffusion model into the stereo depth estimation algorithm.
    2. The results demonstrate improved performances with respect to existing methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The effectiveness of the added diffusion module has not been explicitly validated.

    1. The results only show the monocular images. It is not clear whether the dataset the authors utilise is monocular or stereo, by observing the shown images.
    2. The statistical tests (e.g., paired t-tests) have not been done.
    3. The writing needs to be further improved while the discussion/description of the results should also be strengthen.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The ablation study should be added to validate the effectiveness of the added diffusion module.
    2. The results only show the monocular images. It is not clear whether the dataset the authors utilise is monocular or stereo, by observing the shown images.
    3. The statistical tests (e.g., paired t-tests) have not been done, and thus it is not clear whether the improvements are statistically significant.
    4. The writing needs to be further improved while the discussion of the results should also be strengthen.
    5. The authors are also suggested to discuss about the limitations of the paper.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The solid valdation of the added diffusion model to the stereo depth estimation method.
    2. The description/discussion of the results are poor.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work introduces a StereoDiffusion framework to achieve temporally consistent depth estimation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. In This paper, a latent diffusion model is used for disparity refinement.
    2. The prior knowledge feeds to the diffusion model is generated using optical flow to warp the disparity map of the previous frame.
    3. Experimental results validate the effectiveness of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper refines inter-frame disparity maps using the existing stable diffusion V2 method without any model improvements, merely application. I find the innovation insufficient. Moreover, the critical part depicted in Fig. 2 is not clearly described.
    2. The paper’s frequent mention of real-time inference methods merely involves directly adopting existing methods in [14], with a very concise description in the method section.
    3. While the conversion from disparity maps to depth maps should be explained in the paper, it only focuses on refining and comparing disparity maps in the method and experimental sections, which is inconsistent with the previous depth estimation task.
    4. The paper lacks essential ablation experiments to assess the impact of introducing the diffusion method. Additionally, real-time inference is mentioned but not compared in the experiments.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The code used in this paper will not be provided as indicated. This would not prove that the results are reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should enhance the description of the key aspects of disparity map refinement and provide a clearer explanation of the inference acceleration process. Additionally, conducting ablation experiments on the method is essential to determine the effectiveness of its various components.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper aims to generate disparity maps with greater temporal consistency by incorporating temporal information. However, the entire paper merely applies existing excellent diffusion methods [10] to optimize disparity maps, lacking detailed descriptions of key contributions and ablation experiments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The author’s response failed to address my concerns: 1. Lack of novelty; this paper only achieves stereo depth estimation using the existing method stable diffusion V2 with minimal improvement. 2. This work claims to achieve real-time inference. However, the actual inference time of this work is only 5fps, which does not align with the stated claim.



Review #3

  • Please describe the contribution of the paper

    This paper presents the StereoDiffusion framework, a novel approach that incorporates a latent diffusion model into stereo depth estimation for the first time. The framework consists of three main parts: initially, it estimates disparity maps from the previous frame (ts-1) using two cameras. Next, it combines the estimated optical flow with the current frame to warp the previous frame’s disparity maps and predicts reprojected disparity maps (dts) for the current frame. Finally, using the current frame’s image as a condition, a latent diffusion model refines the reprojected disparity maps through just 10 iterations of denoising. Importantly, the paper demonstrates through extensive validation that the StereoDiffusion framework surpasses several advanced stereo depth estimation methods, achieving high fidelity in temporal consistency without the need for training on domain-specific datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper boasts several key strengths in advancing stereo depth estimation through the integration of a latent diffusion model: 1.Innovation through Integration of Latent Diffusion Models: The principal innovation of this paper lies in its integration of latent diffusion models with stereo depth estimation. This approach is novel because it utilizes a Variational Autoencoder (VAE) to generate latent representations of disparity maps and conditioned RGB images, which are then refined through a diffusion process. This method effectively addresses variations in dynamic scenes, representing a significant improvement over traditional methods for maintaining temporal consistency. 2.Superior Performance Metrics: The authors conducted extensive experimental analysis on the proposed StereoDiffusion model. The results demonstrate outstanding performance not only in terms of lower End-Point Error (EPE) and higher Intersection Over Union (IOU) scores but also in maintaining temporal consistency in disparity maps. Such consistency is critical for the continuity required in medical imaging and robotic surgery applications.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses identified in the StereoDiffusion paper primarily concern the foundational premises of the described methods and the thoroughness of the experimental validation:

    1.Lack of Quantitative Analysis for Certain Claims: The paper occasionally asserts broad claims about the performance and efficiency of the StereoDiffusion model without providing quantitative support or detailed comparative analysis with other methods that might use similar or different architectures, such as Conditional GANs. Importantly, for the purpose of optimizing reprojected disparity maps, Conditional GANs appear to be a more straightforward method that could offer speed or performance benefits under certain conditions. Moreover, the paper does not clearly explain why diffusion models are used for this task, and lacks experimental comparisons with approaches like Conditional GANs, leaving the purpose and advancement of using this model unclear. 2.Absence of Ablation Studies: The paper lacks systematic ablation studies to dissect the contribution of individual components of the model, such as the denoising steps or the impact of using a latent diffusion model versus other types of models. Ablation studies could help in understanding the necessity and efficiency of each component in the proposed framework, thereby solidifying the claims made about the model’s performance. 3.Insufficient Contrast with Existing Regression-Based Methods: While the paper emphasizes the novelty of integrating latent diffusion models with stereo depth estimation, it does not adequately compare its approach to existing regression-based methods or justify why the proposed method is superior in practical applications, especially given that regression models might be simpler or more interpretable. 4.Temporal Consistency and Error Metrics: The paper’ s performance in maintaining temporal consistency is validated qualitatively rather than quantitatively, which may not adequately demonstrate the model’s effectiveness across diverse conditions. This qualitative validation might overlook specific discrepancies that quantitative metrics could reveal.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.Clarification of Novelty and Methodological Foundations: The paper presents an innovative integration of latent diffusion models with stereo depth estimation. However, it would be beneficial to provide a clearer explanation of why this particular model was chosen over alternatives like Conditional GANs, which might be perceived as more straightforward for this application. It’s important to elucidate the specific advantages of the latent diffusion approach in terms of efficiency or performance in dynamic scenes to strengthen the justification for this choice. 2.Quantitative Validation: The manuscript would benefit greatly from a more thorough quantitative analysis to support the claims about the superiority of the proposed method. This includes detailed performance comparisons with state-of-the-art methods that use different architectures. Specifically, adding a quantitative measure of how much improvement is achieved, particularly in terms of temporal consistency, would substantiate the claims made. 3.Ablation Study: Incorporating systematic ablation studies would help in understanding the impact and necessity of each component of the framework, such as the denoising steps and the use of the latent diffusion model itself. This could provide valuable insights into what contributes most to the performance gains and under what conditions each component is most effective. In addition, there is a very small suggestion: although this paper is already very good in expression, it is suggested that the authors can keep the symbols in the pictures consistent with the paper, and whether it is italicized, etc.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is significantly influenced by the innovative integration of latent diffusion models into stereo depth estimation, which introduces a new method for improving temporal consistency and depth estimation accuracy in dynamic scenes. However, this also raises some concerns and curiosities. 1.Lack of Comparative Analysis: The absence of a comparative analysis with methods such as Conditional GANs raises concerns about the superiority of using latent diffusion models in terms of estimation accuracy and time efficiency for this task. Including such comparisons would help substantiate the claimed advantages of the proposed model over existing techniques. 2.Depth of Validation: While the model has been extensively validated, it lacks quantitative depth in certain areas. A more detailed analysis, including ablation studies, would provide a clearer understanding of the effectiveness and efficiency of each component of the proposed model.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for your rebuttal. However, it does not fully address my concerns regarding the paper. The authors claim that their conclusions are based on previous experiments, but this remains unconvincing. Additionally, the references in the manuscript are highly inconsistent in format, with many being non-peer-reviewed or lacking proper citations. These issues, along with the lack of code, which hinders reproducibility, exacerbate my overall negative impression of the experiments and analyses. Nonetheless, considering the high quality of the visual results presented, I am maintaining my original score.



Review #4

  • Please describe the contribution of the paper

    The paper describes a novel approach for stereo depth estimation using a latent diffusion model that utilizes prior knowledge generated through optical flow estimates, thus guaranteeing temporally consistent depth estimates.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a novel approach for stereo depth estimation using optical flow maps and disparity computations as inputs to a latent model to propose a refined disparity map. Overall each concept (optical flow and latent models for disparity evaluation) is not novel on its own, but, to the best of my knowledge, both of them have not been combined previously and tested on endoscopic data. The advantage of this work is that some of the components (namely the RAFT models) did not need to be fine-tuned, and only the latent diffusion model underwent fine-tuning, but on natural scenes only. With good results on endoscopic data, this opens up the avenue to improve stereo depth estimation in the medical field while alleviating the burden of generating large annotated training datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main “weakness” is that the building blocks of this approach are not novel per se. These are concepts borrowed from other applications, but I still don’t think this should take away from the value of the paper given the quality of the reported results.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I honestly did not find any major flaws with the methods, study, or results reported. Obviously considering additional datasets and other applications (such as estimating the position of an endoscope for e.g.) can be further explored. But the authors have validated the contribution claims they made, and conducted fair comparisons to other methods as well.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiments support the authors’s claims, and even though the different components of the approach are not novel, the authors demonstrated the usefulness of this approach, and its superiority compared to other approaches, which can benefit the community at large.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for the rebuttal. I believe the authors have addressed my comments, albeit to some extent. Even though, as I mentioned before, additional studies can be conducted in the future to better support the authors’ claims and the work’s novelty, I think the quality of the reported results justifies the publication of this work, and would benefit the community.




Author Feedback

We would like to thank the reviewers for their constructive comments. All issues will be addressed in the revised paper. About reproducibility, we’ll share the code once the paper gets accepted.

R1

Ablation study: In Tables 1&2, we compare the disparity generated by RAFT-stereo with the refined RAFT disparity generated by our StereoDiffusion. This is our ablation study which validates the effectiveness of the diffusion model in refining RAFT’s disparity estimation. We have already tested the model by removing the optical flow component and we found that the accuracy and the temporal consistency of the depth estimation deteriorate. Hence, we did not include these results in our paper.

Monocular/stereo Image: Our method predicts disparity from stereo data. Due to the page limit, only the left RGB images and the corresponding disparity maps are shown which is commonly done in the literature.

Statistical tests: The paired t-test on the EPE and D3 metrics with alpha equal to 0.05, have verified that there is significant difference between SteroDiffusion and each of the baselines.

Writing and results discussion: They will be improved and strengthened in the revised paper.

Limitations:

The limitation is the inference time (~5fps) which could be improved in the future using faster denoising steps.

R3

Conditional GANs (cGANs): The Diffusion model has multiple significant advantages over cGANs, including training stability, higher generated image quality, less often training collapse, higher robustness and generalizability. Our previous experiments showed that cGAN-based models for depth estimation have poor performance on medical dataset, so we didn’t include them in our comparison.

Ablation Study: Please refer to R1Q1.

Regression-Based Methods: We have already tested SOTA regression-based methods like Marigold [6]. This method can not generate accurate depth maps on medical datasets like SCARED and STIR. Considering the significant depth error, we didn’t include it in our comparison. The low performance is due to the huge domain gap between natural and medical scenes, and the low generalisability of the existing regression-based methods. This is also the motivation and advantage of our proposed integration of the diffusion model into stereo depth estimation.

Temporal Consistency Metrics:

Currently, there are no medical datasets with both GT optical flow and depth information which could be used to quantitatively evaluate temporal consistency. SCARED only provides GT depth for sparse keyframes and STIR provides only sparse feature correspondences.

R4

Novelty: Our contribution are 1. We integrate for the first time a latent diffusion model into a novel stereo depth estimation pipeline for disparity refinement. 2. Our method advances existing depth estimation methods based on diffusion models, as it does not treat the diffusion model as a regression model (using RGB to predict depth as in Marigold [6]) but uses prior knowledge to refine disparity by using RGB and disparity together to predict disparity. 3. A tailored data normalization and denormalization process has been designed to make the disparity prediction invariant to different camera set-ups. Our validation verifies that StereoDiffusion has superior performance on medical data although it is trained on natural scenes only.

R5

Novelty: Please refer to R3Q3 & R4, which explains the difference between our methods with work [6] adapted from [10].

Inference: The use of fewer denoising steps (10) and of an efficient denoising scheduler enables StereoDiffusion to run at faster inference time (5fps) compared to other diffusion models (such as [14] with 0.1fps) while generating high quality images.

Disparity/depth conversion: Given the intrinsic and extrinsic camera parameters, the standard function which relates disparity and depth is:

depth = (baseline*focal length) / disparity)

Ablation Study: Please refer to R1Q1




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed comments from the reviewers. While some comments were justified by the authors, the quality and reproducibility of this paper seems somewhat a concern. Suggest making the code available.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed comments from the reviewers. While some comments were justified by the authors, the quality and reproducibility of this paper seems somewhat a concern. Suggest making the code available.



back to top