List of Papers Browse by Subject Areas Author List
Abstract
3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is often the case in real-world clinical scenarios. To tackle this sparsity challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against the state-of-the-art methods, EndoSparse achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to the sparse-view limitation in endoscopic reconstruction. EndoSparse signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: \https://endo-sparse.github.io/.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0791_paper.pdf
SharedIt Link: pending
SpringerLink (DOI): pending
Supplementary Material: N/A
Link to the Code Repository
https://endo-sparse.github.io/
https://github.com/CUHK-AIM-Group/EndoSparse
Link to the Dataset(s)
N/A
BibTex
@InProceedings{Li_EndoSparse_MICCAI2024,
author = { Li, Chenxin and Feng, Brandon Y. and Liu, Yifan and Liu, Hengyu and Wang, Cheng and Yu, Weihao and Yuan, Yixuan},
title = { { EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
year = {2024},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15006},
month = {October},
page = {pending}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper incorporates priors from vision foundation models to regularize the ill-posed problem of deformable 3D reconstruction from sparse view points.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the paper uses a novel approach to incorporate appearance priors to regularize the 3D reconstruction.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
(1) the paper is based on the assumption that appearance priors can regularize deformable reconstruction from sparse view points. However, in section 2.2 it is not clear how the regularization actually works. Is the SDS applied to images rendered from virtual cameras (camera pose that is not in the training samples), or from not observed time points (tissue is deforming)? Additionally, it is not clear how the model is able to reliably estimate the tissue deformation over the whole sequence if only 3 frames are used for training and how the appearance prior could possibly help in this scenario. (2) geometrical priors are not novel. They have been used in other works either as depth from stereo (Wang, EndoNerf, MICCAI2022), or depth from monocular images (Y. Huang, Endo-4DGS: Endoscopic Monocular Scene Reconstruction with 4D Gaussian Splatting, Arxiv preprint 2024). (3) in the experimental section it is not clear how train and test images are sampled. Are training samples spread over the whole sequence and all the rest is testing?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
-As stated in the main weaknesses the sampling of train and test data is not clear.
- it is not clear if the baselines use monocular or stereo images for model fitting.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- the ablation study could be expanded to show performance results from 3 frames up to the whole sequence to show up to which point it is useful to add the priors.
- the paper states the rendering FPS but it would be interesting to compare also fitting times.
- eq. 1 is not explained or referenced. Also there is no normalization term in the original 3D-GS paper.
- [27] does not use 4D voxels but hex-planes
- \hat{C}_t -> probably a typo in section 2.2
- \sigma_1 in figure 3a is not defined
- it would be convincing to provide videos of the rendered sequences
- it would be interesting to provide details on the evaluation for each sequence.
- It would be interesting to have more insight into the method: in which cases the method helps and where it fails (and why)
- using only 3 frames to fit a model showcases how the method is able to reconstruct from sparse views. But is this a realistic scenario? It would be interesting to run the method on a real surgery and exclude the frames that are blurred, occluded etc. This would make the motivation of the paper much stronger.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Reject — could be rejected, dependent on rebuttal (3)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I selected weak reject because the key elements of the method are not clear. Namely, how images are sampled for train/test and how the appearance priors are used. see main weaknesses (1) and (3)
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Reject — should be rejected, independent of rebuttal (2)
- [Post rebuttal] Please justify your decision
I reject the paper because my major concern regarding the sampling of test and train images remains. Although, it has been addressed in the rebuttal it is still not clear to me how samples are selected. Given this and the fact that the code will not be published, it is not possible to reproduce the results.
Further, the authors promise to “thoroughly modify revision for the entire text” which seems not reasonable as there should only be minor changes after acceptance.
ranking 2 out of 2
Review #2
- Please describe the contribution of the paper
1) Applying the Gaussian splatting method to endoscope reconstruction and has reached the state-of-the-art level. 2) In endoscopic videos, there are varying lighting conditions, a significant amount of noise, and shaking. Noise is introduced into endoscopic images and a diffusion model is used to obtain clean images. 3) Using the prior RGB2Depth model to obtain depth maps and address situations with limited perspectives and improve the coherence of the learning set.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The latest vision-based 3D reconstruction methods are applied to endoscopic reconstruction, pointing out the problems that exist when using endoscopic images for reconstruction compared to natural scene images, such as unstable equipment, varying lighting and noise, and low-quality images. The authors proposed two strategies to address these issues. 1. Using prior knowledge from pre-trained 2D generative models to improve the quality of visual reconstruction, achieving state-of-the-art reconstruction results. 2. Solving the problem of coherent learning of geometric capabilities through pre-generated depth maps. The authors’ solutions are very innovative and effective
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1)The author should pay attention to some details in the expressions of this paper. In “Deformable Scene Reconstruction“,”[27] “should not be the subject. In “Implementation Details”, we utilize an Adam,”W” should be capitalized. 2)I hope the author can add more explanations. Authors can try to explain the experimental phenomenon in figure 3(a), such as why the metric on Geometry quality, the δ and TV, and Full EndoSparse are worse than Geo prior. In implementation details, why is it necessary to use the warm-up strategy? I believe authors can explain the time consumption required for training the model.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
I hope the author can share the source-code related to the project to enhance the reproducibility of the paper.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Adding user studies involving ratings and evaluations from clinical physicians can make the research more comprehensive and convincing.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Accept — could be accepted, dependent on rebuttal (4)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The author introduced 3D GS into the field of endoscopic 3D reconstruction, analyzed the challenges of applying 3D GS to endoscopic video reconstruction, and proposed two methods to enhance the performance of geometry and texture reconstruction, achieving state-of-the-art results. The English writing is good, and the problem analysis is specific and comprehensive.However, the author’s performance in some experimental and article details is not rigorous enough.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
The paper describes a method to render novel views of endoscopic scenes from only a few observations. It optimizes a gaussian splatting technique by incorporating prior knowledge of geometry by utilizing the previous work Depth-Anything and appearance by Stable Diffusion. As opposed to some of the compared to techniques, their implementation can run in real time on current a GPU and yields superior visual and geometric quality in EndoNeRF-D and SCARED. Moreover the paper gives an extenisve overview of several related works.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is very well written, and reads extremely well. The paper also describes many fundamentals, so that even a novice in the field can easily follow the explanations. The literature review lists an impressive number of very recent related papers and cites them for their key contributions, creating effectively a categorization of these papers. The mathematical formulations and details about the three major components (Gaussian Splatting, diffusion prior and geometric prior) are very clear and easy to follow. The experiment seems well designed, uses the common choice of metrics and the results are impressively good, clearly beating the other compared methods. The ablation studies nicely show the effects of how the individual parts of the method contribute to the quality of the overall method.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
My main ‘complaint’ is editorial in nature. I find the references of several papers somewhat confusing. The paper demonstrates comprehensive engagement with relevant literature, which is commendable. However, to enhance clarity and readability, I recommend refining the citation style used in the introduction and throughout the document. The current approach of citing multiple sources frequently and sometimes inconsistently throughout sentences could be streamlined for better reader comprehension. Therefore aim for a consistent placement of references, preferably at the end of sentences or statements where they support or justify the claim made. This practice helps maintain the flow of the text and makes it easier for readers to follow the narrative without disruption. Furthermore, contributions should be clarified: When multiple papers are cited together to support a single claim, consider briefly mentioning the specific contribution for each cited work . This helps in delineating the landscape of existing research. Consider reducing the frequency of citations where a single, well-chosen reference would suffice. Over-citation can overwhelm readers and may dilute the impact of key points.
A minor weakness may be that the authors use a somewhat ‘sensational’ or very enthusiastic language describing their own method and have nothing ‘bad’ to say about their method. Right now, I can only believe their claims, as I have no way of verifying them. However, I believe, that no method is flawless. Therefore I would very much appreciate, if the authors could introduce a short section talking about some limitations of their method. This could include scenarios where the method underperforms compared to other techniques, instances where it may require significantly more memory, or situations where it imposes strict constraints on the training data. Highlighting these limitations will enhance the credibility of the paper.
Other than that I provide a number of very minor editorial suggestions in the main comment section.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
Right now there seems to be no mention of making the source code available. I hope that the authors plan to release it. While I believe that the details in the paper are enough to recreate the system, it would at least be a major engineering task.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Disclaimer: This is not my expert area; therefore, some of my comments regarding the content may be incorrect. However, I focus on several editorial flaws that, when addressed, would improve the paper.
Introduction:
- “surgical training [32] for medical professionals [24,32].” - Referencing reference 32 twice in the same sentence is unnecessary here.
- “This advancement could further minimize the need for invasive follow-up procedures” - This sounds very general and it’s unclear what the authors mean.
- “EndoNeRF [24] and its follow-up works [24,32,30]” - how can EndoNeRF be its on follow up work?
- “rendering cost since the NeRF approach requires querying such neural radiance fields multiple times for a single pixel, limiting the applicable usage in intraoperative applications” - the multiple querying is not the problem. I assume that the resulting performance hit might be the problem.
- “necessitating eliminating a significant number of low-quality views [4].” - reference 4 seems odd to me. It does not seem to back the claim in the text, is quite old, and seems to focus rather on human perception. Either I misunderstood, or this is a weird choice for this claim.
- “this paper presents the first investigation into the medical scene reconstruction under sparse-view settings.” - this is a very bold claim and without context does not seem correct. It may be the first in Gaussian splatting in endoscopy, but the authors should formulate this claim more focused and less sensational and general.
- “Visual Foundation Models (VFMs) [6,21,31]” and other examples, which are backed up by multiple references at once. It is unclear which claim is backed by which reference. Citing many references for one claim is confusing and tiring to the reader. If some of the related works are highlighted as ‘impressive’ (e.g. “Our insight is inspired by the impressive results…”) the authors can consider stating the authors explicitly like “Author et al. [1] showed in the impressive work that…” or stating the name of the system.
- “In this paper, we introduce EndoSparse,” sounds like the beginning of a new paragraph to me.
- The highlighting in the Introduction seems a bit arbitrary to me. Some highlights are made using bold text, some underlining, and some italics. this should be consolidated.
- I’m somewhat confused by Fig. 1. Should the output of the boxes on the right also affect the left box? there seems to be no final output?
Method
- “Overall, the proposed EndoSparse is robust against degraded reconstruction quality due to only having sparse observations” - due to? Grammatically, this sounds like the reconstruction quality is robust because it only has sparse observations. This is not what the authors mean?
- The abbreviation “MLP” has not been introduced before.
- “Building upon 3D-GS, [27] introduces a deformation” - I think reference numbers should not be used like words. Consider naming the authors, or the name of their technique here.
- “declines as the number of available viewpoints decreases” - I don’t think emphasis makes sense here.
Instilling Diffusion Prior for Plausible Appearance
- “In essence” can be removed for conciseness.
- “we use that predicted clean” - I think it should be ‘this’ instead of ‘that’.
Overall Optimization
- “geometrical loss”. The authors use both ‘geometrical’ and ‘geometric’ several times in the text. Are they interchangeable?
Experiment Settings
- the abbreviations PSNR, SSIM, and LPIPS were not introduced.
- “Following [27,16], we adopt” - again, I discourage using reference numbers as words in the text.
- “methodology in to model” - superfluous ‘in’
- “the coefficients of photometric loss term” - missing ‘the’?
- “Following [27,16], we adopt a” - this exact string appears twice shortly one after the other. Avoid repetition.
- “which initially optimize Canonical Gaussians” - should be “optimizes”. Also, why is ‘Canonical’ capitalized?
- “on a RTX 4090 GPU” - should be ‘an’. But ideally should be “an Nvidia RTX 4090”. Could also possibly list VRAM size etc.
- Fig. 2: “EndoGauassian” - spelling mistake.
- Fig. 3: “Plain Spare-view” - spelling mistake.
- Fig. 3: The arrows in the figure indicate in which direction the values of each metric are better. Are the directions of the arrows correct? They do not seem consistent.
Comparison with State-of-the-arts
- “state-of-the-arts” should be singular.
- “NeRF representation [24,32,30]” - should be “representations” and the references can be omitted, as they already appear in the sentence before.
- “recovers a accurate” - “an accurate”
- “proffers “ - “offers”?
References
- references 3,7 and 27 are missing the year.
- regarding reference 15: I’m not sure if a sparse CT reconstruction paper is applicable here.
- regarding reference 17: MICCAI should be spelled out like in [2]
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Accept — should be accepted, independent of rebuttal (5)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I enjoyed reading the paper quite a bit, even though it is outside my field. I think the literature review is quite extensive and seems to give a good overview. The description of the method is very clear and the experiments and results seem to be very good. If the authors address the minor issues I listed, I think this paper is excellent. Maybe adding some honest limitations, as mentioned above, could be a nice addition, as right now the paper seems one-sidedly positive.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Accept — should be accepted, independent of rebuttal (5)
- [Post rebuttal] Please justify your decision
The other reviewer’s comments and the rebuttal do not substantially change my opinion of the paper. As my concerns about the paper were mostly editorial in nature, and the authors seem to be happy to address several of the reviewer’s comments in the final version, I maintain my vote to ‘accept’.
Rank 1 of 3
Author Feedback
First, we express our gratitude to all reviewers for their valuable feedback and the general consensus that the work is innovative & effective, reads extremely well (R#3). Now, we address the specific comments.
[R#1, Q1] Further Details. [(I) In Fig 3(a), why Full model worse than Geo prior in Geometry quality?] This could be due to the added diffusion prior having a slight negative impact on geometric quality. Nevertheless, the complete model outperforms the +Geo-Prior variant in terms of SSIM and visual metrics (e.g., PSNR+1.11, SSIM+0.035), indicating a significant overall gain. [(II) Why warm-up?] We adopt warm-up in line with common settings in [27,16]. Initially, we optimize only the static 3D representation and then we train the deformation of static Gaussian splatting across different time frames. [(III) Training time consumption] The average training time is about 2.0 minutes.
[R#3, Q1] Suggestions for Writing. We greatly appreciate your suggestions! In revision, we will carefully revise literature, including standardizing the format of literature. Also, we will thoroughly modify revision for the entire text.
[R#3, Q2] Limitation? Although promising, our method needs precise camera poses. Future work could explore unconditioned scenarios with sparse, inaccurate, or unavailable camera poses, leveraging recent progress in unconditioned natural image reconstruction (Wang S, DUSt3R, Arxiv 2024).
[R#4, Q1] Is SDS applied to images rendered from virtual cameras, or unobserved time points? For ENDONERF dataset with fixed viewpoint and different times, we apply SDS at unobserved time points due to missing frames. For SCARED dataset with static scenes from different viewpoints, we apply SDS at unobserved viewpoints due to missing viewpoints. We will clarify it in revision.
[R#4, Q2] How the model estimate tissue deformation given 3 training views? How appearance prior help it? Please note that we have selected three training viewpoints at equal intervals covering all sequences. Given few training views, the rendering from the reconstructed tend to contain some distortions. Intuitively, the diffusion prior learned from large-scale datasets will correct them to be more plausible and consistent with the distribution of real images.
[R#4, Q3] Novelty compared to (EndoNerf, MICCAI’22) & (Endo-4DGS, Arxiv’24)? We respectfully argue that our work is novel in its use of geometric prior, as it differs from these works in the problems addressed and the priors used: (1) While both EndoNerf and Endo-4DGS utilize estimated depth maps to supervise NeRF training across all views, our approach refines the geometric quality of rendering from sparsely supervised views. We dynamically update the depth on newly rendered images, providing a feedback supervision signal. (II) Different Geometric Priors: Compared to the ad-hoc endoscope depth estimator used in EndoNerf, our geometric prior is learned from a large dataset using Visual Foundation Models. It’s also noteworthy that Endo-4DGS is a concurrent work on arXiv, and “arXiv papers are not considered as prior work” based on MICCAI review regulations. We will include more acknowledgement & discussion of Endo-4DGS in revision.
[R#4, Q4] Split of Training & Testing Images? We follow the common train-test split setting in EndoSuRF that splits training and testing images in a 7:1 ratio, with the test images are evenly sampled. When the sparse-view setting is imposed, we evenly sample k views from the training views and discard the remaining.
[R#4, Q5] Is the model fitted using monocular or stereo images? The input for training the model is oriented towards monocular images.
[R#4, Q6] Ablation studies from 3 frames to the entire sequence. We have shown the result under different k-view setting in Fig 3(b). As expected, more views correlate with better visual and geometrical output quality.
Meta-Review
Meta-review #1
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This submission received mixed reviews, where R1 recommended weak accept, R3 recommended accept, and R4 recommended reject. The main concerns from R4 were lack of method clarity (persisting as issues after the rebuttal), weak ablation studies, lack of failure mode analysis, and and some other key aspects that I also agree with. In particular:
R4: “Using only 3 frames to fit a model showcases how the method is able to reconstruct from sparse views. But is this a realistic scenario? It would be interesting to run the method on a real surgery and exclude the frames that are blurred, occluded etc. This would make the motivation of the paper much stronger.”
And R4’s concern about training/test splitting, for which the response was not clear in the rebuttal: “[R#4, Q4] Split of Training & Testing Images? We follow the common train-test split setting in EndoSuRF that splits training and testing images in a 7:1 ratio, with the test images are evenly sampled. When the sparse-view setting is imposed, we evenly sample k views from the training views and discard the remaining.” It is not clear if testing and training images were separated across videos, or subjects. Ransized splitting, which mixes images from the same videos for testing and training, is not good practice. This aspect must be clarified.
I also have an important concern about validation, which is the lack of 3D reconstruction accuracy assessment. The presented metrics are standard visual quality metrics (PSNR, SSIM, and LPIPS) which, while useful, do not clearly assess reconstruction accuracy due to depth ambiguities in deformable environments. However, I will note that the reviewers did not raise this objection (yet - it should have been raised, in my opinion - especially because there exist public datasets that have ground-truth depthmaps that can be used for validation, including those used in this work where depthmaps have been used for supervision). While I feel this aspect of validation is weak, I am inclined to accept the work, since it is offset by several technical novelties to achieve reconstruction with sparse image sets, which appear to be effective in terms of the visual quality metrics.
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
This submission received mixed reviews, where R1 recommended weak accept, R3 recommended accept, and R4 recommended reject. The main concerns from R4 were lack of method clarity (persisting as issues after the rebuttal), weak ablation studies, lack of failure mode analysis, and and some other key aspects that I also agree with. In particular:
R4: “Using only 3 frames to fit a model showcases how the method is able to reconstruct from sparse views. But is this a realistic scenario? It would be interesting to run the method on a real surgery and exclude the frames that are blurred, occluded etc. This would make the motivation of the paper much stronger.”
And R4’s concern about training/test splitting, for which the response was not clear in the rebuttal: “[R#4, Q4] Split of Training & Testing Images? We follow the common train-test split setting in EndoSuRF that splits training and testing images in a 7:1 ratio, with the test images are evenly sampled. When the sparse-view setting is imposed, we evenly sample k views from the training views and discard the remaining.” It is not clear if testing and training images were separated across videos, or subjects. Ransized splitting, which mixes images from the same videos for testing and training, is not good practice. This aspect must be clarified.
I also have an important concern about validation, which is the lack of 3D reconstruction accuracy assessment. The presented metrics are standard visual quality metrics (PSNR, SSIM, and LPIPS) which, while useful, do not clearly assess reconstruction accuracy due to depth ambiguities in deformable environments. However, I will note that the reviewers did not raise this objection (yet - it should have been raised, in my opinion - especially because there exist public datasets that have ground-truth depthmaps that can be used for validation, including those used in this work where depthmaps have been used for supervision). While I feel this aspect of validation is weak, I am inclined to accept the work, since it is offset by several technical novelties to achieve reconstruction with sparse image sets, which appear to be effective in terms of the visual quality metrics.