Abstract

The application of ultrahigh definition endoscopy systems in minimally invasive surgeries has become increasingly widespread. However, their high resolution results in a reduced depth of field (DOF), making it difficult to achieve clear imaging across the entire frame. Unlike improvements in optical structures, we address this issue using a deep learning-based multi-focus image fusion (MFIF) approach. Traditional MFIF methods are less effective in endoscopic scenarios because of their inadequate design for extracting information from complex organ structures. To address these limitations, this work proposes a two-streamed cascaded encoder-decoder network that incorporates multi-scale feature extraction and fusion mechanisms validated in medical image segmentation. The network includes novel multi-scale fusion module with cross-axial attention that hierarchically integrates features using attention-guided weights and hybrid operations, effectively preserving intra-domain textures while modeling cross-domain dependencies. The framework is rigorously validated using novel real-world endoscopic datasets collected from imaging experimental platform. The experimental results demonstrate that the proposed method outperforms traditional approaches in benchmark tests. Our code will be made publicly available in our final version.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3862_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/luoyu5023/CTMFusion

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DenXia_Endoscopic_MICCAI2025,
        author = { Deng, Xiang and Liu, Xing and Xu, Tian and Liu, Xiaoyue and Gan, Tianyuan and Lu, Chen and Zhou, Congcong and Wang, Peng and Lei, Yong and Ye, Xuesong},
        title = { { Endoscopic Depth-of-Field Expansion via Cascaded Network with Two-streamed Multi-scale Fusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {141 -- 150}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a multi-focus endoscopic image fusion network utilizing a two-streamed cascaded encoder-decoder architecture. A cascaded network preserves intra-domain textures utilizing multi-scale fusion and the decision map generation paradigm. A Feature fusion module model cross-domain dependencies with hierarchical hybrid fusion strategy and cross-axial attention mechanism. Superior or comparable results are achieved on the newly-collected real-world endoscopic datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) This paper focuses on the Multi-Focus Image Fusion (MFIF) problem in real endoscopic scenarios, and proposes a two-stream cascaded encoder-decoder network to facilitate a deeper exploration of multi-focus image interdependencies. 2) The authors construct a simulated abdominal cavity environment using fresh ex vivo porcine organs and propose a new real-world endoscopic dataset EVP-MFI. 3) Superior or comparable results are achieved on the newly collected real-world endoscopic datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The main focus of the paper is unclear. The title refers to “Endoscopic Depth-of-Field Expansion”, while the primary task appears to be multi-focus image fusion (MFIF). Related statements in the paper are also inconsistent, such as “a high-resolution DOF expansion method … which effectively fuses multi-focus endoscopic images” and “achieving DOF expansion through the generation of decision maps and MFIF.” 2) In the discussion of previous studies, the authors point out that a main drawback as “allocates computational resources disproportionately to the feature extraction module”, but the proposed method includes little discussion regarding computational resources. 3) In the first paragraph of the Method section, the description does not match Fig. 1. For example, “RGB inputs are first converted into the YCbCr color space,” and “the Y (luminance) channel is employed as the input of the fusion model” find no correspondence in Fig. 1. 4) The definitions of L_{int}, L_{text}, L_{ssim} in Equation 1 are not provided. 5) There may be some misunderstanding regarding Fig. 3. Do “blue insets” refer to “difference maps representing the difference between the near-focus image and the fused results”? If so, why does this prove that the method “preserves intricate textural details from far-focused regions”? 6) Many inferences are not common sense but lack supporting literature or evidence. For example:

    • In the third paragraph of Section 2.1, why can direct mask operations ensure spatial continuity of the in-focus region to eliminate subjective bias?
    • In the second paragraph of Section 2.2, why can channel concatenation in shallow layers enhance local feature representation, while max fusion in deeper layers enhances salient features and suppresses blur artifacts? What exactly are channel concatenation and max fusion? 7) Minor: Although the dataset is real-world, it is still simulated on porcine organs. The relationship and differences with real scenarios are not analyzed in the paper. 8) Minor: At the beginning of the third paragraph of the Introduction, it is unclear what “these methods” refer to.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper does not provide clear technical descriptions, despite the statement that the code will be released. The motivation and overall logic is unclear, and the review of existing literature is insufficient. Therefore, this reviewer recommends a reject.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thanks for the detailed rebuttal. Most of my concerns are addressed. Still, I am not convinced by the response on Q2, Q5, Q6. For Q2, better results achieved by early fusion design do not necessarily demonstrate that this is due to balanced utilization. For Q5, texture difference between the near-focus and fused image does not indicate correspondence with real textures; it would be more convincing if the far-focus and fused image matched in this regard. For Q6, better results than DRPL[6] does not necessarily imply that this is due to direct mask operations. Therefore, I would keep my recommendation reject. Considering its dataset and method contribution, I won’t oppose or defend, if the AC and the rest two reviewers recommend accept. I just hope the authors could make it clear on the questions.



Review #2

  • Please describe the contribution of the paper

    This study focuses on addressing the depth-of-field (DOF) limitation in endoscopic imaging systems. The key contributions are summarized as follows:

    1. A novel network architecture incorporating multi-scale feature fusion modules specifically designed to mitigate DOF-related image quality degradation in endoscopic systems.
    2. The establishment of the EVP-MFI test dataset constructed using real-world clinical endoscopic data(from animals), providing a standardized benchmark framework for future research in this field.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The focus of the study is practical and corresponds to the scope of the conference.
    2. The training and testing comparison is abundant.
    3. The establishment of the EVP-MFI test dataset provides a new standardized benchmark.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The innovation in method is not strong. Cascaded encoding-decoding and two-stream are typical. The fusion module is an assembly of several classical components.
    2. There is no parameter quantity comparison.
    3. As for the significance in solving DOF in clinical application, no comparison and evidence indicate the real-time performance of models (inference time) or system integration.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Whether the pretrained weights, derived from training on datasets not encompassed within the scope of this study, such as ImageNet, are utilized in cross-Swin-Transformer?
    2. What are E_A and E_B?
    3. What is the Stem module? I do not see any descriptions on its structure or calculation method.
    4. Could the authors specify the equation for the loss function? A more detailed formula is required.
    5. The test on separate static images cannot prove its feasibility in the real clinical environment. In a real environment, the movement of the camera, the moisture and uneven illumination also challenge the efficiency of the deep learning algorithms. Despite these factors, the crucial problem is whether the method in this research can work well in real-time working conditions.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study exhibits practicality, potential, and significance in clinical practice. However, the study remains incomplete, as there is an absence of evidence indicating its integration into the medical system with respect to speed and portability, and less novelty in model. All of the above concerns weaken the innovation in methodology and importance in engineering.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have solved all my concerns.



Review #3

  • Please describe the contribution of the paper

    The paper presents a method to fuse near-focus images and far-focus images to expand the depth of field with minimal blur artifacts and good preservation of details and textures. To do so they propose a two-streamed cascaded encoder-decoder network with multi-scale feature extraction and fusion mechanisms.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Well written
    • Code will be released
    • Extend comparison with other methods to prove that the proposed method outperforms existing methods in real-word endoscopic sequences
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • If I understood correctly, Fig 3. should show in the diff image the parts that have more detail, is it that correct? Could you give more details?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and the results show an improvement over previous decision map-based models, showing a better preservation or more comprehensive feature information which could be useful for other image processing algorithms.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have answered all the reviewers questions.




Author Feedback

Thank reviewers (R2, R3, R4) for your positive comments and key suggestions that help us improve our work. Below we address the major concerns of the reviewers. The main focus (R3Q1):Our ultimate goal is to extend DOF in high-resolution endoscopy. Our technical approach is multi-focus image fusion (MFIF), which effectively fuses multi-focus endoscopic images without altering optical structures.Generating decision maps is a common method in MFIF. The novelty of the proposed method (R2Q1): Unlike prior work focused on general domains primarily, we first extended MFIF to endoscopic applications, designed a ​specialized model in medical scenarios and established the first real-world medical dataset EVP-MFI in MFIF scenario. Loss function (R2&R3Q4):Following common practice in prior works [13,23], we have: L_{ssim} = 0.5(1-ssim(F, A)) + 0.5(1-ssim(F, B)). L_{text} = 1/HWL1[abs(▽F)- max(abs(▽A), abs(▽B))]. L_{int} = 1/HWL1[F – mean(A, B)]. F is the fused image while A/B is the near/far focus image. L1 means l1·norm and▽ is Sobel gradient operator. We didn’t include them due to length limits and will add them in our final paper. Explanation of Fig.3 (R3Q5&R4): “Blue insets” refer to “difference maps between the near-focus and the fused image” in caption. The near-focus image shows clearer details in the lower heatmap region, whereas the far-focus image excels in the upper region. The difference maps should highlight far-focus details, manifesting as obvious texture details in the upper region. Our method did this expected behavior. Real-time working (R2Q2-3): Prior work focused on algorithmic accuracy over speed. Addressing real-time surgical needs, we designed a more lightweight model architecture to reduce computational resources, including a hybrid fusion strategy and small number of layers and dimension in the Cross-Swin component. The adaptor shown in Fig.1 can achieve stable real-time video output by deploying trained model in Nvidia Jetson Orin. This work paid more attention to fusion precision improvement due to length limits. We will validate its performance under moisture and uneven illumination conditions in future work. R3Q2: We mentioned this main drawback to emphasize the benefits of the balanced utilization of extraction/fusion stages in cascaded networks. The early fusion design achieved better results in the ablation study (Row 4 in Table 3) and CEDNet[16]. We will clarify this ambiguous expression. R3Q3: We will revise Fig.1 to clearly illustrate this step. Supporting evidence (R3Q6): First, related work [21] mentions the distinction between sharpness and blur is inherently continuous and probabilistic, rather than a subjective binary (0/1) segmentation. This approach avoids the need for consistency verification required by DRPL6 after fusion and gets better results in Table2. Secondly, channel concatenation means ‘torch.cat’ and max means ‘torch.max’ in PyTorch on feature fusion. SwinFusion[13] directly fuses two output after Cross-Swin by concatenation. Element-wise max operations at low-level stage suppress low-confidence artifact from error accumulation during network deepening. This hybrid strategy shows its performance in Row 6 of Table 3 and has been widely used in prior research, such as “Low-Light Image Enhancement Network Based on Multi-Scale Feature Complementation” (AAAI 2023). R3Q7: There are no clinically approved multi-focus endoscopes now, and laparoscopic image processing research widely used porcine organs due to their abdominal anatomical similarity to humans [3]. R3Q8: “These methods” mean traditional MFIF methods without deep learning. We will correct the confusion. R2Optional: Our method and baselines do not use pretrained weights from ImageNet. Our dataset settings follow Table 1. E_A/E_B means near/far focus output after encoder, including H/4W/4C1, H/8W/8C2, H/16W/16C3. Stem includes two 3*3 convolutional layers mentioned in the final paragraph on page 3.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have clearly clarified the issues raised by the reviewers. The explanations in the rebuttal look reasonable and correct to me.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top