Abstract

Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DiT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DiT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance. The code and the proposed dataset are available at github.com/longbai1006/EndoUIC.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0221_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0221_supp.pdf

Link to the Code Repository

https://github.com/longbai1006/EndoUIC

Link to the Dataset(s)

https://mycuhk-my.sharepoint.com/:u:/g/personal/1155161502_link_cuhk_edu_hk/EZuLCQk1SjRMr7L6pIpiG5kBwhcMGp1hB_g73lySKlVUjA?e=g84Zl8 https://data.mendeley.com/datasets/3j3tmghw33/1 https://mycuhk-my.sharepoint.com/:u:/g/personal/1155161502_link_cuhk_edu_hk/EYtX3vMBWE1KizB1scvGOkgBzG4JW5SjTMAnJuxZTUAwdg?e=gbdyuR https://mycuhk-my.sharepoint.com/:u:/g/personal/1155161502_link_cuhk_edu_hk/EZ_Dz7G4J4hBpDKn3YPng6cByGmdGt1z2Qd51fZsmv6DoA?e=veMC5d

BibTex

@InProceedings{Bai_EndoUIC_MICCAI2024,
        author = { Bai, Long and Chen, Tong and Tan, Qiaozhi and Nah, Wan Jun and Li, Yanheng and He, Zhicheng and Yuan, Sishen and Chen, Zhen and Wu, Jinlin and Islam, Mobarakol and Li, Zhen and Liu, Hongbin and Ren, Hongliang},
        title = { { EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed EndoUIC, a promptable diffusion model for unified WCE illumination correction and image enhancement. The proposed illumination prompt module dynamically produces and integrates prompt parameters with features and enhances the interaction by the Adaptive Prompt Integration (API) module and the Global Prompt Scanner (GPS) module, respectively. In addition, the authors presented a novel Capsule-endoscopy Exposure Correction (CEC) dataset, containing 1,000 ground-truth and corrupted image pairs. Experiments show that the proposed method outperforms state-of-the-art methods on four datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The performance of the proposed method seems outstanding.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are many redundant and fancy sentences in the manuscript (e.g. self-adaptive dynamic feature space, dynamic selection mechanism), which have no further explanation or visualization to prove the efficacy. They are overly abstract and make the reader confused about the core idea. Besides, in the Introduction section, there are too many contributions and motivations that the authors would like to stress. It would be great to focus on the key points instead of mentioning all of them at the same time.
    2. The authors made overly abstract claims without further experiments. For example, could the authors explain the following sentences and present the corresponding results to support the statement? “Then, it shall steer the model within the parameter space toward different low-level details essential for EC and LLIE. Thus, leveraging the task-specific knowledge acquired by the model, it dynamically adapts the input data according to different brightness levels. Additionally, the prompt module is capable of learning different levels of illumination abnormalities within a single degradation. Thus, even if the input exhibits only one type of degradation, the model can still maintain effective restoration performance.”
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. To demonstrate the effectiveness of API, the reviewers would like to see more visualization results when it comes to the uneven illumination conditions in local regions.
    2. The authors mentioned the brightness levels in overexposure often extends beyond the dynamic range. As a result, the reviewer would like to know how the proposed model performs in these cases compared to other methods.
    3. Why the performance of the proposed method (LPIPS) is less satisfactory in Tab. 2?
    4. The authors should align the symbols in Fig. 1 and the ones in Sec. 2.2.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though the performance is good, the method seems over-designed and the writing is too wordy. In addition, the experiment is not comprehensive and lacks results other than performance to prove the effectiveness of the proposed module.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose EndoUIC, utilizing an end-to-end DFT model to address the low-light issues in WCE. It adjusts internal illumination to prevent over- or under-exposure of images. The illumination prompt module regulates exposure levels, while the API and GPS modules further enhance concurrent exposure, demonstrating additional utility from downstream experiments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A new research dataset was proposed and used in the experiments, featuring annotations added by experts to use realistic images. This facilitated better training. Through extensive experiments, efficiency was demonstrated, and comparison with existing methods showed the superiority of the proposed approach. Without the need for additional networks, we improved the images through P. By leveraging prior information, we demonstrated applicability in both low and high exposure scenarios.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While acknowledging the limitations of the data, it’s regrettable that the comparison was conducted solely on synthetic datasets, leaving uncertainty about its performance in real-world. The description of how initial P values are set in the API step appears somewhat inadequate, indicating a need for further elaboration. Additionally, there’s a need to validate the impact of additional P values. Although comparisons were made with existing methods, the specific differences between these methods and the proposed approach are not clearly described. Providing a brief overview of the distinguishing features would have facilitated a more informative comparison.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Once accepted, the code and the proposed dataset will be made publicly available, ensuring reproducibility without any issues.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    You can see the following points in the above item.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is written in an easy-to-understand manner and attempts to validate its superiority through numerous experiments.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose EndoUIC an Endoscopic Unified Illumination Correction - promptable diffusion model for unified WCE illumination correction. The illumination prompt module is designed to guide the difussion module to correct certain illumination conditions. It exploits a transformer to make use of global illumination information as well as multi-scale contextual information for the restoration process.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The problem is important and clinically relevant in order to improve the outcomes of wireless capsule endoscopy, as more information can retrived due to the exposure correction. The problem is well motivated.

    2. The state of the art is comprenhensive enough, although some mentions to other areas or image ehnacement nor related to endoscopy might help to situate the reader about the efforts done in this field at large.

    3. The experimental design and the test done by the authors are very thorough. They performed experements on four datasets (two EC and two LLIE): Capsule endoscopy Exposure Correction (CEC), the Endo4IE, as well as the Kvasir-Capsule and Red Lesion Endoscopy (RLE) datasets.

    4. The results are well presented and the authors present both quantitative and qualitative assessments of the experiments, presenting ablation studies in certain detail.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The methodology is diffcult to follow even for someone who has been working in this area for some time. I would absolutely like to see a more conceptual descripion of the proposed approach before Figure 1. This figure as it stands now provides too many details, but it not clear enough. More details for improving this part are given below in question 10.

    2. It could have been interesting to see a more detailed analysis on the rosbustness of the different methods, making use of box-plots or some similar analysis. Also, it could be interesting to see the failure cases. The image enhancement does not introduce any artifacs that could be considered detrimental for clinical purposes?

    3. It is not clear how the method could be used in practice. Is the video analyzed and the correction done after the procedure was performed? The authors do not discuss the number of parameters of the proposed models and do not delve into inference performance. How much does it take to process a video sequence for the compared methods?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper makes use of the publicly available dataset. If the code is made available it could be easily reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    There are some minor concerns and details that the authors need to address to improve the paper

    1. In the new dataset (Capsule endoscopy Exposure Correction (CEC)) how do you define or measure complexity of the generated images in terms of the generated degradations? It could be interesting to define metrics used for generating the dataset

    2. It could be interesting to provide qualitative example of this statement: “Thus, even if the input exhibits only one type of degradation, the model can still maintain effective restoration performance

    3. The text in the article needs to reflect better what is shown on Figure 1. For instance, several blocks are not properly defined in the text: gor the DFT layer, the FFN block is not defined, what is the purpose of this block?

    For the illumination Prompt Module, blocks such as DWConv (DepthWise convolution) and the Selective-Scan (coming from the Vmamba paper) are not clearly explained. The Xo and Xi terms should be understood as Xin and Xout in the diagram? The paper should be easily understood on its own even if inspired from other works.

    1. I understand that some of these modules are mosly inspired from V;amba, but still it could be good if the authors did some additional effort to make this point clearer. Looking at the VMamba paper the Global Prompt Scanne basically looks like a Mamba block

    2. The adaptively learned parameter is the same as kernels in the sub-figure or how are they dfifernet? The fiture should contain the terms discussed in the text. For instance the combined featue XA does not appear in the image. Making such things cleaer in the figure would make the text easier to follow. For instance, XG is reused for the GPS module, making it confusing to follow.

    6.The entire GPS module is summarized just as this: The combined feature is then fused with features from higher spatial dimensions to facilitate the illumination restoration process of the overall model. More explanation is needed in my opinion.

    1. The authors need to state in the beginning which modules were completely developed by them and which are larg e resused from the state of the art. As it is stated now the distinction between compontes from the state of the art and what has been proposed by the authors is not completely clear.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is a solid contribution and it could be a good paper for MICCAI. However, the writing and organization of the paper could be improved to simplify certain aspects and enhance the readability of the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I maintain my original review for the paper if the authords address my comments as discussed in their rebuttal




Author Feedback

Thank the reviewers for their valuable feedback. We appreciate that the reviewers find our work to be “well motivated & thorough” (R1) and “superior” (R3). We address the major concerns as follows:

Clarify the Prompt Block (R1, R3, R5): We use additional learnable prompt parameters to encode key discriminative information about brightness degradation levels (over & underexposure). The API block is used to learn discriminative information, while GPS integrates the learned prompt information with the image feature, guiding the model’s learning process. We randomly initialize P values.

We observe the feature clustering of over & underexposed images in the CEC test set with t-SNE. After passing through the 1st, 2nd, and 3rd prompt blocks, the feature clustering between over & underexposed images became more distinct and better clustered (Davies-Bouldin Index 1.68->1.55->0.53). After removing the prompt blocks, we find worse clustering results (2.24->1.94->1.87). This indicates that the prompt block can help optimize the feature clustering based on the discriminative illumination information.

How the Prompt Block Works on One Type of Degradation (R1, R5): When there is only one type of degradation, the prompt block acts like an attention mechanism and increases the depth of the network, which will not affect the performance. Tab.3 results show that our model still works effectively when only having the LLIE task.

Key Contribution Clarification (R1, R5): The main motivation is to tackle the brightness anomalies (under & overexposure) in WCE. The core contributions are: 1) designing a prompt block to guide the image enhancement; 2) developing a new dataset.

The Selective-Scan from VMamba [13] & DWConv are from existing papers. We develop our prompt block based on these methods. We will explain our motivation & contributions in an easy-to-understand way and clarify the relationship between our proposed block and existing works.

Overexposure Results (R5): Tab.1 shows the overall scores for under & overexposed images. When tested separately on overexposed images, our method also achieves SOTA results (PSNR: Ours 29.84, PyDiff 26.92, PromptIR 27.32).

Discussion on Less Satisfactory Results (R5): We are sorry we have not discussed the less satisfactory results in Tab.2. Our method primarily restores images at the pixel level, leading to top-1 results in PSNR & SSIM. However, it falls behind LANet in feature perception quality, while our method still ranks 2nd in LPIPS.

Confusing & Wordy Method Writing (R1, R5): DFT: Diffusion Transformer FFN: Feed-forward Network In Page 5 Para 2, we construct kernels of different sizes to provide the model with different receptive fields, and fuse them with Avg & Max Pool. We will remove the abstract descriptions of “self-adaptive dynamic feature space” & “dynamic selection”. We will fix all inconsistent or missed symbols in the figure for better clarification.

Result Robustness (R1): Below are the STDs for results in Tab.1 (PSNR: Ours 29.65±3.51, PyDiff 28.18±3.47, PromptIR 28.27±3.79). Our STDs are within an acceptable range. We use the downstream lesion segmentation to validate that our model can benefit clinical uses (Tab.3 last row).

Real-time Deployment (R1): Here are the FLOPs & speed (Ours 14.36M/3.29FPS, PyDiff 97.89M/8.65FPS, PromptIR 35.59M/4.23FPS). Our model’s size is not big, but the inference speed needs improvement. This will be our focus for future work.

Dataset Generation Process (R1, R3): We simulate the camera aperture settings with the Adobe Camera Raw SDK to adjust exposure values (EVs). Changing the exposure value is equivalent to changing the camera aperture size. We render the raw image with varying numerical EVs to change the exposure range of highlights and shadows, emulating real exposure mistakes. Raw images are randomly assigned EV from a defined range (underexpose [-4, -3.5], over [3, 3.5]).

We do not add any additional experiments, but only use new metrics.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top