Abstract

Volumetry is one of the principal downstream applications of 3D medical image segmentation, for example, to detect abnormal tissue growth or for surgery planning. Conformal Prediction is a promising framework for uncertainty quantification, providing calibrated predictive intervals associated with automatic volume measurements. However, this methodology is based on the hypothesis that calibration and test samples are exchangeable, an assumption that is in practice often violated in medical image applications. A weighted formulation of Conformal Prediction can be framed to mitigate this issue, but its empirical investigation in the medical domain is still lacking. A potential reason is that it relies on the estimation of the density ratio between the calibration and test distributions, which is likely to be intractable in scenarios involving high-dimensional data. To circumvent this, we propose an efficient approach for density ratio estimation relying on the compressed latent representations generated by the segmentation model. Our experiments demonstrate the efficiency of our approach to reduce the coverage error in the presence of covariate shifts, in both synthetic and real-world settings.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3051_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/benolmbrt/wcp_miccai

Link to the Dataset(s)

https://www.synapse.org/#!Synapse:syn51156910/wiki/622461

BibTex

@InProceedings{Lam_Robust_MICCAI2024,
        author = { Lambert, Benjamin and Forbes, Florence and Doyle, Senan and Dojat, Michel},
        title = { { Robust Conformal Volume Estimation in 3D Medical Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, the authors propose a method based on conformal prediction theory to estimate the uncertainties of a medical image segmentation network. This theory assumes that there is no domain shift between the calibration and the test data which is not true in practice. To cope with this problem they use weighted conformal prediction and propose to use a latent vector computed from the segmentation network feature to compute the weights in a tractable way. The method is evaluated on a synthetic dataset and on the Brats 2023 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    – The proposed method effectively improve the uncertainty estimation when the distribution with respect to given covariate of the calibration dataset is different from the distribution of the test dataset which is a real limitation of conformal prediction theory.

    – The proposed method is usable in practice: add only a small overhead with respect to standard conformal prediction, and does not impose to train another deep classifier.

    – The paper is interesting and clearly written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    – The experiments validate the point of the authors but in a somewhat restricted setting. More experiments would help to convince the reader of the usefullness of the method in real life settings (see section 10 for details).

    – Experiments show that with the new method the coverage is indeed improved but at the cost of extended width. It might be a normal behaviour to expect: if the real is indeed larger, it is better estimated. But an estimated width way larger than the real width will also improve the coverage and while making the model unusable if the uncertainties are huge. It seem that the experimental protocol does not answer this point. How can we be sure that the method does not estimate a overlarge width ?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    – In the experiments, the author use a different repartition of the covariate between ID and the (shifted) test. For example, for Brats, repartition of glioblastoma vs meningioma is 30%-70% in ID and 82% − 18% in the shifted test dataset. What are the limits of the shift ? what if the repartition is 0%-100% in ID ? A study on the influence of shift amplitude would be interesting.

    – what would happen on data from a completely different dataset ? For example, for the deployment of a model in a new hospital. Experiments on Brats with for example https://paperswithcode.com/dataset/ucsf-pdgm has test data would also be interesting.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper deals with an interesting trending new method for uncertainties estimation of deep networks. The proposed method seems to effectively improve the effect of the covariate shift. The rebuttal should more specifically address my question about the potential overestimated width.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    A novel approach for volume estimation in 3D medical images. Weighted conformal prediction has been proposed and shown to effective in tackling covariate shifts in medical image analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Conformal prediction, calibrated systems are the current research problems for using AI systems for mission critical applications. This paper is a right step towards this direction.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I look for discussions on the following points during rebuttal.

    1. The paper has mentioned that the exchangeable hypothesis is not applicable for medical images as there might be covariate shifts while data acquisition. More explanation on this claim, specifically how this exchangeable hypothesis is associated with conformal prediction tasks is needed.

    2. In section 2.3 a claim has been made that it will be very difficult to train a neural network to classify training data and testing data which is essentially a binary classifier. In this age of modern high performing neural networks, a binary classification of any sort is a pretty intuitive task. So, an explanation is needed in this regard.

    3. Also, a claim has been made that sufficient amount of calibration data is not easily available for medical images. No citation or proof has been provided along with.

    4. Equation 5 seems not correct in terms of indexes and notations. It’s a summation of a single term with index j and the existence of index i is not explained. Does it provide a correct definition if it is compared with other similar literatures? Kindly check.

    5. In section 2.4 some more details on the methodology are certainly needed. Like the reason behind averaging on all spatial dimensions etc. More space is given to explain the existing knowledge and therefore, section 2.4 lacks in right space to convey ideas properly.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I could not execute the code to validate the reported results. This may require more effort to understand the working of the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Weight conformal prediction is a novel contribution. The section where major ideas are presented, i.e., section 2.4 should be expanded a bit suppressing the discussion of previous studies or background.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I look for the justification for a few points as mentioned in the section on “weakness” during rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors explore the task of identifying objects of interest from 3D medical images, and estimating their volume, with confidence bounds (using weighted CP method).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • proposes a computationally efficient way to use conformal prediction on 3D volume estimation tasks
    • technically sound, well-written, easy to comprehend
    • detailed explanations of their method and evaluation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the authors have not compared their results against baselines other than their own
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    When generating synthetic shifted data, how did the authors modify SNR? Was it additive noise? For instance, if the noising process is affine, then InD and shifted data would be linearly correlated, and would represent a very specific case of covariate shift.

    In their evaluation, authors have consistently observed a 50% accuracy for InD cases of W-Latent and W-Oracle cases. Is this for calibration-ID vs test-ID prediction? I recommend updating the figure captions / table headers for clarity.

    Since the datasets are split in multiple levels for the evaluation, I recommend having a visual representation of the split procedure to improve the clarity of presentation.

    For the brain tumor dataset, the authors treat class imbalance as the covariate shift. How well does this reflect a practical observation of covariate shift? What would happen if an entirely OoD input is given (e.g., image with no tumor)?

    Moreover, to help readers without background in conformal prediction, I recommend describing how the actual inference process would take place.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • proposes a computationally efficient way to use conformal prediction on 3D volume estimation tasks
    • technically sound, well-written, easy to comprehend
    • detailed explanations of their method and evaluation.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their useful and positive comments, which overall commend the clarity of the paper and the novelty of the ideas. We acknowledge all suggestions to improve the final version.

We agree with R1: despite promising experimental results reported in the paper, more experiments are required to confirm the interest of the approach, in different contexts and when applied to largely different datasets. Our code is available to the community for possible feedback. The link will appear in the final text.

As pointed out by R1, the improvement of the coverage can lead to a too-large estimated width. This is a limitation of the technique that can only be verified using simulations. In the conformal calibration step, only the bounds of the intervals are modified, while the estimated volume is the same for the standard and weighted conformal step. The only way to correct the under coverage caused by the covariate shift is thus to enlarge the intervals. Moreover, the final width may depend on several factors: the classifier used and its calibration, leading to a more or less precise approximation of the weights; the score function, and the presence of outliers. Note that since the introduction of WCP by Tibishirani et al 2019, some works have investigated improvements in the best way to handle non exchangeability, see for instance Farinhas et al 2024.

Availability of calibration data (R6): the current guideline for conformal prediction is to rely on a set-aside calibration dataset of N=1000 samples in order to obtain precise intervals (see Angelopoulos et al. 2023). The BraTS 2023 dataset, which is one of the largest and most popular open-source medical image segmentation dataset, only contains around 1100 subjects which are used in this study for training, calibration, and testing. For other tasks such as Multiple Sclerosis lesions segmentation, the number of open-access cases is much smaller (a few hundred), making the calibration procedure noisier. We thank (R1) for pointing out new datasets for brain tumors.

Concerning the generation process of synthetic data (R7), we start by sampling uniformly a random target SNR from the range [1, 20]. The next step is to convert the binary mask into an intensity image matching the predefined SNR. This is achieved by setting the background intensity to 0, the sphere intensity to 1, and then injecting an additive random Gaussian noise to the image following N(0, 1/SNR). As a result, the generated image has an SNR that matches the target one. We agree that this process generates a very specific type of covariate shift, and it is used only to demonstrate our approach on a dataset with a controlled shift.

Concerning the influence of shift amplitude (R1) and Out-of-distribution samples (R7): importantly, the covariate shift can only be tackled if it is not too important. For example, if the proportion of glioblastoma / meningioma was 100:0 in the calibration dataset and 0:100 in the test data, then the density ratio would be undefined and there would be no hope of accounting for this radical shift (see Dockès et al 2021). This is also true for an OOD input (e.g. an image without a tumor).

Section 2.3 (R6): even if final binary classification is not a high computational task per se, standard CP procedure requires, each time a new test data is considered, to reweight the calibration dataset, which can be computer intensive and not feasible in practice (no access to the calibration data set, no computer resources available at the center where the system is deployed, etc.). Moreover, the density ratio estimation using a classifier heavily relies on the calibration of the predicted probabilities, which is known to be a pitfall of modern deep CNNs ( Guo et al. 2017)

Eq. 5 seems incorrect (R6): thanks for your careful reading, the sum is over i=1 to n. This is corrected in the final version.




Meta-Review

Meta-review not available, early accepted paper.



back to top