Abstract

We investigate the role of uncertainty quantification in aiding medical decision-making. Existing evaluation metrics fail to capture the practical utility of joint human-AI decision-making systems. To address this, we introduce a novel framework to assess such systems and use it to benchmark a diverse set of confidence and uncertainty estimation methods. Our results show that certainty measures enable joint human-AI systems to outperform both standalone humans and AIs, and that for a given system there exists an optimal balance in the number of cases to refer to humans, beyond which the system’s performance degrades.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1535_paper.pdf

SharedIt Link: https://rdcu.be/dV53F

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_1

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1535_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Kon_Aframework_MICCAI2024,
        author = { Konuk, Emir and Welch, Robert and Christiansen, Filip and Epstein, Elisabeth and Smith, Kevin},
        title = { { A framework for assessing joint human-AI systems based on uncertainty estimation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {3 -- 12}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper describes a framework for evaluating joint human-AI systems.

    The AI system must provide a confidence or uncertainty score, but the framework is agnostic to the choice. Given this uncertainty score, a threshold is chosen below which the decision is deferred to a human. The risk evaluation metrics are then computed for the whole system – for cases where the AI answers confidently, the AI result is used, but for cases where the AI defers, the human result is used. The authors argue this differentiates their approach from others in the literature which generally leave out the human results.

    The system is evaluated on a database of ultrasounds used for classification of ovarian tumors. A network is trained several confidence/uncertainty methods are implemented. Humans also provided predictions. The risk-coverage curves of the joint human-AI show that there is a “sweet spot” for how often the AI should defer. Here the performance is generally better than AI or human alone.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Often the motivation for uncertainty/confidence estimates in AI models is to flag cases requiring human intervention. Rather than focus on methodology for estimating uncertainty, however, this paper is remarkable because it directly addresses this use case and shows that even very simple estimates of confidence based on the model’s output probabilities are still useful to flag cases for human intervention and reap a benefit.

    The experiments are quite extensive including multiple human readers with varying levels of expertise.

    Inclusion of In domain and out-of-domain comparisons strengthens the validation. It also shows that when domain drift is higher, human intervention can be increased to compensate.

    In terms of clinical applicability it is promising to see that human involvement can provide substantial benefits even at high levels of modeling coverage, and therefore the benefit can be reaped without requiring huge amounts of clinician expert time. The comparison of performance between experts and non-experts is also informative in this regard, because the benefit is available (albeit smaller) even when the humans have less expertise.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    One weakness of the paper is in relation to the proposed metrics (AUJRC_\gamma, AUJF1C_\gamma). While the curves themselves make sense, the reasoning for the choice of cutoff is less clear. What considerations go into the choice of gamma, and are the metric values meaningful on an absolute scale or only useful for comparison? Furthermore, why integrate the curve from a threshold, rather than compare performance at the optimal coverage tau? What if gamma is too high and excludes the optimal coverage level? Would the values be easier to interpret if we scaled by the integration range (1-gamma)? The paper could describe the properties of the proposed metrics in better detail to answer some of these questions.

    Along these lines, the paper argues towards the end that the proposed metrics better reflect the performance of the systems, as shown in Fig 4. But Fig 4 is essentially plotting the peaks of the same curves that are being integrated by the metrics, so this is not terribly surprising, and raises the question: why not just use those peaks rather than integrating the curves?

    In section 4, confidence and uncertainty metrics are described. It is claimed these approaches can be applied to the various 7 model configurations used, however it is not stated exactly which are used when computing the results (such as the curves in Fig 4).

    Finally, the authors claim that 66 doctors, 33 of which were highly experienced, reviewed the scans, and each scan was read by an average of 14 doctors. What isn’t clear is how these reader results were used to provide single answers in the joint system. Theoretically the system is described with the AI deferring to one reader. How was the human answer chosen from the ~14 readers for each image when needed? Was it an average? Was one reader chosen somehow? And what implication do these choices have on the system calibration? As noted later in the paper, when applying the system in a new domain or with new readers, it would need to be recalibrated to choose the appropriate threshold for deferral. Does having an inconsistent set of readers in the validation data set effect the interpretation of the thresholds in paper results?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Barring the lack of clarity on how the human answers were chosen, and exactly which confidence/uncertainty measure was applied, the paper is otherwise clearly described. If the authors clarify those points, I would be satisfied with the reproducibility of the work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the weaknesses section I describe several clarifications that could be made that I think would improve the paper.

    I also think a bit more information about the imaging data set would be useful (even if as a supplement). As it stands, very little information is given outside the numbers of scans and readers. What kinds of scanners? How many scans from each institution? Summary patient demographic information? Etc

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is a valuable contribution to the community, but there are a few clarifications that would be needed for acceptance, in my opinion. Hence I am giving a 4.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This study proposes joint risk-coverage curves that measure the overall performance of a joint human-AI system where humans classify images that an AI-based classifier is uncertain about. Prior research used risk-coverage curves, which do not consider the performance of human evaluators.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study has several strengths. First and foremost, the study proposes curves that consider the performance of human evaluators in joint human-AI systems, which previous research does not consider. These curves could be used to help determine the uncertainty threshold at which AI-based classifiers will defer to a human expert in a clinical setting. Additionally, they could be used to visualize the utility of a joint human-AI system. Second, the study evaluates their methods on a very large and diverse dataset, including evaluations on unseen hospitals and devices. Third, the study bootstraps their results. Fourth, the study compares seven uncertainty/confidence estimation methods within their proposed paradigm.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While I appreciate that the proposed curves evaluate an entire joint human-AI pipeline, my main concern is that the curves will not be widely-applicable due to the limited availability of human experts for experimental and computational studies. The curves require a human assessment for every point in the testing dataset, which is not feasible in many scenarios.

    My secondary concern is that the manuscript treats the clinical setting as if AI models that are certain would have the only say in the diagnosis of disease. In reality, especially in the cancer setting that the manuscript explores, physicians make the final call, using AI models as tools. A more robust evaluation of a joint human-AI pipeline would have humans evaluate all the test images, but have the AI-predictions as a tool when forming predictions for data that the AI model exhibited high-confidence on.

    Finally, the readability of the manuscript can be challenging at times. I delineated my concerns in the comments section in an effort to help the authors.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While the paper contains results whose methods are not fully explained (see comments below) making the study itself to be harder to reproduce (including being done on a propietary dataset), the proposed curves could be reproduced, which is the main purpose of this study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I’ve detailed my comments below. I understand that there is limited room in the template, so while I think many points should be clarified, I understand if there is not enough room to include all of them.

    Major comments:

    • There are many results presented in the paper, whose methods are not discussed. 1) How is human performance measured? Is it an average of all 16 doctors? Why is human performance different for the AI-first-reader and Arbitration scenarios? (i.e. the horizontal lines in Figure 3) 2) Who are the experts and non-experts? Is this a split among the doctors? How many fall into each category? 3) How bootstrapping was performed? Or even that it was performed. 4) What are “always malignant” and “always benign” in Figure 1 (d)? 5) How many estimates are used for MCD? How many members in the ensemble? 6) All the training procedures for the ResNet50 models besides the fact that they were pretrained on ImageNet. 7) The methods say that the results with the lowest joint risk would be reported, but then areas under curves are reported in Table 1?
    • It concerns me that a main contribution of the paper is AUJRC, but it isn’t even presented in the main table (Table 1).
    • I’m not convinced that partial areas under the curves are better than the full areas. It is much easier to interpret how good performance is when the potential range is between 0 and 1, instead of 0 and an arbitrary value. Moreover, while full ranges do take into account more human error, isn’t that a better measurement of the entire system? Is it not plausible that a model will reject a large amount of images from out-of-domain scenarios?

    Writing comments:

    • In Figure 1, there are several symbols that are not explained. What does the gauge represent? The 0.93? The boxplots? Also, aren’t subplots (c) & (d) the same in the diagnostic process? It is disengenous to assume that the samples rejected due to uncertainty aren’t going to be labeled by physicians. I would also capitalize the beginning of the subplot headings.
    • The variable psi is overloaded. You define it to be the confidence or uncertainty of a model and refer to it as certainty. But it limits the readability of the paper as uncertainty and certainty are opposites. For example, in section 3 you say “AI with low certainty, .., correlates with higher error rates”, which would make sense except that you defined certainty to be uncertainty (footnote 1). Futhermore, Equation (7) is measuring certainty when it claims to be measuring uncertainty.
    • The “AI models” section could be clarified. The abbreviation “NN” is never defined. TS, TTA, MCD, etc. are posed as AI models, but they are confidence/uncertainty estimation approaches that can be applied to almost any model?
    • The way the first sentence is written, it makes the reader wonder what the difference in errors are between human and AI evaluators.
    • In your contribution paragraph, I would change “unique” to something similar to “large and diverse” ultrasound dataset. If you have such a big dataset, I would flaunt it instead of making readers wonder at what “unique” means.
    • When defining confidence and uncertainty estimates, it would be good academic practice to cite those who originally proposed these estimates for uncertainty estimation (i.e. Hendryks et al. MSP for (6)).
    • “InD” and “OoD” are used before the abbreviation is delineated.
    • I would change the approximately equal symbol to the word “approximately”.
    • In equations 3 & 4, what is “j”?
    • The Related Works section is hard to follow. I can infer the difference between quality and utility from Figure 1, but it isn’t clear the first readthrough. What do you mean by “better” in “better scores”? Isn’t ECE literally measuring the aligment of confidence (output probabilities) with the ground truth probabilities? So why isn’t lumped in “better scores”? What are some examples of “better scores”? Are Bungert and Alves truly contradictory (Bungert doesn’t include evaluations for ensembles - only for 4 confidence scoring functions when many exist)? The last paragraph in general seems to be attempting to motivate the proposed metrics by highlighting the failures of previous metrics, but the cited failure points are of the confidence scoring functions themselves, not the measures of confidence scoring function utility.
    • Where do the plots in Figure 2 come from? As their origin wasn’t explained and they are presented so early in the paper, it feels like they were contrived to be a “proof by example” while reading through the Methods.
    • Additionally, in Figure 2, a star (or some other symbol) on the optimal operating point in (c) may be helpful.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper demonstrates novelty in the proposed curves and I appreciate that it considers the whole human-AI system, I’m not convinced that the proposed curves would be widely-applicable as most studies don’t have human experts readily-available.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper is proposing an evaluation framework for assessing joint human-AI classification systems. This framework assesses how AI models, which compute confidence or uncertainty measures, interact with doctors according to clinically informed metrics. To evaluate the performance of the system, the task was to classify ovarian tumors in a large data set of ultrasound images. The ground truth for model training and evaluation was histological diagnosis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses an interesting issue in application of AI technologies. Authors proposed a sound methodological approach with an extensive dataset. It opens the road for the investigation of new certainty metrics assessing joint performances.

    The study used a large data set of 17,119 ultrasound images from 3,652 patients, gathered from 20 centers across eight countries using 21 different ultrasound systems. A total of 66 doctors, including 33 experts with over five years of experience, assessed the exams. Each exam was reviewed by 14 doctors.

    The paper shows that integrating any of the evaluated models within a collaborative human-AI framework consistently enhanced diagnostic accuracy beyond what could be achieved by either humans or AI systems operating independently.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors do not explain clearly the nature of the images, such as the type of US? The potential differences across the centers, the heterogeneity of the data and the influence on the assessment by both AI and Experts or ‘’non-experts’ readers.

    A small experiment with a reduced data set would have been very useful to assess the performance of such system in real clinical conditions such as imbalanced data or data sets with few cases.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Neither the algorithms nor the data would be released. Nevertheless the methods and the references are well described to ensure that anyone with the required knowledge would be able ti implement and apply in a different dataset

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is clear, well organized. THe resutls are driven by interesting questions. It falls within the MICCAI Scope. Examples on real images would add an important value to the presentation in the conference.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important issue in current AI technologies.

    THe assessment of uncertainty of both humans and AI is a crucial aspect to improve detection and diagnosis. THe paper illustrates the performance with a large data set and a full set of experiments. it deserves further discussion during the conference.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

CONFIDENCE VS. UNCERTAINTY Despite being conceptually different, confidence and uncertainty can serve a similar function in the clinic. We intentionally define the term psi to refer to both confidence or (negative) uncertainty to reflect their similar utility. We are not implying equivalence, e.g. a model can have terrible calibration yet a good uncertainty estimation.

ALPHA SELECTION The problem of alpha selection is similar to selecting the operating point of a classifier. During our in-domain (InD) experiments, we found the alpha which yields peak validation performance to consistently yield nearly peak performance on the test set. For unseen hospital (OoD-H) experiments, the alpha that showed best joint performance on the validation set does not translate well to the test set – it underestimates the required doctor oversight (but the joint system still outperforms standalone AI or doctors). The extent of this discrepancy was smaller for unseen devices (OoD-D). Thus, we emphasized the need for a small set of cases and assessments for choosing the alpha for out-of-domain settings. Having few doctors for calibration is not a problem, provided that the joint system will be used by doctors with a similar level of experience.

GAMMA SELECTION The selection of partial areas (gamma values) is arbitrary as this value will vary from center to center depending on demand and available healthcare resources. We tried to select values that address a range of realistic use cases. For example, a hospital with limited resources may rely on their AI model to diagnose the majority of patients, reducing the workload of doctors. In this case, the hospital would be interested in comparing models at high coverage ranges only using a partial AUJRC with e.g. gamma = 0.9. Looking only at the full AUJRC (when gamma = 0) score, a model which performs well at low to medium coverage ranges, but poorly at high coverage ranges could potentially be chosen despite it not being the optimal model for the realities of the setting.

FEASIBILITY While the joint risk curves may be resource intensive, the point of our work is that ignoring the joint function of humans and AI in a diagnostic setting can lead to suboptimal performance and a misguided reliance on existing strategies. We must account for the interaction between doctor assessments and uncertainty/confidence if we are to have a clear understanding of how (or if) these quantities are helpful.

JOINT SYSTEM USAGE We agree that in many realistic scenarios, doctors should have access to model predictions (and uncertainties) to revise their assessment. However, this can only be done prospectively at high cost, which we plan to do. However, the AI-as-first-reader scenario we discuss is a practical and valid use case which is worthy of study and likely to be deployed in the coming years.

DATASET Our data contained Doppler and grayscale transvaginal and abdominal ultrasound images. Centers showed high variance in their diagnosis distributions, sonographer experience and device model. Standalone experts (median 17 years experience) had on average 5 points higher F1 score than non-expert (median 5 years experience) doctors.

BOOTSTRAPPING We employed bootstrapping where we first sampled the case, then a single random doctor per case. While bootstrapping doctor assessments, we normalized the sampling probability with respect to the number of assessments done by each doctor.

TRAINING DETAILS We used each method’s default hyperparameters except for learning rate and schedule, both of which we selected manually by inspecting performance on the validation set. Similarly, we picked the 5-member ensemble’s diversity regularization strength using the validation set. We used 10 configurations for MCD (trained with p=0.2), 10 checkpoints for SWAG, and 10 particles for SVGD. For each method we selected an applicable confidence or uncertainty measure based on the peak validation F1RC performance.




Meta-Review

Meta-review not available, early accepted paper.



back to top