Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate probability estimates are critical for clinical decision-making, yet many Multiple Instance Learning (MIL) methods prioritize classification performance alone. We investigate the calibration quality of various MIL aggregation strategies, comparing them againstsimpler instance-based probability pooling in both in-distribution and out-of-distribution ultrasound imaging scenarios. Our findings reveal that attention-based aggregators yield stronger discrimination but frequently produce overconfident predictions, leading to higher calibration errors. In contrast, simpler instance-level methods offer more reliable risk estimates, albeit with a modest reduction in classification metrics. These results underscore a trade-off between predictive strength and calibration in MIL, emphasizing the importance of evaluating both aspects for clinically robust applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4101_paper.pdf

SharedIt Link: https://rdcu.be/eG4CZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_6

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GeyAxe_Calibration_MICCAI2025,
        author = { Geysels, Axel AND Van Calster, Ben AND De Moor, Bart AND Froyman, Wouter AND Timmerman, Dirk},
        title = { { Calibration in Multiple Instance Learning: Evaluating Aggregation Methods for Ultrasound-Based Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {55 -- 64}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper addresses both empirical analysis and methodological development, with a stronger emphasis on the latter. The authors begin by raising an important question regarding the calibration of bag-level predictions in MIL methods. They provide a valuable empirical evaluation of various MIL aggregation strategies (mean-pooling, max-pooling, gated attention, and their proposed variant) in the context of transvaginal ultrasound-based tumor diagnosis, highlighting that most attention-based MIL approaches are not well-calibrated at the bag level. To address this, they propose a modified method (MIL+GA+Uncertainty) that incorporates uncertainty into the aggregation process—offering a simple yet meaningful extension to existing approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The paper presents an interesting and practically relevant motivation, raising a critical question about the calibration of MIL aggregation methods.

2) It offers a thoughtful empirical analysis of existing MIL aggregation strategies, supported by the use of multiple calibration metrics to provide a comprehensive evaluation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Comparison with Related Work:  The paper would benefit from a direct comparison between MIL+GA+Uncertainty and similar approaches, such as the method presented in [1]. This would help better situate the contributions of the proposed method within the field.

Calibration vs. AUC Improvement: While the method improves AUC, its impact on calibration scores appears limited. If the primary focus has shifted from calibration to overall performance, a comparison with other state-of-the-art MIL methods would provide a more comprehensive evaluation of its effectiveness. Such a comparison would help clarify whether the observed improvements in AUC are competitive with existing approaches in the field.

Lack of Instance-Level Attention Analysis: 1) In Figure 3 it seems that the distinction between positive and negative instances may be less clear in MIL+GA+Uncertainty compared to instance (e.g., compare p in instance with a_i in MIL+GA+Uncertainty) Including a comparison with standard MIL attention or other variants could clarify how the proposed method modifies instance-level attention. 2) In applications such as tumor localization, instance-level attention maps are critical for accurate region detection. The current results raise a valid concern about whether the method’s modifications impact instance-level performance. To address this, additional analysis—such as instance-level AUC evaluation—could help ensure the method’s robustness in real-world scenarios. A brief discussion of this potential limitation would further strengthen the paper.

[1] https://doi.org/10.1007/978-3-030-68763-2_11
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Comments to Improve Presentation:

1) The font size in several figures is quite small, which makes them hard to read. Increasing the font size would improve readability.

2) Figure 1, in particular, is difficult to interpret in its current form. It might be better presented as a distribution plot instead of bar plots, or alternatively included in the supplementary materials.

3) In the Abstract, Future Steps, and Key Takeaways sections, it would be helpful to briefly mention and summarize the proposed method (MIL+GA+Uncertainty). If the method is not entirely novel, appropriate citations should be included.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I found the motivation and the central question of the paper very compelling. However, the proposed solution does not seem fully aligned with the original goal, which focuses on improving calibration. Additionally, I’m concerned that the method might negatively impact instance-level predictions, as there is no analysis provided to assess its effect at that level. Including such an evaluation would strengthen the overall contribution.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper analyzes the calibration aspect of Multiple Instance Learning (MIL). While the overall study is straightforward, it offers some valuable insights into MIL
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Through experiments on several common MIL strategies, the study reveals that there might be trade-off between discrimination and calibration in MIL models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In the paragraph “Attention and Uncertainty-Based Aggregation,” the authors state: “we propose a weighted version of the…”. It should be clarified whether this weighting scheme is newly introduced in this paper or adapted from prior work. If it is based on previous studies, proper citations should be provided. If it is an original contribution, it would be helpful to discuss how it relates to or differs from similar designs in the literature.
2. MIL models are often trained with a batch size of 1 (i.e., one bag per batch) to accommodate varying numbers of instances across bags. The authors mention implementing a custom batching procedure with batch size >1; however, key implementation details are missing.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper analyzes the calibration aspect of Multiple Instance Learning (MIL). While the overall study is straightforward, it offers some valuable insights into MIL
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper systematically evaluates the calibration of multiple instance learning (MIL) aggregation strategies for ultrasound-based diagnosis. While MIL is widely used for weakly supervised medical image classification, most prior work focuses only on discrimination (e.g., accuracy, AUC), neglecting calibration, i.e., the alignment between predicted probabilities and true outcome frequencies, which is critical for clinical decision-making. The authors compare attention-based, mean/max pooling, and uncertainty-weighted aggregation methods on a large, multi-center ovarian ultrasound dataset. Their key finding is that attention-based MIL achieves higher discrimination but suffers from overconfident (poorly calibrated) predictions, while simpler instance-level pooling yields better-calibrated risk estimates, albeit with slightly lower discrimination. This work highlights a fundamental trade-off between discrimination and calibration in MIL, urging the community to prioritize calibration for clinical deployment.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Rigorous Calibration Analysis: The study addresses a crucial but underexplored issue in MIL for medical imaging: probability calibration. Calibration is essential for risk-based clinical decisions.

Comprehensive Benchmarking: Multiple aggregation strategies (mean, max, gated attention, uncertainty-weighted) are compared using both in-distribution (ID) and out-of-distribution (OOD, leave-one-center-out) protocols, providing robust generalizability assessment.

Large, Diverse Dataset: The dataset includes 8,824 images from 1,457 patients across 11 centers in 6 countries, supporting the reliability and external validity of the findings.

Clinical Relevance: The paper demonstrates that attention-based MIL, despite interpretability and discriminative gains, can produce overconfident predictions, a real risk for clinical adoption.

Clear Recommendations: The results provide actionable guidance: calibration metrics should be central in MIL evaluation, not just discrimination metrics.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Limited Scope of Aggregation Methods: The study does not evaluate other calibration-improving aggregation techniques, such as temperature scaling (Guo et al., 2017), or deep ensemble-based calibration (Lakshminarayanan et al., 2017).

No Post-hoc Calibration: The paper omits post-hoc calibration methods (e.g., Platt scaling, temperature scaling) that are widely used to improve neural network calibration in medical imaging (Guo et al., 2017).

Single Clinical Task: Results are limited to ovarian tumor ultrasound. The generalizability to other modalities (e.g., histopathology, mammography) or tasks is not assessed, despite MIL being broadly used in these contexts.

Interpretability Not Quantitatively Evaluated: While attention weights are shown for interpretability, there is no quantitative analysis of how well these correspond to clinically meaningful regions, as recommended in recent interpretability studies.

No Analysis of Calibration Impact on Clinical Utility: The study does not assess how calibration differences affect downstream clinical decisions or patient outcomes, a key aspect of model utility (Assel et al., 2017).

Ref:

Guo, C., Pleiss, G., Sun, Y. & Weinberger, K.Q.. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70:1321-1330

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles.” Advances in neural information processing systems 30 (2017).

Assel, M., Sjoberg, D.D. & Vickers, A.J. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res 1, 19 (2017). https://doi.org/10.1186/s41512-017-0020-3
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses a critical and underexplored issue in MIL for medical imaging—calibration of predicted probabilities. The experimental design is robust, and the findings are clinically meaningful. The lack of comparison to state-of-the-art post-hoc calibration methods and limited task generalizability modestly could increase the impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for recognizing the importance of calibration in MIL and for the constructive suggestions. Below we clarify the main points raised.

UNCERTAINTY-WEIGHTED ATTENTION We introduced MIL+GA+Uncertainty as an additional test case in our study, not as an universal solution to calibration problems. In this variant, for each image k, the gated-attention weight α_k is scaled by (1+E_k)^-1 and then re-normalized, yielding uncertainty-aware soft pooling in a single deterministic pass. This inverse-entropy idea is novel to this paper; the most similar method mentioned by Reviewer 3#, “Certainty-Pooling” (Gildenblat et al., ICPR 2021), differs by selecting a single instance via a hard arg-max and relying on Monte-Carlo dropout.

MULTI-BAG BATCHING Each mini-batch contains several complete patients (up to 64 images total). For each bag (patient) we compute the bag-level cross-entropy loss and average these losses across the batch. This preserves the standard MIL objective—minimizing the per-bag loss—while using the GPU more efficiently than processing one bag at a time.

FOCUS ON INTRINSIC CALIBRATION Calibration is measured directly on the raw, unscaled outputs. By examining these “native” probability estimates we isolate the influence of the pooling mechanism itself. Post-hoc calibrators such as temperature or Platt scaling could be applied to any method but would obscure the differences we aim to study here.

INSTANCE-LEVEL ANALYSIS This study targets patient-level diagnosis, not lesion localization. Each patient has only a single histological label (benign vs. malignant); the images selected by the expert gynecologist as representative of the tumor were not individually labelled. Without such image-level ground truth we cannot compute the instance-level AUC as suggested by Reviewer #3. We do agree that a comparison of the attention weights produced by MIL+GA vs. MIL+GA+Uncertainty would offer additional insights into the behavior of both aggregation methods.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

All reviewers agree that exploring the trade-off between calibration and discrimination in the context of MIL is a worthwile topic and has been ignored in the past. There remain some low-level technical doubts (R2: batch-size, R3: deeper discussion on instance-level analysis), and I would be very grateful to the authors if they could attempt to clarify this in the final version of their work, but it does not seem to be enough trouble to block acceptance.

back to top

Calibration in Multiple Instance Learning: Evaluating Aggregation Methods for Ultrasound-Based Diagnosis

Author(s):