Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Digital pathology has revolutionized the field by enabling the digitization of tissue samples into whole slide images (WSIs). However, the high resolution and large size of WSIs present significant challenges when it comes to applying Deep Learning models. As a solution, WSIs are often divided into smaller patches with a global label (i.e., diagnostic) per slide, instead of a (too) costly pixel-wise annotation. By treating each slide as a bag of patches, Multiple Instance Learning (MIL) methods have emerged as a suitable solution for WSI classification. A major drawback of MIL methods is their high variability in performance across different runs, which can reach up to 10-15 AUC points on the test set, making it difficult to compare different MIL methods reliably. This variability mainly comes from three factors: i) weight initialization, ii) batch (shuffling) ordering, iii) and learning rate. To address that, we introduce a Multi-Fidelity, Model Fusion strategy for MIL methods. We first train multiple models for a few epochs and average the most stable and promising ones based on validation scores. This approach can be applied to any existing MIL model to reduce performance variability. It also simplifies hyperparameter tuning and improves reproducibility while maintaining computational efficiency. We extensively validate our approach on WSI classification tasks using 2 different datasets, 3 initialization strategies and 5 MIL methods, for a total of more than 2000 experiments.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4644_paper.pdf

SharedIt Link: https://rdcu.be/eHwXf

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_48

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/mammadov7/mil_merging

Link to the Dataset(s)

Camelyon16 dataset: https://camelyon16.grand-challenge.org/ BRACS dataset: https://www.bracs.icar.cnr.it/

BibTex

@InProceedings{MamAli_Reducing_MICCAI2025,
        author = { Mammadov, Ali AND Le Folgoc, Loïc AND Hocquet, Guillaume AND Gori, Pietro},
        title = { { Reducing Variability of Multiple Instance Learning Methods for Digital Pathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {506 -- 516}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper addresses the issue of high variability in results for multiple instance learning. The authors investigate this challenge in the context of pathology datasets and whole-slide imaging, proposing the use of model merging techniques as a means to reduce performance variability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The problem of multiple instance learning is indeed relevant and high-impact, particularly in biomedical image analysis, such as whole-slide image analysis in pathology. The MIL formulation and its working mechanism are clearly explained in the paper.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Motivation: While the multiple instance learning (MIL) problem is indeed important, the motivation for applying model merging techniques in this context is not entirely clear to me. The main problem authors try to tackle—namely, the variability among MIL models—appears somewhat self-imposed. Model merging approaches are typically intended to enhance performance or facilitate multi-task learning. Could the authors clarify why such techniques are appropriate for reducing variability? And what is the intuition or theoretical justification for why model averaging would be effective in this case?

In the conclusions, the authors mention “MIL methods suffer from high variability in performance across different runs, which can hamper reproducibility and trustworthiness when comparing different methods.” However, if the primary concern is result reliability, practitioners can already account for performance variability when evaluating methods—by reporting confidence intervals, multiple seeds, and also considering other practical factors such as computational cost. Do we truly need new techniques that are solely focused on reducing performance variability?

Experimental setup Why were Soup used for MaxMIL and ABMIL, and Ties for DSMIL, CLAM, and TransMIL? Could the authors clarify the rationale for employing specific merging strategies to each MIL method? Was this based on some empirical observation, or was it simply an arbitrary choice?

In the Methods overview, it is stated that “the learning rate can be either randomly chosen or tuned using the validation set.” What exactly does “randomly choosing” the learning rate mean? Learning rate is typically tuned for each task, and improper learning rate has a significant influence on performance variability itself. What is the purpose of having your “Baseline” method if no learning rate tuning is performed?

Results In most cases in Table 1, the Soup3 and Ties3 models do not show big differences in standard deviation compared to the Best on Val models. While they tend to perform better—which aligns with their intended purpose of enhancing performance—this makes the primary contribution of the paper regarding variability reduction somewhat unclear to me.

Regarding the “LR tuned” setting, why are all models trained for 100 epochs? In practice, learning rate tuning can typically be done using only a few initial epochs, as is done for Soup3 and Ties3. Similarly, for Best on Val, why train all 10 models for the full 100 epochs? It seems inefficient to fully train configurations that clearly perform poorly early on.

The tables are somewhat difficult to interpret due to the volume of values reported (e.g., min, max, etc.). For the goal of illustrating variability, showing just the mean and standard deviation should be sufficient and would improve readability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I have many concerns about the current state of the manuscript, and I do not believe it is in a good shape for submission. Please see Weaknesses.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The authors’ explanation regarding the learning rate for their baseline model is unclear and technically questionable. In the author response, the authors claim “without any additional tuning on the validation set, and selected solely based on prior knowledge or existing literature (Baseline)”, leaving me quite concerned about correctness of their evaluations.

Review #2

Please describe the contribution of the paper

The authors propose Multi-Fidelity, Model-Fusion strategy to improve reliability of MIL methods assessment, which is challenging due to high variability of final checkpoint performance with the respect to random initialization. The approach involves training M models for few epochs and then average the T most effective checkpoints before training further. Experiments provided by the authors show that Multi-Fidelity, Model-Fusion strategy achieves better performance, while being more efficient than training M models for full number of epoch and selecting the best one on validation set.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The topic is highly relevant in medical image analysis that involve whole scan images.
- The proposed method, while intuitive and simple, is grounded by findings in Model Soup [24] and TIES-Merging [25], which provide the foundation for this approach.
- There are multiple runs for every experiments, 2 relevant datasets and 5 MIL approaches in comparison, which makes validation convincing. I also find most of the selected baselines to be reasonable.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It’s strange, that authors treat learning rate as a source of instability and randomness similarly to random seed. Opposite to initialization seeds learning rate is not selected randomly and there are usually reasons in selecting learning rate and scheduler. If the learning rate is not chosen appropriately, repeating the training with different initialization seeds will not help. I suggest the problem of selecting proper base learning rate might be partially alleviated by Adam optimizer and cosine annealing scheduler used in the experiments, but I would argue against the claim that “The learning rate can be either randomly chosen”. Authors do not justify that claim as well as that learning rate should be treated differently from other hyper-parameters, such as batch size, number of epochs or weight decay.
- In the regard of the above point, I wonder what was the strategy of selecting base learning rate for different experiments?
- Figure 3 of ablation study provides very different numbers compared to table 1 or Fig.1. For example, in Figure 3 MaxMIL + Soup3 never exceeds 75 AUC on BRACS, while Table 1 shows 85 AUC in average.
- Table 2 shows huge differences between Test and Val evaluation. What dataset is used for this experiment, and how this difference can be explained?
- The paper would benefit for deeper theoretical analysis (see my suggestion below), but I do not consider this to be a major concern.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- The work would benefit from deeper analysis of the proposed method. Probably comparing with (Izmailov et al., 2018) or follow-up works can help.
References: Izmailov et al. Averaging Weights Leads to Wider Optima and Better Generalization. 2018.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors proposed simple yet effective method to improve stability of training by merging several best checkpoints after training for several epochs. The validation convincingly shows the method reduces computational burden for assessing MIL methods compared to baselines. There are concerns about experimental details that can be resolved during rebuttal.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.
For me, this is a borderline decision due to concerns about the wording choices made by the authors. Nonetheless, I recommend acceptance, as from my perspective these issues do not undermine the main results reported in Tab.1, which demonstrate that the method indeed reduces variability. I think the authors can address wording issues when preparing the final version of the paper. More specifically:
- The phrase “selected solely based on prior knowledge or existing literature (Baseline),” leaves room for interpretation. Was the same baseline learning rate chosen for the proposed method and baseline in Tab. 1? Does «prior knowledge» refer to authors’ own previous experiments? If so, how reliable were those experiments? Are the accuracies reported in Table 1 consistent with those found in other works using similar methods? Clarifying these points would significantly reduce concerns about hyperparameters choice and the overall relevance of the work.
- According to authors’ feedback «ablation» study in Fig.3 wasn’t performed around the final setup, which was expected. Moreover, varied parameter (K) has significant impact on the result. It raises questions about the reliability of the interpretation of Fig.3 used in the feedback. Additionally, the way the term ‘ablation’ is used in the main text is both imprecise and misleading.
- I would also expect the authors to cite a paper that supports the claim that the gap between validation and test performance is a common challenge in MIL. This statement is not self-evident and should be properly substantiated.
Despite these concerns, I find the proposed method to be simple and potentially valuable for researchers in this area. The main results, presented in Table 1, support the authors’ claims that the method reduces performance variability across runs while preserving top performance and maintaining a reasonable computational cost.

Review #3

Please describe the contribution of the paper

The authors study the variability in training of multiple instance learning methods due to random initialization, random training instance order, and training settings. The proposed method trains multiple models for a few epochs, selects the best ones based on the validation set, averages the weights, and continues training the averaged weights. The results show that this reduces the variability of the final performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper combines simple existing methods to achieve a lower variance. The experiments and results are comprehensive. The approach has large potential since many people are using MIL approaches to train models for histopathology.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

One of the hyperparameters, the number of epochs until the models are compared based on the validation set performance, was set to 5 but not discussed or studied in more detail. Can the authors explain how the value of 5 was selected and strong the effect is of this hyperparameter on the performance of the approach?

The dataset size could also play a role in the variability of the final models. Can the authors comment on this?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experiments are reported results are comprehensive. The approach is simple and therefore widely applicable, albeit not very novel technically. There are some points which the authors could elaborate on in more detail.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors appropriately answered the questions I raised in the review. My prior recommendation was weak accept, and I now recommend to accept the paper.

Author Feedback

We thank the reviewers for their thoughtful comments and address each concern below:

[R1,R2] Clarification about the learning rate (LR) We apologize, the phrase “randomly chosen” was misleading. We will replace it with: “without any additional tuning on the validation set, and selected solely based on prior knowledge or existing literature (Baseline)”. Moreover, we observed that performance can vary significantly depending on the choice of the LR, even when using the Adam optimizer and tuning it on the validation set (see “LR tuned” in Tab. 1). Differently, the other MIL hyperparameters- batch size, # training epochs, and weight decay- are either chosen with a precise rule or do not lead to such high performance variability. Specifically, batch size is typically constrained to one slide due to the variable number of patches per slide. The number of epochs is fixed at 100 for all methods to ensure fair comparison. For weight decay, we found its influence to be minor compared to LR. We will clarify these points in the final manuscript.

[R1] Different results between Tab. 1, Fig. 1 and 3. In the ablation study (Fig. 3), we evaluate the impact of the # selected models T independently of the aggregation epoch K. We thus set the aggregation epoch to K=100. In Tab. 1 and Fig. 1, aggregation is done at K=5 instead, hence different results.

[R1] Difference between test and val performance in Tab. 2? Tab. 2 is for Camelyon16. The gap between val and test is a common challenge in MIL. It also motivates our search for more effective methods than Best on VAL.

[R1] Theoretical comparison with SWA As already shown in Soup [24] (Fig. L.1, L.2), SWA is complementary to our approach. It can be used together with SOUP to improve performance. However, it would introduce additional computational overhead. We will clarify it in the Conclusion.

[R2] Assessing Vs reducing performance variability Confidence intervals quantify uncertainty but do not reduce it. MIL suffers from high variance (Fig. 1). Moreover, val AUC does not reliably reflect test AUC in MIL (Tab. 2). Previous studies have empirically shown that averaging (or ensembling) the weights of independent runs trained with different hyperparameters improves classification performance [24]. Here, we confirm that this also holds for MIL, and further show that it reduces variability in a computationally efficient way.

[R2] Choice of methods in Fig. 1? Due to space constraints, we chose to show results for both methods, Soup3 and Ties3, whose performances are very similar. For clarity, we will only show results for Ties3 in the final version.

[R2] Comparison between Soup3 and Ties3 w.r.t. Best on VAL The performance variability between our methods and Best on VAL can indeed be similar (even if in 5/10 experiments is actually lower, see Tab. 1). However, Best on VAL comes with a high computational cost (x6) and, as noted by R3, Soup3 and Ties3 tend to perform better.

[R2] LR-Tuned and Best on VAL trained for 100 epochs We tried using fewer epochs (3 or 5) and even an early-stopping rule. This decreased computational cost but it increased variability and degraded performance. By fully training for 100 epochs, we aimed to establish fair comparisons showing the best possible results (in terms of variability and performance) for the proposed baselines. We will clarify it in the final version.

[R2] Tables interpretation We understand the reviewer’s concern, but min and max provide a more thorough summary of variability. In particular, the proposed methods consistently improve the minimum score and reduce the range.

[R3] Why K = 5? This is based on the ablation (Tab. 2), which gives the same variance reduction and mean AUC as K = 10 but at half the initial training cost. Using K = 3 degrades the performance.

[R3] Dataset size We experimented with a small dataset (N=223) and a larger one (N=395). We observed comparable variability, indicating our method’s effectiveness across this range.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper receives an initial review of 2WA (R1, R3) and 1R (R2). After rebuttal, R1 and R3 change to Accept while R2 remains Reject. The main concerns include: 1) R2’s questions about learning rate selection methodology, though R1 noted that reported values appear reasonable compared to literature and the “LR tuned” experiments show adequate hyperparameter choices, 2) Minor technical inconsistencies between ablation study parameters (K=100) and main results (K=5), which authors clarified were intentional for independent evaluation, 3) R2’s concerns about motivation, though R1 and R3 recognized the practical value for the MIL community. While R2 maintained skepticism about experimental methodology, R1 and R3 found the approach simple, effective, and valuable for reducing MIL training variability with reasonable computational cost. The comprehensive experiments across multiple datasets and MIL methods, along with provided code for reproducibility, support the work’s practical contributions. I suggest a recommendation of Accept, as the majority of reviewers recognized the method’s utility despite R2’s methodological concerns, and the authors adequately addressed most technical questions in their rebuttal.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper proposes a simple model fusion strategy to reduce performance variability in MIL. While the idea is practical, it lacks theoretical depth, clear motivation, and rigorous experimental design. Key concerns—such as learning rate handling, inconsistent variability reporting, and arbitrary design choices—remain unresolved. I recommend rejection.

back to top

Reducing Variability of Multiple Instance Learning Methods for Digital Pathology

Author(s):