Abstract

Multiple Instance Learning (MIL) effectively analyzes whole slide images but faces overfitting due to attention over-concentration. While existing solutions rely on complex architectural modifications or additional processing steps, we introduce Attention Entropy Maximization (AEM), a simple yet effective regularization technique. Our investigation reveals the positive correlation between attention entropy and model performance. Building on this insight, we integrate AEM regularization into the MIL framework to penalize excessive attention concentration. To address sensitivity to the AEM weight parameter, we implement Cosine Weight Annealing, reducing parameter dependency. Extensive evaluations demonstrate AEM’s superior performance across diverse feature extractors, MIL frameworks, attention mechanisms, and augmentation techniques. Here is our anonymous code: \url{https://github.com/dazhangyu123/AEM}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5183_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/dazhangyu123/AEM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaYun_AEM_MICCAI2025,
        author = { Zhang, Yunlong and Li, Honglin and Sun, Yuxuan and Shui, Zhongyi and Li, Jingxiong and Zhu, Chenglu and Yang, Lin},
        title = { { AEM: Attention Entropy Maximization for Multiple Instance Learning based Whole Slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a regularization technique into MIL framework to address overfitting challenge. They evaluate the proposed strategy across multiple feature extractors, MIL frameworks, attention mechanisms, and augmentation techniques.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The writing and presentation of this paper is good. The analysis of result is good. The design of Attention Entropy Maximization is reasonable.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.The paper claims to solve the overfitting problem; however, the approach seems to be a combination of existing techniques with additional simple strategies. If overfitting is considered a major challenge in previous research, the paper should first discuss in detail how overfitting affects the previous models. Subsequently, it should provide evidence that the proposed approach can effectively mitigate this problem. Currently, the paper only presents the problem, the solution, and the resulting F1 score improvement, without further discussion or verification of the overfitting problem. 2.The experiments employ different feature extractors to show that the proposed strategy works across various settings. However, the comparison is inconsistent. For instance, while the paper compares the proposed method with ABMIL when using UNI and CONCH, the other models listed in Table 2 (e.g., GIGAPATH, UNI and CONCH) are not compared using these diverse feature extractors.

    1. Since the proposed method is simple, more extensive experiments on a broader range of datasets or tasks would be beneficial to convincingly demonstrate its generalizability and robustness. 4.The simplicity of the proposed method requires a stronger and more detailed motivation. The paper should provide more empirical or observational evidence to more convincingly demonstrate the necessity of the method and how it addresses the shortcomings found in previous research.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See weakness.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript presents a regularization loss on the attention values of the attention-based multiple instance learning framework for WSI classification. This regularization prevents spiky attention distributions at the beginning of training, loosening this constraint as training continues, and is shown to lead to less overfitting.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper presents a simple regularization and loss that improves training dynamics of the widely used ABMIL WSI classification architecture
    • This addition is compared across 3 datasets, compared to many existing baselines, and compared with several encoders.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The private dataset is not clearly described
    • The hyperparameter tuning and train-val-test setups are not clearly described
    • The presented research appears to contain too many details and experiments to be accurately presented in such a short conference format.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Fig 2 and and Fig 5b appear to state that a more uniform attention distribution at inference time is preferred for a high performance. The rest of the paper appears to focus on training dynamics; constraining the attention at the start of training. Do the authors believe a more uniform distribution is always better? (if so, MeanMIL would be superior) E.g. for presence of lesion classification, the reviewer could imagine a spiky attention on only patches containing tumor would lead to a higher performance than a more uniform distribution, similar to many other tasks. Or is it important to improve training dynamics, independent of the resulting attention distirbution?
    • In Table 2, do ABMIL and AEM (ABMIL+AEM) have the exact same model architecture?
    • In Table 2, what hyperparameter tuning has been performed for all these models?
    • Could the authors clarify the LBC dataset (tumor entity, task). The reviewer finds “WSIs with four cytological categories” unclear.
    • Why are the tested lambda values in the ablation study different? (C16 uses [0,0.001,0.002,0.005,0.01,0.02,0.05} and C17 and LBC uses {0,0.01,0.02,0.05,0.1,0.2,0.5})
    • Generally, how have all hyperparameters been selected? Since the regularization is mostly about training stability, have other training hyperparameters been explored to investigate the effect of the regularization dependent on training hyperparameters?
    • Is Fig 7a on the complete loss, or only the classification loss? Since ABMIL+AEM has a different loss function, this may not be a fair comparison. . E.g. the L_{ce} could go up while the L_{aem} goes down.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Attention regularization appears to be a useful addition to ABMIL, independent of the used encoder. However, many training details of the models, and a clear description of one of the datasets, are not presented. Since the method is intended to improve training dynamics, presentation of the training and optimization set-up is vital, and preferably experimented with. The reviewer questions if the number of experiments and required details for proper interpretation can be fairly presented in a short conference paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The method is reasonably motivated, and the experimental results are extensive for a conference-format paper. Although the number of experiments limits the space for the motivation, detailed analyses on the hyperparameters and training dynamics, the reviewer thinks the insights and ideas are generally interesting for the MICCAI community. These ideas may motivate the community to focus on the training dynamics of MIL besides metric optimization which would be a highly valuable addition to the field of computational pathology. The reviewer believes that further analysis would be valuable in a future journal publication with fewer constraints, and should focus primarily on training dynamics.



Review #3

  • Please describe the contribution of the paper

    The paper introduces Attention Entropy Maximization (AEM), a simple regularization method that mitigates over-concentrated attention in MIL by promoting higher attention entropy, leading to improved performance across various MIL settings.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper addresses the overfitting issue in attention-based MIL by proposing a simple and effective AEM regularization method, which can be easily integrated into most attention-based MIL framework.
    2. The method is grounded in an interesting observation—a positive correlation between AUROC performance and attention entropy—which forms the basis for the proposed regularization term.
    3. The effectiveness of AEM is validated across various foundation tile encoders and different MIL frameworks, demonstrating its generalizability.
    4. Extensive ablation studies provide strong evidence of AEM’s ability to mitigate overfitting and reduce attention over-concentration.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The AEM regularization is applied to the softmax-normalized attention scores based on empirical observation. However, it remains unclear whether using the original attention logits, which maintain consistency across different WSIs, could yield similar or even better results. The paper does not explore or discuss this alternative, which could be important for understanding the method’s full potential.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Technical contribution and Sufficient experiments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Technical Contribution for MIL in Pathology




Author Feedback

We thank all reviewers and ACs for their valuable feedback. We appreciate recognition of our simple yet effective regularization method (R1, R2,R3), strong validation across datasets/encoders(R1,R2), solid theoretical grounding(R1), and clear presentation(R3). Due to space limitations, we address major concerns below, with all issues to be addressed in our revision.

To meta-review and R3 [Q3: Broader experiments for generalizability]: We appreciate this suggestion. Following standard MIL literature practices on subtyping tasks, we’ve gone beyond typical studies (which use 1-2 feature extractors) by demonstrating consistent improvements using five different extractors—providing stronger evidence of generalizability. MICCAI’s space constraints and rebuttal guidelines prevent including additional experiments now, but we commit to extending AEM to survival and biomarker prediction tasks in future work.

To meta-review [Q: Unique value for computational pathology]: AEM’s unique value lies in its simplicity and versatility—it requires no additional modules or processing steps, enabling seamless integration with existing MIL frameworks while maintaining computational efficiency. Our experiments demonstrate AEM’s effectiveness when combined with subsampling augmentation, advanced frameworks (DTFD-MIL, ACMIL), and various attention mechanisms, addressing a critical overfitting issue in pathology MIL applications.

To R1 [Q1: Using original attention logits vs. softmax-normalized scores]: Thank you for this insightful question. We applied AEM to softmax-normalized attention scores because entropy is mathematically defined for probability distributions (non-negative values summing to 1). Original attention logits can be negative and don’t satisfy these constraints, making entropy calculation mathematically invalid. We will clarify this mathematical necessity and discuss the trade-off between normalization and consistency across WSIs in our revised paper.

To R2 [Q1,Q2,Q3: Dataset clarity, Hyperparameter tuning details, Train-val-test protocol] We appreciate these insightful comments. While extensive experiments were conducted to demonstrate AEM’s generalizability, we recognize the need for clearer methodological details within conference constraints. We will enhance our manuscript by: 1) Describing the LBC dataset (1,989 WSIs of cervical cancer across four cytological categories: Negative, ASC-US, LSIL, ASC-H/HSIL). 2) Clarifying hyperparameter selection was based on validation performance optimization, with default λ values of 0.001, 0.1, and 0.2 for C16, C17, and LBC respectively. 3) Detailing dataset splits: C16 (270 training WSIs from hospital 1 split 9:1 for train/val, 130 testing WSIs from hospital 2); C17 (500 training WSIs with 300 from hospitals 1-3 split 9:1 for train/val, remaining 200 from hospitals 4-5 as test set); LBC (split 6:2:2 for train/val/test). These clarifications will significantly improve reproducibility.

To R3 [Q1,Q4: Stronger motivation and Effectiveness evidence]: Thank you for this insight. We have established the connection between overfitting and low attention entropy through their correlation in Fig. 2. Based on this observation, AEM specifically targets overfitting by regulating attention entropy. Fig. 7 provides empirical verification by showing AEM’s impact on test performance curves, demonstrating reduced overfitting compared to baselines. We will strengthen this narrative in our revision to better highlight this mechanism-based approach to addressing overfitting.

To R3 [Q2: Inconsistent feature extractor comparisons across models]: We appreciate this observation. Due to MICCAI’s strict 8-page limit with prohibitions on supplementary materials for additional results or analyses, we prioritized demonstrating generalizability across feature extractors for our baseline comparison. We’ll clarify this constraint and provide comprehensive comparisons in our journal extension and code release.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The authors need to pay more attention about the concerns raised by Reviewer #3. More importantly, the paper requires stronger justification of its research focus on WSI analysis specifically: (1) What makes this approach uniquely valuable for computational pathology compared to existing methods? (2) Why wasn’t the proposed method validated on more general natural image tasks to demonstrate its broader applicability?

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a simple yet effective entropy-based regularization to mitigate over-concentration in attention-based MIL models. The method is evaluated across datasets, encoders, and MIL variants, and is backed by extensive empirical evidence and ablations. While some reviewers found the contribution impactful and practical, others noted limited novelty, incomplete theoretical analysis, and insufficient clarity on overfitting mitigation and generalization.

    Given the method’s simplicity and the remaining concerns about experimental completeness and justification, I recommend a accept, with the caveat that the final version should better articulate the novelty, overfitting analysis, and broader applicability.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top