Abstract

Multiple instance learning (MIL) has proven effective in classifying whole slide images (WSIs), owing to its weakly supervised learning framework. However, existing MIL methods still face challenges, particularly over-fitting due to small sample sizes or limited WSIs (bags). Pseudo-bags enhance MIL’s classification performance by increasing the number of training bags. However, these methods struggle with noisy labels, as positive patches often occupy small portions of tissue, and pseudo-bags are typically generated by random splitting. Additionally, they face difficulties with non-discriminative instance embeddings due to the lack of domain-specific feature extractors. To address these limitations, we propose Phenotype Clustering Reinforced Multiple Instance Learning (PCR-MIL), a novel MIL framework that integrates clustering-based pseudo-bags to improve MIL’s noise robustness and the discriminative power of instance embeddings. PCR-MIL introduces two key innovations: (i) Phenotype Clustering-based Feature Selection (PCFS) selects relevant instance embeddings for prediction. It clusters instances into phenotype-specific groups, assigns positive instances to each pseudo-bag, and then uses Grad-CAM to select the most relevant positive embeddings. This approach mitigates noisy label challenges and enhances MIL’s robustness to noise; (ii) Reinforced Feature Extractor (RFE) uses reinforcement learning to train an extractor based on selected clean pseudo-bags instead of noisy samples. This approach improves the discriminative power of extracted instance embeddings and enhances the feature representation capabilities of MIL. Experimental results on the publicly available BRACS and CRC-DX datasets demonstrate that PCR-MIL outperforms state-of-the-art methods. The code is available at: https://github.com/JingjiaoLou/PCR-MIL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4174_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/JingjiaoLou/PCR-MIL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LouJin_PCRMIL_MICCAI2025,
        author = { Lou, Jingjiao and Pan, Qingtao and Yang, Qing and Ji, Bing},
        title = { { PCR-MIL: Phenotype Clustering Reinforced Multiple Instance Learning for Whole Slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    1.The proposed instance clustering and redistribution approach, combined with the Grad-CAM selection method, effectively mitigates the label inconsistency issue between pseudo-bags and their parent bags—a notable improvement over conventional pseudo-bag generation methods.

    2.The integration of reinforcement learning for pseudo-bag selection further enhances the framework by filtering out high-noise pseudo-bags, which contributes to more robust model performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The proposed instance clustering and redistribution approach, combined with the Grad-CAM selection method, effectively mitigates the label inconsistency issue between pseudo-bags and their parent bags—a notable improvement over conventional pseudo-bag generation methods.

    2.The integration of reinforcement learning for pseudo-bag selection further enhances the framework by filtering out high-noise pseudo-bags, which contributes to more robust model performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The authors claim that prior methods generate pseudobags “through random splitting,” yet numerous recent studies (e.g., [1]) employ non-random strategies. This assertion requires stronger empirical support and a more comprehensive literature review.

    2. The statement that “Embedding-based MIL suffers from non-discriminative features” overlooks advances in pathology-specific foundation models (e.g., CONCH, TITAN, MUSK), which demonstrably extract highly discriminative features from histopathology images. The authors should address these contemporary approaches.

    3. The evaluated baselines are predominantly from 2021–2022, while state-of-the-art (SOTA) pseudobag generation methods (e.g., [1–4]) are omitted. For instance, DTFD-MIL (2022)[5] is the only pseudobag method cited, despite newer techniques like

    [1]Yang R, Liu P, Ji L. ProDiv: Prototype-driven consistent pseudo-bag division for whole-slide image classification[J]. Computer Methods and Programs in Biomedicine, 2024, 249: 108161. [2]Liu P, Ji L, Zhang X, et al. Pseudo-bag mixup augmentation for multiple instance learning-based whole slide image classification[J]. IEEE Transactions on Medical Imaging, 2024, 43(5): 1841-1852. [3]Yang J, Chen H, Zhao Y, et al. Remix: A general and efficient framework for multiple instance learning based whole slide image classification[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022: 35-45. [4]Chen Y C, Lu C S. Rankmix: Data augmentation for weakly supervised learning of classifying whole slide images with diverse sizes and imbalanced categories[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23936-23945. [5]Zhang H, Meng Y, Zhao Y, et al. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18802-18812.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The authors claim that prior methods generate pseudobags “through random splitting,” yet numerous recent studies (e.g., [1]) employ non-random strategies. This assertion requires stronger empirical support and a more comprehensive literature review.

    2. The statement that “Embedding-based MIL suffers from non-discriminative features” overlooks advances in pathology-specific foundation models (e.g., CONCH, TITAN, MUSK), which demonstrably extract highly discriminative features from histopathology images. The authors should address these contemporary approaches.

    3. The evaluated baselines are predominantly from 2021–2022, while state-of-the-art (SOTA) pseudobag generation methods (e.g., [1–4]) are omitted. For instance, DTFD-MIL (2022)[5] is the only pseudobag method cited, despite newer techniques like

    [1]Yang R, Liu P, Ji L. ProDiv: Prototype-driven consistent pseudo-bag division for whole-slide image classification[J]. Computer Methods and Programs in Biomedicine, 2024, 249: 108161. [2]Liu P, Ji L, Zhang X, et al. Pseudo-bag mixup augmentation for multiple instance learning-based whole slide image classification[J]. IEEE Transactions on Medical Imaging, 2024, 43(5): 1841-1852. [3]Yang J, Chen H, Zhao Y, et al. Remix: A general and efficient framework for multiple instance learning based whole slide image classification[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022: 35-45. [4]Chen Y C, Lu C S. Rankmix: Data augmentation for weakly supervised learning of classifying whole slide images with diverse sizes and imbalanced categories[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23936-23945. [5]Zhang H, Meng Y, Zhao Y, et al. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18802-18812.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have mainly addressed my concerns.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel MIL framework which consists of two interesting components: PFCS and RFE, which are used to improve the framework’s robustness to noise and its ability to extract discriminative embeddings, respectively.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The manuscript is well-written and supported by clearly structured figures that facilitate understanding. 2.The paper utilize PFCS to assign positive instance to each pseudo-bag to mitigate noisy labels and proposed RFE which integrates RL into MIL to identify clean pseudo-bags for pre-training a domain-specific feature extractor. 3.The paper demonstrates its effectiveness compared to other methods on two popular datasets, and its ablation study also demonstrates the effect of the two key components.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The overall work is solid. The following comments are intended to help improve the clarity of the manuscript and better substantiate the claims. 1.There are overlapping parts between the text and content in the figure. It would be helpful to adjust the position of the text to avoid overlap and improve the readability of the figure. 2.The reason for using an attention-based MIL model as the classifier in the RFE module is not clearly explained. Given that the attention-based MIL model is relatively outdated and has been shown to underperform compared to more recent methods such as DSMIL or other advanced MIL models, it would be helpful for the authors to clarify why this particular model was chosen. This choice may limit the overall performance of the framework, and a justification would strengthen the methodology.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As noted in the main strengths, the paper introduces PFCS to assign positive instances to each pseudo-bag, which helps reduce the impact of noise. In addition, a reinforcement learning strategy is employed to train a domain-specific extractor for capturing discriminative representations. Although there are some minor issues in the manuscript, they do not significantly affect the overall quality of the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents PCR-MIL, a novel framework designed to enhance Multiple Instance Learning (MIL) for classifying Whole Slide Images (WSIs). PCR-MIL introduces two innovative components: Phenotype Clustering-based Feature Selection (PCFS) and Reinforced Feature Extractor (RFE). PCFS aims to improve noise robustness by constructing phenotype-specific pseudo-bags and selecting positive instance embeddings. RFE incorporates reinforcement learning to enhance the discriminative power of feature extractors by training on selectively chosen pseudo-bags with reduced noise. Experimental results indicate that PCR-MIL surpasses current state-of-the-art MIL methods on the BRACS and CRC-DX datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The introduction of phenotype clustering is a novel strategy that addresses noisy labels, potentially enhancing the robustness of MIL methods for histopathology image classification. -The use of reinforcement learning in training feature extractors offers an original approach to improving the discriminative power and representation capability of MIL models. Demonstrated improvements in accuracy and AUC on benchmark datasets imply improved clinical feasibility and effectiveness in practical applications. -The paper provides a detailed empirical evaluation showing PCR-MIL’s performance relative to existing techniques, suggesting promise in real-world applications. The explanation of the mechanism of both PCFS and RFE could be further detailed for better comprehension, especially on the integration of phenotype clustering and reinforcement learning. Although the paper introduces novel components, it could further benchmark performance against a wider range of datasets to comprehensively assess generalization across different pathological conditions. Some claims, such as enhanced robustness to noise and improved feature representation, would benefit from more detailed discussion regarding why these approaches are able to surpass existing techniques extensively. Limited context is provided regarding computational complexity and runtime performance, which could be critical considerations for deploying the framework in clinical settings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    -The choice of datasets, BRACS and CRC-DX, may not be ideal for evaluating MIL frameworks, which are traditionally applied to megapixel WSIs rather than 512*512 patches; questioning the necessity of MIL application on such datasets. -Baseline accuracy and AUC results for DSMIL, both reported as 100%, appear unusually high and warrant verification, especially without comparisons to state-of-the-art foundation models such as UNI[1], CONCH[2], and Gigapath[3], which excel in classification tasks. -The absence of results from visualization and explainability analyses limits the ability to fully assess the model’s internal workings and interpretability. -The paper’s evaluation scope could be broadened to include these foundation models to better judge PCR-MIL’s applicability and performance relative to cutting-edge standards. -The explanation of the mechanism of both PCFS and RFE could be further detailed for better comprehension, especially on the integration of phenotype clustering and reinforcement learning. -Although the paper introduces novel components, it could further benchmark performance against a wider range of datasets to comprehensively assess generalization across different pathological conditions. -Some claims, such as enhanced robustness to noise and improved feature representation, would benefit from more detailed discussion regarding why these approaches are able to surpass existing techniques extensively. -Limited context is provided regarding computational complexity and runtime performance, which could be critical considerations for deploying the framework in clinical settings.

    [1]Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3 [2] Lu, M. Y., Chen, B., Williamson, D. F., Chen, R. J., Liang, I., Ding, T., … & Mahmood, F. (2024). A visual-language foundation model for computational pathology. Nature Medicine. [3]A whole-slide foundation model for digital pathology from real-world data, nature

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the current weaknesses and areas of concern, I would recommend a neutral stance regarding acceptance, contingent upon the authors addressing the highlighted issues. Specifically, there is a need to:

    Clarify the rationale behind dataset selection and its relevance to typical MIL applications. Verify the baseline DSMIL results and provide comparisons with state-of-the-art foundation models to ensure comprehensive benchmarking. Incorporate visualization techniques and provide explanations for model decisions to improve transparency and interpretability.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    this work lack of addressing the advances of foundation models that temper my recommendation for acceptance.




Author Feedback

We thank the reviewers for their valuable and constructive comments. Q1: Classifier choosing (R1). A2: We chose an attention-based MIL because it is the most suitable for our task. The PCFS module requires that the extractor have a strong capacity to extract diverse phenotype features. Only under this condition can PCFS effectively select embeddings that contribute most to the classification, forming the input bag of the 2nd MIL and enhancing noise robustness. Therefore, we followed [1] and adopted an attention-based MIL with Siamese fully convolutional networks, which have been shown to effectively learn informative representations for individual phenotypes. Q2: Datasets choosing (R2). A2: BRACS and CRC-DX have been used in many studies [2, 3], demonstrating their applicability for evaluating MIL frameworks. It is common practice to first divide WSIs into patches and then use these patches as input to MIL networks. The patches in BRACS and CRC-DX are derived from megapixel WSIs. Q3: Baseline results and foundation models (R2). A3: The excellent results of DSMIL may be attributed to its use of multiscale features. In [4], a high F1-score of 96.32 on BRACS was reported using graph representation features at two different scales simultaneously. Additionally, by leveraging features extracted from the UNI foundation model, the baseline DTFD-MIL achieves 93.33 accuracy (Acc) and 99.56 AUC on BRACS (lower than ours), and 90.43 Acc and 94.29 AUC on CRC-DX (higher than ours). Q4: Visualization and explainability (R2). A4: Our work was inspired by [5] and DTFD-MIL. Similar to our proposed RFE, [5] employed reinforcement learning (RL) to remove noisy data for robust emotion classification. Our PCFS module is an improvement upon DTFD-MIL. The analyses presented in these two existing works provide a certain degree of visualization and explainability for our method. Q5: Explanation of the mechanism (R2). A5: PCFS enhances noise robustness by reducing the noisy bags. It achieves this by constructing pseudo-bags that capture diverse phenotype patterns in the 1st MIL and selecting phenotype embeddings with the highest probability of belonging to a specific category in the 2nd MIL. RFE improves feature representation by training a domain-specific extractor. It does so through RL, which selects pseudo-bags with less noise for training. The integration of phenotype clustering and RL empowers PCR-MIL with both strong noise robustness and improved feature extraction capabilities. Q6: Datasets generalization (R2). A6: The results on BRACS and CRC-DX demonstrate the effectiveness of our proposed method and its individual components. We will extend evaluation to broader datasets in the future. Q7: Model deploying (R2). A7: Our PCR-MIL contains 9,692,930 parameters and requires an average of 1.43 minutes per case to generate a diagnostic prediction. Q8: Random splitting (R3). A8: Our intention was to emphasize that the challenge of using pseudo-bags lies in the presence of noisy labels. Random splitting and a small proportion of positive tissues are two causes of these noise. Q9: Foundation models (R3). A9: Non-discriminative features are indeed a challenge faced by previous embedding-based MIL methods. Foundation models, like our approach, aim to address this challenge. (A3 for foundation model results) Q10: SOTA methods (R3). A10: We selected classic methods that are commonly used for comparison. ProDiv, one of the SOTA pseudo-bag generation methods, achieves 74.00 Acc and 79.12 AUC on CRC-DX, and 90.00 Acc and 98.29 AUC on BRACS (lower than ours). [1] Yao. Whole slide images based cancer survival … . Med Image Anal, 2020. [2] Yang. Mambamil: Enhancing … . MICCAI 2024. [3] Schirris. DeepSMILE: Contrastive … . Med Image Anal, 2022. [4] Pati. Hierarchical graph representations … . Med Image Anal, 2022. [5] Li. Deep reinforcement learning for robust emotional … . Knowl-based Syst, 2020. [6] Yang. ProDiv: Prototype-driven … . Comput Meth Prog Bio, 2024.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper receives an initial review of 2WA (R1, R2) and 1WR (R3). After rebuttal, R3 changes to Accept while R2 changes to Reject, resulting in a split decision. The main concerns include: 1) R2’s fundamental questions about dataset appropriateness, noting that BRACS and CRC-DX use 512×512 patches rather than traditional megapixel WSIs, and suspicious baseline results with DSMIL achieving 100% accuracy/AUC, 2) Missing comparisons with state-of-the-art foundation models like UNI, CONCH, and GigaPath, which R2 and R3 emphasized as critical for contemporary MIL evaluation, 3) Outdated baseline methods (predominantly 2021-2022) and missing recent SOTA pseudo-bag generation techniques like ProDiv, Remix, and RankMix (R3), 4) R1’s concerns about using attention-based MIL classifiers instead of more advanced methods like DSMIL. The authors addressed many concerns in rebuttal, providing foundation model comparisons showing competitive performance and clarifying technical details, which satisfied R3 but not R2. While R2 remained concerned about foundation model integration, R1 found the work solid with novel PCFS and RFE components, and R3 appreciated the phenotype clustering approach for noise reduction. I suggest a recommendation of Accept, as the paper introduces meaningful contributions to MIL with phenotype clustering and reinforcement learning integration, the authors adequately responded to technical concerns, and the majority assessment recognizes the work’s value despite some methodological debates about baseline selection.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The field of pathology AI has entered the era of foundation models for everal years, with many backbones available to the public. I agree with R3 that the paper in its current version overlooked the advances in foundation pathology models, and further improvements over the foundaiton model backbones should be shown to demonstrate the advantage of a paper aiming to address the pathology MIL problem. In my opinion, the paper in its current version hasn’t met the standard of MICCAI.



back to top