Abstract

Weakly supervised whole slide image (WSI) classification is challenging due to the lack of patch-level labels and high computational costs. State-of-the-art methods use self-supervised patch-wise feature representations for multiple instance learning (MIL). Recently, methods have been proposed to fine-tune the feature representation on the downstream task using pseudo labeling, but mostly focusing on selecting high-quality positive patches. In this paper, we propose to mine hard negative samples during fine-tuning. This allows us to obtain better feature representations and reduce the training cost. Furthermore, we propose a novel patch-wise ranking loss in MIL to better exploit these hard negative samples. Experiments on two public datasets demonstrate the efficacy of these proposed ideas. Our codes are available at https://github.com/winston52/HNM-WSI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3822_paper.pdf

SharedIt Link: https://rdcu.be/dY6ir

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_14

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3822_supp.pdf

Link to the Code Repository

https://github.com/winston52/HNM-WSI

Link to the Dataset(s)

https://www.cancer.gov/ccg/research/genome-sequencing/tcga https://camelyon16.grand-challenge.org/Data/

BibTex

@InProceedings{Hua_Hard_MICCAI2024,
        author = { Huang, Wentao and Hu, Xiaoling and Abousamra, Shahira and Prasanna, Prateek and Chen, Chao},
        title = { { Hard Negative Sample Mining for Whole Slide Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {144 -- 154}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the ubiquitous computational pathology problem of gigapixel whole slide image classification under the multiple instance learning principle with only slide-level labels given. The authors propose a variant of contrastive learning enhanced by hard negative sample mining to improve performance and training efficiency. The proposed methodology works with only a subset of negative patches instead of all negative patches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The authors introduce the combination of two technical contributions. 1] pairwise multiple instance ranking loss and 2] hard negative sample mining within a contrastive learning framework. 2) The paper is written clearly and mostly easy to follow and understand. The figures are well designed. 3) The claim of improving training efficiency by focussing on a smaller number of hard negative instances is intuitively believable. 4) To proposed method focusses on the top % of negative instances according to predicted scores, instead of naively using all negative instances from a given negative slide. 5) The ablation study on using ranking loss with the proposed method as well as with the baseline methods is well motivated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The paper lacks a discussion of related works focusing on similar problems and particularly hard negative mining based modifications published in recent MICCAI conferences & related publications. As such the experimental comparison is limited to two works based on contrastive learning (besides the obligatory max-pooling and ABMIL baseline). 2) Limited novelty in the proposed methodology. As described in 1), hard negative mining in combination with the MIL setting has already been explored extensively. 3) The main experimental results show only Accuracy and AUROC metrics. Results on Precision/Recall or AU-PRC would be important in multiple instance learning context, especially when patch-wise classification is considered. 4) While the ablation study on the proportion of negative samples used is welcome, the included runtime comparison lacks meaning. Such analysis is dependent on far too many other factors when comparing with different methods. The claimed training efficiency gain is motivated but not shown in empirical results compared to baseline methods. 5) There is not a strong difference in visualization of instances in the proposed method compared to baseline ItS2CLR (Fig. 3.). 6) There is no discussion or comparison to other works using related multiple instance ranking methods.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The used datasets are standard benchmark datasets in the computational pathology community and thus easy to find and use. Training methodology and model design is relatively clearly stated. However, there is no published code/dependencies to easily reproduce the claimed results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) The related work section should discuss both: previous works using hard negative mining and previous works using a ranking loss between the multiple instances. At least the following works [1-3] are to be considered. 2) In the implementation details section is does not become clear how the hyper parameters w_b and w_r were selected to be 0.5 and 0.1 respectively. 3) The same is true for sampling rates r_p and r_n as well as K. 4) While additional experiments can not be requested, I would like to see some additional discussion on how the claimed training time speed gains compare to other methodologies (with and w/o MI ranking loss). 5) Showing selected top negative instances by the proposed method and discussing what is present in them vs. instances that are found in naive hard negative sample mining might give additional valuable insight. 6) For equation 4 it is stated that the similarity between two instances is defined as in reference 14. The definition should be included in the main text for easier understanding of the sim() operation.

    [1] Li, Meng, et al. “Deep instance-level hard negative mining model for histopathology images.” MICCAI 2019 [2] Butke, Joshua, et al. “End-to-end multiple instance learning for whole-slide cytopathology of urothelial carcinoma.” PMLR 2021 [3] Lu, Ming et al. “Data-efficient and weakly supervised computational pathology on whole-slide images.” Nature Biomedical Engineering 2021

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As is, the paper does largely not justify the main conference track of MICCAI. The methodological novelty is limited and of combinatorial nature (hard negative sampling in contrastive learning based MIL) / trivial (using less instances makes training faster). The experimental results show an incremental improvement over two baselines in contrastive learning. I would recommend revising with regards to discussing how the proposed technique compares to other hard negative mining approaches (e.g. clustering-based) to strengthen how this work differs and excels in performance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author’s rebuttal addresses my major concerns somewhat. The necessary discussion on other hard negative mining methods is mentioned to be worked into the main text but was not discussed in this rebuttal. Regarding all comments by all reviewers, there are quite a lot of changes to incorporate for the revised manuscript. I hope the authors honor all their promised changes in the case of acceptance. Still, the paper is a borderline valuable contribution to the community. Thus, overall I will upgrade my rating to ‘weak accept’.



Review #2

  • Please describe the contribution of the paper

    The authors propose a Multiple Instance Learning (MIL) approach for classifying histological whole slide images. The authors propose a novel method incorporating the (hard) negative samples during tuning the network. An evaluation is performed based on two common data sets (Camelyon, TCGA).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well written and easy to follow
    • Argumentation for the novel loss formulation is clear
    • The application scenario is currently of high interest in the community
    • The method’s description seems to be correct
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • According to the authors, … “For Camelyon16 dataset, we reported the results of the official testing set.” According to the website, this testing data set consists of 130 images. I did not find any information, that the experiments were repeated. In this setting, the classification of a single image as class 0/1 corresponds to a difference in accuracy of 0.7 %. It is hard to assess based on this setting, whether improvements are random or systematic. As long as the test set is only used for testing after all (meta) parameters are fixed, this is part of the game (of a challenge for example). However, this is typically not the case as soon as the data is disclosed. According to Bonferroni, a meta-parameter optimization can be very dangerous in such a setting. I do not understand why a cross-validation was not performed.
    • For me it is not clear how the number are obtained. Since only 130 test images were used without any repetitions, I expect that the following accuracy is possible: 1.00, 1.00-1/130, 1.00-2/130, …. -> 1.0000 0.9923 0.9846 0.9769 0.9692 0.9615 0.9538 0.9462 0.9385 0.9308 0.9231 0.9154 0.9077 0.9000 0.8923 0.8846 0.8769 0.8692 0.8615 0.8538 0.8462 0.8385 0.8308 0.8231 0.8154 0.8077 0.8000 …

    Please indicate why the accuracies in the paper differ from them.

    • Where the baseline methods also evaluated in combination with pre-trained networks and one single resolution? For some, this is not the optimal configuration (e.g. DS-MIL).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Without source code, it is hard to reproduce the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please explain why you did not perform a cross validation based on the Camelyon data set.
    • Also please explain how the resulting accuracies (see above) were obtained.
    • Please also provide details how the baseline methods were evaluated.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The paper is well written and presentation is nice (figures etc)
    • The technical contribution is sound and well motivated
    • My main merit refers to the evaluation, which is of a very high importance for the community. For details, please see above.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    My points were clarified accordingly.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a hard negative mining strategy to fine-tune the feature extractors pre-trained on ImageNet dataset to address the disparity issues of the natural image and histopathology image features distributions. This paper proposes that the negative patches are overwhelming and redundant, they should be treated similarly to the positive patches, that selection is needed. Furthermore, the authors propose a multiple instance ranking loss to push the most similar negative patches far away from the true positive patches. The proposed method is an iterative process that the MIL training and feature extraction training takes place alternatively. Experimental results show that this method achieved state-of-the-art performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors introduce an innovative approach for fine-tuning the feature extractor on histopathology images. The novelty of this method resides in the selection of hard negative patches. A classifier is employed to score these negative patches, with the highest scoring ones being considered as negative patches. This strategy not only enhances the performance of the models but also reduces the computational costs associated with training.
    2. The experimental results provided are compelling, demonstrating that the proposed method has achieved state-of-the-art results across all datasets.
    3. The paper is well-structured and clearly written, and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper lacks statistical measures such as standard deviation or p-tests to validate the effectiveness of the proposed methods, which is a significant oversight. Furthermore, given the iterative nature of the process, where MIL and feature encoder training occur alternately, the paper does not provide specific details regarding the number of epochs for MIL training, feature extraction training, and the total number of iterations. To ensure reproducibility of the results, it is imperative for the authors to include this information.
    2. Figure 3 appears to be of low resolution, making it difficult to discern whether the discrepancies between the proposed methods and the ground truth represent actual tumor regions. Given that the ground truth seems to be manually annotated by the authors, there are subtle areas that may not be tumor regions. If the ground truth is as labeled in this figure, the method appears to suppress too many positive instances. The authors may need to fine-tune the hyperparameters further to achieve a more balanced result.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors are encouraged to offer more detailed information regarding the number of epochs for aggregator and contrastive learning training, as well as the total number of iterations required. This additional data would provide readers with a more comprehensive understanding of the training process and the computational resources involved. Furthermore, it would be highly beneficial if the authors could consider sharing the source code. This would not only enhance the transparency and reproducibility of the study but also allow other researchers to build upon this work, thereby contributing to the advancement of the field.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is recommended that the authors consider integrating the tables and figures (specifically Figure 3 and Table 1) directly into the experimental section of the text. Currently, the placement of these elements makes it somewhat challenging for readers to cross-reference the data. By aligning the visual aids with the corresponding text, the authors could significantly enhance the readability and comprehension of the paper.
    2. The paper seems not to clearly specify the type of aggregator utilized in the methodology. It would be advantageous for the authors to conduct a comparative analysis of the aggregator’s performance with and without the loss and the iterative process. This comparison could yield valuable insights into the effectiveness of the proposed approach and potentially strengthen the overall impact of the research.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach to fine-tuning the feature extractor, a critical component in WSI classification. This is achieved through the application of contrastive learning with pseudo-labels and a unique MI ranking loss. The latter is designed to distance high-score negative patches and low-score patches, thereby enhancing the overall performance of the model. While there are certain issues that need to be addressed, as highlighted in the preceding sections, these do not significantly detract from the overall quality of the research. However, it is recommended that the authors address these points to further strengthen the robustness of their study and enhance the potential for practical application.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have mainly addressed my concerns, however, the training of the whole framework still remains kind of confusing. I hope the authors can release the codes so that they are clear.




Author Feedback

We thank all reviewers for their constructive comments. We will improve the presentation according to suggestions. Below we address major concerns.

R1: Missing references [1-3] and comparison. A: We will include discussions regarding hard negative mining methods[1,2] and multiple instance ranking methods[3]. We reproduced the model in [1], achieving an ACC of 0.8605 and AUC of 0.8781 on Camelyon16, which is worse than our results.

R1: Novelty: hard negative mining + MIL has been explored extensively. A: Previous work (e.g., [3]) only mine negative samples for MIL using attention values. Our work is the first to mine hard negative samples for feature representation learning. We propose an elegant solution to leverage instance prediction to select hard negative samples, which are used for contrastive learning to improve feature quality. Extensive experiments show the power of our method in reducing feature representation learning time.

R1: Metric: AUPRC and P/R are missing. A: We follow previous works ([10,12,16,26] in the paper) and report ACC and AUC. We will add AUPRC and P/R as suggested.

R1: Efficiency gain is not shown. A: Testing on our machine (1 Nvidia A5000 GPU, 48GB VRAM), our method (top 5% negative samples) converges in 12 iterations in about 8 hours. In comparison, Its2CLR takes about 2 days.

R1: Visualization (Fig. 3): no difference with Its2CLR. A: They are supposed to be close. We are showing that with much fewer training samples, our method achieves the same performance as Its2CLR.

R1, R4: Implementation details. A: The code will be made publicly available upon acceptance to ensure reproducibility. The MIL aggregator was trained for 350 epochs. The fine-tuning was set to 15 epochs. For hyperparameters, we run experiments with different values and select them using the validation set. The value explored for w_r was in the range [0.05, 1], with step size = 0.05, K in {5, 10, 50}, r_n from {0.02,0.05,0.1,0.2,1}, w_b was set to 0.5 as in DSMIL, and r_p was set to 0.2 as in Its2CLR.

R1: Showing hard negatives compared with naive hard negative sampling methods. A: Will add hard negative samples during training. But as mentioned above, they are solving different problems. Previous methods only sample hard negative samples for MIL using attention values.

R4: Standard deviation. A: We provided std dev for TCGA in the supplementary, showing significant performance improvements over baselines using a t-test (95% confidence interval). We will add std dev for Camelyon16 as well.

R4: Low resolution in Fig.3. A: Fig. 3 shows patch-level ground truth and predictions. Each pixel represents a patch. Thus the low resolution. We will add magnified figures.

R4: Fig.3 Ground truth. A: We use the original ground truth from Camelyon16 and we did not manually annotate it.

R4: MIL aggregator selection and ablation study. A: We use DSMIL as our aggregator. In Tab. 2, “DSMIL + ranking loss” shows results without iteration, and “Ours w/o ranking” shows results without ranking loss. Our method improves both metrics, demonstrating the effectiveness of each component.

R5: Camelyon16: 130 testing images, acc numbers does not match the list. A: Although the official test set has 130 images, most existing works, including DSMIL and Its2CLR, excluded one test data due to quality issues. We use the same setting, with 129 test images, thus the acc numbers.

R5: Why not cross-validation for Camelyon16. A: We follow the official test set split to ensure fair comparisons with previous methods. The accuracy results are the average of three independent experiments.

R5: Baseline evaluation-DSMIL only used one resolution. A: Thank you for pointing out this oversight. We reevaluated DSMIL with two-scale features (optimal in the original paper), showing an ACC of 0.8837 and AUC of 0.9095. This is still worse than our results. We will update these numbers in the revised version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The main concern with this borderline paper is that hard negative mining in MIL has been done before and it is not clear what novelty is presented here.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The main concern with this borderline paper is that hard negative mining in MIL has been done before and it is not clear what novelty is presented here.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top