Abstract

Whole slide image (WSI) classification plays a crucial role in digital pathology data analysis. However, the immense size of WSIs and the absence of fine-grained sub-region labels pose significant challenges for accurate WSI classification. Typical classification-driven deep learning methods often struggle to generate informative image representations, which can compromise the robustness of WSI classification. In this study, we address this challenge by incorporating both discriminative and contrastive learning techniques for WSI classification. Different from the existing contrastive learning methods for WSI classification that primarily rely on pseudo labels assigned to patches based on the WSI-level labels, our approach takes a different route to directly focus on constructing positive and negative samples at the WSI-level. Specifically, we select a subset of representative image patches to represent WSIs and create positive and negative samples at the WSI-level, facilitating effective learning of informative image features. Experimental results on two datasets and ablation studies have demonstrated that our method significantly improved the WSI classification performance compared to state-of-the-art deep learning methods and enabled learning of informative features that promoted robustness of the WSI classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1644_paper.pdf

SharedIt Link: https://rdcu.be/dY6in

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_10

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lia_Enhancing_MICCAI2024,
        author = { Liang, Peixian and Zheng, Hao and Li, Hongming and Gong, Yuxin and Bakas, Spyridon and Fan, Yong},
        title = { { Enhancing Whole Slide Image Classification with Discriminative and Contrastive Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {102 -- 112}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper joined the discriminative and contrastive learning to enhance the robustness of classification and to get a more compact and informative image representations.

Considering the immense size of WSI, this paper used several image patches to represent the whole image.

When selecting representative image patches to use, this paper adopted SAM as foundation model to extract patch features, and use k-means to categorize patches into k clusters, which can better represent the whole image.

This paper performed classification task in the image scale rather than the patch scale, and thus can use the image labels as classification labels, avoiding the use of pseudo class labels to patches.

This paper performed contrastive learning task with image level label, which can explore pathology related discriminative information, while patch level contrastive learning could only consider contextual similarity.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- a novel way to perform WSI with limited computational budget. This paper performs WSI tasks in image scale rather than patch scale, and uses several patches to represent the whole image. SAM and k-means are adopted to select representative patches to improve performance.
- a better way to avoid label noise In other weakly supervised learning methods, image-level labels are utilized to assign pseudo class labels to patches, and thus would introduce noise. Using image scale label can avoid label noise, and learn better representations.
- a better way to perform contrastive learning In other methods, patches, along with their augmented or semantically similar counterparts, are regarded as positive samples while semantically dissimilar patches are considered as negative samples. This criteria is not tied to WSI class information. This paper use WSI label as criteria to get positive samples and negative samples, which better integrated class information.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- joint contrastive learning and classification is not a novel trick
Several existing works also used contrastive learning to improve the classification performance, such as

Yang, P., Hong, Z., Yin, X., Zhu, C. & Jiang, R. Self-supervised Visual Representation Learning for Histopathological Images. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 47–57 (Springer International Publishing, 2021).

Basak, Hritam, and Zhaozheng Yin. “Pseudo-label guided contrastive learning for semi-supervised medical image segmentation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
- the performance of the method is not very convincing
DC-WSI is compared with other methods in two benchmarks, which can not show generalizable performance of the model. Besides, DC-WSI does not show a better performance than other methods in the TCGA-Lung benchmark. The t-SNE visualization of image features of WSIs could not demonstrate the improvement of the classification+contrastive model comparing with the classification model.
- there lacks comparison between different contrastive learning methods
There lacks the reasons for using maximum-margin classification rather than other contrastive training methods. It would be better to add more comparison experiments to find out the best method.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

It would be better to provide training iterations, training scheduler, and hardware environment (such as GPU memory).
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- provide more experiment results on different benchmarks to show generalizability of the method, and more impressive comparison with existing methods to show model strengths. *provide a more obvious t-SNE comparison result to show the effectiveness of the contrastive learning method.
- explain the ‘semantic information’ in ‘semantic information obtained by these methods’ in Introduction. How to define ‘semantic information’ and what are the strengths of ‘WSI class information’ comparing with ‘semantic information’.
- If possible, try more contrastive learning methods, such as triplet contrastive learning, SimCLR, and compare their performance.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- This paper has some contributions on the WSI classification in whole image scale, especially on how to select representative patches of images and how to use image level label to perform training.
- However, it lacks demonstration and justification of the effectiveness and generalizability of their method.
- It would be better to add more experiments and impressive results to further demonstrate their strengths.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

Based on the performance of the proposed method, the authors introduce a new state-of-the-art end-to-end pipeline that can more accurately classify different types of cancer. However, it’s unclear whether the results are statistically significant.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Patches from abnormal histopathological images can sometimes be labeled as normal, making it challenging to apply contrastive learning in this context. The method described here overcame this issue by using a foundational model and clustering the histopathological patches, indirectly generating the appropriate labels.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It’s unclear whether the authors re-trained the state-of-the-art methods on their data split or if they used the results from the original papers. If they used the results directly, the numbers don’t match those in the original papers. Can the authors show exactly where/how these numbers came from? If they re-trained the state-of-the-art architectures on their data split, there’s a gap between their results and those of the original studies. In any case, the model’s performance is poorly presented. It should include other metrics like precision, recall, mean and standard deviation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

It is evident that the authors dedicated significant effort to this study. The novel methodology overcomes challenges in patch-level contrastive learning by leveraging the SAM foundation model and an unsupervised clustering method.

I’d like to offer a few more comments. The paper’s structure is well-organized, but there are instances where the meaning is unclear due to grammatical errors and repetitive information.

Major Issues:

One of the study’s main contributions is demonstrating the model’s robustness for WSI classification. However, no statistical significance analysis is provided, and key metrics like precision and recall are missing. Cross-validation is also crucial for providing mean and standard deviation. Detailed documentation of the methodology is essential for reproducibility, especially since the compared methods share their code on GitHub

Minor Issues:

In the paragraph discussing the paper’s contributions, the same contribution seems to be repeated three times or is presented more as a methodology description. In the second paragraph of the introduction, the acronym ‘WSI’ is repeated too frequently, which makes the sentences hard to follow. The phrase ‘to learn compact and robustness image representations for accurate’ contains a grammatical error. In Figure 2, ‘CC’ is labeled as Contrastive Learning, which should instead be ‘CL.’
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The primary reasons the paper received weak acceptance are the absence of statistically significant methods demonstrating the robustness and superiority of the approach compared to state-of-the-art architectures, and the reported performance data doesn’t align with the original studies.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper
The paper describes an “end-to-end”* approach for learning Multiple-Instance-Learning-Like models for binary classification in histopathology using a loss based on both contrastive and disriminative terms.
- The “End-to-end” approach is enabled by two things i) Sampling a fixed number of patches from each image, ii) Turning each patch into a smaller 1D vector using a pre-trained network (SAM). One can argue that this isn’t truly “end-to-end” as claimed as this initial patch network is fixed, but novelty exists in the use of a second trainable patch level network that is trained “end-to-end” based on this 1D description of the patch. This is the first time I’ve seen this approach to this.
While sampling patches is not new, the authors introduce a sampling strategy that clusters feature vectors and samples a fixed number from each cluster. The superiority of this sampling method is demonstrated empirically.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The methods represent presented (see 4) can be generalised to any binary clasification problem using MIL. The sampling strategy has even wider scope (e.g. multi-class problems). While individually most parts that make up this approach (MIL, contrastive learning, etc.) have been used before the use of a two feature generators (one fixed, one trainable) at the patch level is a great idea that I haven’t seen before which facilitates end-to-end learning in a scenario where this isn’t usually possible due to memory constraints. The sampling also contributes to this tractability, although sampling has been presented before. The particular sampling strategy (cluster, sample from clusters) isn’t new in the general sense, but I’ve not seen it used in this context, and it’s superiority is shown in this work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

If I have to name a weakness then the contrastive term appears to be only applicable to binary problems and not multi-class or continuous learning problems (e.g. survival). Additionally, the formulation doesn’t seem to be able to make use of completely unlabeled images - which is usually cited as the main advantage of contrastive learning. that said the contrastive term improves things over simply using a discriminative learning loss / approach so this is perhaps being picky.

Another criticism might be the use of only a 80:20 train:test split. Why no 5-fold cross validation?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

Datasets (2) are standard, and method seems easy enough to replicate despite no code being mentioned.I do encourage code publication though. There are so many papers on MIL variations expecting authors to re-implement your code is not very community minded.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Abstract: “Different from the extant contrastive learning methods for WSI classification that primarily assign pseudo labels to patches based on the WSI-level labels,” I’m not sure this really describes the body of MIL literature well. Most modern MIL methods produce a patch embedding that is fed through an attention mechanism, and then cobined into an image-level feature before classification. That sentence really only describes quite early MIL approaches.

P1: “Despite their promising classification performance, these classifier-driven methods face challenges in attaining compact image representations to enhance the robustness of classification accuracy in that these methods employ discriminative information alone to learn” - This is pure speculation.You present no evidence for this. The paper shows a small improvment over SOTA. Such a claim that the representation is a step change in representation is not backed up by evidence (although it may be true. It may also be true that the features generated by discriminative learning are a good representation - We’ve seen that with ImageNET trained models being applied in a wide range of circumstances!).

I’m going to say it is 2024 and not doing cross validation in your experiments is not good practice.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the ideas in this paper are generally useful to the community and I’d like to see this published. The paper is generally well written and the evaluation (on two standard datasets/problems) sound.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We sincerely appreciate the reviewers’ insightful comments and constructive suggestions. Their unanimous recognition of the innovation, importance, and practical application value of our method is truly encouraging. Below, we address concerns raised by the reviewers:

Following the constructive suggestions, we will report five-fold cross-validation results in the final version. We will also make our code publicly available along with the cross-validation results to promote reproducibility. We will also improve the manuscript following the reviewers’ specific suggestions.

R3: 1) In the present study we evaluated our method on binary classification problems. However, it is straightforward to apply our method to multi-class classification problems. It merits further investigation to extend the method for survival analysis. The method does not directly work in a completely unsupervised setting, but it can be used to finetune models built using self-supervised contrastive learning or integrated with self-supervised contrastive learning in a semi-supervised learning setting. 2) Despite the promising classification performance, the classifier-driven methods are not equipped to learn compact image representations. We will improve the abstract following the reviewer’s constructive comments to provide accurate descriptions of existing methods.

R4: 1) The studies of Yang et al and Basak et al both learn features guided by information at an image patch level, while ours use contrastive learning to learn features guided by information at the WSI level. We will include these references and discuss the similarity and differences between them and our method. 2) We have compared our method with those built upon self-supervised contrastive learning, Transformers, and graph-based techniques since they are widely used. More comparison results with cross-validation will be provided in the final version, particular those built upon contrastive learning. 3) Our method achieved remarkable accuracy, even though its AUC value on the lung dataset wasn’t the best. 4) Semantic information refers to the visual patterns of the WSIs (e.g., shape, color, common patterns). WSI class information means specific disease related information (e.g., disease and normal regions).

R5: 1) We will update the results with cross-validation and provide statistical significance test results. 2) For a fair comparison, we trained all models using code released by the respective authors with consistent training and testing splitting settings, i.e., all the methods under comparison were based on the same training and testing datasets. 3 ) We will improve the manuscript and correct all typos.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Enhancing Whole Slide Image Classification with Discriminative and Contrastive Learning

Author(s):