Abstract

Self-supervised representation learning has been highly promising for histopathology image analysis with numerous approaches leveraging their patient-slide-patch hierarchy to learn better representations. In this paper, we explore how the combination of domain specific natural language information with such hierarchical visual representations can benefit rich representation learning for medical image tasks. Building on automated language description generation for features visible in histopathology images, we present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. We explore contrastive objectives and granular language description based text alignment at multiple hierarchies to inject language modality information into the visual representations. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets. Our framework also provides better interpretability with our language aligned representation space. The code is available at https://github.com/Hasindri/HLSS.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0460_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0460_supp.pdf

Link to the Code Repository

https://github.com/Hasindri/HLSS

Link to the Dataset(s)

https://opensrh.mlins.org https://www.cancer.gov/ccg/research/genome-sequencing/tcga

BibTex

@InProceedings{Wat_Hierarchical_MICCAI2024,
        author = { Watawana, Hasindri and Ranasinghe, Kanchana and Mahmood, Tariq and Naseer, Muzammal and Khan, Salman and Shahbaz Khan, Fahad},
        title = { { Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposed the novel concept of hierarchical text-to-vision SSL for slides which are characterized by the gigal pixel size and hierarchy. It resorts to LLMs to generate dataset-specific granular characteristic descriptions. A language guided framwork is also proposed to encourage the model to learn multimodal information.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Hierarchical multimodal learning is intriguing and worths exploring. It uses LLMs to generate dataset-specific descriptions, which can inspire the visual-language learning in the histopathology field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the topic is intriguing, some details are missing and thus the paper is somewhat confusing.

    1. In terms of the text generation, what is the prompt of the LLM? Any consideration about the haullicination? What’s more, what is the advantage of the dataset-specific descriptions compared to sample specific descriptions? Intuitively, sample specific text benefits more to the model because it can provide more fine-grained and hetergeneous information.
    2. In terms of the hierarchical visual learning, it seems that the feature z is directly processed by three MLPs to generate three new features, as shown in FIg.2. I think it is unreasonable and cannot support the concept of hierarchical learning.
    3. In terms of the experiments, what is the training set about the baselines? SimCLR is firstly adopted in the natural images. Here, do you directly use SimCLR trained backbone or reproduce it in the medical domain? In addition, HIPT is not included in baselines.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I think hierarchical multimodal learning is intriguing in the pathology field. The training framework needs further improvement to better learn hierarchical information. The details of the text generation are missing and the advantage of the dataset specific descriptions need to be explained.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the training framework is not so convincing and some key details are missing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The missing details and insufficient experiments make the paper confusing.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel pre-training method for pathological images, which utilizes a multi-level structure to align visual and linguistic information, enhancing accuracy on downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The writing of the paper is good.
    2. The paper utilizes hierarchical information to guide the model to obtain feature representations of the entire sample at the patch level of images.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Only kNN was used as the calculation method for downstream task accuracy, and no more methods were employed for ablation.
    2. Only classification tasks were used as the scenario for downstream task experiments, which is insufficient.
    3. The authors did not clearly explain how they obtained text information at different levels through the LLM, which is a point of confusion.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As work to enhance the capabilities of pre-trained models, the authors should conduct more experiments in diverse downstream task scenarios.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See in strength and weakness

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a new vision-language self-supervised learning framework called Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. It proposes learning text-vision alignment at patch-level, slide-level, and patient-level, and shows how these self-supervised representations can be used to achieve strong knn performance on 2 downstream classification datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is the first to exploit the hierarchical nature of pathology information in a VLM setup.
    2. The hierarchical vision-text alignment and the contrastive vision objectives are well motivated.
    3. The results on both SRH and TCGA show promise. The ablations neatly unpack the role of the different proposed parts of the approach.
    4. The interpretability analysis and inclusion of negative control tests for sanity checking are convincing and interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The use of MLP for patch, slide and patient features in the PPM module is a slightly awkward formulation as it makes the output permutation dependent. Depending on how the patches from the slide are arranged post the reshape operation, the z_patch, z_slide, z_patient can end up producing different output just because of a difference in ordering. Multiple instance learning or any other permutation invariant function should be preferred in such situations.
    2. The baselines used in the paper are claimed to be state-of-art SSL methods for pathology, but no datasets or SSL methods specific to pathology have been included for comparison. Some examples include [1, 2, 3, 4].
    3. Not a weakness, but more of a question. In equation 6, the loss L_HA s defined as the KL divergence between the visual embedding z and the best matching text embedding from each hierarchy. Is this a typo and the correct expression should be KL(z′_slide, t′_patient) or is this correct and the KL for all hierarchies is computed with the same z?

    [1] Kang, Mingu, et al. “Benchmarking self-supervised learning on diverse pathology datasets.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. [2] Filiot, Alexandre, et al. “Scaling self-supervised learning for histopathology with masked image modeling.” medRxiv (2023): 2023-07. [3] Chen, Richard J., et al. “Towards a general-purpose foundation model for computational pathology.” Nature Medicine 30.3 (2024): 850-862. [4] Vorontsov, Eugene, et al. “Virchow: A million-slide digital pathology foundation model.” arXiv preprint arXiv:2309.07778 (2023).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The paper needs more a thorough and convincing benchmarking and different datasets and problems. The metrics should go beyond knn performance and show downstream performance improvement as compared to non-VLM state-of-art baselines from previous papers.
    2. The authors should improve the figure 1 to clarify how visual and language hierarchy is formulated. Currently, the visual image shows a single high-resolution patch.
    3. Please clarify your thoughts on point 1 and 3 from the weakness section above.
    4. Please fix a small typo/missing text on Page 6, first line. “for details on exact prompts, example descriptions, and manual inspection process.”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a new hierarchical VLM formulation in pathology which is a relevant line of work. The formulation is well motivated and shows promising results. There are some concerns regarding the thoroughness of the evaluation in terms of datasets, tasks, and baselines. There’s also some scope for improvements in the PPM module formulation. I will wait for the author’s rebuttal to clarify these questions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

R1 Formulation of PPM: During the reshape operation,order of patch representations is kept consistent across hierarchical levels. z is simply reshaped differently at each layer to indicate the number of independent entities & positively paired patches. Motivation behind using a different projection layer per each level is to learn a separate secondary feature space, More baselines: We would like to acknowledge the previous works mentioned. Virchow and UNI are pretrained on large in-house data. Since, Virchow is not publicly available, we compare of our model to UNI. UNI achieves a lower linear accuracy on SRH compared to NCT-CRC dataset consisting of H&E images. Notably, many large vision models in histopathology are trained on H&E stained images, restricting their application on other image modalities like SRH. While we focused on building a model for SRH data, we also demonstrated how it can perform equally well on H&E stained image data using TCGA. The most notable prior work that explores hierarchy are HiDisc (Cheng et al), HIPT (Chen et al). Since we closely follow the visual hierarchical concept introduced in HiDisc, we included it as our baseline., Confusion regarding the use of same patch representation (z) inL_HA: This is not a typo. The KL divergence between the most aligning text vector and the visual representation is calculated using the same z in all levels. Unlike our text hierarchy which is formed of separately curated granular descriptions per each level, the visual hierarchy is always formed on patch sized views. We traverse through visual hierarchy by altering the count of positively paired patches based on a common origin, Multiple reviewers mentioned the lack of diversity in downstream datasets, tasks and metrics: To address the lack of comparison to baselines, we compared HLSS with UNI. We also achieve significant results on kNN, linear and out-of-distribution classification on multiple downstream datasets,How visual and language hierarchy is formed: The granular text descriptions for each level are separately curated to form the language hierarchy by describing the dataset-specific visual characteristics at the given granularity. Input to the visual encoder is a patch sized image. Some prior work have explored visual hierarchy using a pyramid of visual encoders which is highly computationally expensive.We get inspiration from HiDisc and explore the visual hierarchy using a slightly different positive pairing mechanism per each level. This formulation integrates well with contrastive objectives. While we only utilise patch sized visual inputs, they are paired in a level-specific manner to create a visual contrastive objective unique for each level. R3 We utilised multi-stage prompting: Provide a list of visual attributes present in Stimulated Raman Histology (SRH) images of brain tumor patients. These visual attributes should be categorised in relation to their granularity, based on whether they represent patch level features (cellular level), slide level features (tissue characteristics) or patient level features (hollistic view). Provide 3 separate lists for patch, slide and patient level features, each containing 128 non-overlapping visual attributes, for each of the above patch level features, provide four sentences, each describing a SRH brain tumor image with respect to the corresponding feature. These prompts are repeated for other levels. After obtaining visual attributes, they were manually inspected by an expert to remove hallucination. The main motivation behind the dataset-specific descriptions instead of sample-specific descriptions, is to avoid the noise in the language data. We first adopt the SIMCLR to our domain. R4 beyond classification, we have provided interpretability tests in supplementary. It is conducted using unseen bio markers for tumor classes of SRH, obtained from a pathologist.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mostly negative feedback. I agree with the reviewers that this paper proposes an interesting self-supervised learning framework for hierarchical text-to-vision alignment in histopathology,. However, the implementation lacks thorough benchmarking against pathology-specific datasets and state-of-the-art methods. Moreover, the clarity of methodological details and experimental settings needs improvement for better reproducibility and validation of the claimed benefits. Therefore, the paper currently falls short in substantiating its contributions and the recommendation is towards negative.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mostly negative feedback. I agree with the reviewers that this paper proposes an interesting self-supervised learning framework for hierarchical text-to-vision alignment in histopathology,. However, the implementation lacks thorough benchmarking against pathology-specific datasets and state-of-the-art methods. Moreover, the clarity of methodological details and experimental settings needs improvement for better reproducibility and validation of the claimed benefits. Therefore, the paper currently falls short in substantiating its contributions and the recommendation is towards negative.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers and AC acknowledge the importance and value of hierirical text-vision alignment for pathology image representation learning. However, as pointed out by reviewers, this paper lacks necessary technical description and experimental results. However, the text generation from LLM not histopathology reports is also questionable.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers and AC acknowledge the importance and value of hierirical text-vision alignment for pathology image representation learning. However, as pointed out by reviewers, this paper lacks necessary technical description and experimental results. However, the text generation from LLM not histopathology reports is also questionable.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed reviews and the criticism relates to the insufficient benchmarking and the relative early stage of development and evaluation of the approach. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the reviewers are in general consensus that there is valuable contribution in the way the authors explore a vision-language self-supervised learning framework focused on histopathology images. The authors should improve the clarity of the presentation and highlight limitations in their discussion.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed reviews and the criticism relates to the insufficient benchmarking and the relative early stage of development and evaluation of the approach. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the reviewers are in general consensus that there is valuable contribution in the way the authors explore a vision-language self-supervised learning framework focused on histopathology images. The authors should improve the clarity of the presentation and highlight limitations in their discussion.



back to top