Abstract

Recently, vision-language pre-trained models have emerged in computational pathology. Previous works generally focused on the alignment of image-text pairs via the contrastive pre-training paradigm. Such pre-trained models have been applied to pathology image classification in zero-shot learning or transfer learning fashion. Herein, we hypothesize that the pre-trained vision-language models can be utilized for quantitative histopathology image analysis through a simple image-to-text retrieval. To this end, we propose a Text-based Quantitative and Explainable histopathology image analysis, which we call TQx. Given a set of histopathology images, we adopt a pre-trained vision-language model to retrieve a word-of-interest pool. The retrieved words are then used to quantify the histopathology images and generate understandable feature embeddings due to the direct mapping to the text description. To evaluate the proposed method, the text-based embeddings of four histopathology image datasets are utilized to perform clustering and classification tasks. The results demonstrate that TQx is able to quantify and analyze histopathology images that are comparable to the prevalent visual models in computational pathology.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2481_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2481_supp.pdf

Link to the Code Repository

https://github.com/anhtienng/TQx

Link to the Dataset(s)

https://github.com/QuIIL/KBSMC_colon_cancer_grading_dataset https://wsss4luad.grand-challenge.org/ https://iciar2018-challenge.grand-challenge.org/Dataset/ https://figshare.com/projects/nmi-wsi-diagnosis/61973

BibTex

@InProceedings{Ngu_Towards_MICCAI2024,
        author = { Nguyen, Anh Tien and Vuong, Trinh Thi Le and Kwak, Jin Tae},
        title = { { Towards a text-based quantitative and explainable histopathology image analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper hypothesizes and validates the potential of pre-trained vision-language models for quantifying histopathology images. The text-based image embeddings can be associated with human-readable histopathologic terms, thus achieving interpretability. The construction of the text pool is also being investigated.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    a. The exploration of the potential of a pre-trained vision-language model for quantifying histopathology images is interesting. b. The paper is well-organized and easy to understand.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    a. The text-based interpretation is inherient in the pre-trained vision-language models, which align the visual with the text. b. The study limits itself to a single pre-trained vision-text model without exploring other VLMs. c. Missing discussion between the proposed wold-of-interest pool with the original text used for training VLMs.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The experimental settings are very detailed, which should ensure the reproducibility of the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    a. How to expalin that the silhouette coefficients of ground truth cluster is smaller than those of the visual/text embedding cluster shown in Fig. 2? b. Is interpretability related to the knowledge granularity of the text in the pre-trained model? Do other pre-trained vision-language models have these kind of properties? c. How does combining image embedding and text-based image embedding affect classification performance? d. How about the original text-based embedding from pre-trained VLM? Do they already provide a good interpretation than the text from proposed WoI pool? d. The color of fig 1 is too light to read the content. (minor comments)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The exploration of text-based interpretation is interesting. The exploration of text-based interpretation is interesting. The paper expands the interpretability capabilities of multi-modaliy models from more standardized clinical texts.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Building on the foundation of vision-language pre-trained models in the field of computational pathology, the authors demonstrate that text-based embeddings can effectively quantify histopathology images, enhancing their explainability. To validate the effectiveness of these features, the authors employ them in clustering and classification tasks, showcasing their practical utility in distinct applications within the domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors conduct comprehensive experiments to substantiate the hypothesis that Vision-Language Models (VLMs) can effectively quantify histopathology images. The study reveals that text-based embeddings can achieve performance comparable to that of visual embeddings in classification tasks. This finding is noteworthy as text-based features not only parallel the effectiveness of visual counterparts but also offer enhanced explainability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    A notable weakness of the paper is the handling of the ‘word-of-interest’ pool and the filtered ‘Wol Pool’. Even in its reduced state, the smallest pool contains over 2,000 keywords, which likely includes abundant and noisy labels. A more specific and refined ‘word-of-interest’ pool tailored to the dataset would significantly enhance the quality of the results. In terms of classification, the paper would benefit from demonstrating how the combination of text and visual embeddings affects performance, providing a more comprehensive understanding.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Utilizing a Large Language Model (LLM) to generate keywords represents a promising direction to consider as a replacement for the current ‘word of interest’ pool.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall score for this paper was primarily influenced by the its findings and analysis, and the comprehensive experiments conducted to validate them.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces TQx, a text-based framework for quantitative and explainable histopathology image analysis using pre-trained vision-language models (VLMs). TQx harnesses an image-to-text retrieval process to generate understandable feature embeddings, enabling both clustering and classification tasks. The authors evaluate TQx on four histopathology image datasets and demonstrate its ability to deliver comparable performance to conventional visual models while providing interpretability through human-readable keywords.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The framework is evaluated on four different histopathology datasets, showing capabilities across a diverse set of conditions. The model provides a direct mapping between text-based features and medical terms, improving the interpretability of results for clinical professionals and researchers.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Explainability Not Fully Addressed: The paper does not provide a comprehensive explanation of the framework’s explainability aspects. More information is needed to justify why it is labeled as explainable and to demonstrate how the model improves downstream tasks. The authors don’t compare TQx with other studies on the specific datasets, which would provide a clearer understanding of its comparative advantages and limitations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Include comparisons with existing VLMs and relevant non-VLM approaches. Such analysis will give a better perspective on TQx’s relative strengths and weaknesses. Conduct an ablation study or investigate more in depth to evaluate the text encoder’s importance and enhance the study’s emphasis on explainability.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Lack of comparison to existing vision-language models (VLMs) or non-VLMs on the specific datasets
    • Lack of explainability evaluation
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The primary focus of this paper lies in introducing a novel methodology for text based quantitative, explainable histopathology image analysis, primarily reliant on VLMS models. This innovative approach combines the principles of image retrieval with VLMS to enhance the interpretability of the analysis. To validate the effectiveness of the proposed method, comprehensive testing was conducted across four histopathology datasets. The evaluation was tested for both clustering and classification tasks, showcasing the robustness and applicability of the approach across varied contexts. By seamlessly integrating text-based methodologies with advanced image analysis techniques, this paper contributes significantly to the advancement of histopathology image analysis, offering a promising avenue for improved understanding and interpretation within the field.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper’s main strength lies in using VLMS models for health issues, which is a fresh approach. Additionally, it addresses two major challenges: combining vision and language models for clustering and classification in histopathology, which is crucial for diagnosing cancer accurately. Their thorough evaluation across four datasets adds weight to their findings, making them more reliable. This comprehensive testing ensures that the proposed methods are effective and applicable across various scenarios, which is important for real-world use.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I found this paper quite captivating and well-structured. However, there are a couple of areas that could be improved. Firstly, the methodology could benefit from more detailed explanation to enhance clarity and understanding. Secondly, the experiment design could be made clearer, providing a better framework for interpreting the results. Improving these aspects would enhance the overall quality and accessibility of the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors should make the code publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall, I found this paper quite captivating and well-structured. However, there are a couple of areas that could be improved upon. Firstly, I believe that providing a more detailed explanation of the methodology would enhance clarity and understanding for readers. Secondly, clarifying the experiment design would provide a better framework for interpreting the results. Addressing these aspects would significantly enhance the overall quality and accessibility of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The primary factors include the novelty of the problem and the utilization of various datasets to assess the proposed model.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

  1. The reviewers questioned why the silhouette coefficients of the ground truth cluster are smaller than those of the visual/text embedding cluster shown in Fig. 2: We would like to clarify that the silhouette coefficients for the ground truths are computed by first clustering the samples using either visual or text embeddings and then re-assigning the samples to the ground truth class labels. Hence, the clustering is still based upon the visual or text embeddings. These embeddings are obtained without utilizing the ground truth class labels. The difference in the silhouette coefficients indicates that the clusters by the visual or text embeddings include differing class labels, which is shown in Fig. 3. By definition, tissues belonging to the same class label share common histologic properties. However, borderline cases exist where tissues possess heterogeneous characteristics that can be related to multiple class labels. The fact that the visual and text embeddings achieve higher silhouette scores demonstrates that both embeddings are able to find the common characteristics among tissues that are slightly different from the ground truth labels. The visual and text embeddings can be understood as an alternative interpretation and explanation of the tissue samples, potentially providing more fine-grained information, e.g. 100 relevant terms in comparison to 4 class labels. 

  2. The reviewers asked whether the interpretability of our work is related to the knowledge granularity of the text in the pre-trained model and whether other pre-trained vision-language models have these kinds of properties: The interpretability of the text-based embedding is based on the direct generation from natural-language pathology terms. In particular, we know exactly which terms are used to produce the text-based embedding as well as the weights in the combination that are based on the similarity scores. Therefore, any pre-trained vision-language models can be used without losing the explainability. 

  3. The reviewers raised a question on the effect of combining image embedding and text-based image embeddings on the classification performance: The combination of visual and text-based embeddings slightly improves the classification performance. However, the combination has two main drawbacks. Firstly, additional weights are required to combine these two embeddings. Secondly and more importantly, the combined embedding loses the interpretability of the proposed text-based image embedding. 

  4. The reviewers asked whether the original text-based embedding from the pre-trained VLM already provides the interpretability that the proposed suggests: The embeddings generated by the pre-trained VLMs are difficult to interpret due to their numeric form, which is why the traditional visual embeddings are unexplainable. Therefore, direct mapping from the embeddings to human-readable texts is required to understand the embeddings fully. 

5.  The reviewers asked to provide a more detailed explanation of the methodology and experiment design to improve the readability of our work:  In response to the reviewers’ comments, we will update the Methodology section to clarify the procedures and to improve the understanding and readability of our work in the final manuscript.

  1. The reviewers suggested making comparisons with existing VLMs and relevant non-VLM approaches and ablation experiments to provide an in-depth evaluation of our work. We appreciate the reviewers’ comments. Due to the MICCAI policy, we cannot provide additional experiments and results. We will leave these for future study. 

  2. The reviewers asked to change the color of Fig 1 since it is too light to read the content.  We will update the color of Fig. 1 in the final manuscript.




Meta-Review

Meta-review not available, early accepted paper.



back to top