Abstract

In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3991_paper.pdf

SharedIt Link: https://rdcu.be/dY6iO

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_35

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3991_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

https://zenodo.org/records/6550925

BibTex

@InProceedings{Zho_PathM3_MICCAI2024,
        author = { Zhou, Qifeng and Zhong, Wenliang and Guo, Yuzhi and Xiao, Michael and Ma, Hehuan and Huang, Junzhou},
        title = { { PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {373 -- 383}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper present a multimodal, multi-task, multiple instance learning framework for WSI classification and captioning, named PathM3.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Massive experiments have been done to prove the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) for the classification task, both image and caption are sent to the network. In the real-world scenario, do we need to have a classification model when we already have the caption written by the pathologist? 2) the caption T = {T1, T2, . . . , TN } is defined but never explained or mentioned again. How does the caption used in the network? 3) typo in Eq 4, is the LR actually LG?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) the authors need to explain the motivation of designing the multi-modal prediction task. 2) explain what happened after the caption input in detail. 3) check for typos.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the idea seems reasonable and the experiment are quite solid.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces a multi-modal multi-task framework to both classify and caption whole slide images (WSIs). Classification and captioning being the two tasks of interest. The paper proposes a way to fuse image and text information together as well an approach to utilize WSI captions in a more efficient way.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • While others have attempted to do classification and or captioning of WSIs in the past, the captions, in this interesting work, are more specific (not single word captions) and expert-provided (not LLM generated).
    • The paper clearly explains the challenges associated with WSIs that are unique to it that may not exist for natural images, making it easier for the reader to appreciate the work that has been done.
    • An ablation study at the inference stage is conducted, showing how the incorporation of different modules of the framework affect performance to help demonstrate the usefuless of the proposed approach. Also, it shows how differently the different models perform depending on what data modalities are available at inference.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some of the shortcomings of other WSI captioning methods could be explained in slightly more depth. Particularly, why is it that some methods are prepared to caption at the patch level rather than the WSI level.
    • Explaining a little a bit more on how the proposed method is different from MIL methods that work at the bag-level (ABMIL, TransMIL) which are mentioned in the same paragraph in section 2. Currently, it is not clear enough what makes ABMIL and TransMIL different from the proposed method
    • I suspect there is a bit of a flaw in the problem formulation. For the classification task, if the model is trained to accept as input both image information and text information, the text input might be too strong or useful such that the the image information becomes irrelevant because text clasification is an easier problem than image classification. I am not sure if the classification task tells us much if we are feeding in the caption T. Effectively, how do we know that, for the classification problem, that the image information is contributing to the predictions? Classifying text is easier than classifying images because it is easier to encounter words (in a caption) that are associated specifically with only one class.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Public dataset used (specficially PatchGastric dataset introduced by Tsuneki and Kanavati).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • In Fig. 1, is the caption input supposed to be the textual embeddings?
    • Please make clearer what about the correlation module makes correlations happen?
    • In equation 2, Q and K are not defined. Do they have to with the Query matrix and the key matrix of the query-based transformer?
    • The Problem Formulation subsection and the Multi-Task Joint Learning subsection seem to slightly imply different things with regards to use of text input in the classification task. Is text used as input for classification at inference?
    • LR and LG on page 5 seem to refer to the same thing, the generative loss. Please consistently use one of them throughout. If they are different, please define LR because I could not find a definition for it.
    • How do you determine the value for alpha in equation 4?
    • With regards to Table 4, I find it fascinating that on their own, correlation and multi-task only lead to very minimal increases in BLEU and METEOR (which, as metrics they are more similar to each other than either of them is to SPICE) but together there is significant improvement. Is there an intuitive explanation for why this has happened?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well-written paper that proposes a way to classify and caption WSIs that takes characteristics of WSIs into consideration (patches of WSIs are related, etc.) through the correlation module; however, the point of using text in image classification and how ABMIL and TransMIL differ from the proposed method warrant further clarification (through a rebuttal) and hence I have currently set the score to be 4. Weak Accept.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The manuscript proposed PathM3, a multi-modal, multi-task, multi-instance learning framework for histopathology image analysis, exploring the intricate alignment between WSIs and the corresponding diagnostic captions. PathM3 aggregates patch features with the MIL method that evaluates correlations among instances. Furthermore, PathM3 can leverage the limited WSI diagnostic caption in multi-task joint learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed PathM3 efficiently fuse WSI-level images and captions via a multi-modal alignment. The usage of limited caption data is also practical for real-world clinical applications. Compared with the baseline models, PathM3 can achieve a good performance on multiple tasks. Systematic experiments are conducted to demonstrate the performance of PathM3 and the effect of each module design.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses are summarized below; please view the detailed comments in “10. detailed and constructive comments”:

    1. More multi-class evaluation metrics can be considered to be added in Table 1 and Table 2.
    2. More details should be added to improve the methodology clarification.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Beyond using accuracy, more evaluation metrics, such as micro/macro F1-score, can be included in Table 1 and Table 2 to better evaluate 3-subtype classification.
    2. I am curious about the detailed subtype classification result. Does the model consistently perform well across each subtype or tend to perform well on partial subtypes?
    3. Several details should be added to improve the methodology clarification: (1) What is alpha in eq(4) and how to determine its value to balance the difference between two different losses. (2) How to determine the query as the input and how to achieve the corresponding query embedding?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written, and the capability of the proposed model has been demonstrated through several experiments with multiple tasks. The experiment designs are reasonable, and the baseline inclusion is sufficient. Compared with the baseline models, PathM3 can achieve a good performance on multiple tasks via multimodal clinical data.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    the author prosed a novel multi-task algo leveraged both the multi-instance learning algorithm and the large language model in the digital pathology domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    the correlation module proposed in this pipeline is innovative

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    the codebase and the parameters/hyperparameters, infrastructure info is missing for reproducibility concerns.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    the author prosed a novel multi-task algo leveraged both the multi-instance learning algorithm and the large language model in the digital pathology domain.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    innovative approach

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely appreciate the constructive feedback from all the reviewers. Our responses to the raised concerns are as follows:

For our problem formulation and contribution: Our method addresses a realistic scenario where only limited captions are available. Therefore, we employ the multi-tasking setting during training, where images are used as the input, with captions serving as ground truth. During the inference in this scenario, images alone are used as inputs (R5). Additionally, we only input both images and captions (textual embeddings) when all of them are available, which is the upper bound of our method (R3, R5).

Regarding the shortcomings of previous methods, as stated in Section 1, reliable WSI diagnostic captions require specialized pathologists and face privacy concerns, leading to a limit of captions for training models. Therefore, previous methods gather data from textbooks and the internet, which only yield patch-text pairs. We would like to clarify that in Section 2, when we state “Our work is distinct from these works as it seeks alignment at the bag level”, it means that our proposed approach differs from instance-level image-text alignment methods like CITE and MI-Zero, as our method achieves image-text alignment at bag level.

For our methodology, the input query is learnable vectors, which are initialized with the pretrain weight from Blip2 (R4). We employ a self-attention mechanism to establish correlations, as demonstrated by TransMIL (R5). “Alpha” is a hyperparameter utilized for loss balancing (R4, R5) and we experimented with values ranging from 0.2 to 0.8 to find the best one. In Equation 2 (R5), “K” and “Q” follow standard attention mechanisms, and we use Nystrom attention approximation instead of regular attention. The correct notation in Equation 4 should be “LG,” and we apologize for the typos and will correct them in the final version (R3, R5).

For the experiments: Regarding the significant improvement of multi-task and correlation components (R5), we propose that correlation could learn spatial redundancies and contextual relationships, which identify the key diagnostic areas. The multi-task setting further leverages subtype labels to locate ROI. Together, these enhancements enable significant progress in captioning tasks by focusing on diagnostically relevant areas. Regarding the image and text contributions to the results, Table 3 in our ablation studies indicates that using both image and text data during training improves accuracy by at least 4.81% compared to using only text input (R5). The AUC of our method is 85.22 (0.24), and for the detail subtype accuracy: “moderately differentiated tubular adenocarcinoma” is 67.93 (3.27) “poorly differentiated adenocarcinoma” is 63.22 (6.53), and “well differentiated tubular adenocarcinoma” is 79.06(3.35) (R4).




Meta-Review

Meta-review not available, early accepted paper.



back to top