Abstract

Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports at both local and global granularities. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset and code are available here.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0796_paper.pdf

SharedIt Link: https://rdcu.be/dY6iv

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_18

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0796_supp.pdf

Link to the Code Repository

https://github.com/dddavid4real/HistGen

Link to the Dataset(s)

https://github.com/dddavid4real/HistGen

BibTex

@InProceedings{Guo_HistGen_MICCAI2024,
        author = { Guo, Zhengrui and Ma, Jiabo and Xu, Yingxue and Wang, Yihui and Wang, Liansheng and Chen, Hao},
        title = { { HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {189 -- 199}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a new method for aggregating WSI patches while incorporating global and cross-modal context. The authors show promising results of this approach in WSI report generation and slide-level prediction tasks. They additionally create a dataset for WSI report generation and evaluate their and existing methods on it.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The local-global hierarchical pooling formulation is interesting and simplifies WSI aggregation which is challenging due to the large number of patches.
    • The cross-modal context provides an memory back for both modalities to query relevant context from.
    • The idea of using WSI report generation as a pre-training setup for learning slide-level representations is interesting and promising.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Its unclear if the prototypes are needed as we can potentially use region level representations. The details around how the query response from a small number of prototypes is aggregated back into the large number of patches is not clearly discussed.
    • While the LGH aggregator seems useful, its unclear how it compares with other existing MIL aggregators like HIPT, TransMIL etc.
    • Its unclear how the #regions and #prototypes is chosen and how they impact the results
    • According to figure the region-level encoder is applied twice (before and after wsi-level encoder) but according to the equations its only applied once after the wsi encoder.
    • The authors compare their pre-trained ViT-L against Resnet-50 but its unclear how it compares against other openly available pathology pre-trained ViT backbones like Phikon (ViT-B), UNI (ViT-L) and Lunit (ViT-S)
    • For evaluation on slide-level tasks is the LGH encoder frozen or fine-tuned while training a classifier on the pooled WSI representation? If fine-tuned can the authors elaborate on the number of training params as that could drive the improvements in results?

    Phikon - https://github.com/owkin/HistoSSLscaling UNI - https://github.com/mahmoodlab/UNI Lunit - https://github.com/lunit-io/benchmark-ssl-pathology

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors could expand more on the training details, modelling details and hyper-parameters chosen. May be adding some psude-code to expand on the LGH encoder and expanding on interactions with cross-modal module would be useful in clarifying details.

    • Its unclear what training dataset was used for pre-training.
    • The authors mention using word-embeddings but its unclear what text-encoder or word-embeddings are used.
    • The model settings are unclear as the hidden/embedding dims from the vision encoder (1024) are different from the decoder attention (512) and the context module (512x2048) so its unclear how these hyper-params are chosen and they interact
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper formulation and use of WSI report generation for pre-training is interesting and novel. The creation of new dataset for evaluation is useful. But the authors dont describe a lot of the details and the paper needs more work to expand and clarify on the technical details and compare against other openly available MIL featurizers and aggregators to understand the utility of the proposed approach. While I understand MICCAI doesnt allow including new experiments, I would be open to changing my recommendation if the other improvements and clarifications are made.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Based on authors comments I am changing my decision to Weak Accept. The justification is the authors clarifications on the feature extractor comparisons and the technical details.



Review #2

  • Please describe the contribution of the paper

    The author curate a WSI-report dataset and propose the local-global feature encoding and cross-modal as the generator. Furthermore, the proposed model after finetuning can outperform previous works on cancer subtyping and survival analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The automatic generation of pathology reports is intriguing. The proposed model also show the ability on cancer subtyping and survival analysis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The lackage of description and analysis of the generated wsi-report dataset. What is the prompt of the LLM to summarize and clean the report? Are the descriptions summarized in the final report relevant or correct in clinical? Considering the hallucination of the LLM, the generated text may contain incorrect or unfaithful information, which imapct the reliability of the model trained on this dataset.

    2. The worry about the noisy correspondence between the WSI and the generated report. Since these reports from the hospitals often include information not discernible from images alone such as ‘lymph node metastasis’ in Fig.1 of Supplementary . This raises concerns about the model’s ability to avoid generating unreasonable content.

    3. In Table 3, the model did not show a significant imporvement on cancer subtyping compared to ABMIL, which is a relatively simple framework owning fewer parameters.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The topic is interesting but more in-depth analysis is required. The author claimed too many contributions including a new dataset, a visual encoder, a corss-modal module, a new pretraining method. Thus, many details are inevitably missing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The concern about the generation process of the dataset with the aid of GPT and missing details make the paper somewhat confusing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The authors have too many expriements and details to complete. I tend to reject.



Review #3

  • Please describe the contribution of the paper

    Author introduces HistGen, a multiple instance learning- empowered framework for histopathology report generation. They also introduced the first benchmark dataset for evaluation. They have curated a benchmark WSI-report dataset of around 7,800 pairs. They have developed a local-global hierarchical encoder for efficient visual feature aggregation from a region-to-slide perspective. They claim that, the proposed model outperforms state-of-the-art (SOTA) models by a large margin.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Finding a good dataset in radiology or pathology or histopathology report generation always a challenge. Authors have curated a benchmark WSI-report dataset of around 7,800 pairs, which will definitely help to other researchers with simiar interest.
    2. Authors have developed a local-global hierarchical visual encoder is proposed for effective encoding and aggregation of extensive WSI patch features in a region-to-slide manner.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Manual evaluation from domain expert is not included in the paper.
    2. Statistical significance criteria are crucial for gaining confidence in the findings; yet, they are not included in the publication.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Public access to the dataset and code is necessary to enable researchers to utilize them for further study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Can author provide statistical significance metrics that allows reviewers and readers to critically evaluate the study’s methodology and results. It promotes transparency and helps ensure the rigor of the research conducted.
    2. Author should add manual evaluation done by domain experts. Manual evaluation metrics help validate the effectiveness of automated algorithms in accurately identifying and characterizing histopathological features. Researchers need to demonstrate that their algorithms produce results that are consistent with those obtained through manual evaluation by experts.
    3. It would be great if author gives one example of input and output in the main paper. It will be helpful to reader to understand the problem statement clearly.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Paper is well writen with cliarity.
    2. New dataset with novel model will definitely advance the reserach in this field.
    3. Paper can definitely accepted it authors addresses the reviewer’s comments.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all reviewers for their constructive comments. We will open-source our code and dataset upon acceptance to boost reproducible research. To Reviewer #1: R1.1: We designed the prototype learning module to select key tokens of WSI for cross-modal interaction, addressing redundancy and computational complexity of the patch sequence. We used uniform sampling to select key tokens and aggregate them back with a cross-attention layer. R1.2: You mentioned we did not compare the LGH module with MIL aggregators. Since our modules are specifically proposed for the task of report generation, we mainly compare our model (LGH + CMC) with report generation models (Table 1). And to verify that our model learned diagnosis-related information during this task, we compared the LGH with MIL aggregators in subtyping and survival analysis (Tables 3 and 4) by fine-tuning it with a learning rate of 1e-4, 8 attention heads, hidden dimension of 512, dropout of 0.1, and Adam optimizer. R1.3: We conducted ablation studies on #region size (64, 96, 128, 256, 384, 512) and #prototypes (64, 128, 256, 512, 768). #region size 96 (fixed #prototypes 512) yields the best BLEU-4 score of 0.182. And #prototypes 256 (fixed #region size 96) yields the best BLEU-4 score of 0.185. R1.4: Thank you for highlighting equation ambiguities. We omitted the first region encoder process in the equation but mentioned it in the text. We will revise this for clarity. R1.5: We compared our DINOv2 ViT-L to several backbones on report generation under BLEU-4: Ours (0.184), Phikon (0.178), PLIP (0.053), UNI (0.151), CONCH (0.077). Ours outperformed others, with only Phikon (0.178) being comparable. R1.6: We make some minor clarifications. The dataset used for pretraining is our proposed TCGA WSI-report dataset. The text decoder is a 3-layer transformer. The DINOv2 backbone dimension is 1024, reduced to 512 by the LGH modules, matching the CMC module input dimension (512 x 2048) for cross-attention and text decoder attention (512).

To Reviewer #3: R3.1: For report data preprocessing with GPT-4, this prompt is used: Help me check the formatting and spelling of the supplied pathology report, including incorrected use of punctuation like misusing of ‘x’ and ‘X’, and capitalization as well as deletion of some words of unclear meaning. It checks formatting and spelling while preserving the report’s meaning, ensuring clinical correctness in the final report. R3.2: Regarding noisy WSI-report correspondence, we clarify that reports do contain irrelevant information like patient age/sex or lymph node metastasis. The former (irrelevant information) will be filtered more strictly in future work. For the latter (diagnosis information not shown in WSI), as deep learning has been used to predict lymph node metastasis, treatment response, or gene mutation directly from WSI, we anticipate our model to implicitly learn and reflect this in generated report. R3.3: You mentioned our model surpass ABMIL only marginally. We claim that our DINOv2 backbone provided superior initial features so that a simple MIL model can already achieve good results. We verify this by using ImageNet-pretrained ResNet50 for feature extraction. ABMIL achieved 0.673 accuracy in UBC-OCEAN subtyping, while with our backbone, it reached 0.792. R3.4: Please refer to R1.1 – 1.3 and R1.5 – 1.6 for more details and analysis of our work.

To Reviewer #4: R4.1: We conducted t-tests comparing our method to SOTA methods, resulting in P-values of 0.0003 for report generation, 0.004 for cancer subtyping, and 0.005 for survival analysis, indicating significant performance improvements. R4.2: We acknowledge the importance of manual evaluation. In future work, domain experts will be included to further clean the dataset and validate the quality of generated reports as well as the effectiveness of our method. R4.3: Our algorithm takes a WSI as input and generates the corresponding diagnosis report, as shown in Supplementary Figure 1.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a very hot topic and there have been a number of very high impact papers published recently however reviewers all point to a lack of clarity in the paper. The provision of a public dataset was a key plus with a couple of the reviewers but this would need to be validated much more thoroughly by domain experts before it will be of general use so is not a sufficient reason to overlook the other weaknesses in the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This is a very hot topic and there have been a number of very high impact papers published recently however reviewers all point to a lack of clarity in the paper. The provision of a public dataset was a key plus with a couple of the reviewers but this would need to be validated much more thoroughly by domain experts before it will be of general use so is not a sufficient reason to overlook the other weaknesses in the paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents two main contributions: a dataset and a method for whole-slide-representation learning / automatic report generation.

    The topic of is interesting, timely, and relevant and this submission occurs concurrently with highly related papers. The proposed approach is seen as interesting by the reviewers and the efforts to make a corresponding dataset available are appreciated. The reviews also mention substantial criticism, including missing details of the underlying approach, missing comparisons w.r.t. other pretrained backbones and MIL approaches, and questions about the curation of the paired image-report datasets. While some of these concerns were alleviated in the rebuttal, the opinions of the reviewers are still divided. One issue is the validation of the dataset, which is currently only automatically preprocessed.

    Since the majority of the reviewers leans toward an accept and the technical concerns were mostly addressed in the rebuttal, although borderline, the paper has sufficient merit for publication at MICCAI.

    I’d like to stress that the authors should consider including a concise discussions about potential limitations of the dataset in a revised version of the paper. Furthermore, from my perspective, the paper currently does not sufficiently address the usability from a human-AI collaboration perspective.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents two main contributions: a dataset and a method for whole-slide-representation learning / automatic report generation.

    The topic of is interesting, timely, and relevant and this submission occurs concurrently with highly related papers. The proposed approach is seen as interesting by the reviewers and the efforts to make a corresponding dataset available are appreciated. The reviews also mention substantial criticism, including missing details of the underlying approach, missing comparisons w.r.t. other pretrained backbones and MIL approaches, and questions about the curation of the paired image-report datasets. While some of these concerns were alleviated in the rebuttal, the opinions of the reviewers are still divided. One issue is the validation of the dataset, which is currently only automatically preprocessed.

    Since the majority of the reviewers leans toward an accept and the technical concerns were mostly addressed in the rebuttal, although borderline, the paper has sufficient merit for publication at MICCAI.

    I’d like to stress that the authors should consider including a concise discussions about potential limitations of the dataset in a revised version of the paper. Furthermore, from my perspective, the paper currently does not sufficiently address the usability from a human-AI collaboration perspective.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There were opinions suggesting that the explanation of the technique needs to be strengthened. However, I believe that the work conducted in this paper is a suitable topic for presentation at the conference. I agree with Meta Reviewer 3’s opinion of accepting the paper by enhancing the specific details presented by the authors in the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There were opinions suggesting that the explanation of the technique needs to be strengthened. However, I believe that the work conducted in this paper is a suitable topic for presentation at the conference. I agree with Meta Reviewer 3’s opinion of accepting the paper by enhancing the specific details presented by the authors in the rebuttal.



back to top