Abstract

Vision language models (VLM) have achieved success in both natural language comprehension and image recognition tasks. However, their use in pathology report generation for whole slide images (WSIs) is still limited due to the huge size of multi-scale WSIs and the high cost of WSI annotation. Moreover, in most of the existing research on pathology report generation, sufficient validation regarding clinical efficacy has not been conducted. Herein, we propose a novel Patient-level Multi-organ Pathology Report Generation (PMPRG) model, which utilizes the multi-scale WSI features from our proposed MR-ViT model and their real pathology reports to guide VLM training for accurate pathology report generation. The model then automatically generates a report based on the provided key features-attended regional features. We assessed our model using a WSI dataset consisting of multiple organs, including the colon and kidney. Our model achieved a METEOR score of 0.68, demonstrating the effectiveness of our approach. This model allows pathologists to efficiently generate pathology reports for patients, regardless of the number of WSIs involved.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1738_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1738_supp.pdf

Link to the Code Repository

https://github.com/hvcl/Clinical-grade-Pathology-Report-Generation/tree/main

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tan_Clinicalgrade_MICCAI2024,
        author = { Tan, Jing Wei and Kim, SeungKyu and Kim, Eunsu and Lee, Sung Hak and Ahn, Sangjeong and Jeong, Won-Ki},
        title = { { Clinical-grade Multi-Organ Pathology Report Generation for Multi-scale Whole Slide Images via a Semantically Guided Medical Text Foundation Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors proposed a model for pathology report generation using multiple WSIs per patient. The underlying vision-language architecture consists of a multi-region ViT processing WSIs, a tag-guided feature extractor and a pretrained medical language model for processing report data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Well motivated proposed approach, latest deep learning based approaches were used to solve the problem. The method is well described and the visualisation supports intuitive understanding.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main concerns about the paper refer to insufficient experiments:

    • standard deviations are missing, common scheme for slide or patient analysis is k-fold cross validation to alleviate effects of data splits and random seeds.

    • Lack of comparison between other slide processing models. The proposed multi region ViT should be compared with TransMIL, AB-MIL and DS-MIL, instead of just hierarchical encoders. TransMIL, for instance, is also able to infer slide-level encodings for the follow-up task. This would also show if region-wise visual feature extraction is necessary compared to extracting features at just one level, e.g. 20x magnification. Since HIPT was not trained on the same slide data, despite following other studies, it is not a reasonable method for comparison in this context. ZoomMIL is trained using a ResNet50 feature extractor, and since not the pre-trained CNN models should be compared here to show superiority of the proposed method: Also use ResNet-50 to compare with ZoomMIL or also use VGG16 for Zoom-MIL to see the effects of their distinct architectures. Moreover, most recent works use cTranspath as a baseline for pathology image encoder.

    • Lack of comparison in PMPRG: Why is ZoomMIL (better than HIPT for image encoding task) left out of experiments here? It is not fair to compare HIPT (which is trained on another dataset) against the proposed MR-ViT model, which is trained on the authors’ in-house dataset which is also used for testing. Also, there should be more method comparisons here.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors did not state that the implementation will be published. The method is evaluated on an in-house dataset and several layer / data shapes of the method are omitted. Thus, this paper is not reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Image encoding comparison: Accuracy metric is heavily influenced by class distributions. However, data were splitted randomly. Did you remove excess classes in test set for this experiment? Please clarify.

    • I assume that the dataset splits ensure the same amount of organs in each split to numerically validate the multi-organ hypothesis? Please clarify.

    • The authors should numerically support the claim for a more light weight/efficient region wise encoder compared to HIPT, e.g. by providing the number of parameters and by performing a training time comparison.

    • Please clarify the choice of the loss weight parameters alpha (0.2) beta (0.6) gamma (0.2).

    • Attention maps are provided, but it is neither discussed if they are valid, nor if these regions align with statements in the generated report. Thus, the stated novelty of an “explainable model” is not sufficiently validated and confirmed.

    • Also see section “weaknesses

    Minor comments:

    • In abstract MR-ViT abbreviation not introduced
    • Page 3 first bullet point format violates MICCAI guidelines + space missing here
    • Sec 2.2 what is DINO in this context, citation? What is meant by batch-training?
    • Sec 2.2 How big is Q and L, the random selection of regions for MR-ViT_S and latent dim?
    • Sec 2.3 Space missing F_{R,pat}.During
    • Sec 2.3 “activation function omitted”; softmax missing here?
    • Please state which computational resources and training times are needed for you method and experiments
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Lacking method comparisons; insufficient experiments; several additional minor issues

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the issue of class imbalance by clarifying that they used macro-F1 score and have explained how the loss function hyper-parameters were derived.

    However, the main concern regarding evaluation of the approach remains:

    (1) Evaluation:

    • Tab1 shows that the authors compare feature encoding networks instead of their proposed slide-level approach, and thus tune their approach in a different way than comparable methods. They call this ablation.
    • In Tab2 there is only one comparable method, which is not sufficient. Furthermore, this comparable method is not even trained on the same dataset.
    • The answer about time constraints is not valid, because with this reasoning, ViTs would not have emerged in the time of CNNs.
    • Furthermore, single-scale methods are still reasonable baselines that have not been evaluated for the vision-language task; they should be considered as in very recent literature. In summary, there is a lack of comparable methods, and those that are included have been modified inappropriately, so it’s beneficial for the proposed approach.

    (2) Analysis/data splitting:

    • Standard deviations are missing, experiments lack k-fold CV, which is common in the literature even for high class imbalance.
    • The authors did not report problems with class imbalance in the paper, nor did they use data downsampling, nor did they fit a weighted loss.

    Overall, considerable further work is needed to provide an unbiased evaluation (comparable methods, metrics, data) to really indicate whether and to what extent this approach can outperform other SOTA methods.



Review #2

  • Please describe the contribution of the paper

    The authors train a vision language model with whole slide histological images and pathology reports. The model is able to generate reports from whole slide images automatically

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • A vision language model for pathology that considers multiple WSIs is novel
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Performance for tumor grading seems to be far from accurate
    • Language scores do not provide a good measure for the usefulness of the reports. A quality validation from pathology experts is missing
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code would be great. Information on the origin of the data would be very beneficial

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The motivation at the beginning of the introduction is not really to point, more a collection of general medical AI issues.
    • Avoid too many abbreviations: a sentence link “VLMs have gained traction in pathology report generation (PRG). Prior works like Zhang et al. [26] combined CNN and LSTM for VLM training us- ing ROI or sampled patches from WSIs. PLIP [6], MI-Zero [11] and CITE [25]…” is really hard to understand if you not working on those topics.
    • Please label all panels in fig 1
    • showing the report in fig 1 A and D is redundant
    • Table 1 and 2: showing four digits is probably overconfident. Please estimate the standard deviation of your methods and adapt
    • Fig 2 is very instructive. The generated report suggests immunohistochemistry findings - did you check if those are accurate?
    • Disease classes in main and supplement do not match
    • Code would be great
    • Information on the origin of the data would be very beneficial.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good idea, but validation of the generated reports needs to be done more thoroughly

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    After reading the reviews and the rebuttal letter and re-reading the manuscript, I think that a more solid evaluation of the reports beyond language scores is necessary. My concern about the tumor grad prediction is off still holds. My concern about the correctness of the reports is exacerbated after realizing that some of the high attention regions in Fig 2a point to background pixel, but not tissue.



Review #3

  • Please describe the contribution of the paper

    The paper aims to generate slide-level reports for histopathology cases. Some interesting additions are a true, patient-level report by combining multiple slides, a region-selection method and separate tag and organ classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method is definitely novel, I have not seen a patient-level report generation which combined multiple WSIs.
    2. The method is at least competitive to the state of the art in the experiments the authors conducted
    3. The validation dataset is quite substantial, making the trustworthiness of the results higher.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There is no statement on code availability, or the dataset, which makes it harder to replicate the study later on.
    2. Validation on external data would have been a ‘nice-to-have’.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I think this is the one part of the paper that can be substantially improved by releasing the code, and ideally the data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. As stated above, I think my main concern revolves around validation on an extra, external dataset. For example, I know TCGA (at least in the past) also provided text reports associated with the histopathology slides.
    2. Code availability would definitely increase the value of the paper.
    3. The section on the patient report generation is somewhat hard to read due to the removal of a lot of whitespace and inline introduction of symbols. Given the size constraints this might not be fixable, but maybe moving part of it to suplementary might help?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has little weaknesses, is novel and well-validated. There are some improvement points which is why I didn’t rate it a 6.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My positive impression of the paper still remains, I see no reason to alter the score.




Author Feedback

Reproducibility. [R1, R3, R4] We will release the source code and related data to the public upon acceptance.

Tumor grade performance.[R1] Determining tumor grade is challenging due to multiple slides per patient, each with varying grades. The final grade is based on the highest grade observed across all slides. Low tumor grade performance has also been reported in other work, e.g., Chan CVPR23.

Quality validation on generated reports. [R1] As depicted in Figure 2 and Fig.S2, our in-house dataset exhibits a relatively structured report format rather than being entirely freeform (with considerable variation in expressions conveying similar meanings). Moreover, based on the observation in Table 1 showing a linear relationship between the CE metric and NLG metric, we anticipate that the NLG metric indirectly but effectively reflects the validity of the generated reports.

Cross-validation + Data splitting. [R1, R3] Cross validation: Our dataset has a severe class imbalance, making it difficult to distribute minority classes evenly across each fold (as each sample has multiple tags and each tag has multiple inner classes). Consequently, we were unable to perform cross-validation and cannot provide standard deviations. Data splitting: Although we split the data randomly, we ensured that 1) no classes appear in the test set that are not present in the training set, and 2) as many classes as possible are included by adjusting the seeds. Additionally, to avoid inflated metric scores from biased predictions on our imbalanced dataset, we reported the macro-F1 score in Table 1 for a more accurate assessment. Organ quantity: Although the quantities of organs are not exactly the same, the ratio between kidney:colon:rectum is approximately 1:2:1 in each split. However, no single organ is excessively dominant, and we believe the macro-F1 metrics accurately represent the model’s performance across multiple organs.

Lack of comparison between other slide processing models. [R3] Existing multi-scale models (e.g., ZoomMIL, HIPT) have already demonstrated superior performance compared to single-scale models (e.g., TransMIL, AB-MIL) so we did not compare ours with them. We found that recent histopathology VLMs (Guevara MIDL23, Sengupta ML4H23, MI-Gen) employ HIPT without fine-tuning. We followed the same experimental setup, but it did not perform well on our dataset. We also tried to fine-tune HIPT but failed due to the extremely long training time. We observed that ResNet50 features outperformed VGG16 features for ZoomMIL, but VGG16 features were better for our model.

Lack of comparison in PMPRG.[R3] Due to time constraints, we were unable to include the result of PMPRG with ZoomMIL on the entire dataset. However, we observed that our method outperforms ZoomMIL on the small dataset (kidney data) in PMPRG, in line with the encoder-only performance shown in Table 1.

Quantitative proof compared to HIPT. [R3] HIPT took about 938 seconds while our model finished in just 3 seconds to train using a single WSI with similar computational resources, which is approximately 300 times slower (see Image Encoder in Section 3.1).

Loss weight parameters clarification. [R3] We empirically found that the combination of alpha (0.2), beta(0.6), and gamma (0.2) provided the best performance.

Fig.2 validation. [R1, R3] As for the model’s prediction for the IHC-related contents in Fig. 2, the ground truth full-length report also contains similar text in the note section (not shown in Fig. 2), so it is a valid prediction. The attention maps were generated based on the attention scores between each tag token and regional feature. Since each tag token plays a crucial role in generating sentences related to the corresponding disease, these maps directly indicate where the model focuses when generating sentences related to each disease. We briefly assessed the validity of the attention map with our pathology collaborator, and the initial feedback was positive.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper present a patient-level report generation framework from multiple WSIs. As acknowledge by R5, this setting is new and important. Moreover, the problem of WSI report generation is also very important. However, R1 and R3 also raise severe concerns about method evaluation and the experimental results.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper present a patient-level report generation framework from multiple WSIs. As acknowledge by R5, this setting is new and important. Moreover, the problem of WSI report generation is also very important. However, R1 and R3 also raise severe concerns about method evaluation and the experimental results.



back to top