Abstract

The characterization of Tumor MicroEnvironment (TME) is challenging due to its complexity and heterogeneity. Relatively consistent TME characteristics embedded within highly specific tissue features, render them difficult to predict. The capability to accurately classify TME subtypes is of critical significance for clinical tumor diagnosis and precision medicine. Based on the observation that tumors with different origins share similar microenvironment patterns, we propose PathoTME, a genomics-guided representation learning framework employing Whole Slide Image (WSI) for pan-cancer TME subtypes prediction. Specifically, we utilize Siamese network to leverage genomic information as a regularization factor to assist WSI embeddings learning during a training phase. Additionally, we employ Domain Adversarial Neural Network (DANN) to mitigate the impact of tissue type variations. To eliminate domain bias, a dynamic WSI prompt is designed to further unleash the model’s capabilities. Our model achieves better performance than other state-of-the-art methods across 23 cancer types on TCGA dataset. The related code will be released.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1063_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1063_supp.pdf

Link to the Code Repository

https://github.com/Mengflz/PathoTME

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Men_Genomicsguided_MICCAI2024,
        author = { Meng, Fangliangzi and Zhang, Hongrun and Yan, Ruodan and Chuai, Guohui and Li, Chao and Liu, Qi},
        title = { { Genomics-guided Representation Learning for Pathologic Pan-cancer Tumor Microenvironment Subtype Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a deep learning method: PathoTME for pan-cancer TME subtype classification from histopathology images. The main contributions of this paper are the utilization of an adversarial loss, visual prompt tuning and a gene expression based cosine similarity loss to boost classification performance and generalizability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    1. The paper presents a cost-effective solution for an important problem of characterizing the tumor microenvironment.
    2. The use of domain adversarial training to learn shared representations of whole slide images across diverse cancer pathologies is innovative.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are by now several deep learning methods to directly predict gene expression profiles from histopathology images [e.g., ST-Net (https://www.nature.com/articles/s41551-020-0578-x), HE2RNA (https://www.nature.com/articles/s41467-020-17678-4), BLEEP (https://arxiv.org/abs/2306.01859)]. It is unclear what technical advantage their method provides over these approaches, which can in principle be used to infer the TME subtypes from imputed gene expression profiles.
    2. The rationale for choosing HIPT as the starting point for whole slide image feature extraction is not appropriately justified. Especially in the context of other popular self-supervised feature extractors such as: cTransPath (https://www.sciencedirect.com/science/article/abs/pii/S1361841522002043), and RetCCL (https://www.sciencedirect.com/science/article/abs/pii/S1361841522002730). Do simpler training strategies achieve similar state of the art performance by replacing HIPT with another self-supervised feature extraction model?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • It is unclear how ROCAUC and PRAUC metrics and their respective standard deviations were calculated given 4 distinct TME classes. Furthermore, there is a considerable difference between AUC values and F1 scores which is not explained.

    • The tile size and base magnification used for whole slide image feature extraction is not specified in implementation details.

    • Is the dynamics weight lambda and the hyperparameter lambda_p the same?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Incorporation of additional datasets for validation beyond TCGA (for e.g., CPTAC) was missing in this study. Validation on independently generated datasets can help further bolster the robustness and accuracy of the proposed approach.
    2. A flowchart depicting the train-validation and test data used in each benchmarking experiment was missing. Having such a flow chart in supplementary can be useful for understanding exactly how different models were trained and compared.

    3. The writing of the methods in the paper can be improved in places to increase clarity. Grammatical errors and typos in notations such as:
    • “To further fine-tuning the feature extractor”
    • “Image xi passes through image encoder fi and a predictor MLP head h, while g pass through gene extractor fg and SNN network. Two output vectors denoted as px = h(fi(xi)), zg = fg(xg), we minimize their negative cosine similarity”

    Can be written as:

    • “To further fine-tune the feature extractor”
    • “The whole slide image xi passes through the image encoder fi and a MLP head h, whereas the gene expression data xg passes through the feature extractor fg and SNN network to generate the following two embeddings: px = h(fi(xi)), zg = fg(xg). We then minimize the negative cosine similarity between the whole slide image and gene expression data embeddings px and zg as follows:”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper provides an interesting strategy to interrogate the TME from routinely collected histopathology slides of diverse tumor types. However the novelty and impact of this work has not been adequately demonstrated.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors overall provided good justification of why they chose to directly perform TME subtype classification over gene imputation. They plan to perform further evaluations using other feature encoders and validations on independent datasets. The training strategies presented in this paper are innovative and aim to solve a challenging problem of characterizing the TME. I am looking forward to see the future versions of this work.



Review #2

  • Please describe the contribution of the paper

    The paper proposes to learn a WSI latent space guided by genomics data. Genomics data contains more information above TME than WSI image feature, the authors argue that using genomic data latent representation to refine the WSI latent representation will help in learning a more accuracy latent space for TME subtype classification. The latent space thus learned is used for pan-cancer subtype classification and shows better performance compared with other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    These are the strengths of the paper:

    1. Learning a WSI latent space guided by genomics data. Genomics data contains more information above TME than WSI image features, the authors argument of using genomic data latent representation to refine the WSI latent representation is sound.

    2.Visual Prompting tuning to avoid finetuning the whole HIPT model is also a good choice to save on computational resources.

    3.Reporting results on PAN cancer images and beating the state of the art methods for pan cancer subtype classification.

    1. Relevant ablation experiments covered in the paper with comparison to appropriate baselines.

    2. Using Domain adversarial neural network and corresponding loss to ensure the model learns organ/tissue type independent features.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are some weaknesses of the paper:

    1. Shouldn’t the first comparison be done with subtype prediction with GENE expression data. That sets up a baseline and prove the underlying assumption that subtypes can be predicted by Gene Expression information ? This also give access to the maximum accuracy that can be achieved with the model.

    2. Latent representation of WSI image and Latent representation of Gene expression do not share the same vector space and are from different modalities(pathology images vs gene expression). This is like comparing animals vs stationary, but these are totally different modalities. So the logic of using cosine similarity might be flawed. Maybe density alignment methods in latent space such as MMD or wasserstein distance might have been better, or some form of contrastive loss might perform better. Why does supervised contrastive loss worse than PathTME ? They work on the same underlying phenomenon. That gene expression data can better cluster WSI features, but that doesn’t seem to the case with those results.

    3. Additionally, Using stop gradient in equation 4, limits the flexibility of the model. Where the task of the model becomes adapting the WSI representation to gene-expression clusters but pair wise matching. What if the latent space of WSI image feature has more variability than gene, then it might harm to match feature one way, that will results in mode reduction of WSI feature space.

    4. Comparison with HIPT features directly used for pan-cancer classification are missing and are relevant. As the proposed model has better clustering than HIPT features, then HIPT baseline is relevant.

    5. The original HIPT paper, for BRCA subtyping has an AUC of about 87.4 %, which is significantly higher than PathTME(71.2%). PathTME due to its multi-modality(gene expression data) should have performed better than the original HIPT feature space classification. Pan cancer task might have introduced the drop in the performance(due to negative gradients), but then it violates the assumption that TME features across different organ/tissue types are similar in gene expression. These results don’t align with the underlying assumption of the proposed methodology ?

    6. ABMIL + Siamese in Table-2 is not explained ? Is that a fusion method ?? what is the difference between ABMIL-Siamese and ablation study “+Siamese”, or is there no difference because there is a difference in average ROC score ?

    7. Plots in the appendix don’t really make any sense. Additionally they don’t seem to align with the observation that spatial clustering has been reduced.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    If the code is released to public, i don’t see any issues of reproducing the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Suggestions for improvements:

    1. Addressing the choice of using negative cosine similarity instead of MSE or any other distance measures such as JSD, MMD or wasserstein distance which have shown good performance in reducing latent space distance for domain adaptation problems. Additionally addressing the choice of stopgrad in equation 4.

    2. From my understanding the proposed method should work for all types of MIL methods, so adding additional methods with same process of aligning gene expression latent space are also be relevant and might show

    3. “Gene Embedding from tabular data”, would using a tabular encoder as done for wikipedia tables and other document encoder might be better ?

    4. Adding HIPT comparison for PAN cancer subtype classification and addressing the huge discrepancy in the results for single Subtype classification(BRCA subtyping) addressed above. HIPT baseline comparison is very important, because those features are used for representation of slides.

    5. Clarifying the table 2 comparisons.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has sound underlying reasoning and idea. However there are minor points that need to be addressed and clarified especially:

    1. Addressing the choice of using negative cosine similarity instead of MSE or any other distance measures such as JSD, MMD or wasserstein distance

    2. HIPT comparison for PAN cancer subtype classification and addressing the huge discrepancy in the results for single Subtype classification(BRCA subtyping) addressed above.

    3. Clarifying the table 2 comparisons and adding additional analysis for the points mentioned above.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors cleared some of my doubts and clarifications in the rebuttal. I do appreciate the novelty of the method and think that MICCAI community might enjoy reading this paper with some modification in results.

    However, I have decided to keep my score as same.

    1. I am still concerned about the huge performance difference between single cancer subtype classification in HIPT and other methods in literature and the numbers reported in the paper. Even though they are not pan-cancer, the proposed methods with better WSI representation should outperform on single cancer subtyping as well.

    2. With the authors explanation that unfreezing the model, made the model diverge, says to me that there might be other factors of influence unaccounted for in the model, or more experiments might be required to find the an optimal training procedure. Maybe Burn of both methods at high learning rate and then fine-tuning at slowing decreasing learning rate.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a framework for pan-cancer tumor microenvironment classification based upon whole slide imaging (WSI). The proposed approach leverages multiple interesting techniques including the integration of gene-level and WSI features and domain generalization strategies and demonstrates robust performance on a large dataset comprised of 23 cancer types.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses an interesting classification task of pan-cancer tumor microenvironment classification which to my knowledge has not been performed using WSI analysis. The proposed method outperforms other approaches.

    The proposed modules are logically motivated, and while fairly simple, are sufficiently novel and contribute to an overall framework that is well-justified for the task. Specifically, while contrastive architectures have been explored previously for multi-modal feature fusion, the decision to pair this module with the domain adversarial branch to avoid focus on cancer-subtype-specific features is well thought out and clearly improves model performance.

    The paper presents appropriate ablation studies to justify each portion of the proposed architecture and provides extensive metrics for model performance when compared with several baselines.

    The inclusion of the tsne representations of model embeddings at various steps is compelling and further supports the effectiveness of different model components.

    In general, the paper is well-organized and well-reasoned.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I wonder if there might be additional comparisons that could appropriately be made for baseline comparisons. I understand the paper’s point that inclusion of genetic information is a unique attribute of PathoTME and that inclusion of this information directly in inference may lead to information leakage/defeat the purpose of WSI-only based classification of the TME, but I still wonder if there are any other more sophisticated baselines that could be compared since the ones listed all are fairly basic MIL baselines.

    This may have been due to space limitations, but I would like to see some analysis or discussion on why there is so large a variance in some of the scores when stratified by cancer type. There could be obvious reasons for this (perhaps certain cancer types do not fall neatly into the 4 TME classes identified by previous studies), but some additional insight would be appreciated since the scores do vary dramatically for some cancers.

    One thing that is a little unclear is how the training and testing splits are performed. “These paired samples are randomly stratified into training sets(85%) and test sets(15%) based on the origins.” Does this mean that on a per cancer subtype basis, 85% were taken for training and 15% for testing?

    The paper acknowledges that performance for this task is ultimately low to a degree, but this is somewhat mitigated by superior performance to baselines and ablation studies demonstrating the contribution of each module of the architecture.

    A minor weakness of the paper is that the language is occasionally a little difficult to follow.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper should be easily reproduced with the details provided if the code is indeed made public. All datasets were from TCGA.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I think that this paper is generally well-written and compelling. The proposed method is strongly justified and logically presented. The results are strong, and demonstrated robustly over a large, pan-cancer dataset. This task appears to be novel, and of clinical significance.

    The paper could benefit from a little revision in its wording, though in general this does not impede a reader’s understanding of the material. Specific sections that could use some attention are :

    1. the description of training and testing data partitioning (described in the weaknesses section)

    2. “Noteworthily, during the pretraining phase, the gene extractor and SNN network are trained with TME subtypes labeled yt. Once trained, its weights will be frozen and it will stop gradient descent.” It is clear here that the contrastive module is a training-only step, but it is not clear to me what the authors mean that the gene extractor and SNN are trained with yt.

    Some additional changes that I might suggest to make the paper more compelling are:

    1. The authors can place a greater emphasis on the fact that this approach effectively leverages genetic information while not require genetic information as input during inference. I think this is a very strong contribution of the paper that is not fully highlighted.

    2. A more thorough exploration of the variance in model performance on different subtypes. Does lower performance correlate with lower N datasets as the paper currently speculates?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a significant task and robustly outperforms existing methods on a large, pan-cancer dataset. The generalizability of the approach to this pan-cancer dataset, combined with the technical innovation of leveraging genetic information during training while requiring solely WSI in inference makes this paper suitable for acceptance in MICCAI, in my opinion. The weaknesses of this paper are relatively minor as discussed in my previous comments.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my comments adequately. I agree with other reviewers that additional experiments could further strengthen the work, but as they are not allowed by MICCAI in the rebuttal period I still believe that my initial review remains in favor of acceptance. Even without the additional experiments I believe the approach interesting enough to be presented at MICCAI.




Author Feedback

Thank all the reviewers for your constructive comments. R1: Q1. Gene imputation methods Thank you for pointing out these methods. However, our model is significantly different from the named methods. We would like to further clarify our novelty: a)Task and model design: our model focus on TME prediction, a complex task with clinical significance, which heavily relies on gene guidance. In comparison, the named methods aim to predict genes from images, leading to distinct model designs by nature: our framework is end-to-end, while these methods have to impute genes before prediction. b)Performance: the named methods perform well on several genes but are less effective overall, with mean correlations between predicted genes and ground truth mostly below 0.3(Gao et al., 2024, 101536 Table1). TME prediction needs hundreds of genes(we used 1810 knowledge genes). Inaccuracies in individual genes accumulate, resulting in poor overall prediction accuracy. c)Training strategy: our model employs gene information at training stage to guide TME prediction, distilling the knowledge from genes to TME classification network trained with TME labels. The named methods use gene expression as labels directly. However, gene expression may contain biological and technical biases, especially in ST genes(ST-Net, BLEEP), limiting model training. Q2. Feature extractor We chose HIPT for its pyramid structure, which can extract the larger organizational features needed for TME. Our method is agnostic to SSL feature extractors. Here we use HIPT as an example. cTransPath and RetCCL can also work within our framework. We will make detailed comparison in the future. Q3. Reproducibility Tile size is 4096px with 20x magnification. Dynamic weight λ is the same as lambda_p. We used 5-fold cross-validation on training set and calculated all metrics on test set. We will add a dataset partition flowchart in final version. We agree tests on more datasets can further bolster robustness. We will include this in future version. R3: Q1. Additional comparisons Thank you for the suggestions. We can compare our method with gene imputation methods mentioned in R1Q1. Q2. Large variance We think Sample size(N) may cause variance. Correlations of N and std are below -0.6, confirming our assumption. Some samples not fitting neatly into TME classes may also be a reason. We will do further analysis. Q3. Dataset partition Yes, for each tumor type, there is an 85%-15% data split. We explained the data split in detail in R1Q3. R4: Q1. Gene and HIPT features baselines Indeed, based on our previous tests, using gene alone predicted well(0.93 AUC). kNN on HIPT features directly achieved mean AUC of 0.60(pancancer) and 0.62(BRCA).The difference between the results in HIPT paper and ours is due to the different tasks. The former is a simpler binary classification task with clear histological features due to tumor origins, hence better results. Q2. Representation alignment & Gene encoders Thank you for your constructive suggestions. Our contribution is mainly about proposing a framework that uses genes to guide the model. Multiple density alignment methods can also work within our framework. We chose cosine similarity as an example for its simplicity and speed. We are already exploring these methods and other gene encoders. We will add results in futher versions. Q3. Stop gradient In our experiments, we found not freezing the model made it hard to converge. For classification, learning a compact and discriminative latent space is a key. Therefore, aligning WSI image features to compact gene features helps improve classification efficiency. More variability leads to ambiguous decision boundaries. Q4. ABMIL+Siamese The structures of ABMIL+Siamese and “+Siamese” are the same. But the former was trained separately on each tumor type, and the latter was trained on pancancer dataset once, which explains the difference in ROC scores.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is an interesting paper that aims to identify TME type rather than predicting gene expression which may have applications in identifying patients who would benefit from immunotherapy. The use of an adverserial network to remove reliance on cancer type is neat. The reviewer who initially scored this as a 3 has increased score to accept after the rebuttal and the consensus is for accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This is an interesting paper that aims to identify TME type rather than predicting gene expression which may have applications in identifying patients who would benefit from immunotherapy. The use of an adverserial network to remove reliance on cancer type is neat. The reviewer who initially scored this as a 3 has increased score to accept after the rebuttal and the consensus is for accept.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal addresses most of the raised concerns. There is a clear consensus among the reviewers, so I recommend acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The rebuttal addresses most of the raised concerns. There is a clear consensus among the reviewers, so I recommend acceptance.



back to top