Abstract

Spatial Transcriptomics is a novel technology that aligns histology images with spatially resolved gene expression profiles. Although groundbreaking, it struggles with gene capture yielding high corruption in acquired data. Given potential applications, recent efforts have focused on predicting transcriptomic profiles solely from histology images. However, differences in databases, preprocessing techniques, and training hyperparameters hinder a fair comparison between methods. To address these challenges, we present a systematically curated and processed database collected from 26 public sources, representing an 8.6-fold increase compared to previous works. Additionally, we propose a state-of-the-art transformer-based completion technique for inferring missing gene expression, which significantly boosts the performance of transcriptomic profile predictions across all datasets. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on spatial transcriptomics.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2459_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2459_supp.pdf

Link to the Code Repository

https://github.com/BCV-Uniandes/SpaRED

Link to the Dataset(s)

http://157.253.243.29

BibTex

@InProceedings{Mej_Enhancing_MICCAI2024,
        author = { Mejia, Gabriel and Ruiz, Daniela and Cárdenas, Paula and Manrique, Leonardo and Vega, Daniela and Arbeláez, Pablo},
        title = { { Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The two key contributions of this work are a large scale spatially resolved gene expression database for benchmarking histology based spatial transcriptomics prediction methods. Second, the authors show that imputed spatial transcriptomics profiles boost accuracy of histolopathology based gene expression prediction methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work provides a comprehensive and standardized collection of datasets to benchmark emerging deep learning methods for gene expression prediction from histopathology images

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The technical novelty of this work is not adequately discussed, especially in the context of other spatial transcriptomics imputation methods that integrate knowledge from single cell RNASeq datasets (eg: Tangram, gimVI, Harmony, LIGER, Seurat, SpaGE, stplus) (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02653-7)
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No additional comments

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should provide some intuition/rationale for applying median completion prior to running the transformer based imputation approach.

    • It is not clear whether position embeddings representing the spatial locations of different spots were utilized in the training/inference process. Intuition suggests they would be important for imputation of spatially variable genes.

    There are some minor grammatical errors in the paper that can br corrected to improve clarity. For example:

    • “standardize and cure 26 public ST databases”. Perhaps the authors mean curate and standardize?
    • “To adequate different gene dimensionalities to a fixed transformer dimension…”. Perhaps the authors mean accommodate?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In this paper, the authors primarily aim to address the challenge of missing gene imputation, which is not necessarily the focus of this conference. However this work is likely to have a lasting impact on many computational pathology and spatial transcriptomics applications. Thus this work would be better suited for a bigger journal publication.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have provided a strong and convincing rebuttal. The authors have also now clearly explained the technical novelty of their work in contrast to other imputation methods. It is very interesting that position encodings do not affect imputation performance. Overall, this work presents a very important contribution to the emerging field of spatial transcriptomics and histopathology.



Review #2

  • Please describe the contribution of the paper

    The authors presented an integrated database, SpaRED, by systematically curate and standardize 26 public ST datasets, and also a transformer-based completion/imputation model, SpaCKLE. The study showed that with SpaCKLE-completed ST dta, it enhanced the performance of 7 state-of-the-art models across the 26 datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    i. Detailed discussion and comparison with existing methods ii. Clear methodology

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    i. Important aspect, data scaling or normalization, was not discussed ii. No discussion on the target gene selection

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Suggest to provide more details on data normalization and target gene selection

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study provides comprehensive performance evaluation and comparison with existing algorithms, and offers useful data repository for spatial transcriptomic data (if made available to public).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this work, authors present a benchmark of different gene expression predictions models using spatial transcriptomics data, by combining 26 public sources (named SpatRED). This extensive dataset represents a 8.6-fold increase in comparison to previous works presented in literature. Then, they evaluate 7 state-of-the-art prediction methods in the dataset, showing the best performance methods across the homogenised dataset. The gathering of datasets contains both human and mice samples. Furthermore, as the technical contribution, they introduce a novel gene expression completition method (named SpaCKLE), that outperforms other gene competition methods and further improved the gene expression prediction capabilities of the tested models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Authors made a great work combining 27 public spatial transcriptomic datasets, given the lack of curated datasets that are available at the moment.
    • Authors show that they can corrupt up to 70% of the gene expression data and their proposed methodology, SpaCKLE, can accurately impute the missing values (Figure 3). This can also be nicely visualised across the tissue sample. Furthermore, it vastly outperforms the other two presented completition models.
    • As shown in Figure 1b, authors show that using SpaCKLE improves the performance of the gene expression models. That is really interesting result, given the high-quantity of missing values that these datasets usually have.
    • They compare 7 well-know gene expression prediction methods, showing that not always the most powerful models (in terms of parameters) obtain the best performance. This is quite useful to know, and benchmarking all models in different tissues gives a better sense of which model can be most useful for a given task. However, as authors also point out, this might be a problem given the low number of spots available in these datasets.
    • They also compare their models both in intra and inter patient scenarios, which gives more context and further results to the models compared.
    • Authors state that they are going to release both the benchmark and the models trained after the acceptance of this work, which is highly appreciated by the scientific community.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No statistical evaluation of results: paired tests would give statistical weight to the argument of which models are better.
    • Authors provide the batch size and optimiser used to train the model but they say: “… use an Adam optimiser with the default parameters”. The default parameters are library dependant, so it would have been nice to include them.
    • While the selected number of genes, 32 and 128 for the gene expression prediction tasks, is well justified (given the gene expression quality), it is also true that these genes could have been selected based on their importance for each tissue. Also, a comparison for a “high-resolution model” where all genes available are included, would have been interesting.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors present an impressive benchmark of spatial transcriptomics datasets. Given the scarce availability of these kind of datasets, it is highly expected by the scientific community to have a dataset like this. Furthermore, given the increasingly availability of gene expression models from digital pathology slides, it would help to further measure the performance of newer models. I would suggest the authors to increase the number of genes predicted at least to the thousands. That would show a better analysis of the models. The same applies to their proposed SpaCKLE model. Furthermore, given that we are treating with spatially-resolved data, using some kind of distance (like the Earths Moving Distance) to compute how similar is the ground truth in respect to the prediction, would highlight if the structure is maintained. This would apply both to the gene expression imputation (as depicted in Figure 3), and the gene expression prediction.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As far as I am concerned, this is the biggest curation of spatial transcriptomic datasets to date. Having such a benchmark to test gene expression prediction models, which are gaining a lot of interest in recent times, can be really valuable for the community. Showing that their proposed methodology for data imputation preserves the spatial gene expression characteristics of the tissue and it helps in the gene expression prediction task is also interesting, and having such a model can help in multiple downstream tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Authors have succesfully answered the questions raised by me and authors reviewers, so I maintain my score and the acceptance of the manuscript.




Author Feedback

We want to thank the reviewers for their insightful comments. Below, we respond to their concerns individually.

Reviewer 1:

Data pre-processing details: We follow the standard practice in the literature for data scaling, normalization, and gene selection reported in [11]. As we stated at the end of the introduction, all benchmark data and source code from our project will be publicly available upon acceptance.

Reviewer 3:

Relevance for MICCAI: We want to address explicitly the reviewer’s concern that our work may not align well with the conference’s focus. Our primary contribution is the most extensive standardized benchmark for gene expression prediction from histopathological images. This task, at the intersection of histology and transcriptomics, is poised to shape the future of next-generation histology applications, where AI bridges the gap between medical imaging and molecular data types. We believe that this new benchmark in an emerging modality will not only be highly valued by the MICCAI histology community but also, as the reviewer rightly pointed out, will have a profound and lasting impact on numerous computational pathology and spatial transcriptomics applications.

Comparison with single-cell RNAseq reference-based imputation: We would like to point out that those methods belong to a different family of gene completion algorithms that require a paired single-cell dataset to work correctly. This paired dataset can sometimes be challenging or impossible to find, limiting the usability and practicality of such techniques. In contrast, SpaCKLE is a reference-free completion method applicable to more general setups.

Median Completion Initialization: We used median completion for three main reasons: (1) it allowed the calculation of a complete reconstruction loss between input and output matrices, which ensured faster training convergence and guaranteed non-zero predictions; (2) having a rough estimate of missing data allowed us to supervise the model in regions where a large patch of data was missing, and (3) as the previous state-of-the-art [11], it was a natural baseline to build upon. Positional encodings did not provide any noticeable improvements to the results.

Reviewer 4:

We are delighted to learn that the reviewer recognizes the significant value of our benchmark for the histopathology community, particularly considering the current lack of reference points for gene expression prediction from histology images. We fully acknowledge the importance of evaluating statistical significance when comparing models and will include this analysis in the final version of our paper. The implementation details of our methods will be fully documented in our paper’s public code repository. We also agree that there are other experiments worth considering, such as filtering the genes based on their tissue relevance. The source code of our project is already designed to be flexible, allowing users to select any specific set of genes of interest.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors introduced SpaRED, an integrated database of 26 public spatial transcriptomics (ST) datasets, and SpaCKLE, a transformer-based imputation model, demonstrating that SpaCKLE-enhanced ST data improved the performance of seven state-of-the-art prediction models across these datasets. The method is interesting and novel. The paper is clearly written, and the experimental setup and ablation study are sound. Previous concerns of reviewers have been addressed during the rebuttal. After the rebuttal, the reviewers reached consensus about its acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors introduced SpaRED, an integrated database of 26 public spatial transcriptomics (ST) datasets, and SpaCKLE, a transformer-based imputation model, demonstrating that SpaCKLE-enhanced ST data improved the performance of seven state-of-the-art prediction models across these datasets. The method is interesting and novel. The paper is clearly written, and the experimental setup and ablation study are sound. Previous concerns of reviewers have been addressed during the rebuttal. After the rebuttal, the reviewers reached consensus about its acceptance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have introduced SpaRED, a robust and extensive database compiled from 26 public spatial transcriptomics datasets, and SpaCKLE, a novel transformer-based imputation model. These contributions significantly enhance the performance of gene expression prediction models from histology images.

    The authors have effectively addressed the reviewers’ concerns in their rebuttal. For instance, they provided detailed clarifications on their data preprocessing methods and emphasized the relevance of their work to the MICCAI community. Additionally, the authors justified the use of median completion as an initialization step for SpaCKLE, citing its role in ensuring faster convergence and better supervision during training.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have introduced SpaRED, a robust and extensive database compiled from 26 public spatial transcriptomics datasets, and SpaCKLE, a novel transformer-based imputation model. These contributions significantly enhance the performance of gene expression prediction models from histology images.

    The authors have effectively addressed the reviewers’ concerns in their rebuttal. For instance, they provided detailed clarifications on their data preprocessing methods and emphasized the relevance of their work to the MICCAI community. Additionally, the authors justified the use of median completion as an initialization step for SpaCKLE, citing its role in ensuring faster convergence and better supervision during training.



back to top