Abstract

The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2317_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2317_supp.pdf

Link to the Code Repository

https://github.com/XiaofeiWang2018/Diffusion-ST

Link to the Dataset(s)

https://www.10xgenomics.com/datasets?menu%5Bproducts.name%5D=Spatial%20Gene%20Expression&query=&page=1&configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000&refinementList%5Bproduct.name%5D=&refinementList%5Bspecies%5D=&refinementList%5BstainingMethods%5D=&refinementList%5BdiseaseStates%5D=

BibTex

@InProceedings{Wan_Crossmodal_MICCAI2024,
        author = { Wang, Xiaofei and Huang, Xingxu and Price, Stephen and Li, Chao},
        title = { { Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors propose to leverage the high-resolution information present in histology images alongside the rich gene expression information present at a low resolution in spatial transcriptomics to obtain high-resolution profiles of gene expression by modelling a cross-modality diffusion systems that takes into account the correlations between the two modalities and predicts the high-resolutions profiles of multiple genes jointly.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of using conditional diffusion processes to infer the high-resolution spatial transcriptomics from low-resolution ST and high-resolution pathology information is interesting given the power of diffusion models shown in diverse domains and applications.
    • The utilisation of gene co-expressions to collectively predict multiple genes leverages the dependencies of the genes and is interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper does not clearly motivate the different components of the proposed method, or how it fits into the full pipeline. E.g., why is there a need for hierarchical cell-to-tissue feature extraction and curriculum learning (and how is this done?)? What if you do a simpler pooling which does not use attention? The cross-modal adaptive modulation and multi-modal disentangling also appears rather complex without sufficient justification. E.g. why is there a need for a “disentangling” network and how is it used?
    • Evaluation metrics: It is not clear how the different methods are compared. Do the authors have access to the high-resolution ST images as ground truth? If not, how are the metrics designed? With respect to what is the RMSE computed? Further, how does the proposed method compared to other related works like iFuse and xFuse?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Is there a range of genes which have shown to be easier to predict from LR-ST + histology? Would you expect that all genes can be predicted easily? This would be an interesting aspect of the problem to experiment on.
    • In section 2.2. - how are the cells identified for the hierarchical cell-to-feature extraction? How is the patch defined? Why is there a ‘screening’ step and what is it designed to tackle?
    • It is not really clear how the CIGC-graph network fits into the overall pipeline. Perhaps the overall model can be described together in a section after the different elements of the model are presented?

    Minor comments:

    • In equation (1) - two different styles of x are used to denote the same thing.
    • Figure 2 has no (a) (b) labeled, although it’s referenced in the text.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is positioned to tackle a relevant problem in the area of tissue analysis. The overall idea of the paper to leverage diffusion processes is interesting and powerful. However, different components of the proposed method are not sufficiently motivated/described and explained with respect to the overall pipeline. Therefore, it is difficult to understand the value of each component. The manuscript needs to be further improved to be of use to the relevant audience.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a methodology for performing super-resolution of spatial transcriptomics data using a cross-modal diffusion process. The backward diffusion process predicts a high-resolution image of the spatial transcriptomics data by conditioning on the underlying histology image and ground-truth low resolution spatial transcriptomics data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The methodology is well described and provides a considerable improvement over existing state of the art methods using the evaluation metrics provided. The paper is thorough with clear diagrams explaining the structure of the model used. The ablation study justifies the inclusion of each element effectively.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is hard to gauge the significance of these results without any downstream context in order to understand them. How does the improved resolution of spatial transcriptomics data lead to any further biological insights? Does the original low-resolution provide a limit to what can be achieved? Having briefly reviewed the TESLA work, this provides exemplar studies such as identifying detailed segmentation maps of tumour regions in the histology based on known genes expressed in the tumour. Is this model capable of this and does the improved ST resolution offer an improvement in these downstream tasks?

    More details on the patching approach and the meaning of “cell level” features is required. Are these patches of the individual cells or are they patches taken at a resolution suitable to observe individual cell properties (i.e. 40x/20x).

    The reader cannot fully understand the experiments without having read prior referenced works (refs. 10, 22). Some additional details on how the datasets are prepared would improve the manuscript allowing it to be read and understood without referring to other works.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code is included, pre-trained model promised upon acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See note about details for experimental setup above.

    Downstream significance of results would further improve manuscript.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a good piece of work with strong experimental results compared to other baselines. There are some further details required, particularly on the data preparation and tile extraction.

    Lack of evaluation of the downstream significance of the results weakens the impact of the paper.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This manuscript introduces a diffusion-based multi-modal SR method for ST maps; It addresses the issues of low resolution in current ST platforms. It proposes a cross-modal conditional diffusion model that integrates histology images with gene expression profiles, enabling the modulation of complementary information for more accurate spatial gene expression analysis. The paper’s main contributions include a multi-modal disentangling network, a co-expression intensity-based gene-correlation graph network, and a cross-attention modeling strategy. The results show outperformance over existing methods on three public datasets. The validation part is well presented. The paper is clear and well written.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s main strengths are in how it improves spatial transcriptomics using Diff-ST. It introduces a new way referred to as cross-modal conditional diffusion model. This model combines histology images and gene expression profiles to make spatial details clearer. This approach is particularly interesting because it addresses a critical limitation of current ST platforms and offers a promising solution for in-depth spatial gene expression analysis. Additionally, the proposed multi-modal disentangling network and co-expression intensity-based gene-correlation graph network represent novel methodologies that contribute to the effectiveness of the Diff-ST model. Overall, the paper’s contributions are substantial and are well presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I recommend clarifying the methodology section, especially by providing a more detailed explanation of how the disentangling network is specifically designed to capture both the unique features and the shared aspects of multi-modal data.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work is new and well presented. This deserves to be accepted.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would thank the reviewers for the valuable feedback

Disentangling network {R3,R5} We appreciate R3 for acknowledging our disentangling network. To further clarify, we would add that spatial transcriptomics (ST) and WSI have shared and unique genetic and morphological information. WSI provides unique features of cellular patterns in high resolution, while spot-ST bears unique features of expression patterns that form the basis of ST super-resolution. Meanwhile, the shared features of tissue structure across ST maps and WSI can bridge the translation from image to spatial expression. Hence, disentangling shared and unique features could better facilitate ST enhancement. Specifically, we devise a disentangling loss to derive the shared and unique features of each modality, where the disparity among shared features is minimized while that among the unique features is maximized

Cell-to-tissue feature extraction & cell patching {R4,R5} To generate cellular level high-resolution (HR) ST maps, it is central to consider cellular heterogeneity, i.e., cells may contribute distinctly to expression profiles. Hence, we perform cell patching to extract cellular features. Specifically, as described in Section 2, we perform patching in the average scale (160µm region at 20X) demonstrating cellular organization. We then use cross-attention to identify key patches for certain genes. In contrast, a simple pooling averages the gene expression attribution of different cell patches, which cannot generate accurate ST maps at the cellular level

Dataset and evaluation metrics {R4.R5} For Xenium dataset, we have publicly available HR ST [9] as ground truth, described in Section 3.1. HR ST is unavailable in two external validation sets, so we follow [10] for model evaluation, where ST enhancement should retain the original spot level pattern while increasing resolution[10]. Specifically, the enhanced ST maps are first downsampled to LR, and then used to calculate metrics with paired LR ST. We will further clarify the dataset preparation and evaluation metrics

Downstream task validation (R4) The HR ST at the cellular resolution has inherent advantages over spot ST in downstream analysis, e.g., localizing cell types, studying cell-cell communications [3,10]. We appreciate the reviewer’s insightful suggestion to conduct downstream exemplar tasks for model validation. Due to the paper limit, we were unable to include these results, and future work will warrant downstream validations at the cellular level

Curriculum learning (R5) The complexity of spatial gene expression patterns varies with the cellular complexity and heterogeneity (Jiaren Lin,2023), rendering the unstable model training across different patches. Therefore, we first estimate the cell complexity of individual patches using Shannon entropy [1], before implementing Curriculum Learning on different patches, which can learn samples with varied difficulty, enabling stable training for better results

Comparisons (R5) We would thank the reviewer for providing the two representative methods. Despite their effectiveness in ST enhancement, both xFuse and iStar are unable to directly utilize the HR ST maps in training, and they only use the downsampled LR ST as weak supervision, thus less capable of reconstructing expression details. Besides, xFuse may suffer from low test speed (~1 day/WSI). We will add detailed quantitative comparisons of these methods in future version

Gene range(R5) We observed that highly variable genes (e.g. GATA3 for breast cancer) are easier to predict, consistent with [10]. We thank the reviewer for this comment and will investigate the gene range in future work

CIGC(R5)
The CIGC-Graph network is proposed to model the correlation of the features of multiple genes. Fig1 and Supple Fig1 show its information flow and structure. We will follow the advice to better describe the model in the future version




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewer concerns included that individual blocks of the network are not well justified/explained and need for clarity about the way the evaluation was done would also be helpful. Rebuttal clarifies these points very well. Given the novelty and the evaluation that has been done, paper is worthy of acceptance

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Reviewer concerns included that individual blocks of the network are not well justified/explained and need for clarity about the way the evaluation was done would also be helpful. Rebuttal clarifies these points very well. Given the novelty and the evaluation that has been done, paper is worthy of acceptance



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The manuscript introduces a novel method for enhancing the spatial resolution of spatial transcriptomic (ST) maps by leveraging a cross-modal conditional diffusion model. This method integrates histology images with gene expression profiles to achieve super-resolved ST maps, addressing the critical issue of low resolution in current ST platforms.

    Strengths of the paper:

    • The paper proposes a cross-modal conditional diffusion model. The use of a multi-modal disentangling network and cross-modal adaptive modulation is novel and shows significant improvements over existing methods.
    • The paper demonstrates substantial improvements over state-of-the-art methods in multiple evaluation metrics across three public datasets. The quantitative comparisons show the proposed method achieving the lowest RMSE and highest PCC, indicating superior performance.
    • The paper includes a thorough ablation study, validating the effectiveness of each component of the proposed model. The visual comparisons further illustrate the improved resolution and clarity of the super-resolved ST maps.

    Weaknesses and responses:

    • The need for hierarchical cell-to-tissue feature extraction and the curriculum learning approach could have been better justified in the original submission. The authors clarified that cellular heterogeneity requires considering distinct contributions of cells to gene expression profiles, which a simple pooling approach would not capture accurately.
    • The authors clarified that for datasets without high-resolution ST ground truth, they followed established evaluation methods by downsampling enhanced ST maps and comparing them to the original low-resolution ST maps. The detailed comparisons with methods like iFuse and xFuse were acknowledged and promised for future versions, highlighting the distinct advantages of their approach.
    • While the paper did not include downstream tasks due to space limitations, the authors acknowledged the inherent advantages of their high-resolution ST maps for such analyses and indicated that future work would include these validations.
    • The rebuttal clarified the importance of disentangling shared and unique features to bridge the translation from histology images to spatial gene expression, further supporting the methodological soundness of their approach.
  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The manuscript introduces a novel method for enhancing the spatial resolution of spatial transcriptomic (ST) maps by leveraging a cross-modal conditional diffusion model. This method integrates histology images with gene expression profiles to achieve super-resolved ST maps, addressing the critical issue of low resolution in current ST platforms.

    Strengths of the paper:

    • The paper proposes a cross-modal conditional diffusion model. The use of a multi-modal disentangling network and cross-modal adaptive modulation is novel and shows significant improvements over existing methods.
    • The paper demonstrates substantial improvements over state-of-the-art methods in multiple evaluation metrics across three public datasets. The quantitative comparisons show the proposed method achieving the lowest RMSE and highest PCC, indicating superior performance.
    • The paper includes a thorough ablation study, validating the effectiveness of each component of the proposed model. The visual comparisons further illustrate the improved resolution and clarity of the super-resolved ST maps.

    Weaknesses and responses:

    • The need for hierarchical cell-to-tissue feature extraction and the curriculum learning approach could have been better justified in the original submission. The authors clarified that cellular heterogeneity requires considering distinct contributions of cells to gene expression profiles, which a simple pooling approach would not capture accurately.
    • The authors clarified that for datasets without high-resolution ST ground truth, they followed established evaluation methods by downsampling enhanced ST maps and comparing them to the original low-resolution ST maps. The detailed comparisons with methods like iFuse and xFuse were acknowledged and promised for future versions, highlighting the distinct advantages of their approach.
    • While the paper did not include downstream tasks due to space limitations, the authors acknowledged the inherent advantages of their high-resolution ST maps for such analyses and indicated that future work would include these validations.
    • The rebuttal clarified the importance of disentangling shared and unique features to bridge the translation from histology images to spatial gene expression, further supporting the methodological soundness of their approach.



back to top