Abstract

Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactions between target and neighboring information (e.g., gene co-expression). This leads to a failure in establishing connections among adjacent regions and capturing intricate cross-modal relationships. To address these issues, we propose NH²2ST, a framework that integrates spatial context and both pathology and gene modalities for gene expression prediction. Our model comprises a query branch and a neighbor branch to process paired target patch and gene data and their neighboring regions, where cross-attention and contrastive learning are employed to capture intrinsic associations and ensure alignments between pathology and gene expression. Extensive experiments on six datasets demonstrate that our model consistently outperforms existing methods, achieving nearly 28.77% in PCC metrics.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2697_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MCPathology/NH2ST

Link to the Dataset(s)

STNet and Skin dataset: https://github.com/NEXGEM/TRIPLEX INT and ZEN dataset: https://github.com/mahmoodlab/HEST PCW and Mouse dataset: https://github.com/JiawenChenn/STimage-1K4M

BibTex

@InProceedings{QuMin_Spatially_MICCAI2025,
        author = { Qu, Mingcheng and Wu, Yuncong and Di, Donglin and Gao, Yue and Su, Tonghua and Song, Yang and Fan, Lei},
        title = { { Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {580 -- 589}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce NH22ST, a framework for predicting gene expression from WSIs that aims to learn cross-modal interactions between the modalities (ST and histopathology) using cross-attention and contrastive learning in the query branch. Additionally, it incorporates hypergraph learning through a neighbor branch to integrate spatial context in both modalities, to support the query branch and learn more robust patch features. Moreover, a predictor is trained to predict gene expression from the patch features. The authors show that NH22ST improves over 6 datasets compared to several baselines and they perform extensive ablation studies.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The use of cross attention in addition to contrastive learning allows for deeper modeling of cross-modal interactions.
    • The use of hypergraph learning to model both the spatial and molecular interactions between neighboring patches.
    • NH22ST does not rely on representative spots from the training data during inference.
    • The authors performed extensive ablation studies, helping to understand and motivate the contributions of each component.
    • Extensive evaluation on multiple datasets and metrics and overall improved performance.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The decision to use only the top 250 highly expressed genes may oversimplify the prediction task and introduce bias by focusing mostly on easy targets that are potentially highly expressed everywhere. While this strategy can reduce noise, it also potentially ignores biologically important genes, that are less highly expressed or vary more across different patches. A more robust alternative might be to select genes based on biological relevance.

    • The experimental setup for the baselines is not well-documented: o Are baselines trained under the same conditions (e.g., optimizer, batch size, epochs)? o Do all baselines use the same pretrained encoder, or are their original encoders used? o If training setups and/or image encoder backbone differs, performance differences cannot be directly attributed to the architectural changes in NH2ST. o Overall, the experimental setup of the baselines and comparison with the proposed framework should be expanded and better documented, to give better and more fair insights into the novelty of the current work. While the ablation studies are helpful, the emphasis on these should not outweigh the setup and comparison of the baselines.

    • There are inconsistencies and ambiguities in the mathematical notation that may confuse readers or reduce clarity. o Inconsistent use of bold, capitalization and lower case for matrices. o It also helps if vectors and scalars have a different notation, e.g., by making the vectors bold. o The indexing of the data is not consistent. Sometimes a patch is referred to as x and other times it is referred to as x_i. Same with the gene expressions y_i and y. Moreover, in the MSE loss, y should be changed to y_i and same for y^. o Sometimes the authors refer to the obtained gene and image features as h^g and h^p and other times as as h^g_s and h^p_s. o According to equation 1, the same Q, K and V projection matrices are used for both modalities, i.e., to obtain z^g_s and z^p_s. I assume this is not the case, but if they are indeed distinct, the notation should reflect this explicitly. o For L^n , the authors refer to L. I assume the authors mean a similar loss as formula 2, but notation is off.

    • Figure 2 lacks clarity and discussion. What do the values represent—expression values of a specific gene?

    • The absence of publicly available code and unclear train/val/test split settings, combined with the notably high PCC results, raises concerns about the potential for data leakage. It is not possible to assess this further without access to the exact implementation details and/or code.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The first two experiments (on STNet and Skin datasets) add limited insights, as they rely on results directly sourced from the TRIPLEX paper, making the comparisons less meaningful due to differences in experimental setup.

    • The paper is overall well written and the motivation is clearly stated. Some minor notes: o Terminology: I would recommend to refer to histopathology images rather than “pathology images,” as the work focuses on WSIs. o In the caption of Fig1. It states that ‘the model consists of two key components: a query patch and a neighbor branch. I assume the authors mean query branch instead of query patch? o The title states: “spatially gene expression prediction” but it should be ‘spatial gene expression prediction’.

    For future work, I would recommend to explore predictions on a broader range of genes (e.g., highly variable or biologically curated sets). Moreover, I would recommend to keep the experimental setup as much the same as in the proposed framework for a fair and meaningful comparison.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is well-motivated and novel. NH22ST combines cross-attention, contrastive learning, and hypergraph-based context modeling and the authors support their approach with extensive ablation studies. Notably, NH22ST does not rely on representative spots from the training data during inference. The consistent improvements across diverse datasets further demonstrate the promise of the method. One major weakness that cannot be addressed in the rebuttal is the decision to limit the prediction task to the top 250 highly expressed genes, which may bias the evaluation toward easily predictable genes and exclude biologically meaningful low-expression genes. However, there are several issues that limit the reproducibility and clarity of the work, that can be (partially) addressed during the rebuttal. First, the experimental setup of the baselines is not sufficiently documented. Without knowing whether comparable pretrained patch encoders and/or training protocols were used, it is difficult to determine whether performance gains can be attributed solely to architectural changes. Second, the mathematical notation contains several inconsistencies, which complicate readability. Clarifying these points would significantly improve the clarity of the technical content. Additionally, Figure 2 lacks context and interpretation. Finally, the lack of publicly available code and the absence of a clear train/validation/test split prevent verification of the experimental setup and raise concerns about potential data leakage, especially given the unusually high reported PCC scores. Providing more information on the data split strategy or making the code available, would help with addressing this concern.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose NH22ST, a spatial transcriptomics framework designed to predict gene expression by leveraging both histopathology images and spatial context. The architecture is structured around two main components: a query branch and a neighbor branch. Both branches incorporate cross-attention mechanisms and cross-modal contrastive learning to align and refine features across modalities (histopathology image patches and gene expression data). The query branch processes a histopathology patch in conjunction with the gene expression vector of a single target spot, while the neighbor branch operates on a local neighborhood of patches and their corresponding gene expression vectors to model spatial interactions.

    A key element of the framework is the prediction translator, which maps the learned pathology features from the query branch to a gene expression vector, enabling the inference task. Each component (the query branch, the neighbor branch, and the translator) has its own associated loss function. The total training objective is defined as the weighted sum of these three losses, guiding the model to jointly optimize feature alignment and gene expression prediction. Experimental validation across six publicly available datasets demonstrates that NH22ST consistently achieves superior performance over existing methods, as measured by Pearson Correlation Coefficient (PCC).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Relevance of the Task

    • The paper highlights the importance of gene expression prediction from histopathology images, emphasizing the advantages offered by spatial transcriptomics while acknowledging its scalability limitations due to cost and technical complexity.

    Technical Novelty of the Approach

    • A limitation in current approaches is the lack of explicit modeling of interactions between ST spots. They address this gap using hypergraph-based modeling, which allows for capturing complex spatial relationships.

    • While previous methods leveraging contrastive learning focus primarily on aligning features from histopathology and gene expression within the same spatial location, this work introduces a dual-branch design that incorporates dual-level contrastive losses to enforce intrinsic cross-modal consistency. To the best of my knowledge, this is the first work to combine hypergraph modeling with contrastive learning in this context, which represents a substantial contribution.

    Technical Correctness of the Paper

    • The preprocessing pipeline, including the selection of the 250 most highly expressed genes and the application of a log transformation, demonstrates methodological practice and reflects state-of-the-art domain understanding.

    • The model architecture includes mutual refinement of pathology and ST features via cross-attention and alignment through contrastive learning, which strengthens the multimodal integration.

    • Two hypergraphs are constructed based on feature similarity and spatial proximity, providing a robust framework to model the interactions among different ST spots for both modalities, capturing complex spatial and molecular interactions.

    Related work

    • The introduction demonstrates a solid understanding of prior work in gene expression prediction from histopathology images.

    • The comparative analysis is extensive, involving multiple state-of-the-art baselines and a diverse set of datasets. It is commendable that the authors included large-scale datasets such as HEST-1K and STimage-1K4M, which reflect current standards in spatial transcriptomics.

    Experimental Validation

    • The model consistently reports the best PCC across all datasets.

    • Ablation studies are comprehensive, covering all major components: query and neighbor branches, cross-attention, contrastive learning, and different graph types (standard graphs vs. hypergraphs). Further ablations investigate hypergraph parameters (number of neighbors k, number of layers L), as well as the impact of batch size, loss balancing parameters (λ₁, λ₂), and the temperature coefficient (τ) in contrastive learning.

    Writing and Presentation

    The paper is generally well written and logically structured, making it easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Technical Correctness of the Paper

    The translator module relies solely on the query branch for inference. While this design simplifies the prediction phase, it fails to fully utilize the rich contextual information from neighboring patches that was considered during training. As acknowledged by the authors, the neighbor branch currently serves only as an auxiliary training signal to enhance the pathology encoder.

    In the conclusions, the authors mention that attempts to use the neighbor branch during inference led to degraded performance due to added complexity or noise, but do not provide concrete experimental evidence. It would be important to clarify what experiments were conducted, how the neighbor branch was incorporated during inference, and under what conditions it failed. As stated, this reads more like an unrealized extension rather than a demonstrated limitation.

    Related work

    Although mclSTExp is mentioned in the introduction, it is not included in the experimental comparisons. Given its methodological similarity, especially the use of contrastive learning and neighbor context, this omission is notable and weakens the comparative evaluation.

    Experimental Validation

    The model performs worse than STNet on MAE and MSE for the HEST-1K datasets, but this discrepancy is not discussed.

    The second-best results are not highlighted in the performance tables, and no discussion is provided around them, which could offer additional insights.

    Design choices related to the encoders are neither explored nor ablated.

    In the ablation study section, the statement “Integrating CL with cross-attention improved all three metrics” is only partially accurate. The configuration using query branch and self-attention yields better MAE and MSE results than the one integrating CA, yet this is not acknowledged.

    The prioritization of PCC throughout the experiments and hyperparameter selection (e.g., tuning λ₁ and λ₂) is not explained. There should be a discussion on the implications of optimizing for PCC versus MAE/MSE, and how each branch (query or neighbor) contributes differently to this trade-off.

    The qualitative results in Figure 2 are discussed superficially, and the visual differences between TRIPLEX and NH22ST is not evident. The claim that the proposed model shows “higher visual consistency with the true values” lacks strong visual support.

    Writing and Presentation

    There are minor typographical errors (e.g., “qurey”, “cross-attention CL”) that should be corrected.

    Figure 1 could be improved to better reflect the model’s structure. It would be helpful to include notations such as the encoders (ϕ), input patches (x), and gene expression vectors (y). Additionally, component naming should be unified throughout the paper, for instance, the figure uses “Query Branch” and “Neighbor Branch,” while the figure description uses “query patch branch.” Similarly, “Image Encoder” and “ST Encoder” in the figure are referred to as “Pathology and ST encoders” in the manuscript. Finally, the box labeled “Prediction” in the figure presumably refers to the Predictor Translator, which should be explicitly named.

    A potential error exists in the inference description: “During inference, since gene expression data is unavailable, only the query branch is used to predict gene expression values ŷᵢ = φg(φp(xᵢ))”. Based on the rest of the paper, it seems this should be: “only the Predictor Translator is used to predict gene expression values ŷᵢ = φt(φp(xᵢ))”.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Recommendations

    It would be beneficial to include comparisons with BG-TRIPLEX [1], a recent model that has demonstrated improved PCC over TRIPLEX. Additionally, comparing against HGGEP [2], which also employs hypergraph-based modeling, would provide a more complete assessment of the proposed approach. At a minimum, these models should be referenced and discussed in the introduction as part of the related work.

    The inclusion of statistical tests to assess the significance of performance differences would strengthen the evaluation. As demonstrated by benchmark studies such as SpaRED [3], many methods yield statistically non-significant enhancement results.

    References

    1] M. Qu, Y. Wu, D. Di, A. Su, T. Su, Y. Song, and L. Fan, “Boundary-Guided Learning for Gene Expression Prediction in Spatial Transcriptomics,” in Proc. 2024 IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Dec. 2024, pp. 445–450.

    [2] B. Li, Y. Zhang, Q. Wang, C. Zhang, M. Li, G. Wang, and Q. Song, “Gene expression prediction from histology images via hypergraph neural networks,” Brief. Bioinform., vol. 25, no. 6, bbae500, Nov. 2024. [Online]. Available: https://doi.org/10.1093/bib/bbae500

    [3] G. Mejia, D. Ruiz, P. Cárdenas, L. Manrique, D. Vega, and P. Arbeláez, “Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv. (MICCAI), Cham, Switzerland: Springer Nature, Oct. 2024, pp. 91–101.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant problem in spatial transcriptomics by proposing a novel framework that outperforms state-of-the-art methods. The technical contributions about combining the dual-branch architecture and the use of hypergraphs to model inter-patch relationships, are original and well-motivated. The experimental results demonstrate consistent improvements in PCC across six datasets, and the ablation studies are thorough. However, the core limitation that prevents a higher recommendation is the design choice to use only the query branch during inference, despite having trained the model with a rich contextual representation through the neighbor branch. Additionally, while the authors acknowledge this issue in the conclusions, they do not present concrete experimental evidence to support their claim that incorporating the neighbor branch during inference leads to degraded performance. Despite the weaknesses, the paper offers a meaningful contribution to the field and introduces ideas that could inspire further work.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a new framework for spatial prediction of gene expression using a dual approach that: 1) aligns pathology and genomics at a tile level using cross-attention and contrastive learning, 2) explores the interaction among neighbor patches for both modalities using graph neural networks, 3) incorporates this dual-approach to enhance the final prediction of gene expression in the prediction branch that can run independently during inference. Additionally, it evaluates the framework in a large cohort with 6 different datasets and benchmarks against six different previous methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s major strength is its innovative framework that combines genomics and pathology feature interactions locally through cross-attention and contrastive alignment and spatially using graph neural networks for improving the direct prediction of spatial expression of genes from H&E WSI. Additionally, through the implementation of ablation studies the authors show 1) the positive impact of including cross-attention before contrastive learning and 2) the performance gains from incorporating the neighbors branch with graph neural networks. All of this evaluated in a comprehensive dataset and benchmark study design.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    A weakness of the paper is the lack of reproducibility given that no github is presented and the implementation of the methods is vague. Additionally, it is not clear from the ablation study what is the difference between the Q+N with graphs (G.) and (H.G.) from the previous neighbor ablation studies. The authors also used a pretrained ResNet18 as their image encoder, missing the opportunity to benchmark against specialized pathology foundation models such as CONCH, Virchow2, UNI ( https://doi.org/10.48550/arXiv.2408.15823). Specially, TANGLE which incorporates information from genomics data from a larger dataset pretrained to learn representations of pathology through contrastive learning.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Overall this study presents an interesting approach to train the prediction of locally genetic expression but it has to train their foundation in every training. Instead, the authors could consider training a larger foundation model on these datasets using similar approaches that could then be used during inference for other tasks to improve the translational value of the study. This would align with current trends in computational pathology where foundation models like CONCH and Virchow2 have demonstrated strong generalization capabilities across multiple downstream tasks.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a well-designed study that tackles the need for more accessible spatial transcriptomics and region-aware representations from WSIs. It proposes an approach for predicting spatial genomic expression from H&E by the combination of graph neural networks together with contrastive learning methods, which is a novel approach for locally aware pretraining. It includes different architecture design and ablation studies to prove the relevance of their design choice. However, the authors should consider comparing to SOTA methods for computational pathology.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the ACs and reviewers for their constructive feedback. Some grammatical and spelling errors have been corrected. R#1-1: The experimental setup. A1-1: We will revise and clarify all experimental configurations (optimizers, learning rate schedules, training epochs, etc.).

R#1-2: Mathematical notation. A1-2: We have now carefully revised all equation and formulations.

R#1-3: Figure 2 clarity and discussion. A1-3: Figure 2 visualizes the expression values of specific marker genes across distinct tissue regions.

R#1-4: Available code. A1-4: We will release code and training/testing split.

R#2-1: Model performance and Metrics. A2-1: Our model demonstrates overall advantages across all datasets. We will refine the description of metrics in the revised version.

R#2-2: Figure 2 Discussion. A2-2: We acknowledge that TRIPLEX and NH22ST show visual similarities, making it challenging to distinguish between the two models based solely on visual inspection. We will clarity and show more results.

R3#-1: Ablation Study Details. A3-1: Q.+N. refers to the combination of the query branch and neighbor features without using any graph structure. In this case, the neighbor features are aggregated through a simple averaging of the neighbor patches. In contrast, G. uses a standard graph structure, while H.G. employs a hypergraph structure to capture more complex relationships.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top