Abstract

Spatial transcriptomics (ST) is a promising technique that characterizes the spatial gene profiling patterns within the tissue context. Comprehensive ST analysis depends on consecutive slices for 3D spatial insights, whereas the missing intermediate tissue sections and high costs limit the practical feasibility of generating multi-slice ST. In this paper, we propose C2-STi, the first attempt for interpolating missing ST slices at arbitrary intermediate positions between adjacent ST slices. Despite intuitive, effective ST interpolation presents significant challenges, including 1) limited continuity across heterogeneous tissue sections, 2) complex intrinsic correlation across genes, and 3) intricate cellular structures and biological semantics within each tissue section. To mitigate these challenges, in C2-STi, we design 1) a distance-aware local structural modulation module to adaptively capture cross-slice deformations and enhance positional correlations between ST slices, 2) a pyramid gene co-expression correlation module to capture multi-scale biological associations among genes, and 3) a cross-modal alignment module that integrates the ST-paired hematoxylin and eosin (H&E)-stained images to filter and align the essential cellular features across ST and H&E images. Extensive experiments on the public dataset demonstrate our superiority over state-of-the-art approaches on both single-slice and multi-slice ST interpolation. Codes are available at https://github.com/XiaofeiWang2018/C2-STi.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0326_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/XiaofeiWang2018/C2-STi

Link to the Dataset(s)

OpenST dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE251926

BibTex

@InProceedings{QueNin_Adaptive_MICCAI2025,
        author = { Que, Ningfeng and Wang, Xiaofei and Chen, Jingjing and Jiang, Yixuan and Li, Chao},
        title = { { Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {46 -- 55}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a novel method named C2-STi to interpolate missing spatial transcriptomics. This method contains the following three novel parts: a distance-aware local structural modulation module to capture cross-slice deformations and enhance positional correlations slices, a pyramid module to capture multi-scale associations among genes, and a cross-modal alignment module that integrates the spatial transcriptomics and eosin images to filter and align essential cellular features. Experiments on the HNSCC dataset demonstrate the effectiveness of C2-STi.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (a) This paper introduces a novel method, C2-STi, for interpolating missing spatial transcriptomics (ST) data. The proposed framework incorporates three key modules: cross-modal alignment, pyramid gene-expression correlation, and local structural modulation. These components are designed to capture structural, latent, and correlative information across adjacent tissue slices, thereby improving interpolation performance. Notably, the integration of these modules in the context of ST data interpolation is novel and original. (b) The cross-modal alignment module effectively leverages the correlation between H&E-stained histology images and ST slices. It employs a ResNet-50 backbone to extract visual features, which are then aligned via concatenated feature maps refined by attention mechanisms and activation functions. This approach has been experimentally validated to efficiently extract and align cross-modal features. (c) The pyramid gene-expression correlation module is designed to address both tissue-level and cellular-level deformations, as well as gene-specific expression variability. It consists of a pyramid encoder that extracts multi-scale features and a Multi-Gene Co-expression Graph (MGC-Graph) to model gene co-expression relationships. Both components are novel contributions to the ST image interpolation task, and their effectiveness has been demonstrated through experiments. (d) The distance-aware local structural modulation module captures local structural patterns across ST slices. It incorporates cross-section distance modeling, a coarse-to-fine modulation strategy, and deformable convolutional fusion layers. Experimental results confirm the effectiveness of this module in enhancing the interpolation accuracy.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (a) The selection of comparison methods is not sufficiently convincing. Specifically, the paper includes several video-generation methods that are not explicitly designed for spatial transcriptomics (ST) image interpolation, which limits the fairness of the comparison. Additionally, the use of U-Net as a baseline is outdated and may not represent the current state-of-the-art. Although the authors cite two relevant works on ST image interpolation (references [11] and [19]), neither is included in the experimental comparisons. To strengthen the evaluation, the authors should incorporate more advanced and task-specific methods as baselines. (b) The evaluation metrics are not clearly explained. Since multiple images may be generated for a single inference, it is important to clarify how standard metrics—such as PSNR, SSIM, PCC, and NMSE, which are typically defined for single image pairs—were computed in this multi-output setting. A detailed explanation of how these metrics were adapted or aggregated across generated outputs would improve the clarity and reproducibility of the experimental evaluation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed paper presents a novel method, C2-STi, for interpolating missing spatial transcriptomics (ST) data, which is a significant and timely problem in biomedical image analysis. The authors introduce three innovative modules: a cross-modal alignment module that integrates histology and transcriptomics data, a pyramid gene-expression correlation module to capture multi-scale gene associations, and a distance-aware local structural modulation module to enhance slice-wise consistency. The overall architecture is thoughtfully designed, and the experimental results on the HNSCC dataset demonstrate promising performance improvements, which validate the proposed approach. However, there are a few limitations that affect the overall strength of the paper. First, the choice of baseline methods is not fully convincing. Some of the comparison methods are designed for video generation and are not directly tailored for ST interpolation. The use of U-Net, while popular, is outdated and may not serve as a strong baseline for this specific task. More relevant and recent ST-specific methods (e.g., those referenced in the paper itself, such as references [11] and [19]) should be included in the comparison to better position the proposed method within the current state of the art. Second, the explanation of evaluation metrics lacks sufficient detail. Since multiple images can be generated for a single inference, clarification is needed on how metrics like PSNR, SSIM, PCC, and NMSE were computed or aggregated. These metrics are typically defined for single image pairs, so their use in a multi-output context should be clearly justified to ensure transparency and reproducibility. Despite these weaknesses, the core idea and contribution of the paper are novel and relevant to the MICCAI community. The integration of cross-modal information and the emphasis on both structural and gene-level representations offer a valuable perspective for advancing ST data processing. With improved experimental comparisons and clearer metric descriptions, this paper has strong potential. Therefore, I recommend a Weak Accept, contingent on the authors addressing the concerns raised in a rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes C2-STi, the first attempt for interpolating missing spatial transcriptomics (ST) slices at arbitrary intermediate positions between adjacent ST slices. To achieve this, the authors propose a distance-aware local structural modulation module and a pyramid gene co-expression correlation module to capture positional correlations between ST slices and multi-scale biological associations among genes. In addition, a cross-modal alignment module is developed to align the essential cellular features across ST and H&E images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well structured and written.
    2. The idea of interpolating missing ST slides between adjacent ST slices is interesting and novel. The introduction of H&E images besides ST images makes sense.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although the proposed methods yield promising results, the network architecture is complex. Therefore, I suggest comparing the number of parameters and other relevant metrics of the proposed method with existing approaches to provide a comprehensive evaluation.
    2. The baseline methods considered in the comparison were all published before 2022. To more rigorously demonstrate the superiority of the proposed method, recent approaches should be included in the evaluation.
    3. The description of Fig. 2 is lacking. In addition, the quality of interpolated ST slides appears suboptimal. The difference between the proposed method and the ground truth remains substantial.
    4. In the ablation experiments, the effectiveness of some crucial modules (e.g., CFM) has not been adequately validated.
    5. In Sec. 2.4, please number the total loss function as Eqs. (1)–(5). In addition, the values of λ_sim and λ_smo are both set as 1. How were these values determined? Was any parameter tuning or selection performed?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the comments above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes C2-STi, a novel deep learning framework for interpolating missing spatial transcriptomics (ST) slices at arbitrary positions between observed tissue sections. Unlike previous methods limited to single-slice or midpoint interpolation, C2-STi enables flexible and structure-aware slice generation through three key modules:

    • Cross-modal alignment module to fuse ST maps with paired H&E histology images,
    • Pyramid gene co-expression correlation module to model gene dependencies across multiple spatial resolutions,
    • Distance-aware local structural modulation (DLSM) module to adaptively model non-linear tissue deformations and enable fine-grained interpolation. Experiments demonstrate superior performance on one public dataset.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Unlike prior work limited to midpoint or fixed-location interpolation, C2-STi supports generating any number of slices between input ST images.
    2. The design addresses cross-slice deformation, gene co-expression structure, and multimodal histology–transcriptomic alignment.
    3. The cross-modal attention module enhances gene prediction by using the structural context of histology images.
    4. Outperforms five SOTA interpolation methods (e.g., RIFE, DAIN, IFRNet) in both single-slice and multi-slice ST interpolation on the HNSCC dataset.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Only tested on one dataset (HNSCC); robustness on other tissues, resolutions, or ST platforms (e.g., MERFISH, Slide-seq) is unknown. No runtime or efficiency analysis is provided. Although interpolation quality is measured quantitatively, no biological interpretation (e.g., preserved gene domains or cell boundaries) is provided. The combination of pyramid encoders, GCN, and deformable convolutions makes the architecture sophisticated—potentially harder to tune or deploy.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, C2-STi is a technically impressive and novel solution for arbitrary ST slice interpolation. It successfully addresses structural deformation, gene-level correlation, and multimodal fusion—backed by strong empirical gains over baselines. With improved clarity, broader validation, and runtime analysis, the work would make a compelling contribution to spatial omics modeling and deep learning for tissue reconstruction. I suggest weak accept.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors propose a model for imputing intermediate ST slices by integrating H&E images and ST gene expression maps. The method includes cross-modal alignment, a pyramid gene co-expression module, and a distance-aware structural modulation to preserve spatial context across sections.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The model is designed to capture gene co-expression relationships and incorporate histological features from H&E images, addressing unique challenges in ST data.
    • The model is trained on a dataset with subcellular-resolution gene expression, which enables finer-grained spatial predictions. This offers a clear advantage over models trained on lower-resolution, spot-based ST platforms Visium.
    • The model shows a clear improvement over SOTA methods in 4 metrics.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The predictability of genes in ST data varies widely depending on sparsity, histology association, and cell-type specificity. Notably, genes expressed by mobile or small-sized cells such as lymphocytes may not be well preserved across sections. It would be good to have the range or variability of performance across different genes in the results tables.
    • A decrease in performance is observed when the number of interpolated slices increases. It is unclear whether this degradation is due to the increased number of slices being predicted simultaneously, greater physical distance between the input slices and the target slices, or both. The paper would benefit from reporting metrics for each interpolated slice individually in the N-slice setting. If slice number is the primary factor impacting performance, it may be more effective to predict one slice at a time for N consecutive sections.
    • When the number of interpolated slices increases from 1 to 2, RIFE, DAIN, and IFRNet show improved PCC, while C2-STi shows a slight decline. The authors should explain why their model underperforms in this scenario.
    • DeepSpaCE (Monjo et al., 2022) is, to our knowledge, the first model specifically designed for predicting ST in intermediate tissue sections. Please justify why it is not included in the method comparison.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a new method for gene expression interpolation, making a useful contribution to spatial transcriptomics.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank reviewers and AC for their valuable feedback.

[R1]

  1. Performance Variability Across Genes. We agree that gene predictability varies due to factors like cell type specificity. For instance, genes expressed in small, mobile lymphocytes (e.g., IGHG3, PCC 0.26) are harder to predict than highly expressed genes in stromal regions (e.g., KRT6A, PCC 0.64).

  2. Multi-slice Interpolation. We explored two factors affecting performance: (1) the number of predicted slices and (2) the distance between slices. For (1), increasing the number of target slices from 1 to 2 (with fixed 15μm distance to inputs) reduced SSIM from 0.73 to 0.69. For (2), increasing input spacing from 20μm to 30μm (with one target slice) lowered SSIM from 0.77 to 0.66. These results indicate that both more predicted slices and wider input spacing degrade performance.

  3. Compared Method. DeepSpaCE, while pioneering in ST interpolation, only supports a single fixed-position slice and lacks open-source code. We plan to re-implement and extend it for multi-slice settings in future work.

[R2]

  1. Comparison Methods. DeepSpaCE [11] is limited to one intermediate slice and does not generalize to our multi-slice setup. Diff-ST [19] focuses on super-resolution rather than interpolation. Nonetheless, we acknowledge the need for stronger baselines and will include additional methods in future work.

  2. Evaluation Metrics. For multi-slice settings, we compute PSNR, SSIM, PCC, and NMSE for each slice and report the average across all targets.

[R3]

  1. Model Complexity. Our C2-STi has 12.2M parameters, fewer than DAIN (24.0M) and RIFE (13.8M), yet achieves comparable or better results, demonstrating efficiency.

  2. Compared Methods & Ablation. As suggested, we will explore more recent methods and conduct additional ablation studies on modules such as CFM in future work.

3.Others. We will include a detailed explanation for Fig. 2 and number the full objective as Eqs. (1)–(5). Hyperparameters λ_sim and λ_smo are both set to 1, based on prior knowledge indicating equal importance of similarity and smoothness. We will explore more tuning in future work.

[R4]

  1. Test Set. We used the HNSCC dataset due to limited availability of consecutive ST-H&E pairs. Future validation will include synthetic and additional real-world datasets.

  2. Model Efficiency. In single-slice interpolation, C2-STi achieves 0.0218s per slice, outperforming RIFE (0.025s) and significantly faster than DAIN (1.343s), confirming its efficiency.

  3. Biological Interpretation. Due to space constraints, we omitted biological insights. We found that our model better predicts genes with distinct spatial boundaries. In future work, we will assess biological relevance through gene module preservation, spatial domain clustering, and consistency with cellular structure.

  4. Architecture. Our model is designed to capture cross-modal, cross-slice, and cross-gene dependencies. We acknowledge the need for simpler, more deployable models and will aim to optimize and streamline the architecture in future versions.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top