Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Histological images are essential in biomedical research and diagnosis, extending beyond detailed cell and tissue morphology to provide an intuitive view of the cellular microenvironment and spatial relationships. While single-cell gene expression data reveal molecular distinctions in cell states, their complexity obscures cellular interactions and spatial organization. To overcome this, reconstructing histological images from large-scale single-cell data is essential for intuitively visualizing spatial architecture. This paper proposes a single-cell-level histological image generation method that derives cell state representations from gene expression data using a single-cell foundation model. A conditional diffusion model is leveraged to generate histological images, accurately reconstructing the cellular microenvironment and spatial cell type distribution. By decoupling cellular state into two components, cell type and microenvironment, we propose two complementary approaches for generating pathology images, one conditioned on scRNA-seq data and the other on cell type. Our approach successfully generates high-quality histological images of human breast and colon cancer tissues, capturing key spatial features such as cell density, compositional distribution, and cell spacing within tissues.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3404_paper.pdf

SharedIt Link: https://rdcu.be/eHw59

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_24

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CaiHon_GE2Hist_MICCAI2025,
        author = { Cai, Hongmin AND Ji, Boan AND Cai, Shangyan AND Liao, Yi AND Chen, Jiazhou AND Huang, Weitian},
        title = { { GE2Hist: Generating Histology Images from Single-cell Gene Expression via Cross-modal Generative Network } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {240 -- 250}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper leverages diffusion backbone modified by the single-cell gene expression embedding to generate histology images. Single-cell Foundation model is used to get embedding from scRNA-seq data. Two VAE encoders independently encoded the embedding from scGPT to make the embedding more interpretable from the biological perspective. The results are shown to make comparisons with existing algorithms.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The author leverages single-cell expression data as additional information to guide histology image generation. This pipeline is robust in natural vision generation. The application in this specific area is novel and proper for the research questions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

In section 3.1, there’s no detailed description about the data. (Sample size, cell type labels, multi-modal data pairs information, etc.) As a generative model, the training strategies and detailed model implementation is not clearly introduced. The cell type classifier model is not introduced. The training/testing procedure is not mentioned for the classification task, nor the label distributions. Writing ambiguity: The term ‘single-cell gene expression’ in the title may not fully describe the data used in the pipeline. The core backbone is a diffusion model, additional single-cell profile is added to guide the correct generation process. Therefore, “histology images from single-cell gene expression” is not proper. In section 2.1, equation(2). ‘p(ze)’ is not clearly stated. In section 2.2 equation(9). The default values of beta and gamma may need to be clearly stated.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The pipeline is compelling. But the real-world application may be limited and the results reported are hard to reproduce.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper
- The paper proposes an interesting concept of considering scRNA seq data and cell type to generate histopathology images
- A conditional diffusion model is modified to generate single-cell histology images along their microenvironment
- First algorithm that aims to generate histological images from single-cell gene expression data
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- novel model architecure that explicility consider cell type.
- Evaluation of cellular spacing and density of images in Figure 3 other than established metrics such as FID, and LPIPS
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The study states that there are 38 cell types identified, however, no details are provided on those. Fig 4 states different cell types A-D without providing details on what are those. The proposed metrics on cellular distribution in Figure 3 need to consider different cell types to ensure the proportion of these cell types. Also, this figure does not provide a comparison to other methods.
- While the concept is novel, the motivation for developing this approach is not clear as it is far more expensive to generate single-cell gene expression data so how this method will be used in practice is not clear. In contrast, RNA-CDM is considering bulk-gene expression which might be more feasible to obtain.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting concept is presented but novelty is limited and the relevance of the approach to real-world setting is limited. Also the presented images are limited not capturing the wide variability of tissue images.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper presents a deep learning based method for synthesizing histological images from single cell gene expression/transcriptomic data.

A foundational model (pre-trained) embeds gene expression data. Subsequently, two variational auto-encoders decouple two signals from this embedding: mean+variance in a latent cell type classification space, and a mean+variance in a latent microenvironment descriptor space.

The first signal is trained by cross entropy with known cell type labels (it is not clear enough in the paper whether these labels are human annotations that came with the dataset, or if they are predictions from the foundational model).

The second signal is regularized via KL divergence from a normal distribution (the fact that p(e_z) is an assumed normal prior should be stated much closer to equation (2) in the paper).

The first and second signals are concatenated and used as the conditional variable for a conditional diffusion network (CDN). The CDN incrementally transforms noise into a salient image conditioned on the cell type and microenvironment embeddings. The CDN model weights, and in turn the microenvironment signal, are trained with the difference between the CDN model’s noise prediction and the true image’s noise value, at each increment in the diffusion process. (It is not sufficiently clear in the methods description that \epsilon_t is derived from given ground truth images, moreover, how \epsilon_t is derived from them should be explicitly described.)

Data from the Visium HD Spatial Gene Expression Library, which includes paired histological images and spatial transcriptomic data, is used to train the model. Two tissue types are included: breast cancer samples and colon cancer samples.

For both tissue types, the proposed model generated images whose distributions more closely match the ground truth image set than two baseline comparison models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

First, let me say I think this paper is excellent. The work is very interesting and the methods proposed are at the cutting edge of multi-modal machine learning techniques. Several specific strengths are:

A well thought out and novel architecture wherein latent spaces are driven to have interpretable meaning. Learning a Gaussian Mixture Model in the cell type classifier space in particular is an excellent idea. This facilitates sampling of entirely de novo histological images without new transcriptomic data as input. This way, realistic examples of cell type categories can be synthesized in the context of different micro-environments.

Generated images appear both qualitatively and quantitatively superior to baseline methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Although I really like this paper, nothing is perfect and there are some weaknesses to point out.

Only data produced by one spatial transcriptomic method was tested. Frequently in the paper the authors claim implicitly that the model will work for any scRNA-seq input, but this claim is only evaluated for one transcriptomic data modality. The authors should not underestimate the variability of transcriptomic data collected via different platforms, instruments, and chemistries. It may be that the use of the foundational model upfront protects the overall method to some extent from this variability. However, without specifically testing for generalizability across different datasets, the claim of any scRNA-seq data must be tempered to the specific data presently evaluated.

In the introduction the authors also claim “The first algorithm for generating histological images from single-cell gene expression data has been developed…” However, in the preceding paragraphs the authors themselves enumerated existing prior techniques, and the presence of two baseline methods in the experiments section also contradict this claim. The authors should be more respectful of prior art and temper their own claims to what is objectively true in a broad, rather than narrow, sense. Attentive reviewers and readers see through this kind of exaggeration.

In several places mathematical descriptions are incomplete: I. It is not clear enough in the paper whether the supervising cell type labels are human annotations that came with the dataset, or if they are predictions from the foundational model. II. The fact that p(e_z) is an assumed normal prior should be stated much closer to equation (2). III. It is not sufficiently clear that \epsilon_t is derived from given ground truth images, moreover, how \epsilon_t is derived from them should be explicitly described. IV. The parameter \alpha_t which appears in equations (4) and (7) is never described anywhere in the text.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

On reproducibility - no source code is offered, and due to a few omissions and some lack of clarity in the mathematical details, I do not feel confident that I could reproduce this work myself.

Another question - and this is just for discussion purposes and is not related to whether I think the paper should be accepted or not. What is the aspirational clinical relevance of this kind of work?

From a technical standpoint it is very interesting, and the authors motivate the work initially by stating “… multimodal studies using paired histological images and gene expression data have demonstrated significant potential in disease diagnosis.” Several citations are given to support that statement. I am not intimately familiar with those citations, but from context it seems reasonable to assume those studies do not use synthetic data, but rather multiple modalities collected de novo from the samples. It is not assessed in the present work whether de novo histological images contain any features which are absent from the synthetic images. Put computationally, would a classifier trained with two de novo collected modalities perform better than a classifier trained with one de novo modality and one synthetic modality? Is everything present in the synthetic image just latent in the independent variable and therefore accessible to a classifier with the right architecture anyway, without the intermediate step of synthesizing a second modality?

I think these questions are very interesting and are the directions I would pursue with this work if it is going to be submitted to a journal. An additional direction would be to give real and synthetic images to pathologists and ask them to identify any differences (or see if they can correctly classify which images are real and which are synthesized).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall - though there are some exaggerated claims and some details missing from the methods - I believe this work is of excellent quality. The methods are cutting edge and the application area is very interesting. I would be happy to see this paper selected for an oral presentation.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The biggest question I had about this work prior to rebuttal was what the application for this might be. The authors have provided a convincing option. This model can help estimate the phenotypic consequences of a expressed genotype change. That is, one can synthetically knock down a gene (dampen or zero its value in the scRNA-seq input vector) and see how the synthesized image changes. If enough of the true relationships between gene expression and cell type/microenvironment characteristics have been captured, then this could be a very useful tool for drug target discovery and cell culture experiment design.

Author Feedback

We sincerely thank all the reviewers for their valuable feedback and for recognizing the innovative contributions in addressing a challenging problem. Their positive comments were particularly uplifting, with remarks such as “This paper is excellent,” and “The application in this specific area is novel and proper for the research questions.” Reviewers also kindly noted that “these questions are very interesting and are the directions I would pursue with this work if it is going to be submitted to a journal” and expressed enthusiasm, stating they “would be happy to see this paper selected for an oral presentation.” We have carefully considered all of the concerns raised. As they were highly consistent, we have organized our detailed responses by theme below.

What is the motivation and practical use? Our method translates scRNA-seq data—which captures not just individual cell characteristics but also their state as influenced by the surrounding ecological niche (e.g., cellular density, composition, distribution) and implicitly reflects cell-cell interactions—into intuitive histological images. Generating these vital histological images is crucial because they serve to bridge micro-scale cellular behaviour with macro-scale tissue understanding, enabling the observation of complex life processes or disease progression. This offers a holistic view of how cells function in their “native ecological niche” to shape tissue physiology. In emerging cellular digital twin applications, our approach offers precise visual simulation of how interventions like gene editing or drug perturbations reshape a cell’s ecological niche. This provides vital visual evidence for predicting and understanding cellular responses in their microenvironmental context, potentially reducing R&D costs. This paper represents an exploratory step by our team in the digital twin domain. Prior approaches have often utilized bulk gene expression averages for data augmentation; this methodology differs significantly from our primary research focus.

How is the model implemented? Regarding mathematical descriptions (full formulations will be in the revised manuscript): the prior p(ze) (Eq.2) is N(0,1); default β=0.2,γ=0.1 (Eq.9); for Eqs.4&7, αt=1−βt (from the noise schedule), with ϵt derived accordingly. To clarify our model implementation and training strategy (exhaustive details will be in the final paper): RNA sequences (>19k-dim) are foundation model-compressed to 256D, then by a pre-trained VAE to a 64D latent vector zs=[zc,ze]. Critically, the VAE’s zc and ze decoupling is not achieved via pre-training but is guided during a subsequent comprehensive training phase by cell-type classification, KL divergence, and diffusion denoising losses. Our U-Net (ResBlocks, CrossAttention for zs conditioning) performs diffusion (T=1000 steps, cosine schedule, Adam). An MLP cell type classifier is randomly initialized and trained concurrently with this diffusion.

What are the specifics of the data used in this study? Our study utilized data from the Visium HD Library, obtaining whole-slide images (tens of thousands of cells/image) with corresponding sub-cellular precision RNA sequencing. Through preprocessing, including nuclei localization and segmentation, we derived per-nucleus sequencing data, which was then paired with a 256x256 image patch centered on each nucleus, representing its ecological niche. Cell types (identified via a foundation model; full lists will be in the manuscript) exhibited natural imbalance; we retained all data as experimentation showed balancing did not improve, and often hindered, results. The foundation model we incorporated serves to alleviate data batch effects and imbalance; concurrently, we are building our own dataset. We will also correct noted writing ambiguities in the final manuscript, without detailing those changes here for brevity.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The majority of reviewers vote for paper acceptance.

back to top

GE2Hist: Generating Histology Images from Single-cell Gene Expression via Cross-modal Generative Network

Author(s):