List of Papers Browse by Subject Areas Author List
Abstract
Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: https://github.com/uncbiag/dgir
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4536_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/uncbiag/dgir
Link to the Dataset(s)
N/A
BibTex
@InProceedings{TurNur_Guiding_MICCAI2025,
author = { Tursynbek, Nurislam and Greer, Hastings and Demir, Başar and Niethammer, Marc},
title = { { Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15963},
month = {September},
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors tackle the problem of registering two images where structures are potentially missing in one of them, either because of bad tissue contrast in multi-modal scenarios (soft tissues in DXA->X-Ray), or because of different preprocessing (whole head vs. skull stripping). They show that conventional image similarity metrics (uni- or multi-modal) hardly handle such cases. For this reason they propose to use feature-based registration, where they guide a registration network with similarity metrics computed on features extracted from the input images by a diffusion model. Interestingly, the authors show that the diffusion model doesn’t have to be trained for this specific task, but rather they use a frozen existing model trained on 2D natural images.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well motivated and the objectives/contributions are clearly stated.
- Comprehensive lit review about image similarity metrics and diffusion models for registration.
- Here the novelty doesn’t come from the technical side, but from the very interesting idea of using off-the-shelf diffusion models trained on natural images, which are found to be robust to multi-modal and disappearing regions.
- Comprehensive ablation studies.
- The maths are sound and the paper is well written, which makes it easily reproducible.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major concerns:
- My main concern is that the authors have ignored the whole field of feature-based registration, even though this is the very paradigm of this work. This field is an active area of research, especially with representation learning, and should be 1) reviewed in the introduction, and 2) evaluated in the experiments.
- The authors cannot claim to outperform the baselines without running statistical tests (the mean is not enough).
- I am missing a figure showing the features extracted from the diffusion model.
- This is more of a suggestion, but wouldn’t it be better/feasible to aggregate features extracted at different values of n/t to include features with finer/coarser details? Maybe something to add to the discussion?
Minor concerns:
- It’d be good to introduce the dimensions of x, alpha, beta, epsilon in section 3, to better clarify the nature of these objects for the readers unfamiliar with diffusion.
- In 3.1: “where hn is output of block n of the diffusion model”. The n is not properly introduced, it’s only in 3.2 that we understand these are resolution blocks in the diffusion UNet.
- In Fig 5b, shouldn’t the x-axis read “time steps” instead of “block number”?
- In 4.2: “diffsuion” -> “diffusion”
- Isn’t there a problem with the last column of Table 1? (it has zeros everywhere)
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The paper is well-written and gives all details necessary to reproduce its results. Moreover it uses publicly available datasets and models. So overall, the reproducibility seems very good.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This is a well-written work, with clear motivation, novel idea, comprehensive evaluation, and interesting findings. I would gladly recommend strong accept, but I’m very troubled by the forgetting of representation learning in the introduction/baselines. I still recommend weak accept since this paper is likely to foster discussion at the conference.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This work proposed to use the features from the diffusion models trained on natural RGB images for similarity loss (LNCC) computation and perform 2D knees (DXA to X-Ray) and 3D brain MRI (skull-striped to non skull-striped) registration.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors leverage pretrained diffusion models on natural RGB images for medical image registration, which is a novel and interesting idea.
- Extensive experiments are conducted on two tasks, with informative visualizations.
- The method demonstrates robustness in handling missing anatomical structures, such as the skull.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The resolution of Diffusion features in Fig.1 is much lower than other methods. It might not be a fair comparison.
- Missing details regarding the feature extraction. a) How to input gray scale image to Diffusion models trained on 3 channel RGB natural images? Does it undergo similar normalization as the RGB images? b) The U-Net have 37 blocks and experiments were done using different block. But how is the resolution of the features handled? c) “We randomly select N coronal, or sagittal, or axial slices.” What is N? How is full 3D registration achieved from these sampled 2D slices?
- Does 3D feature extraction significantly increase runtime, given that multiple forward passes through the diffusion model are required?
- The 3D brain registration evaluation uses only 4 region labels. Why were only these selected?
- It is very interesting to see the Diffusion features is robust to w/o skull registration. However, it might not be relevant to the real application (skull stripping can be done in Freesurfer/SynthStrip). Besides, SynthMorph can also handle w/o skull case.
- Missing reasoning behind the registration improvement by adding noise. Why would add noise be better than without any noise? Are there more insights in this observation? Minor: Is the x-axis label in Fig. 5b correct?
- Please rate the clarity and organization of this paper
Poor
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Clarify experimental details. For example, while diffusion features are extracted from different blocks, do AE, VGG, and DINO baselines also use different layers, or a fixed one?
- Consider adding AE, VGG, DINO features similarity heatmap to Fig. 1 might be more relevant since they are as well pretrained models on natural images while MSE, LNCC, NGF etc are similarity computation without feature extraction.
- More in-depth analysis on why pretrain Diffusion features outperforms other natural image pretrained models, e.g., DINO, would be interesting.
- Relevant literatures: Song X, Xu X, Yan P. DINO-Reg: General Purpose Image Encoder for Training-Free Multi-modal Deformable Medical Image Registration. Kögl F, Reithmeir A, Sideri-Lampretsa V, Machado I, Braren R, Rueckert D, Schnabel JA, Zimmer VA. General Vision Encoder Features as Guidance in Medical Image Registration.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This is an interesting work that leverage pretrained models on natural images to do medical image registration. However, it lacks key technical details and in-depth analysis explaining why diffusion features are effective, which limits the clarity of the method.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The paper describes a registration method that utilizes features from off-the-shelf diffusion models pre-trained on ImageNet. The authors replace the fixed and moving images with their corresponding diffusion features to compute the LNCC similarity loss. The proposed LNCC+Diffusion model outperforms or matches state-of-the-art (SOTA) methods in both 2D and 3D registration.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
(1) The ability of diffusion features to capture correspondences in medical images is clearly demonstrated in Figures 1 and 2. (2) The choice of parameters for extracting diffusion features (i.e., time step and block) is well explored in the ablation study. (3) The paper provides comprehensive quantitative results and visualizations comparing registration performance.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
(1) [Writing] Please clarify the statement in Section 3.2: “i.e., for a perfect image alignment these features would by construction be identical.”
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The problem formulation is well-defined and supported by both quantitative results and image/feature visualizations. A comprehensive ablation study is included, and the model demonstrates competitive performance in both 2D and 3D medical image registration.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We are extremely grateful for the reviewers for providing detailed feedback along with provisionally accepting our paper! Thank you for the encouraging words about the novelty and the innovation of our approach!
Reviewer 1 concerns:
1) In our paper, we included the closest unsupervised deep feature learning approach [6] and compared to it. Other feature learning approaches usually require some annotations (segmentations or ground-truth deformation fields), or modality-specific training, or non-differentiable to use for deep learning. We instead use a general feature extractor that is differentiable, fully unsupervised from ground-truth labels, moreover didn’t see any medical images in the training process.
2) We followed the same test set and evaluation metrics of [7].
3) Due to the space limit, we didn’t include diffusion feature visualizations, but some of them can be seen from the overview of the method Figure 3.
4) Thanks for your suggestions. We will try to incorporate them into the final version.
Reviewer 2 concerns:
1) We appreciate the reviewer’s attention to detail regarding the resolution differences in Figure 1. We respectfully disagree that the apparent lower resolution of diffusion features affect the validity and fairness of our comparison. The figure just illustrates where the losses look at. Moreover, we upsample diffusion features to original image resolution for illustration purposes.
2)Details regarding feature extraction
a)Yes we repeat the gray image 3 times to make it 3 channels and normalize to be between -1 and 1 as it is done in the diffusion training.
b) We just apply 1-LNCC loss on the outputs of the layers, no matter what resolution is. The loss encourages diffusion features of warped moving and original fixed images to be as similar as possible.
c) N is just a number of slices picked at each iteration. We use N=4 due to the memory constraints. It should be mentioned that for 3D registration, the inputs are 3D moving and fixed volumes (HxWxD) and the output is also a 3D deformation field (HxWxD), making the warped image also 3D (HxWxD). From the warped and fixed images we pick N same slices and align them backpropagating the gradients through only these slices. Since we pick not only from one axis but from all 3 axes (axial, coronal or sagittal) and also pick N random slices, it will cover the whole volume during the training.
4) Yes 3D registration needs much more memory than 2D registration and more time (iterations) to train Neurite-OASIS has both 4 labels of tissue type and 35 regions of region-type. We wanted to show that pixel-based registration is mostly changing the outer tissues (unnecessarily stretching the cortex and grey matter to the neck and skull region). Because the 35-label segmentation mostly focuses on inner regions, we evaluated on 4-label segmentations.
5) It’s a proof-of-concept work that in case of missing anatomies, pixel-level registration losses focus on sharp boundaries instead of semantically meaningful correspondences. Moreover, there are still cases ( for example emergencies, stroke/trauma), where rapid registration without preprocessing delays can be critical. Our approach aligns with clinical needs where time-consuming skull-stripping may not be feasible.
6) It is possible that by adding small-to-medium noise, the diffusion model ignores details and focuses more on semantic features useful for registration tasks. A similar observation was found in [35].
7) Thanks for relevant literature and suggestions to improve the work. We will try to incorporate them in the final version.
Reviewer 3 concerns:
The statement in Section 3.2: “i.e., for a perfect image alignment these features would by construction be identical.” means that if two images can indeed be perfectly aligned then by computing the features of the aligned image instead of warping the features the features should be identical.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A