Abstract

Existing medical image registration algorithms rely on either dataset-specific training or local texture-based features to align images. The former cannot be reliably implemented without large modality-specific training datasets, while the latter lacks global semantics and thus could be easily trapped at local minima. In this paper, we present a training-free deformable image registration method, DINO-Reg, leveraging the general purpose image encoder for image feature extraction. The DINOv2 encoder was trained using the ImageNet data containing natural images, but the encoder’s ability to capture semantic information is generalizable even to unseen domains. We present a training-free deep learning-based deformable medical image registration framework based on the DINOv2 encoder. With such semantically rich features, our method can achieve accurate coarse-to-fine registration through simple feature pairing and conventional gradient descent optimization. We conducted a series of experiments to understand the behavior and role of such a general purpose image encoder in the application of image registration. Our method shows state-of-the-art performance in multiple registration datasets. To our knowledge, this is the first application of general vision foundation models in medical image registration.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2230_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

https://learn2reg.grand-challenge.org/Datasets/

BibTex

@InProceedings{Son_DINOReg_MICCAI2024,
        author = { Song, Xinrui and Xu, Xuanang and Yan, Pingkun},
        title = { { DINO-Reg: General Purpose Image Encoder for Training-free Multi-modal Deformable Medical Image Registration } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15002},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Demonstrates that a general purpose encoder (trained on ImageNet) can be used to extract features from medical images which enable rigid and non-rigid image registration both within and across modalities. A nice thing is that the approach does not require any training or fine tuning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Shows that a pretrained encoder can produce features from medical images which enable efficient and effective deformable registration across modalities. Set of experiments demonstrate that the approach outperforms several competitors on public datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The necessity to threshold the original image to obtain the foreground patches is understandable and justified in the text, but seems like a bit of a fudge. Is there not a more elegant way of doing it? Having to downsample the images to fit in memory is understandable but does potentially limit the accuracy, as the authors acknowledge.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor points: Abstract: “leveraging the general purpose” -> “leveraging a general purpose”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A relatively straightforward approach is shown to give very encouraging results. Likely to inspire further work.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper describes the adoption of vision-pretrained feature extractors together with some statistical mapping and hand-crafted masking operations for highly accurate and robust thorax-abdominal medical image registration. It positively stands out as a work, which is successfully applied to real world datasets that address clinical needs and have not yet been “solved”.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    very high performance on challenging real world datasets with e.g. huge improvements over basic baselines (VoxelMorph) but also more recent SotA work (ConvexAdam, FourierNet, etc.) meaningful ablation study and use of two complementary multimodal datasets moderate incremental but nevertheless impactful methodological contributions regarding the masking of foreground for embeddings, PCA based feature alignment and upsampling

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    to my understanding only axial slices at an interval of 3 are used for the final input to coarse-scale and ADAM instance optimisation, this could limit the accuracy for smaller anatomical details (see suggestions below) SAME embeddings are conceptually similar and code for this is publicly available: could they be incorporated into a comparison/discussion? the PCA based feature alignment across modalities is not very clearly described
    the upsampling method seems rather inefficient and might be replaced by dilated convolutions (see below)

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper uses public benchmarks and a publicly pretrained feature embedding. The other steps are described with sufficient detail to enable a reproduction.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    One could move the DinoV2 description into the related/prior work section as it is not done by the authors and maybe shorten it slightly in favour of a more detailed description of PCA feature alignment. In principle the eigenvectors of two separate PCAs could be aligned (by some chance), but there is usually a risk of flip in sign and ordering/permutation - which will have a negative impact (and might also explain the need for LNCC as metric). Maybe some alignment of their Eigen-spectra (see Mateus) could help. Ref: Mateus et al. Articulated shape matching using Laplacian eigenfunctions and unsupervised point registration https://ieeexplore.ieee.org/abstract/document/4587538 CVPR It is unclear whether Eigenvectors of a (sparse) PCA are computed that are used to project all inputs to the reduced embedding, or whether the output of the PCA is used directly. The former should certainly be faster and might alleviate the need for slice interpolation A more detailed discussion of complementary strengths and weaknesses of the proposed use of DinoV2 vs handcrafted features (MIND) and/or SAME would be beneficial. It is briefly mentioned (almost in passing) that the test results on OncoReg were obtained through a combination of MIND+DinoV2 hence a further comment on whether this also improves results for Abdomen MR-CT would be appreciated. While the relatively poor score of VoxelMorph is not too unexpected (many prior publications have shown that it does not work well beyond inter-subject brain alignment), it could potentially be boosted by doing at least a cross-validation (e.g. leave-2-out) including a subset of paired MR-CT validation cases during training. Otherwise the domain gap from unpaired MR-CT could be too large.
    Regarding my comment whether or not upsampling will always be necessary: many vision models can replace some strides (that lead to the undesired downsampling) with respective dilations in all subsequent layers without retraining the CNN. See e.g. https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py#L186 This could potentially help to reduce the required upsampling while maintaining a higher resolution output.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well written and interesting paper that has a very strong validation with robust performance on clinically relevant and challenging tasks. While the method itself is not too complex, I weigh this not negatively because in this case simplicity seems to just work fine.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes to use a feature extractor model trained on natural images as a feature extractor for multimodal medical image registration. The model extracts features for each slice of the input volumes, these features are compressed using PCA and used for dense deformable registration.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper poses and answers a single research question, therefore providing a clear value to the reader.
    • The description of the approach is clear and it would be possible for the reader to reproduce this work
    • The experiments are quite detailed including 2 datasets, different competing methods and an ablation study
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Main weakness: It is not clear how much of the registration accuracy is due to the DINO features and how much to the chosen registration algorithm. A comparison using MIND features with the proposed registration algorithm and using DINO features with ConvexAdam could answer this question. It should be easy to do since both use SSD and ConvexAdam is open source.

    Other weaknesses:

    • It is unclear in which sense a PCA transform applied on the concatenated fixed and moving features should align them into the same space.
    • It is unclear how, in Section 2.3, a rigid registration is computed from matching slices indices. Is this only a translation across the slice dimension?
    • A description of the limitations of the current approach (examples: is it expected to work for any modality pair? are we losing some information by encoding the input slice by slice?) and of future work directions is missing
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The description of the approach is clear and it would be easily possible for the reader to reproduce this work. Nonetheless availability of source code would be better.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper addresses a single research question offering valuable insights to the reader. The approach is well-described, facilitating potential replication.

    However, a notable ambiguity exists regarding the attribution of registration accuracy to DINO features versus the chosen registration algorithm. Furthermore, certain sections could benefit from elucidation and the limitation of the current approach should be acknowledged.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper poses and answers a single research question of general interest therefore providing a clear value to the reader.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We are thrilled to receive such detailed and high-quality feedback! Your comments and suggestions shed light on our path forward.

PCA clarification Both Reviewer #1 and #4 had questions on how PCA was performed in this work, and we hope to provide further clarification on this matter. Suppose the two images Ref and Mov are encoded into feature maps of size [H,W,D,C], we flatten each feature map into [HWD,C] and concatenate all features to obtain a [2HWD, C] matrix. PCA is performed to reduce the column size of this matrix so that the result is [2HWD, C’], with C’ < C. The resulting matrix is then recovered into two images of [H,W,D,C’].

Feature vs. Registration algorithm Reviewer #4 expressed curiosity on the performance of recombining the proposed registration algorithm, ConvexAdam, DINOv2 features, and MIND features. Since the bulk of the proposed registration algorithm is plain gradient descent, we find it to perform similarly with ConvexAdam, which adds convex global optimization before performing gradient descent. For the abdomen MR-CT dataset specifically, rigid alignment in the axial direction might be more suitable.

Foreground patches We agree with Reviewer #3 that using thresholding to obtain foreground patches is not the most elegant implementation. We are actively researching better solutions. When left unmasked, the first principal component of the patch features should differentiate foreground/background. However, defining a threshold to automate the process is not as straightforward.

Again, we really appreciate the reviewers’ constructive feedback. Hope to see everyone at the conference!




Meta-Review

Meta-review not available, early accepted paper.



back to top