Abstract

Conventional medical image registration approaches directly optimize over the parameters of a transformation model. These approaches have been highly successful and are used generically for registrations of different anatomical regions. Recent deep registration networks are incredibly fast and accurate but are only trained for specific tasks. Hence, they are no longer generic registration approaches. We therefore propose uniGradICON, a first step toward a foundation model for registration providing 1) great performance across multiple datasets which is not feasible for current learning-based registration methods, 2) zero-shot capabilities for new registration tasks suitable for different acquisitions, anatomical regions, and modalities compared to the training dataset, and 3) a strong initialization for finetuning on out-of-distribution registration tasks. UniGradICON unifies the speed and accuracy benefits of learning-based registration algorithms with the generic applicability of conventional non-deep-learning approaches. We extensively trained and evaluated uniGradICON on twelve different public datasets. Our code and weight are available at https://github.com/uncbiag/uniGradICON.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0527_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0527_supp.pdf

Link to the Code Repository

https://github.com/uncbiag/uniGradICON

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tia_uniGradICON_MICCAI2024,
        author = { Tian, Lin and Greer, Hastings and Kwitt, Roland and Vialard, François-Xavier and San José Estépar, Raúl and Bouix, Sylvain and Rushmore, Richard and Niethammer, Marc},
        title = { { uniGradICON: A Foundation Model for Medical Image Registration } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15002},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors trained a foundation model for medical image registration on multiple datasets. This paper combines the speed and accuracy advantages of learning-based registration algorithms with the traditional methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-written, and the motivation is clear.
    2. This paper has done comparative experiments on many datasets, and the results look good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. From a technical perspective, the novelty of this paper is limited. This paper just adopts an existing architecture, and then trains a general model on a large amount of data.
    2. Some comparisons of results may not particularly convincing. As far as I know, the validation results of the top-5 teams in the Learn2reg challenge can all be found on website[1], but the author do not write them in Table 3&4.
    3. The statistical analysis is necessary to demonstrate the effectiveness of the method.

    [1] https://learn2reg.grand-challenge.org/evaluation/task-3-validation/leaderboard/

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please address the issues mentioned in “Weaknesses of the paper”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the method seems to work, there are some major and minor issues with this paper that need to be further polished.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    My major concerns have been well addressed, and I recommend accepting this paper. After this paper is accepted, I recommend the author submit the TEST results to the L2R organizer to achieve a fair comparison. In many L2R tasks, the performance on TEST data can be much lower than that on VAL data.



Review #2

  • Please describe the contribution of the paper

    This work presents a foundation model for medical image registration on multiple datasets. The trained model can be applied to out-of-distribution images from different sources, anatomical regions and image modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of extending existing learning-based registration to a universal model is novel and more relevant to real-world applications.
    2. The paper conducted thorough experiments on how the trained foundation registration model adapts to in- and out-of-distribution tasks, especially on three different types of OoD tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The first main weakness is the lack of details in the presented results. The tables only present mean values without standard deviations. And no analysis of the failure cases was presented.
    2. The second weakness is that no failure detection mechanism was presented in the paper, which is vital for the proposed foundation model to be deployed in reality.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The authors could include more details of the results. For example, box or violin plot can be used to present the distribution of registration accuracy. Also, some examples of failure cases would be useful to evaluate the proposed foundation model.
    2. I also recommend the authors to consider some failure detection strategies, which could be useful for OoD scenarios.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is novel and well-written, and is relevant to real-world applications. However, some flaws in the experiments and results may prevent it from actual deployment.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    While many deep learning-based image registration models aim to solve task-specific problems, this paper aims to create a generic image registration model, suitable for multiple tasks. The idea relies on a conclusion made in a reference work where a task-independent model in terms of hyperparameter settings where invented.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written and easy to follow.
    • The author uses conclusions made in reference work [32] to investigate the benefits of the invention (the benefits of using the same training procedure).
    • The author presents a novel approach to train the model on unevenly distributed data.
    • The author presents a rich and well-organized experiment by dividing the tasks into several different categories.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • In the experiment, the author uses instance optimization which improves the results drastically. The procedure for instance optimization is missing, and I guess that it follows the procedure in [32].

    Some thoughts regarding the comparison and the presentation of the result are listed below: In-distribution evaluation:

    • The author claims that the model achieves excellent registration accuracy by comparing the results against two other models (SyN and VoxelMorph-SVF). However, the comparison models are trained (or solved) using other similarity metrics (MI and MSE). The reason why the similarity metric differs is not stated. For the reader, it would be interesting to see a comparison using the same similarity metrics as well.
    • In Table 1 the author presents results from the L2R-Abdomen dataset and compares his results against other task-specific models. However, in Table 4 (out-of-distribution evaluation) using the same dataset(?), the author includes results from Learn2Reg which outperforms all others. A comment on why this model is not included in Table 1 is of interest. Fine-tuning:
    • The author shows that the accuracy improves using fine-tuning. The fine-tuning process is run for 4000 epochs. I consider it hard to say if the accuracy of the model is based on the fine-tuned parameters, or if a similar accuracy would be possible by re-train the entire model. To be able to determine the value of the pre-training process, I will argue that a comparison with a retrained model, using the same hyperparameters is needed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is easy to follow and the experiments are well-performed. However, I am not convinced about

    • When a foundation model is preferable over task-specific models. The performance of task-specific models is overall better.
    • Why this model performs better than other models for generic registration approaches. The models are trained differently (using different image similarities). The reason for that has to be stated.
    • The use as an initialization model and finetuning for out-of-distributions tasks. To state this, I request a comparison with the trained default model using the same training procedure.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is well-written and easy to follow. The author presents the idea of a fundamental image registration method that may be beneficial and follows it up with extensive experiments. However, there is some concern regarding the conclusions, where missed or inconsistent evaluations are made as stated above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I agree with the author that a baseline model, from which clinics can fine-tune their models is beneficial. However, this paper does not convince me that the model should be generic for all kinds of image registration tasks. If the aim is to solve the problem with a limited amount of data, I believe that federated learning and privacy protection is a research field one may consider, and this should be more stated if that is the case.

    In the fine-tuning experiment, I disagree with the author since a comparison with a training-from-scratch model would only add one extra row to Table 5 and show the true benefit of using their approach.

    However, generic DL-based models have shown impressive results lately. I encourage the author to continue the investigation, perhaps by taking inspiration from work in LLMs and collecting a great deal of data.




Author Feedback

We appreciate the reviewers’ comments and address the main concerns below.

MOTIVATION, MOVELTY, and IMPACT (R3, R5): We thank R4 for recognizing that uniGradICON is the first universal registration model that will lead to real-world impact. The practical ramifications of a universal registration network are profound. Before developing uniGradICON, we often had this conversation with potential collaborators: “Can your deep learning approach help us register our novel dataset? Sure, how many images do you have so far? 15.” In these cases, slower and potentially less accurate conventional, optimization-based registration approaches were still required. Now, we can get the best of both worlds: fast and accurate registrations (especially, combined with instance optimization) while largely preserving the generality of conventional registration approaches. UniGradICON has already generated substantial interest from industry and research collaborators, demonstrating its practical impact.

EXPERIMENTS Statistical analysis (R3, R4): If the policy allows, we will add standard deviations (std). Learn2Reg (L2R) [15] did not report std, only boxplots, so we cannot provide std for their methods, which is why we did not report the std in the first place. Due to space limitations and our extensive experiments, we cannot include boxplots and qualitative failure results.

In-dist.- Inconsistent evaluations (R4): We observed gradient explosion when training VoxelMorph and LapIRN with the same LNCC similarity measure, which generally happens when using an unsuitable regularizer weight (lambda). Finding the ONE lambda that works for the composite dataset is challenging for VM and LapIRN because the optimal lambda for each dataset is different. Despite testing several similarity loss and lambda values from the official repo, we could only train VM+MSE with default parameters, yielding moderate performance (converging on COPDGene, OAI, and Abdomen but diverging on HCP in Tab. 1). This difficulty for training is discussed in Sec. 3.1 and Appendix Table 7. UniGradICON does not suffer from the issue thanks to the gradient inverse consistency (Sec.5.3 [32]) which leads to better performance. We use the official settings for VM+MSE and (uni)GradICON. To keep the experiment consistent, we use MI for SyN as it is the official default setting.

In-dist. - Missing evaluations in Tab. 1 (R5): Thanks for pointing this out. The top L2R task-specific models in Tab. 4 outperform others due to training with a segmentation-based DICE loss (see discussion in [15 ]Sec. V-B). Among Unsupervised methods, uniGradICON (52.2 on the VALIDATION set) outperforms others (49-51 on the TEST set) in the [15] on the Abdomen CT/CT task. A TEST/VALIDATION comparison may raise questions about fair comparison, which is why we did not add these results to Tab. 1.

Out-of-dist. - L2R validation leaderboard. (R3): The L2R validation leaderboard collects self-reported performance without verification of the L2R organizers. It is unclear whether these methods are trained on the validation set or not. Thus, we only compare with peer-reviewed [15] results and results recognized by the challenge organizers (test leaderboard posted by the organizers on the website for L2R-NLST).

FUTURE DIRECTIONS Failure detection mechanism (R4) and extensive study of finetune setting (R5): Introducing failure detection can enhance robustness and trustworthiness but is beyond the scope of the current work. Our finetune experiment shows that uniGradICON can be further optimized for better performance. Due to space constraints, we could not extensively compare finetuning versus training from scratch under various settings, like different dataset sizes and domain shifts. To emphasize, these future works would not be valid problems without the existence of a registration foundation model, which, in turn, demonstrates the necessity and impact of uniGradICON.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers recommended weak acceptance after authors’ rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers recommended weak acceptance after authors’ rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top