Abstract

Multimodal large deformation image registration is a challenging task in medical imaging, primarily due to significant modality differences and large tissue deformations. Current methods typically employ dual-branch multiscale pyramid registration networks. However, the dual-branch structure fails to explicitly enforce that the model learns modality-invariant image registration features. Furthermore, in the multiscale registration process, only the deformation field is propagated, which restricts the model’s capacity to accommodate more complex deformations. To enhance the model’s ability to learn features from different modalities, we propose a modality representation disentanglement method, incorporating Multi-layer Contrastive Loss(MCL) to enforce the learning of modality-invariant features. To address the challenge of complex large deformations, we introduce a Multi-Scale Feature fusion Registration module(MSFR), which integrates features and deformation fields from different scales during the registration process. To explore the registration potential of the trained model, we propose a Recursive Inference enhancement strategy that further improves registration performance. This model is referred to as RDMR. Based on experimental results from both private and public datasets, the RDMR model outperforms other SOTA models. Compared to the baseline registration model (Voxel Morph), the RDMR model achieved improvements of 1.4 and 4.5 percentage points in the DSC metric, respectively. Our code is publicly available at:https://github.com/ybby2020/RDMR

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2376_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ybby2020/RDMR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuYib_RDMR_MICCAI2025,
        author = { Hu, Yibo and Zhao, Ziqi and Zhang, Qi and Xu, Lisa X. and Sun, Jianqi},
        title = { { RDMR: Recursive Inference and Representation Disentanglement for Multimodal Large Deformation Registration } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {519 -- 529}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a novel framework for multimodal deformable image registration, addressing the challenges of large anatomical deformations and inter-modality intensity discrepancies via three key innovations: (1) A Multi-scale Feature Fusion Registration module (MSFR) to enable the interaction across different feature scales, overcoming limitations of traditional multiscale deformation field propagation and improving the model’s ability to capture complex deformations. (2) A modality-invariant contrastive learning loss (MCLoss) that enforces consistency across modality-specific feature distributions at multiple encoder depths, thus promoting disentangled representation learning (3) A recursive inference enhancement mechanism (RDMR) that refines model performance during inference without requiring retraining

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) Novel multiscale feature fusion strategy: The paper introduces a new multiscale feature interaction module (MSFR) that goes beyond conventional hierarchical deformation field propagation. By enabling direct fusion of multiscale features, the method captures complex deformations more effectively, a particularly important capability in multimodal settings where anatomical structures may exhibit non-linear distortions. This approach represents an innovation in the design of multiscale registration architectures. (2) Disentanglement via modality-invariant contrastive learning: The proposed multi-layer contrastive loss (MCLoss) is a novel contribution to multimodal registration. Unlike standard feature alignment methods, MCLoss enforces consistency across modalities at multiple encoder depths, encouraging the learning of modality-invariant and anatomically meaningful representations. This design improves generalizability and aligns well with recent trends in representation learning, but has not been widely explored in the registration domain, making it an original and timely addition.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (1) Clarity and completeness of methodological description: (a) it is unclear whether the warped feature map F^{i}{m} is warped using the deformation field 𝜙^{i} or 𝜙^{i+1}. The ambiguity around the indexing (i) and propagation across scales affects interpretability. (b) Although the multiscale registration pipeline is a central component of the method, the definition of the index i is never explicitly provided. Consequently, it is difficult to determine whether the multiscale behavior arises from hierarchical feature fusion or convolution-based upsampling. (c) Equation (1) introduces “CConv” without any definition, it’s unclear whether this refers to a specific convolution type or is a typographical error. (d) The variable p in Eq. (4) is never defined, likely referring to pixels, but clarification is necessary. (2) Insufficient justification of parameter settings and recursive strategy: (a) The recursive inference enhancement (RDMR) is stated to rely on four hyperparameters optimized at inference time. However, this optimization procedure is vaguely described and could be seen as manual post-processing. Without transparency and fairness in hyperparameter tuning, the generalizability and reproducibility of the method may be compromised. (b) The choice of weights 𝜆1=10, 𝜆2=0 for inter-patient registration is not well justified. If 𝜆2=0 then the modality-invariant loss L{MCL} is effectively removed, contradicting the claim that modality-invariant constraints are enforced. (3) Incomplete dataset and modality description: (a) the authors do not specify whether the task is performed in 2D or 3D (b) The modalities used in the private dataset are not described, and the presence or absence of diseased tissue is not clarified. (c) For the AMOS dataset, which contains multiple organ annotations, only liver registration is evaluated. Results for smaller and more challenging structures like the kidneys and spleen would provide a more complete assessment of performance. (4) Evaluation limitations and inconsistencies: (a) In Figure 2, the authors suggest that their method produces larger deformations, but this alone does not imply better registration accuracy. Deformation quality should be evaluated using both Jaccard/DSC metrics and fold-checking methods to ensure mathematical stability. (d) Despite reporting high DSC standard deviations and small performance differences between methods, no statistical analysis (e.g., paired t-tests) is performed to determine whether improvements are significant. (e) Table 1 includes a “P(K)” metric, but it is never defined or explained in the text. (f) The Jacobian Determinant is presented in Table 1 but is not discussed, its behavior, especially when RDMR outperforms other baselines (e.g., LKU, RDP), deserves explanation. (g) The DSC value reported for AMOS in Table 1 does not match the value in Table 2, suggesting inconsistencies in reporting. (h) It would be nice having as a reference the initial (without registration) DSC and J values of alignment between images to fully appreciate the registration effect. (5) Ambiguity in training conditions: (a) It is unclear how the model is expected to enforce modality-invariant constraints if it is trained without the L_{MCL} loss term, as claimed in some settings. This undermines the narrative around disentanglement. (b) The model claims to generalize well, yet many design choices (e.g., recursive inference tuning, parameter selection, loss term usage) appear to require dataset-specific empirical adjustments. (6) Table 2: the choice if the 1121 for the HB and HC dataset is not very clear since looking at the mean and standard deviations of HB and HC, 1121-1131 are essentially equal and the same holds for AMOS for the 1211 choice and the 1112-1311 configurations the reason for missing values in table 2 for 1311 for HB and HC and 1131 for AMOS is not explained

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper introduces several promising components for multimodal deformable registration, particularly the disentanglement-based learning and the recursive inference enhancement strategy. However, a number of important issues limit the clarity, and interpretability of the work in its current form. (1) I encourage the authors to clarify and standardize the use of indices (e.g., i, i+1) and deformation field references, particularly in Sections 2 and 3. Definitions of critical variables such as F_{m}^{i}, p, and “CConv” should be explicitly provided. Clearly outlining whether the task is performed in 2D or 3D and describing the modality of the private dataset used would also help contextualize results. (2) The recursive inference mechanism, while interesting, needs a more rigorous description. If performance is highly dependent on manual hyperparameter tuning at inference time, this raises concerns about fairness and generalization. Consider including a more formal description of how these hyperparameters are selected and whether the same strategy could be consistently applied across datasets. (3) I suggest revisiting the evaluation section to improve both completeness and reliability. Specifically: (a) For future work (no for rebuttal purposes) provide statistical significance testing to support performance differences between models. (b1) For future work (no for rebuttal purposes) report results for all organs available in datasets like AMOS, including smaller structures such as the kidneys and spleen, which are particularly challenging in registration tasks. (b2) For future work (no for rebuttal purposes) reference of how results could change depending on the fact that the organ to be registered can be healthy or not could be interesting (c) Supplement figures showing deformation fields with visual overlays of aligned masks and corresponding quantitative metrics (e.g., DSC, IoU) to enable more direct interpretation of registration quality. (d) Explain all reported metrics (e.g., P(K)) and justify any missing experimental configurations. (e) The modality-invariant component and use of the L_{MCL} ​loss should be better connected to the experiments. If this loss is omitted in certain training regimes, it becomes unclear how the model maintains modality consistency. (f) Additionally for future work (no for rebuttal purposes), some parameter choices appear empirical (e.g., 1121 vs. 1131), and further analysis of why certain settings perform better would strengthen the contribution.

    Overall, the paper presents an interesting direction, but it would greatly benefit from improved clarity in methodology, results description, and a more transparent connection between claims and performance evidence.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents several novel ideas, including a multiscale feature fusion module and a recursive inference strategy, the current version lacks sufficient clarity and rigor to fully support its contributions: (1) The paper relies on empirical and manually tuned hyperparameters during inference without detailing a principled or automated strategy. In particular, the choice to disable the modality-invariant loss in certain settings is not justified, undermining the theoretical consistency of the proposed disentanglement approach. (2) From an evaluation standpoint, the results are not sufficiently validated. Minor numerical improvements across configurations are reported without statistical testing, and visual examples lack quantitative overlays or fold-checking to ensure the stability of predicted deformations. Furthermore, inconsistencies across tables (e.g., AMOS DSC values) and missing metric definitions (e.g., P(K)) and metric results description (e.g. Jaccard index) reduce confidence in the reported results. (3) Finally, the dataset description is incomplete, with no information provided on modality type, imaging dimension (2D vs 3D), or disease status. This limits the ability to interpret the clinical relevance or generalizability of the approach.

    Overall, while the ideas are promising, the lack of clarity in key areas and the insufficient experimental and statistical analysis prevent a full endorsement of the work in its current form.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reviewing the authors’ rebuttal, I believe that the major concerns have been addressed, particularly regarding:

    • The methodological description of the warping logic and multi-scale registration process
    • Justification of loss parameter settings and recursive inference strategy
    • Clarification of dataset composition, modality usage, and training conditions
    • The rationale behind the configurations and missing entries in Table 2

    The explanations provided resolve the key ambiguities and should enable the authors to revise the final manuscript.

    That said, concerns remain regarding the relatively small improvements reported for the AMOS dataset in Table 1. Due to rebuttal constraints, statistical significance analysis could not be included, and this limits the strength of claims about performance gains.

    Despite this, the work introduces novel ideas for recursive registration of large organs and could contribute to the field with its extension in future work. I encourage the authors to include statistical analysis in the final version to support the robustness of their findings.



Review #2

  • Please describe the contribution of the paper

    Authors proposed a novel multimodal unsupervised registration framework. The model is named RDMR, recursive inference and representation disentanglement for multimodal large deformation registration. The main contribution is a Multi-Scale Feature Fusion Registration module (MSFR) which propagates deformation across scales as well as image feature themselves. This allows finer control and more stable alignment. RDMR uses contrastive learning to extract modality-invariant features. This helps the model learn shared anatomical structure and enables accurate multimodal alignment, e.g., CT to MRI, without needing paired or labeled data. After training, RDMR iteratively refines the deformation field during inference using its own outputs as input. This boosts performance without retraining.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper introduces RDMR for multimodal registration under large anatomical deformation, which is a critical and underexplored challenge in medical image analysis. The proposed MSFR results in finer structural alignment and better preservation of anatomical details. In addition, by applying contrastive learning between CT and MRI image pairs from the same subject, RDMR learns shared structural features enabling accurate alignment without acquiring paired ground truth or labels. This method is validated on both private multi-institutional clinical datasets and public datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Although the method is designed for general multimodal large deformation registration, the experiments are focused on abdominal CT to MRI registration as well as abdominal imaging. It is not clear how well the model generalizes to other anatomical regions or different modality combinations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed model, RDMR, combines three key innovations, 1)MSFR, 2)contrastive learning, and 3)recursive inference strategy. These contributions are well-motivated and effectively integrated. The application demonstrates strong clinical relevance with clear potential for use.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a noval approach for multi-modal deformable registration. Firstly, the paper proposed a noval multi-layer modality-invariant contrastive loss to enforce consistent feature on same subject across different modalities. Secondly, the paper introduces Multi-Scale Feature fusion Registration modules, which passes image feature during multi-scale registration in addition to the deformation field described in dual-stream pyramid registration network. Lastly, the paper introduces a recursive inference enhancement strategy during inference to further improve the registration performance. The proposed architecture has shown SOTA performance on both in house and public dataset(AMOS) in terms of segmentation Dice score.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed multi-layer modality-invariant contrastive loss is an innovative way to enforce the model to learn cross-modality structural information, which is an interesting self-supervised approach to multi-modality registration.
    2. The proposed multi-scale feature registration module allows the architecture to incorporate image feature from previous scale (represented as Fci+1) in addition to the previous deformation field described in prior work to assist in current deformation field generation, which is an innovative approach
    3. Proposed visualization on the performance of the method described in the paper as shown in Fig.3 also shows promising performance in correctly resembling the structures based on the fixed images with clear boundary, which is very impressive.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although the paper proposed an innovative recursive inference method to further improve model performance, the best performed approach only repeat the MSFR module once in one scale, and the difference between different settings of recursions is not supported by statistical tests like paired student-t test, making it hard to measure the effectiveness of proposed recursive inference.
    2. For the DSC performance on AMOS dataset, the performance of proposed architecture is very similar to prior works like RDP and PIVIT, which needs explanation. Is that due to the fact that training data is unpaired and therefore the model is not benefited from contrastive learning?
    3. AMOS contains multi-organ segmentation labels, but the paper did not mention which segmentations are used for the Dice Score metric, which may need clarification.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper proposed several innovative modification to the existing dual-stream pyramid registration network to allow multi-modal image registration. The approach in this paper also shows SOTA registration performance on the in-house dataset, which proves the effectiveness of the approach.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors’ rebuttal made clear clarification on metrics used for AMOS dataset and detailed explanation on the recursive inference strategy. However, In Section 1 of rebuttal, authors mentioned starting MSFR on smallest scale, which is referring to the scale of feature map, but the order of MSFR deformation field should in the reserve order, from coarse scale deformation to fine scale deformation, based on the description from the authors. Such terminology should be clarified if accepted.




Author Feedback

We appreciate the reviewer’s recognition of our work’s novelty and their insightful suggestions. We will revise the manuscript following the reviewers’ suggestions. Our responses are organized as follows:

  1. Clarification of the Method(Reviewer 3): The multi‐scale deformable registration begins at the smallest scale, and the registration features and deformation field are propagated to the next scale for further refinement. The index i denotes the current scale, while i+1 refers to the previous (coarser) scale. At every scale, the moving image F_m is first warped by the deformation field,𝜙^{i+1}, obtained from the previous scale. Eq.1 describes the processing within the MSFR, where ‘CConv’ denotes a two‐layer convolution operation (Fig. 2(b)). In Eq.4, the variable P represents the deformation field and is used to define the regularization loss that encourages smoother deformations.

  2. Dataset and Parameter Setting(Reviewer1&2&3): The private dataset comprises CT and T1w from hepatocellular carcinoma patients. These data are 3D volumes. The clinical objective for multimodal registration is to align different modalities from the same patient. To address this challenge, we introduce the MCL loss, which assumes that volumes from the same patient share greater structural similarity across modalities. By enforcing a contrastive loss, the model learns structural features while suppressing modality-specific features. In the private dataset, volumes are paired, making them well-suited for the MCL loss. Results show that our model achieves the best performance on the HB dataset. The HC dataset, which comprises independent data, also shows that our approach outperforms all competing models.These findings confirm that the MCL loss effectively guides the model to focus on modality-invariant features and reduces domain shift. Analyses of other modalities and anatomical structures, due to space limitations, will be presented in the journal article. Public datasets seldom include multimodal scans of the same patient. For the unpaired AMOS, we relax the weighting of the MCL. Even under this condition, our proposed model maintains superior performance relative to baselines. To ensure consistency with the task, liver labels from the AMOS are used.

  3. Recursive Inference(Reviewer3): The four hyperparameters for recursive inference determine the number of recursions at each scale. We begin with a single recursion per scale to identify which scale yields the greatest improvement in registration quality. We then increment the recursion that scales until no further gains. Unlike tuning learning rates or loss weights, which typically requires retraining, this strategy adjusts recursions post-training, greatly enhancing usability. Dataset-specific tuning is necessary but easy to perform. In Tab.2, each row corresponds to one dataset, and the choice of recursion follows two rules: (1) select the setting yielding the highest DSC; (2) if DSC is equal, choose the configuration with lower computational cost. For the HB and HC datasets, the 1121 recursion schedule produced the best improvement; increasing to 1131 yielded no additional benefit, so 1121 was selected. On the unpaired AMOS dataset, 1211 achieved the highest DSC, and testing 1311 showed no clear benefit, so 1211 was chosen. Blank entries in Tab.2 indicate that the corresponding datasets did not require this test.

  4. Clarification of Results Evaluation(Reviewer3): In Fig.2, the model with recursive inference yields denser deformations, the corresponding quantitative metrics are presented in Tab.1.The last two columns of Tab.1 report inference time(T(s)) and model parameter(P(K)). Since the proportion of negative Jacobian determinants about most models remain below 0.1%, indicating minimal folding, this metric was not analyzed further. The slight discrepancy in DSC decimal places between Tab.1 and Tab.2 arises from independent computations and rounding. We will correct this error in the text.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal well addressed reviewers’ concern, agree to accept



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top