Abstract

Unsupervised medical image synthesis faces significant challenges due to the absence of paired data, often resulting in global anatomical distortions and local detail loss. Existing approaches primarily rely on convolutional neural networks (CNNs) for local feature extraction; however, their limited receptive fields hinder effective global anatomical modeling. Recently, Vision Mamba (ViM) has demonstrated efficient global modeling capabilities via state-space models, yet its potential in this task remains unexplored. To address this gap, we propose a hybrid architecture, CRAViM (Convolutional Residual Attention Vision Mamba), which integrates the precise local anatomical feature extraction of CNNs with the long-range dependency modeling of state-space models, thereby enhancing the structural fidelity and detail preservation of synthesized images. Furthermore, we introduce a cycle denoise consistency-based training framework that incorporates transport loss and random denoise loss to jointly optimize global structural constraints and local detail restoration. Experimental results on two public medical imaging datasets demonstrate that CRAViM achieves notable improvements in key metrics such as SSIM and NMI over existing methods, effectively maintaining global anatomical consistency while enhancing local details, thus validating the effectiveness of our approach. The code for this paper can be found at https://github.com/jmzhang-cv/CRAViM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1652_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/jmzhang-cv/CRAViM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaJun_Hybrid_MICCAI2025,
        author = { Zhang, Junming and Jiang, Shancheng},
        title = { { Hybrid State-Space Models and Denoising Training for Unpaired Medical Image Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {237 -- 246}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a hybrid architecture of CNNs and state-space models for unsupervised medical image synthesis; it also proposes a cycle denoise consistency-based training framework.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper introduces a novel training framework based on cycle denoising consistency, which shows improvements compared to pure cycle consistency.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper claims a contribution by introducing Mamba into the architecture; however, the performance gains appear marginal. For instance, in Table 2, the model shows only limited improvement over AttGAN—a non-Mamba-based architecture—on several metrics. If Mamba is to be highlighted as a key innovation, its impact should be substantiated with corresponding ablation studies, such as comparisons against CNN- or Transformer-based alternatives.

    2. Section 2.2 seems to be the core innovation of the paper and appears to drive the significant performance gains reported in Table 2. However, the writing in this section is unclear. Several variables in Figure 2, as well as components like the adversarial and cycle-consistency losses, are not properly defined in the text. The figure is difficult to interpret without adequate captioning or accompanying explanation—some elements are not mentioned in the paper at all. Furthermore, the motivation for introducing the two additional loss terms should be better articulated to clarify their necessity and contribution.

    3. The scope of evaluated datasets and tasks is somewhat limited. For example, SynDiff explores modality transfer across a wide range of modalities, while this paper evaluates only two modalities in MRI. Given the incremental performance gains, it is difficult to justify the added value of incorporating Mamba under such a narrow evaluation setting.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Minor Comments:

    1. The paper omits several important implementation details, including the model architecture hyper parameters and the weighting of different loss components. Including these would improve reproducibility.
    2. There are a lot of typos in Figure 2, and the figure captions would benefit from brief explanatory notes to improve readability. 6.I have some concerns regarding the evaluation metrics used in the paper. Although SynDiff also relies solely on similarity-based metrics, I question the reliability of this approach—particularly because the reference images are obtained through registration, which may introduce inaccuracies. As a result, certain pixel-wise metrics may not be entirely trustworthy. Incorporating distribution-based metrics, such as FID, could provide a more robust and reliable evaluation.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper claims two main methodological contributions: the incorporation of Mamba and the introduction of denoising consistency. However, the impact of the former appears limited, while the latter is supported by a solid ablation study but suffers from unclear presentation. Based on these observations, I have assigned the current overall score.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed most of my concerns. Please ensure that the cycle denoising framework is clearly explained. Therefore, I improved the score to accept after rebuttal.



Review #2

  • Please describe the contribution of the paper

    This paper proposes CRAViM, a hybrid network combining CNNs and Vision Mamba-based state-space modeling for unpaired medical image synthesis. CRAViM aims to overcome CNNs’ limited receptive field and ViT’s weak inductive bias by merging precise local anatomical feature extraction with efficient long-range dependency modeling. To enhance synthesis fidelity, the authors further introduce a cycle denoise consistency training framework, which includes novel transport loss and random denoise loss to jointly enforce global anatomical coherence and local detail preservation. Experiments on IXI and BraTS2018 datasets demonstrate performance gains in SSIM, MSE, and LPIPS over existing methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clear Motivation: The limitations of both CNNs and ViTs are articulated well, and the architectural decisions are logically motivated.

    Novel Architecture: The integration of EfficientMamba into the unsupervised image-to-image translation pipeline is original. The CRAViM claims to address the trade-off between local fidelity and global structure.

    Innovative Loss Design: The proposed transport and random denoise losses offer new perspectives on optimizing unpaired translation, particularly under noisy or ill-posed domain mappings.

    Comprehensive Evaluation: The paper includes multi-metric quantitative evaluation (SSIM, LPIPS, MSE, NMI), paired with ablation studies and visualizations on two open datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Insufficient Description and Justification of Vision Mamba (ViM):

    • Figures 1 and 2 are densely annotated and hard to interpret without detailed explanation. The equations and variables appear in Fig.1 are not mentioned in the main text.
    • Several expressions in Fig. 2 are missing closing right brackets, which impairs clarity. Please revise the mathematical notations to be complete and syntactically correct.
    • In Fig. 2 and Equations (3) and (4), there is ambiguity in how G, F and Gₓ, Gᵧ relate. It is unclear whether G and F represent forward and backward generators, or whether Gₓ / Gᵧ are separate networks. This notation should be made consistent and explicitly explained in both the figure caption and the main text.

    Scope of Experiments:

    • Although the proposed method is described as modality-agnostic and unpaired, only brain MRI images are used in experiments. Brain MRIs (T1, T2) are known to exhibit relatively less misalignment across subjects, which weakens the generalization claims. Testing on more challenging or diverse modalities (e.g., MR-CT or PET-CT) would strengthen the validation.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    – Missing citations for LPIPS – In Section 2.1, the authors write: “we introduce the efficient visual state-space module (EVSS Block)[14]”. Since EVSS is not proposed in this work but rather in [14], this should be revised to “we adopt EVSS…” or “we adapt EVSS…” depending on whether it has been used as-is or modified. – “patial-hannel attention” -> “spatial-channel attention”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a technically sound and creative approach to improving unpaired medical image synthesis. Its hybrid CRAViM architecture addresses core limitations of both CNNs and transformers, and the novel denoising consistency framework is a valuable contribution. That said, the paper is not yet polished in terms of explanation and reproducibility. Nonetheless, the work is promising and makes a relevant, novel contribution, warranting acceptance with minor revisions for completeness and clarity.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper pioneers the integration of Vision Mamba (ViM), based on state-space models (SSMs), into unpaired medical image synthesis. By combining CNN’s local feature extraction with ViM’s global modeling, the proposed CRAViM hybrid architecture addresses the limitations of traditional CNNs in global anatomical modeling (e.g., CycleGAN’s structural distortion) and avoids detail loss caused by ViT’s lack of local inductive bias (e.g., AttGAN’s local blurring). This innovation opens a new research direction for medical image synthesis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. First application of ViM’s linear-complexity global modeling (via EVSS Block’s selective scanning) in medical synthesis, achieving 265.16G FLOPs at 256×256 resolution.
    2. Ablation studies prove CRAViM achieves LPIPS=0.098 in T2→T1 tasks, 17.6% lower than purely convolutional AttGAN
    3. Significant improvements (p<0.05) on both IXI (normal brains) and BraTS2018 (pathological brains) in SSIM, NMI, etc.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Fails to compare with state-of-the-art medical Mamba variants (e.g., MedMamba in Bansal et al., 2024).
    2. No discussion on its performance degradation with ultra-long-range dependencies (>50% image span, e.g., whole-brain lesions).
    3. Tests only on 256×256 slices, while BraTS uses 240×240×155 volumes. Anisotropic information loss in downsampling is unexamined.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper makes groundbreaking contributions by adapting Mamba to medical image synthesis, supported by rigorous experiments and novel training strategies. While weaknesses exist.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Dear Reviewers,

Thank you for your valuable time and insightful evaluations of submission ID: 1652. We have carefully considered all comments; our detailed responses are below.

Response to Reviewer #1 (R1): W1: Regarding your concerns about the clarity of the Vision Mamba description and Figures 1 & 2, we will enhance both figures with detailed captions, ensuring all elements and formulas (like those for A, B, h_t, y_t in Fig 1) are explained or referenced in-text. Definitions of generators G and F in Figure 2 and their relation to G_X (Eq.3) & G_Y (Eq.4) will be clarified, and all notational issues (e.g., missing brackets in Fig 2, as you pointed out) and overall mathematical correctness will be thoroughly reviewed and corrected. W2: We acknowledge that current experiments are limited to brain MRI. While our framework is designed to be applicable across different modalities, we will discuss this limitation due to current constraints and note validation on other modalities (e.g., MR-CT, PET-CT) as future work. R1.C1-C3: We will also revise the wording for EVSS Block attribution to “we adopt/adapt [14]” (addressing your comment on its introduction), add the LPIPS citation, and correct the “patial-hannel” typo.

Response to Reviewer #2 (R2): W1: Regarding comparison with medical Mamba variants like MedMamba: We note such models are primarily designed for classification/segmentation, with architectures (e.g., patch merging) often optimized for those tasks rather than high-fidelity synthesis requiring fine detail preservation. Our CRAVIM hybrid design was motivated by a careful consideration of these differing task requirements and architectural paradigms. W2: We agree that evaluating performance under extreme long-range dependencies is a valuable future direction and will note it in the discussion. W3: Using 256×256 2D slices is standard for fair comparison and resource management. We will detail preprocessing and acknowledge 3D-to-2D information loss. Exploring direct 3D application to best leverage CRAVIM’s strengths is valuable future work, given current constraints. Regarding reproducibility, we confirm that our code will be released upon acceptance, as stated in our abstract.

Response to Reviewer #3 (R3): W1: Concerning Mamba’s contribution: Even atop our high-performing Cycle Denoise Consistency framework, Mamba-based CRAVIM consistently outperforms the convolutional AttGAN (e.g., T2→T1: 0.96% SSIM up, 7.55% LPIPS down), offering insight into its long-range modeling benefits. While further CNN/Transformer ablations (your valuable suggestion) were beyond current scope due to conference constraints, Mamba’s synergy and gains will be more explicitly detailed in our revised discussion of architectural choices and results. Extensive comparisons are future work. W2: We agree Sec 2.2 & Fig 2 need clarity. New losses tackle key issues: Transport loss preserves source anatomy against target noise. Random denoise loss trains on corrupted inputs for noise-invariant features, boosting robustness. They synergize for better global/local results. Motivations, all loss definitions, & Fig 2 will be clarified in revision. W3: We acknowledge the limited dataset scope. Our framework is designed for general applicability across modalities, and we will state multi-modality evaluation as future work. C_MC6: Regarding evaluation metrics (your Minor Comment 6), we appreciate your concerns about pixel-level metrics (SSIM, MSE) and the suggestion for FID. LPIPS is included for its robustness to minor misalignments. While FID is valuable, its natural-image bias from Inception might not ideally suit grayscale MRI. We will discuss these limitations and LPIPS’s role. C_MC4: Finally, to improve reproducibility and address missing implementation details (your Minor Comment 4), we will add detailed hyperparameters (model architecture details, loss weights) in Section 3.1, and the code will be released upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work is close to early accept, and have addressed the reviewers’ concern in the rebuttal. I agree to accept it.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top