Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose \emph{XFMamba}, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1773_paper.pdf

SharedIt Link: https://rdcu.be/eHwL0

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_64

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/XZheng0427/XFMamba

Link to the Dataset(s)

MURA dataset: https://stanfordmlgroup.github.io/competitions/mura/ CheXpert dataset: https://stanfordmlgroup.github.io/competitions/chexpert/ CBIS-DDSM dataset: https://www.cancerimagingarchive.net/collection/cbis-ddsm/

BibTex

@InProceedings{ZheXia_XFMamba_MICCAI2025,
        author = { Zheng, Xiaoyu AND Chen, Xu AND Gong, Shaogang AND Griffin, Xavier AND Slabaugh, Greg},
        title = { { XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {672 -- 682}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes XFMamba, a multi-view classification framework leveraging a pure Mamba-based architecture. The main contributions include: 1) A four-stage encoder employing Visual State Space Modules (VSSMs), effectively capturing hierarchical, multi-scale features from individual views. 2) Comprising the Cross-View Swapping Mamba (CVSM) for local interleaving and Multi-View Combination Mamba (MVCM) for global feature fusion, enabling effective cross-view feature alignment and integration. Extensive experimental validation on three standard medical imaging datasets (MURA, CheXpert, CBIS-DDSM), demonstrating superior performance against CNN-based and Transformer-based state-of-the-art methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The application of Mamba-based architectures for multi-view fusion in medical imaging is innovative, representing a clear departure from traditional CNN or Transformer-based methods. 2) Formulated state-space modeling (SSM) principles with appropriate mathematical clarity (e.g., Eqs. (1)-(5)). 3) The CVSM and MVCM fusion blocks are carefully designed and motivated, effectively handling both local and global cross-view correlations. 4) Robust experimental validation across diverse datasets (musculoskeletal, chest, mammography) clearly shows significant improvements over existing methods. 5) Extensive ablation studies precisely quantify each proposed module’s contribution, enhancing credibility and interpretability. 6) Effective use of GradCAM visualizations demonstrates improved localization capabilities, aiding clinical interpretability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) While theoretically robust, the complexity of the CVSM and MVCM modules could limit practical interpretability and adoption by clinicians or practitioners who may struggle to understand intricate cross-view feature interactions. 2) Despite the authors’ claims regarding efficiency, the computational cost (FLOPs, parameter counts) remains substantial (e.g., 90M parameters for the largest model variant). Detailed computational runtimes or real-world efficiency benchmarks are missing. 3) The experiments primarily demonstrate quantitative improvements in predictive accuracy. However, deeper analysis of how such improvements translate directly into clinical benefits or outcomes remains unexplored. 4) While three datasets are included, they primarily represent standard benchmarks rather than challenging clinical scenarios. The method’s robustness under limited data, imbalanced classes, or noisy labels—common in real-world medical imaging—is not extensively evaluated. 5) The sensitivity of model performance to hyperparameter choices (learning rate, depth of fusion modules, etc.) is not sufficiently explored, potentially limiting the ease of reproducibility in different clinical setups.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

** While your integration of the Mamba architecture into multi-view medical classification tasks is interesting, the methodological novelty beyond applying Mamba to this task remains limited. The proposed modules, although carefully designed, are essentially adaptations of existing techniques from recent literature (e.g., VMamba). The authors should explicitly articulate what specific aspects of the Mamba-based fusion provide unique, substantial improvements beyond existing fusion strategies. ** The design and implementation of the Cross-View Swapping Mamba (CVSM) and Multi-View Combination Mamba (MVCM) modules are quite intricate. The current manuscript does not provide intuitive explanations or sufficient justification for these complex operations. So, please clarify the theoretical basis and practical motivations behind each step clearly. Consider adding illustrative examples (possibly simplified diagrams) explaining how these modules operate at each stage to help the reader better understand their functional mechanisms. ** The largest model variant (XFMamba-B) is notably large (90 million parameters), which could limit clinical deployment, particularly in environments with limited computational resources. There is no detailed computational runtime analysis or benchmark provided. Please, provide explicit computational benchmarks (runtime, memory consumption) across different clinical computing platforms (CPU-only, low-powered GPU scenarios) to clearly illustrate practical efficiency and feasibility in clinical environments. ** While results demonstrate improved performance numerically, the paper does not effectively establish clear clinical relevance. It remains unclear whether modest numerical improvements will meaningfully impact clinical diagnosis, interpretation, or workflow. So, the authors should explicitly discuss how these performance gains translate into improved clinical outcomes, reduced diagnostic uncertainty, or enhanced clinical efficiency. ** Although you have demonstrated improvements on standard benchmark datasets, you have not thoroughly explored robustness in real-world scenarios such as dataset imbalance, noisy labels, or multi-center data variability. So, the authors can consider additional experiments to evaluate performance under realistic clinical data challenges (imbalanced datasets, label noise, or multi-center variability). ** Sensitivity to key hyperparameters, such as model depth, fusion complexity, and learning rate, is missing. Please, provide a detailed sensitivity analysis to demonstrate model robustness to variations in these parameters, improving reproducibility and trustworthiness.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces XFMamba, an adaptation of the Mamba architecture for multi-view medical image classification, demonstrating improved numerical performance on public benchmarks. However, despite strong empirical validation, the manuscript falls short regarding clear methodological innovation, interpretability, and practical clinical relevance. Additionally, concerns regarding model complexity, computational efficiency, generalizability, and lack of robustness evaluations hinder its potential impact.

The authors must convincingly clarify the unique theoretical and methodological contributions, provide computational efficiency benchmarks, perform detailed robustness analyses, and explicitly discuss the clinical implications of their approach.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal justifies that CVSM and MVCM extend Mamba’s capabilities for cross-view fusion, introducing specialized mechanisms like interleaved SSM and matrix-based view fusion, which go beyond plug-and-play use of Mamba in prior work.

This is a valid and significant architectural contribution, especially in multi-view medical contexts.

The authors provide runtime and memory usage benchmarks in the rebuttal, addressing deployment concerns.

They propose XFMamba-T and -S variants for efficient deployment, which is a realistic compromise.

The rebuttal correctly points out that the chosen datasets (e.g., CheXpert, CBIS-DDSM) include noisy labels, multi-center variability, and uncertainty, inherently reflecting real clinical settings.

The authors convincingly explain that improved cross-view fusion can reduce diagnostic ambiguity, making the modest performance gains clinically meaningful, especially in screening contexts.

The rebuttal promises revisions to improve Figure 1 and naming consistency (SSM/SS2D confusion, early/late fusion vs. CVSM/MVCM).

These are minor presentation issues and do not undermine the technical contributions.

While the performance gains over early/late fusion are modest (~0.9–1.0%), they are statistically consistent across multiple runs and datasets, and the authors justify the frontal view’s dominance in CheXpert, which compresses gain margins.

The detailed design of the fusion blocks still leads to cleaner GradCAMs and better interpretability, an often underemphasized but important dimension.

Review #2

Please describe the contribution of the paper
- Combination of two image views for prediction, using cross-fusion.
- The introduction of the MVCM block
- Above sota performance on multiple datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The MVCM block is seemingly novel.
- Comparisons to many sota baselines, showing superior results.
- Experiments and results on three separate datasets, supporting generalizability.
- Ablation studies to investigate the effects of the CVSM and MVCM blocks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
There is only a modest increase in performance from combining the CVSM-block and MVCM-block, over using only one of them. Additionally, discarding both blocks (late or early fusion) only leads to a small decrease in performance. Lacking discussion about this.
- There is no investigation on the relative importance of the two views in the cross-fusion (attributability). E.g. using only frontal views perform almost as well as combining the two views with early-, late-, or even cross-fusion.
- Unclear if and how early/late fusion (e.g. table 2) correspond directly to shallow/deep fusion (e.g. fig 1).
- Illegible figures 1 and 2.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Font size in fig 1 and 2 too small and illegible. Please adjust.
- You use abbreviation SSM for both State Space Model and Selective Scan Module, which is confusing. Please change one.
- Please clarify if CheXpert was used for a binary task (disease v. normal), or multi-class classification (13 disease classes). AUROC typically implies binary classification, please indicate how it was calculated if multiclass.
- Do you use early=shallow and late=deep to denote types of fusion? Please be consistent in naming.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written with proper ablation studies and baseline comparisons, and the methodology and experiments seem sound and relevant. Combining different image views in a novel way to increase performance is relevant and interesting. However, there are some clarifications needed in the method, and more discussions on the results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors addressed the comments from the reviewers satisfactorily. There were no major critiques regarding e.g novelty or significance of the model.

Review #3

Please describe the contribution of the paper

This paper proposes a new Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification, going beyond simple feature fusion. The framework comprises two key components: a four-stage encoder that captures multi-scale features across multiple views and a two-stage fusion module to exchange and align cross-view information and enhance multi-view integration. The authors present comprehensive experimental results to validate their approach
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The overall paper structure and flow are well-organized, though some core concepts would benefit from more precise articulation. Overall, the methodology is well-designed and is thoroughly validated. Indeed, the experimental validation effectively demonstrates the performance of both individual components and the complete system.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The Introduction is not well balanced; too little space is left for explaining the contributions and the rationale behind the design choices; the authors should spend more space in trying to make the introduction compelling and captivating, to make the reader intrigued and to provide them with more insights; too much space is devoted to background and related works, in my opinion. It is difficult to understand what the contributions of the author mean. The authors should also include more examples to get the reader on the same page before the Method section.

Figure1 is a bit cluttered and difficult to understand. there is no description on what many acronyms mean, such as the LN, DWConv, SS2D blocks; Also, the two hand radiographs shown in (i) are really dark, and one needs to zoom in and increase the brightness of the display to understand what they represetn - I suggest the authors to search better candidates among their dataset, or increase the image contrast and brightness of such images just for visualization purposes (not model training); The image caption should be self-contained.

Table 1 - are the results in Table1 obtained on validation set or external test set? the authors should specify that.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Abstract – the “CNNs” acronyms is not defined before in the abstract; also, it is not utilized anywhere else in the abstract.

Introduction – sentence “To address the aforementioned limitations, Selective Structured State Space Models (S6),”, why “S6”? if it refers to the four capital S, is it a typo?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

see above
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank all reviewers for their constructive feedback and positive reception. They found our work novel and interesting (R1, R2, R3), and noted that it outperforms SOTA methods (R1, R2, R3). Below we respond to specific comments and will revise our paper accordingly. Methodological Novelty and Contribution(R1, R3): While, inspired by VMamba, with long-range dependency and linear time complexity, our cross-fusion architecture introduces substantial innovations beyond prior work, namely the CVSM and MVCM blocks that perform shallow and deep for multi-view medical fusion. CVS-SSM introduces an interleaved/deinterleaved scan mechanism to facilitate localised cross-view feature exchange, while the Mamba block captures long-range dependencies. MVC-SSM uses the system matrix C to decode and fuse features from both views, enhancing multi-view context-awareness ability. Together, this cascaded design outperforms simple early/late fusion and attention-based approaches, which lack long-range modeling or incur higher computational cost. Computational Efficiency (R1): We acknowledge that XFMamba-B may challenge deployment in low-resource settings. While omitted due to space limits, we evaluated computational efficiency in prior experiments. XFMamba benefits from Mamba’s linear-time architecture (O(L⋅D)). On an A100 GPU (40GB), memory usage (average inference runtime per sample) of XFMamba-T/S/B are 16.97% (24.28ms), 31.45% (47.83ms), and 46.97% (75.40 ms), respectively. XFMamba-T/S provide optimal performance–efficiency trade-offs and can be the potential solutions for clinical deployment. We are actively exploring model distillation to benefit clinical PC or laptop usage. Real-World Clinical Challenges and Clinical Relevance (R1): Our experiments address real-world datasets that inherently exhibit imbalance, label noise, and multi-center variability. CBIS-DDSM includes data from four clinical institutions; CheXpert contains radiographs collected over 15 years at Stanford Hospital across diverse inpatient and outpatient settings; MURA and CheXpert use uncertainty labels with noise for the training set due to manual annotations; to ensure robust evaluation, test labels were further refined by board-certified Stanford radiologists. These factors reflect real-world clinical variability across radiologists and datasets. Regarding clinical relevance, improved cross-view fusion in chest radiography, mammography, or musculoskeletal imaging can significantly enhance diagnostic confidence and reduce false positives/negatives, leading to better detection of breast cancer, chest pathologies, and musculoskeletal abnormalities. Hyperparameter Sensitivity (R1): We performed extensive evaluations across learning rates (1e-5 to 1e-3) and fusion depths (1 to 3 layers), selecting 1e-4 and depth=1 as optimal for performance-efficiency trade-off. Although detailed results were excluded due to space, we will release code, configurations/ hyperparameters and pre-trained weights for reproducibility. Model Performance (R2): For ablation studies on the CheXpert dataset, the frontal view contains more diagnostic information than the lateral view, which explains why it performs similarly to simple early or late fusion. However, our cross-fusion mechanism effectively captures complementary information and long-range dependencies between views, resulting in a notable 1.03% AUROC improvement (from 0.9081 to 0.9184) over using the frontal view alone. CheXpert is a multi-label binary task with 13 diseases, each evaluated as a separate binary classification. AUROC is computed per class and averaged across all. Clarifications: (R2) There is a difference between our CVSM/MVCM (enhancement of cross-view feature interaction and integration) and straightforward early/late fusion (concatenation before/after encoding with limited improvement). (R3) All reported performances are based on the hold-out and unseen testing sets. The Introduction/Fig.1 will be revised.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

All three reviewers acknowledged the potential value of the proposed method. At the same time, they raised important concerns regarding different aspects of the paper, e.g. emphasized the methodological novelty, practical clinical relevance, and computational efficiency. The authors should carefully address each of these points in the rebuttal.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Post-rebuttal all reviewers agree that the paper should be accepted. Congrats!

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification

Author(s):