Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1341_paper.pdf

SharedIt Link: https://rdcu.be/eHc4m

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05162-2_15

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/arunkumar-kannan/BrainMT-fMRI

Link to the Dataset(s)

HCP dataset: https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation UKB dataset: https://www.ukbiobank.ac.uk

BibTex

@InProceedings{KanAru_BrainMT_MICCAI2025,
        author = { Kannan, Arunkumar AND Lindquist, Martin A. AND Caffo, Brian},
        title = { { BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15971},
        month = {September},
        page = {150 -- 160}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces BrainMT, a hybrid framework with bi-directional Mamba blocks and a temporal-first scanning mechanism to efficiently learn the spatiotemporal complexity of 4D fMRI. It achieves strong performance on large-scale datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-written and addresses a fundamental challenge in fMRI research: effectively capturing the complex spatio-temporal features of 4D fMRI data. This is a notoriously difficult problem, and the authors present a thoughtful approach that demonstrates clear progress in tackling it.
- The proposed framework leverages a Mamba-based architecture and achieves competitive performance on two important tasks: sex classification and cognitive intelligence prediction. This indicates that the method is both versatile and practically applicable across cognitive neuroscience tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The number of frames must be manually selected, and as it increases, the computational cost grows significantly. Currently, this frame selection process is arbitrary, which remains somewhat limitation of the approach.
2. The authors successfully reduce computational cost in voxel-level 4D fMRI analysis using convolutional blocks to compress feature dimensions and effectively model spatio-temporal information. Recent works, e.g., [1], have applied Mamba-based architectures to reduce computational complexity. The authors should clarify any additional contributions toward computational efficiency beyond the use of convolution blocks and learnable spatial/temporal embeddings.
3. The paper proposes a temporal-first approach utilizing Mamba. As addressed by prior studies [1],[2], the way to scan has been shown to significantly influence performance. A more explicit comparison with these works would help clarify the originality and advantage of the proposed scanning strategy.
4. While the framework shows good performance on both sex classification and cognitive intelligence prediction tasks, the paper lacks comparison with more recent and competitive baselines for each task. Including these comparisons would enhance the evaluation robustness.
5. The evaluation metrics could be expanded. In particular, reporting the widely used metrics , e.g., F1-score, would provide a more comprehensive performance assessment, especially for imbalanced classification tasks.
6. Several points require further clarification: a) In Experiment E in the ablation study, the authors mention “processing more frames at once.” However, it is unclear exactly how many frames are being processed. Providing precise details would improve reproducibility and clarity. b) The concept of “partially overlapping patches” in the method section is introduced but not sufficiently explained. An additional explaination on how these patches are constructed and used would strengthen the method section.
[1] Islam, Md Mohaiminul, et al. “BIMBA: Selective-Scan Compression for Long-Range Video Question Answering.” arXiv preprint arXiv:2503.09590 (2025). [2] Gong, Sitong, et al. “AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation.” arXiv preprint arXiv:2501.07810 (2025).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-written and effectively addresses the critical challenge of learning spatio-temporal features from 4D fMRI data. However, current version of this paper lacks some comparison with recent works and experiments to prove the contribution of this paper, particularly the manual frame selection process and the need for a more thorough discussion on the novelty of the proposed methods for computational efficiency and temporal-first approach. For these reasons, I decide to hold my rating as “weak reject”. I suggest the authors to address these issues in the rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors provided convincing clarifications regarding the core contributions of their framework. Specifically, they demonstrated that the spatio-temporal Mamba block. These responses effectively address the main concerns.

Review #2

Please describe the contribution of the paper

This paper introduces BrainMT, a novel hybrid Mamba–Transformer architecture for modeling long-range spatiotemporal dependencies in 4D fMRI data. The authors combine bidirectional Mamba blocks (temporal-first scanning) with a lightweight global Transformer to enable end-to-end phenotypic prediction from volumetric resting-state fMRI. The model is evaluated on UK Biobank and HCP for both regression (cognitive intelligence) and classification (sex) tasks, achieving consistent gains over existing voxel-level baselines.

The proposed approach is conceptually strong and well-motivated, leveraging recent advances in state-space modeling to improve efficiency and sequence length handling in fMRI contexts. However, while the architecture is novel and promising, the paper would benefit from improvements in evaluation design and interpretability to fully justify its gains.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. End-to-end volumetric modeling of fMRI is an important trend, and integrating long-range modeling blocks (Mamba) is well-justified.
2. The method shows solid improvements over prior work (especially SwiFT) on two large datasets.
3. Figure 1d clearly demonstrates the computational advantage of BrainMT over prior Transformer-based models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. From Figure 2 and the description of predicted results, it is unclear whether BrainMT preserves meaningful connectomic patterns. The interpretability maps are promising but remain high-level. More quantitative validation (e.g., brain module-level accuracy or attention heatmaps) would strengthen the impact.
2. While the ablation results are extensive (Table 3), the analysis lacks deeper interpretation. For instance, the gain from adding the Transformer block is numerically small and may not justify the added complexity without further justification.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is technically sound, addresses a clear gap in the literature, and shows promising results. However, the limited pattern-level validation, and lack of deeper interpretation of architectural contributions hold the paper back from a higher score. I believe this work is publishable and relevant, but requires further refinement to reach its full impact.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes BrainMT, a hybrid deep learning framework for phenotypic prediction from resting-state fMRI data. BrainMT combines a bi-directional Mamba block for efficient long-range temporal modeling with a Transformer block to capture global dependencies. The authors evaluate their method on two large-scale neuroimaging datasets (UK Biobank and HCP), demonstrating better performance in both regression (cognitive intelligence) and classification (sex prediction) tasks compared to existing voxel- and correlation-based methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Timely and relevant problem: The paper addresses the limitations of both correlation-based and voxel-based approaches in modeling long-range spatiotemporal dependencies in fMRI data.
2. Novel architecture: The proposed hybrid framework combining Mamba and Transformer modules is well motivated and technically interesting.
3. Empirical performance: BrainMT achieves better results across two large-scale benchmark datasets on both classification and regression tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Dataset statistics are incomplete: Important dataset characteristics such as the original temporal length of fMRI sequences (T), spatial dimensions (H, W, D), and label distributions (e.g., class balance for sex prediction) are not clearly reported.
2. Random sampling of fMRI frames lacks justification: The authors sample 200 (or 100/300) frames randomly from each sequence but do not provide a sufficient motivation or analysis of how this affects temporal continuity and ordering of the original fMRI signals, which is critical in fMRI analysis.
3. Correlation-based baselines are underdeveloped: Only a single parcellation (HCP multimodal atlas) is used. However, prior literature suggests that model performance is highly sensitive to the choice of parcellation. Using multiple parcellation strategies would make the comparison more convincing.
4. Limited ML baselines: Standard and widely-used machine learning methods in neuroimaging such as SVM, ElasticNet, or Kernel Ridge Regression are not included. Including such baselines could help assess the relative merit of BrainMT over simpler, interpretable models.
5. Lack of tuning for baseline methods: The authors state they used hyperparameters from the original baseline studies, but given the different datasets and splits used in this work, hyperparameter tuning on validation data would have been more fair and informative.
6. Unclear description in ablation study (E): The section on “predicting functional connectivity correlations” lacks detail on the methodology and evaluation protocol used for this task.
7. Counterintuitive ablation results not discussed: In Table 3, removing both Conv and Transformer modules (“No Conv & Transf.”) seems to slightly improve performance compared to removing only the Transformer. This unexpected result is not discussed or explained.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a Weak Accept. While the paper proposes a novel architecture with promising empirical performance, the current version has several important issues, especially related to experimental clarity and reproducibility. Key limitations include incomplete dataset statistics, lack of baseline tuning, insufficient exploration of alternative parcellations and machine learning baselines, and some ambiguity in ablation explanations.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their constructive feedback and for recognising our work as novel (R1, R2); conceptually strong (R1); well-motivated and technically interesting (R2); a thoughtful approach (R3) with computational advantages (R1); and versatile and practically applicable (R3). Below we address the main concerns of R1-3. R3.2 Computational efficiency beyond the convolution front-end. We would like to clarify that our efficiency gains come from more than the front-end convolutions. Instead, the key component is the spatiotemporal Mamba block, whose state-space formulation scales linearly with sequence length, unlike the quadratic self-attention used in prior voxel-level models (e.g., TFF/SwiFT baseline). Because of this linear cost, we can model up to 400 frames in an fMRI sequence instead of truncating to 20, while cutting peak memory by 35.8% (Fig.1d). Furthermore, the Mamba variant in [1] (CVPR ’25) relies on a ViT front-end suited for 2D videos; extending that to 4D fMRI data (400×91×91×109 voxels) would re-introduce quadratic cost, motivating our use of a convolutional block. R3.3 Temporal-first scan novelty. We agree that the scan rule is crucial and would like to point to Table 3 (Exp.D1-D2, p. 7), where we replaced our temporal-first Mamba layer with the variants used in [2] (VMamba, four-direction 2D scan) and [1] (VisionMamba, spatial 1D scan), keeping everything else identical. Both alternatives reduced Pearson’s r from 0.41 (BrainMT) to 0.24 (VMamba) and 0.07 (VisionMamba), showing that modeling temporal correlations first is essential when analyzing an fMRI sequence. This neurobiologically motivated scan order is therefore the key advantage of our approach (p. 5). R2.2, R3.1 Sampling of fMRI frames. We sample a contiguous 200-frame window per scan, keeping temporal order intact. Because resting-state fMRI is near-stationary, any contiguous window is representative; we therefore randomize the window selection to learn diverse temporal contexts and reduce overfitting. We plan to investigate data-driven window selection (e.g.,[1]) in future work. R1.1 Interpretability. We would like to clarify that Integrated Gradients produces voxel-wise attribution scores, so the resulting maps are inherently low-level. To address the request for higher-level validation, we will add brain-module aggregates of these voxel scores to Fig.2. In addition, to show that BrainMT captures connectomic structure, we point to Table 3 (Exp.E) where, using a 200-frame input window, BrainMT’s predicted functional-connectivity matrix correlates strongly with the empirical matrix at r = 0.72. R2.3 Multiple parcellations. We acknowledge that parcellation choice can affect correlation‑based baselines, which motivated our move to a voxel-level framework. To ensure fair comparison, we evaluated baselines using the HCP multimodal atlas - chosen for its integration of multi-modal data, widespread adoption, and inclusion in both UKB and HCP releases, thereby avoiding the need for extra preprocessing. R2.4,2.5 ML baselines. We chose XGBoost as our default ML baseline because of its fast training and easy interpretability in high-dimensions. While we cannot add new experiments during the rebuttal, we acknowledge the value of alternative ML methods and will consider them in future work. Furthermore, all baselines were hyper-parameter-tuned on the validation split; notably, this tuning recovered the same settings as the original papers, which we will clarify. Clarity on ablations. R1.2: Beyond a small performance gain, we integrate Transformer to capture global spatial dependencies that complement Mamba’s long‑range temporal modeling. R2.6, R3.6a: Please refer to R1.1 above for Exp.E details. Minor points. R2.1: We will add UKB/HCP scan lengths (490/1,200), spatial size (91× 91×109  MNI), and class balance (female: 50.86 % UKB; 51.16 % HCP). R3.5: For sex classification (Table  2), we currently report balanced accuracy and will also include F1‑scores.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Reviewers were excited by the proposed model and the promising experimental results, while also noting concerns regarding needed clarifications about the method and experimental settings, as well as some perceived missing experiments or analysis. The rebuttal clarifies several of these concerns, and in the end all reviewers lean toward acceptance of the work. I agree that the proposed approach would be of interest to the fMRI community and thus recommend acceptance.

The authors should please include the requested clarifications in the final version of the paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data

Author(s):