Abstract

Recent advances in deep learning and generative AI have enhanced our understanding of brain function and enabled brain-computer interfaces to reconstruct stimuli from non-invasive neuroimaging data. In this work, we introduce an efficient two-stage training framework for captioning stimulus images from fMRI data, leveraging the compact representations of vision-language models and incorporating contrastive learning with text embeddings. Our approach demonstrates strong performance in fMRI captioning across multiple evaluation metrics and enables multimodal retrieval, highlighting the advantages of the contrastive learning. Additionally, we conduct an analysis with region-of-interests (ROI) to examine the contributions of specific brain regions to the decoding process, providing interpretable results that align with neuroscience theories. Our findings contribute to advancing brain decoding techniques and improving model interpretability.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2049_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

Natural Scenes dataset: https://naturalscenesdataset.org/

BibTex

@InProceedings{SheVya_Interpretable_MICCAI2025,
        author = { Shen, Vyacheslav and Kunanbayev, Kassymzhomart and Jang, Donggon and Kim, Daeshik},
        title = { { Interpretable fMRI Captioning via Contrastive Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {302 -- 312}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution is a model that can embed images, captions, and fMRI data from human subjects viewing the images all into a shared space, enabling the mapping between these domains. The authors’ embeddings are lower dimensional than previous works.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors’ method works reasonably well on retrieval, and outperforms other methods on most metrics in fMRI captioning. Also, the captions generated from synthetic fMRI signals match our understanding of the roles of different brain regions.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weakness is it is not clear if their work is much of a breakthrough. As the authors acknowledge, there are several previous works that align both image and caption embeddings with fMRI data. Further, their two-stage training procedure, and brain ROI interpretability experiment seems to be directly modeled after Ozcelik et al. It is also unclear to me why they don’t report performance of text retrieval for methods such as MindEye or UniBrain since it seems that both of those methods can map fMRI to captions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the authors note that their embeddings are lower dimensional than previous methods, it’s unclear to me how much this would matter in a practical setting. Is this method significantly more efficient with respect to necessary hardware or time to perform computations? How do other methods do in mapping captions to fMRI data? I acknowledge that the captioning results are positive, but as of now this seems to be the only novel result.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My initial concerns have been adequately addressed. I still believe that existing methods could be trivially extended to perform fMRI to text retrieval, but I do believe that their text to fMRI retrieval result is novel.



Review #2

  • Please describe the contribution of the paper
    • They proposed a computationally efficient algorithm BLIP-2 for learning fMRI captions, demonstrating improved performance in generating fMRI captions.
    • Provided examples on how the method can be used to explore of the roles of different brain regions in neural decoding.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s major strength lies in its introduction of a computationally efficient algorithm that generates compact visual embeddings, enhancing fMRI captioning performance and broadening application potential compared to existing high-dimensional methods. The paper introduces a computationally efficient algorithm that broadens application potential. Unlike existing methods that predict high-dimensional embeddings (257 × 768 and 257 × 1024) from an already high-dimensional fMRI voxel vector (length 15,724), the proposed approach generates more compact visual embeddings (32 × 768) while achieving superior fMRI captioning performance in most cases.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    MindEye achieves nearly perfect accuracy in retrieving the correct image from its corresponding fMRI signal and vice versa. In contrast, the proposed method achieves less than 50% accuracy in retrieving the correct image from its corresponding text signal and vice versa. It is unclear what additional benefits the method offers for multimodal retrieval, especially if the primary goal is to understand brain function and decode visual representations, where image-to-brain (I -> B) and brain-to-image (B -> I) retrieval should be prioritized over brain-to-text (B -> T) or text-to-brain (T -> B). The benefits may lie more in fMRI captioning, where text retrieval and interactions between text and brain retrieval are relevant, areas where other methods fall short. I would like to see more explanation and discuss around this.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a computationally efficient algorithm that enhances fMRI captioning and broadens application potential. However, revisions are needed to clarify the benefits of multimodal retrieval and to address accuracy concerns compared to existing methods like MindEye.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper describes a novel method for the very interesting application of multimodal fMRI captioning, based on the BLIP-2 Q-former and contrastive learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The motivation and prior work are very well described, highlighting the key state of the art and the aspects in which this work contributes.
    • The methodology, although heavily based on existing techniques and inspired by previous works, is novel and introduces key improvements towards the final aim of multimodal fMRI captioning. In particular, the use of the more compact BLIP-2 model and contrastive loss are worthy contributions to the field.
    • Results are excellent, proving the potential of the proposed technique for multimodal fMRI captioning. Discussion of the results is detailed and convincing.
    • The region-specific analysis at the end is really intriguing, and it further proves the importance of this work.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Experiments are run on data from a single subject. This is a minor limitation, as this is common practice in similar studies, but further improvements clearly would need to consider multi-subject datasets.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a solid work on a very interesting application. The methodological novelty and the extensive results are well worth acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal correctly addresses the concerns raised by the reviewers. I stick to my initial recommendation of acceptance.




Author Feedback

We would like to sincerely thank all reviewers for their insightful feedback. R1(1): “not clear if their work is much of a breakthrough. seems to be directly modeled after Ozcelik et al”. While previous works rely solely on CLIP image embeddings and do not employ contrastive learning for fMRI captioning, our approach introduces a single model that is capable of bi-directional retrieval across three modalities (B ↔ T, B ↔ I, and T ↔ I) which was not demonstrated previously, to the best of our knowledge. Also, our two-stage training pipeline is distinct from that of Ozcelik et al.: their second stage trains additional regressors and does not incorporate contrastive learning, whereas we continue fine-tuning the model in Stage 2 with contrastive loss. Although our ROI-level analysis follows their methodology, we repurpose it for the novel task of generating fMRI captions and verify that the resulting interpretations are consistent with their findings. (2) “why they don’t report performance of text retrieval for methods such as MindEye or UniBrain”. MindEye-2 projects fMRI activity into the CLIP ViT/L-14 image embedding space and subsequently generates captions via a GIT Image-to-Text decoder; because the mapping is performed in image space, a direct text-retrieval is not defined for that model. UniBrain, although architecturally capable of retrieval through BERT and CLIP-Text encoders, does not report quantitative results for brain-text retrieval tasks. In the absence of retrieval scores in the published papers, we could not compare these works with our method. (3) “While … embeddings are lower dimensional than previous methods, it’s unclear to me how much this would matter in a practical setting”. We note that lower-dimensional embeddings yield substantial computational advantages without sacrificing performance, which are especially critical in resource-constrained environments. For example, our experiments were conducted on a single RTX A6000 (48Gb) GPU, whereas MindEye-2 reported training on an 8 x A100 (80Gb) GPU cluster. In addition to that, we note that lower dimensional processing can lead to simplified and better interpretations, which opens the door for future research on better decoding of brain fMRI. R2 (1): “Experiments are run on data from a single subject.”. To ensure fair comparison, we report results only for subject 1 because prior fMRI-captioning works (MindEye2, UniBrain) conduct their experiments solely for this subject. While revisiting Table 1, we found two transcription errors and will correct them in the camera-ready version:

  • MindEye-1, B → I top-1 accuracy should be 94.7 % [19]
  • Brain Diffuser retrieval scores are the average of four subjects, not subject 1 alone. Thus, the caption of Table 1 will be revised: “Top-1 retrieval accuracies. All values are computed for subject 1, except Brain Diffuser, whose numbers are the mean over subjects 1, 2, 5, 7 [20]…” Please note that these adjustments do not affect any qualitative conclusions or relative rankings described in the paper. R3 (1): “Revisions are needed to clarify the benefits of multimodal retrieval”. To address the reviewer’s comments, we have revised the introduction section: “… a process known as fMRI captioning. In this context, multimodal retrieval provides a flexible way to decode both what is seen and the underlying semantic content from brain activity.” and results section: “… 45.0% for brain signals from text (among 300 candidates). This multimodal retrieval unlocks natural-language querying of fMRI data, resulting in a more comprehensive interpretation of brain activity.”




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top