Abstract

Functional connectivity (FC) analysis is the primary approach for studying functional magnetic resonance imaging (fMRI) data, focusing on the spatial patterns of brain activity. However, this method often neglects the temporal dynamics inherent in the timeseries nature of fMRI data, such as latency structure and intrinsic neural timescales (INT). These temporal features provide complementary insights into brain signals, capturing signal propagation and neural persistence information that FC alone cannot reveal. To address this limitation, we introduce Prompt enhanced multimodal integrative analysis (PMIL), a multimodal framework built on a transformer architecture that integrates latency structure and INT with conventional FC, enabling a more comprehensive analysis of fMRI data. Additionally, PMIL leverages text prompts within a state-of-the-art vision-language model to enhance the integration of INT with latency structure and FC. Our framework achieves state-of-the-art performance on an autism dataset, effectively distinguishing autistic patients from neurotypical individuals. Furthermore, PMIL identified disease-affected brain regions that align with findings from existing research, thereby enhancing its interpretability. The code for PMIL is publicly available at https://github.com/gudtls17/PMIL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2349_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/gudtls17/PMIL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ChoHyo_PMIL_MICCAI2025,
        author = { Choi, Hyoungshin and Kim, Jonghun and Chung, Jiwon and Park, Bo-Yong and Park, Hyunjin},
        title = { { PMIL: Prompt enhanced Multimodal Integrative analysis of fMRI combining functional connectivity and temporal Latency } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15971},
        month = {September},
        page = {541 -- 551}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce Prompt enhanced multimodal integrative analysis (PMIL), which built on a transformer architecture that integrates latency structure and INT with conventional FC, enabling a more comprehensive analysis of fMRI data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strengths include introducing the text prompts within a pretrained vision-language model to enhance the integration of INT with latency structure and FC. The ablated experiments are efficient.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The main concern is that the authors treat the connectivity matrix as an image, despite the lack of meaningful spatial relationships in FC matrices. Since many studies adopt graph-based methods for this reason, the authors should clarify how patches are defined in the vision transformer given the absence of inherent spatial structure.

    The usage details of OCRead layer is unclear.

    It appears that the OCRead layer further decodes the features, and the MLP is used for final prediction. In this case, it is questionable how the learned weights in the merged embedding module can offer interpretability, since they do not directly contribute to the final decoding step used for prediction.

    Statistical tests need to be performed to determine whether the performance significantly outperforms other methods or variants.

    The implementation of decoder modules are not described.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concern is that the authors treat the connectivity matrix as an image. However, unlike medical images, the FC-like matrix lacks a meaningful spatial relationship (i.e., continuity between neighboring pixels). As a result, many studies on functional connectivity adopt graph-based approaches, where edges represent the connections between different regions of interest (ROIs). Therefore, the authors need to clearly explain how they define patches in the vision transformer method when applying it to FC data, given the absence of inherent spatial structure.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have provided details of the OCRead layer and clarified the use of the transformer model, which addresses my major concerns. It would be beneficial to include these details in the final version before publication. I recommend a final decision of Accept.



Review #2

  • Please describe the contribution of the paper

    Considering cross correlation and intrinsic time scale of brain functional dynamics for ASD diagnosis. The intrinsic time scale of brain region is mapped to text for introducing vision-language modelling skills.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Same as main contribution

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The improvement is limited in many aspects. In Table 2 and 3, after considering the large variation, the contribution from the key proposed algorithm could be not significant. For example, “ZL+CL” with only vision is with accuracy 74.0±2.8 and “W/O cycle loss” is with AUC 83.2±1.1, which are very close to an accuracy of 75.9±1.9 and an AUC of 83.4±0.5 from the proposed method. When considering the additional complexity (parameters) introduced by the method, I’m not convinced about the essential advantage and contribution of the method.
    2. As I understand, W/O text (Numeric category) means representing the intrinsic time scale with three numeric categories. How about the original value of the intrinsic time scale?
    3. What is the architecture for decoder layer for reconstruction?
    4. The layout and color coding of Fig 1 is hard to follow.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Limited improvement from the method and key modules

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The evidence for the improvement is from 100 permutation tests, which is a little bit tricky and seems no reason behind. Using different permutation realisation the results could vary. The authors should use stricker test such as paired t-test or signed rank test to obtain one determined significance to support the improvement with higher confidence.



Review #3

  • Please describe the contribution of the paper

    The study proposes a novel fMRI analysis method named PMIL (Prompt enhanced Multimodal Integrative Analysis), which integrates functional connectivity (FC), latency structure, and intrinsic neural timescale (INT) while leveraging a pretrained vision-language model (BiomedCLIP) to enhance analytical capabilities. The method demonstrates strong performance in autism spectrum disorder (ASD) classification tasks and identifies ASD-related brain regions, showcasing high clinical relevance and interpretability.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The study introduces a novel multimodal framework, PMIL, which for the first time combines latency structure and INT with traditional FC and enhances analysis through text prompts.
    2. The study provides a comprehensive description of PMIL’s architecture and implementation, including the computation of latency structure and INT, text prompt design, encoder-decoder implementation, and loss function formulation.
    3. The study is validated on the publicly available ABIDE dataset, with comparisons against baseline models. Results demonstrate PMIL’s superiority in accuracy, AUROC, and other metrics, proving its effectiveness.
    4. By visualizing important brain regions, PMIL identifies ASD-related regions and links them to cognitive functions, significantly enhancing model interpretability.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The rationale for using a vision-language model (e.g., BiomedCLIP) and the role of text prompts remain unclear. Specifically, how the “text descriptions” are generated (e.g., content, format, and thresholds like “short/intermediate/long” INT categories) is insufficiently explained.

    2. Overlooked Related Work: The authors highlight two key challenges: (1) limited use of timeseries data and (2) sparse use of textual descriptions in fMRI analysis. However, recent studies have addressed similar challenges, such as: [1] Spatio-temporal hybrid attentive graph network for diagnosis of mental disorders on fMRI time-series data. [2] FM-APP: Foundation Model for Any Phenotype Prediction via fMRI to sMRI Knowledge Transfer. These works should be cited and discussed to better contextualize PMIL’s contributions.

    3. The authors claim that “FC analysis often neglects the temporal dynamics of brain signals” is questionable. Numerous studies have explicitly incorporated temporal dynamics.

    4. Incomplete Data Setup: The version and implementation details of BiomedCLIP are not specified.

    Minor Issues:

    1. Missing Citations: The statement, “Temporal dynamics can be investigated through latency analysis, which measures delays or advances in the timeseries between brain regions,” lacks supporting references.
    2. Formula Notation Error: In Equation (3), the notation for “X^’text” is incorrectly formatted.
    3. Grammatical error: “Fig. 2(c) demonstrated the cognitive…”.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiment design is persuasive enough to prove the effectiveness of this method.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author clarified my comments.




Author Feedback

We thank the reviewers for valuable comments. Below we address concerns. Q1(R1-4&R4-1): Limited improvement over baselines. A: 1) When we compare our model to the second-best one by 100 permutation tests, our model shows significant improvement in AUROC (p=0.039) for Table 1. In ablation study (Table 2), our model significantly outperforms the model using only vision (ZL+CL) in AUROC (p=0.049). Our method is better than other merging embedding strategies (p<0.01) in Table 3. 2) Adding text representation (7th row in Table 2) improves accuracy by 4.4% and AUROC by 1.4 than using ZL (4th row in Table 2), showing the strength of text representation. Note that the text encoder is frozen and thus does not incur additional complexity. We will amend Tables 1-3 to mark statistical significance.

Q2(R1-1): Patch definition in transformer. A: We did not use the vision transformer but used the standard transformer mentioned in the Method. We used the text encoder of BiomedCLIP to generate text representation. The text representation is appended to the X_ZL (FC) and X_CL (time delay) matrices and they are broken into 200-sized tokens. These tokens are processed through standard transformer decoder layers. We used the transformer model because it has performed well in conventional brain analysis. While graph-based methods can capture brain regional interactions, attention mechanisms in Transformers can serve a similar role by token interactions. We will clarify this.

Q3(R1-2,3): Details of OCRead layer and interpretability. A: OCRead layer consists of a standard transformer encoder layer and token pooling layer. The encoder layer processes merged embedding once more and aggregates all token embeddings with CLS tokens for visualization. Aggregated embedding represents importance because it includes which ROI interacts with another ROI. The token pooling layer leverages brain network features by soft-clustering token embeddings and projecting them orthogonally to produce meaningful graph-level embeddings, aiding autism classification. We will clarify this.

Q4(R1-5&R4-3): Decoder layer details. A: Just like the encoder, a standard transformer decoder layer was used for the decoder layer. Due to the nature of the transformer model, the input shape of the FC and time-delay matrices is maintained through reconstruction. We will clarify this.

Q5(R3-1): Rationale of text description. A: We created a template description containing the length of the timescale and location of ROI like “ timescales ”. Following [3,13,25], we assign timescales 0 to 3 sec as short, 3 to 5 sec as intermediate, and 5+ sec as long. We will clarify this.

Q6(R3-2&minor1): Missing related work. A: Thank you for the REFs! We will cite them and contrast them with our study in the Introduction. Study1, Spatio-temporal.. will be mentioned as a recent study in spatial-temporal modeling. Study2, FM-APP.. will be mentioned as a prominent example in medical vision-language models. We will also add [22] as a pioneering research on latency analysis.

Q7(R3-3): Justify claim related to temporal dynamics. A: We agree with you and will tone down the text as “Older approaches underrepresented the temporal dynamics of brain signals, though recent work increasingly addresses this aspect”.

Q8(R3-4): BiomedCLIP details. A: We used PubMedBERT_256-vit_base_patch16_224, trained by PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles. This will be added.

Q9(R4-2): Ablation on intrinsic time scale (INT). A: As you mentioned, the ablation study about text is the result of using the original value of INT. To avoid confusion, we will replace “Numeric category” with “Real valued INT”.

Q10(R3-minor2,3&R4-4): Missing reference, formula notation, grammar error, and Fig 1 layout. A: We will correct the first three as suggested. The layout of Fig 1 will be enhanced using block diagrams instead of pictorial blocks.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I agree with Reviewer 3 that the improvements introduced by the proposed method and its key modules are limited. I recommend reject.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top