Abstract

Epilepsy affects over 50 million people worldwide, with antiseizure medications (ASMs) as the primary treatment for seizure control. However, ASM selection remains a “trial and error” process due to the lack of reliable predictors of effectiveness and tolerability. While machine learning approaches have been explored, existing models are limited to predicting outcomes only for ASMs encountered during training and have not leveraged recent biomedical foundation models for this task. This work investigates ASM outcome prediction using only patient MRI scans and reports. Specifically, we leverage biomedical vision-language foundation models and introduce a novel contextualized instruction-tuning framework that integrates expert-built knowledge trees of MRI entities to enhance their performance. Additionally, while training only on the four most commonly prescribed ASMs, our framework enables generalization to predicting outcomes and effectiveness for unseen ASMs not present during training. We evaluate our instruction-tuning framework on two retrospective epilepsy patient datasets, achieving an average AUC of 71.39 and 63.03 in predicting outcomes for four primary ASMs and three completely unseen ASMs, respectively. Our approach improves the AUC by 5.53 and 3.51 compared to standard report-based instruction tuning for seen and unseen ASMs respectively. Our code, MRI knowledge tree, prompting templates, and TREE-TUNE generated instruction–answer tuning dataset are available at the https://github.com/khoapham154/Knowledge-Tree-Driven-Contextualized-Instruction-Tuning-of-Foundation-Models.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2254_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/khoapham154/Knowledge-Tree-Driven-Contextualized-Instruction-Tuning-of-Foundation-Models.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{PhaDuy_Knowledge_MICCAI2025,
        author = { Pham, Duy Khoa and Mehta, Deval and Jiang, Yiwen and Thom, Daniel and Chang, Richard Shek-kwan and Nazem-Zadeh, Mohammad and Foster, Emma and Fazio, Timothy and Holper, Sarah and Verspoor, Karin and Liu, Jiahe and Nhu, Duong and Barnard, Sarah and O’Brien, Terence and Chen, Zhibin and French, Jacqueline and Kwan, Patrick and Ge, Zongyuan},
        title = { { Knowledge Tree Driven Contextualized Instruction Tuning of Foundation Models for Epilepsy Drug Recommendation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {380 -- 390}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors adapt a foundation model to predict the effectiveness of antiseizure medication (ASM) in epilepsy patients using MRI scans and clinical reports. They propose TREE-TUNE, a novel approach to instruction tuning that incorporates an expert-curated knowledge tree. This method provides richer context and enables the model to learn more meaningful relationships between MRI findings and ASM effectiveness. Furthermore, to enhance the model’s generalizability to unseen ASMs, the medications were encoded using Simplified Molecular Input Line Entry System (SMILES) representations. This approach allows the model to focus on molecular features rather than specific ASMs

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    By using SMILES representations for ASMs, the model learns from molecular structures rather than just categories. This allows it to predict the effectiveness of previously unseen ASMs, a major improvement over current methods. Additionally, TREE-TUNE presents a novel approach to instruction tuning by integrating an expert-curated knowledge tree. This enhances the model’s understanding by providing more detailed context, which helps it establish more significant connections between MRI results and the effectiveness of antiseizure medication. These two novel aspects help to train a model that shows better performance compared to other baseline methods.

    In more deatils:

    1. Relevant Problem & Motivation: The paper addresses a high-impact clinical problem: guiding the selection of antiseizure medications (ASMs) using noninvasive, routinely available diagnostic inputs. The authors correctly identify the “trial-and-error” nature of ASM selection and position their work as a step toward more personalized and data-driven epilepsy management. The use of MRI and associated textual reports aligns with existing clinical guidelines, making the proposed pipeline potentially relevant for deployment.
    2. Incorporation of Domain Knowledge: The introduction of a structured “MRI knowledge tree” to condition GPT-4-generated instruction-answer pairs is a key design element in the framework. It enables the authors to inject clinical priors and anatomical semantics into the fine-tuning process.
    3. Zero-Shot Generalization Approach: The model’s capacity to handle unseen ASMs through SMILES representations is well-motivated and consistent with trends in molecular machine learning. However, this technique—learning outcome predictors from molecular graph encodings—has been extensively studied in the context of drug response prediction and zero-shot pharmacological inference. Thus, while useful in this application, the approach is not conceptually novel.
    4. Experimental Effort: The authors evaluate on two independent retrospective datasets, indicating a solid effort to validate the method. They report performance on both seen ASMs (in-distribution) and unseen ASMs (out-of-distribution), which demonstrates the intended generalization capability. The results include metrics (AUC, precision, recall) and comparisons to ablated versions of their approach (e.g. vision-only, vision+text without knowledge tree, etc.), which lends credibility to their analysis. The consistent improvement (e.g. +5.5% AUC over a text-only instruction-tuned model) suggests that each component (image, report, knowledge context) added some value.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of Substantial Novelty: While the integration of imaging, textual reports, and drug structure is well-executed, the overall framework primarily reassembles known components rather than introducing a fundamentally new modeling paradigm. Instruction tuning using synthetic Q&A pairs generated by GPT-4 has been established in prior works such as LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [Li et al., 2023, https://arxiv.org/abs/2306.00890]. The inclusion of structured domain knowledge has similarly been explored in DR.KNOWS: Leveraging Medical Knowledge Graphs for LLM-Guided Diagnosis [Gao et al., 2025, https://arxiv.org/pdf/2308.14321]. Additionally, using SMILES encodings of molecules to enable generalization across unseen drugs is standard practice in computational pharmacology, as described in the foundational paper SMILES: A Chemical Language and Information System [Weininger, 1988, https://doi.org/10.1021/ci00057a005] and extended in Representation of molecules for drug response prediction [Xin An et al., 2021, https://academic.oup.com/bib/article/23/1/bbab393/6375515]. Despite these precedents, the manuscript frames TREE-TUNE as a novel contribution, without clearly delineating what is truly new in the architecture, training process, or theoretical insights. The result is a system-level application of known techniques to a specific task—epilepsy treatment prediction—which, while practically motivated, does not by itself meet the bar for methodological novelty in a top-tier venue.
    2. Limited Justification for Using a Knowledge Tree Instead of a Graph: The paper introduces a hierarchical “knowledge tree” to organize anatomical and imaging features. However, this tree structure enforces single-parent relationships and lacks the flexibility of a full-fledged knowledge graph, which can express polyhierarchical and relational dependencies—common in radiological and clinical ontologies. The authors do not explain why a graph-based structure (such as those used in DR.KNOWS) was not adopted, nor do they explore how this limitation may impact the model’s semantic capacity. As a result, the contribution of the knowledge structure remains under-justified and possibly constrained by its chosen format.
    3. Overstated Generalization Claims: The claim that TREE-TUNE generalizes to “completely unseen ASMs” via SMILES embeddings is only partially supported. The evaluation includes three “unseen” ASMs—valproate, phenytoin, topiramate—which are all standard anti-epileptic drugs and likely share therapeutic or structural features with the training set. The reported AUC of ~0.63, while above chance, is modest and not enough to substantiate the strength of the generalization claim. Importantly, the authors do not analyze whether the model meaningfully leverages molecular substructure, pharmacophores, or drug class similarity.
    4. Insufficient Clarity in Methodological Description: While the paper includes several high-level visualizations (Figures 1–3), these serve primarily as illustrative concept diagrams and do not substitute for a technically detailed, step-by-step description of the TREE-TUNE pipeline. The most critical components of the system—namely how instruction-answer data is generated, how multimodal inputs are fused, how instruction supervision is structured, and how the dataset is composed—are under-described or omitted entirely. Prompting and Dataset Construction - Figure 1 (panels a–c) demonstrates examples of GPT-4o-generated instructions based on MRI scans, reports, and optionally a curated “knowledge tree.” While visually clear, the actual prompting strategy is not provided: the templates, parameters, or even the typical format of the instruction (e.g., yes/no, open-ended, multi-choice) are missing. Section 2.1 (p. 4–5) states only that GPT-4o is “prompted as an expert biomedical assistant”, but does not explain whether the prompting is deterministic or varied, whether generations were filtered or validated, or how many examples were generated per patient. It is also unclear how tree nodes are mapped to prompts or whether instructions vary by tree depth or semantic category. No statistical overview is provided for the instruction corpus: e.g., class balance (seizure-free vs not), average instruction length, or lexical richness beyond what’s shown in a single figure. As a result, while the figure suggests lexical diversity, the reproducibility of the prompt-generation pipeline is not established, particularly considering the use of the closed-source GPT-4o API. Architectural Specification and Fusion Details - Figure 2 presents a high-level architecture showing three streams—MRI, SMILES, and instruction—projected into a shared language space before being passed to a language model. However, the critical mechanism of fusion across modalities is missing. The authors do not state whether embeddings are concatenated, summed, passed via cross-modal attention, or otherwise aligned. No details are given on the dimensionality of H_I, H_ASM, and H_q; whether positional embeddings are shared; or how temporal consistency is handled across multi-turn instructions. The authors refer to LLaVA and MoLeR as component models, but do not explain how these are adapted or reconfigured for their multimodal epilepsy task. There is no block diagram showing actual data flow, token-level interactions, or fusion operations that would enable reproducibility or adaptation. Token-Level Learning and Instruction Supervision - The model is said to be trained using both an autoregressive loss (for instruction generation) and a binary cross-entropy (for outcome prediction), as described on pp. 5–6. However, the relationship between generated text and the binary label is never clarified. It is unclear if the generated answers are constrained to specific formats, or whether the language model output is fully supervised, partially guided, or purely autoregressive. The paper does not explain how the model handles multiple instruction-answer pairs per input, whether it uses a sequential aggregation strategy, or if intermediate reasoning steps are weighted during training. Figures 1 and 3 illustrate instruction diversity and quality, but these are anecdotal and not grounded in a measurable training schema.
    5. Use of Closed Models and Private Data: A major limitation is that all instruction-answer data is generated using GPT-4o, a proprietary model accessed via API. Neither the prompts nor the generated instruction corpus is shared, nor is there any ablation indicating how much this component contributes to performance. Furthermore, the two datasets used (ASM-ED1 and ASM-ED2) are fully private (Section 3, p. 6), meaning the entire TREE-TUNE pipeline cannot be reproduced or externally validated. The paper does not include guidance for porting the pipeline to public datasets or applying the prompting strategy to alternative domains. Without open data or open prompt code, even reimplementation is speculative.
    6. Questionable Clinical Relevance of Performance Metrics: Although TREE-TUNE achieves an AUC of up to 76.45% on one dataset for seen drugs, the average performance is closer to 71.39%. On unseen drugs, the AUC drops to 63.03%. While these numbers are statistically better than random, they fall short of clinical significance thresholds for deployment in sensitive treatment decision-making contexts.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    While the code is available on GitHub, there are aspects that make full reproduction challenging: A major limitation lies in the fact that the instruction-answer dataset used for instruction tuning was generated entirely via GPT-4o—a proprietary, closed-source model accessible only via API. Neither the exact prompt templates nor the generated corpus is made available, and no information is provided regarding the filtering, formatting, or validation of these synthetic instruction-answer pairs. Moreover, the contribution of this component to downstream model performance is not quantified via ablation or sensitivity analysis, making it difficult to assess its necessity or portability. In addition, both datasets used in training and evaluation (ASM-ED1 and ASM-ED2) are internal and private, as noted in Section 3 (p. 6). These datasets are not publicly released, nor are summary statistics or standardized preprocessing pipelines provided. This restricts the community’s ability to replicate, benchmark, or adapt TREE-TUNE to alternative datasets. The authors also do not offer guidance on how to apply their method to publicly available epilepsy imaging datasets or synthetic MRI/report pairs. Although Figure 2 outlines a high-level architectural flow, the actual implementation details—such as fusion mechanisms, positional encoding strategies, and how embeddings are passed through the Vicuna decoder—are missing. No explicit hyperparameter configuration or training script is referenced. While the authors mention using LLaVA and MoLeR as base components, they do not clarify which checkpoints were used, how those were modified, or how modality-specific embeddings are projected into the language space. In summary, although partial reproducibility may be achievable using the shared codebase, the lack of open data, instruction templates, and detailed architectural specification significantly limits independent verification of the results. For TREE-TUNE to be broadly usable or extensible, a public release of either synthetic instruction data or detailed prompting framework, as well as benchmarking on publicly available datasets, would be essential.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation for a “weak reject” is based on the combination of four key concerns:

    1. Lack of Substantial Novelty: While the integration of imaging, text, and drug representations is technically sound, the framework largely reassembles existing components—such as GPT-generated instruction tuning, use of structured ontologies, and SMILES-based generalization—that have been previously demonstrated in works like LLaVA-Med, DR.KNOWS, and numerous drug-response models. The manuscript does not clearly articulate what is methodologically new or non-trivial in TREE-TUNE’s architecture or training strategy, and positions the work as novel without sufficient justification.
    2. Incomplete Methodological Transparency: Several core elements of the pipeline—particularly the GPT-4 prompting process, the fusion of multimodal embeddings, and the supervision scheme—are under-described or entirely missing. Figures 1–3 offer helpful intuition, but do not substitute for formal definitions of the prompting logic, data schema, or architectural implementation. This lack of detail impairs reproducibility and prevents a rigorous evaluation of the framework’s validity.
    3. Limited Reproducibility and Generalizability: The full pipeline relies on a closed-source GPT-4 API to generate its instruction dataset, and both evaluation datasets (ASM-ED1 and ASM-ED2) are private. Without public code for the prompting module, open access to data, or guidance on applying the method to external benchmarks, the study cannot be independently replicated or adapted. Moreover, generalization claims are made on structurally related “unseen” drugs and not substantiated by mechanistic analysis or broader validation.
    4. Modest and Statistically Unvalidated Gains: The reported AUC improvements (~3–5%) over instruction-tuned baselines are potentially meaningful but lack statistical support (e.g., confidence intervals or DeLong tests). The peak AUC of 76.45% on seen ASMs is promising, but performance on unseen ASMs drops to 63.03%, which is below commonly accepted clinical thresholds for deployable models. Given these margins, the conclusions on robustness and generalizability appear overstated. In summary, the paper introduces a well-motivated application and contains promising components, but falls short in methodological clarity, empirical rigor, and reproducibility to warrant acceptance at this stage. With significant revisions—including architectural transparency, open resource availability, and deeper validation—the work could mature into a more impactful contribution.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Several comments still:

    1. The authors say that the dataset is available and claim the following: ASM-ED2 dataset could be curated from the publicly available Human Epilepsy Project [NYU Data Catalog, PMID:37846772], while ASM-ED1 is available upon request only for research. In paper PMID:37846772 the same dataset is described as: 443 patients; 161 (36.3%) patients remained on monotherapy: levetiracetam (254), lamotrigine (77), oxcarbazepine (38), and carbamazepine (24). 282 not seizure-free. But their article has completely different figures for the number of patients and distribution of medications: The most frequently used were levetiracetam, lamotrigine, oxcarbazepine, and carbamazepine, having 250, 79, 37, and 24 cases in ASMED1, and 116, 41, 39, and 25 cases in ASM-ED2. ASM-ED2 had 247 patients (62 seizure-free, 185 not). That is, it is not clear how exactly the dataset for training and testing was formed, so the result cannot be fully reproduced.
    2. The authors write that “We will add CIs and p-values ​​in the revision.”, but do not provide them in the answer, which is impossible to estimate. + additional results look good (no worse than those presented), but were not added to the paper due to page limits: “We also performed inter-cohort validation (trained on ED1, tested on ED2) yielding 0.647 AUC - small, and testing on unseen non-epileptic drugs (e.g. Metformin, Omeprazole) on ED1 yielded 0.862 ACC, but these were omitted for brevity.” Most likely they were not obtained before publication, but reviewers should not take additional results into account.



Review #2

  • Please describe the contribution of the paper

    The paper describes the use of vision-language foundation models (VLM) together with a domain ontology (knowledge-tree) for prompt optimization and VLM fine-tuning to predict the outcome of anti-seizure medications (ASMs) for epilepsy.

    The main contributions are: 1) First time use of VLMs for ASM outcome prediction, 2) Use of a novel knowledge-tree and an associated tree based prompt optimization (TREE-TUNE)

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strong points are: 1) The first time use of VLMs for this task indicates an improvement and hence encourages further research 2) The Knowledge Tree itself is novel (though possibly could be replaced with a similar tree as the embedded information is public) 3) Novel (but not open) rich dataset

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weak points are: 1) The building blocks and the training strategies used are known 2) The performance assessment is based on TREE-TUNE (knowledge-tree based prompt optimization) only where as it would be nice to se how the non-VLM approaches would perform, because this is the first time VLMs are used here. 3) The difference between ED1 and ED2 datasets is not clear. Why is the performance so different? This needs to be discussed. 4) Considering the average improvement is not fair. TREE-TUNE should onbly be compared to scan+report option for which the improvement is significant in ED2 but not in ED1. 5) Discussion is weak and does not address the above points.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a novel and contemporary approach to an important problem but the results are not that impressive and the assessment is not full. We should see how VLMs perform against more conventional approaches to get an insight. And it would really be helpful to have the codes available or at least an online site where implementation detaiuls are shared.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Though the overall approach may not be deeply novel and the performance striking, I find the expansion of applications an important and viable contribution. The authors state “… generates richer, contextualized IA datasets by integrating multi-modal, structured domain knowledge…” as their main contribution which is fair and honest. Hence, I am on the positive side, yet draw back my recommendation to mark it as a “highlighted poster”.



Review #3

  • Please describe the contribution of the paper

    The authors present a framework consisting of foundation models leveraging the inclusion of knowledge trees of MRI reports and scans to improve the performance for anti-seizure medication prediction.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The presented framework provides improved classification performance in terms of ASM recommendations based on MRI scans and reports compared to other approaches
    • Methodology is well presented.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • I don’t quite understand, why the ASM is evaluated on seen ASMs as well as unseen ASMs. Aren’t only the unseen ASMs interesting, since the seen ones would already lead to good results?
    • I might be mistaken, but the comparison to baseline results seems to be conducted with results which where generated based on different datasets. Is this even comparable in this case?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Although a link to the code is shared, the datasets are private, which prohibits reproduction of the results.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall interesting and nicely written paper. Some confusion arose due to the usage of seen ASMs. Furthermore, are the comparison to the baseline results conducted on the same dataset?

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Questions that arose previously were clarified.




Author Feedback

Thank you for recognizing the importance of our work and your constructive feedback. If we have addressed your concerns, we kindly ask to reconsider your score.

Open resource availability: We have released our MRI tree and code. To ensure full reproducibility, we will release our TREE-TUNE generated instruction-answer (IA) tuning dataset and prompting templates. The ASM-ED2 dataset could be curated from the publicly available Human Epilepsy Project [NYU Data Catalog, PMID:37846772], while ASM-ED1 is available upon request only for research. Responses to the remaining questions are below.

R1-Q1 Core Novelty: TREE-TUNE uniquely generates richer, contextualized IA datasets by integrating multi-modal, structured domain knowledge—unlike prior work (e.g., LLaVA-Med, ShareGPT-4V) using only text or text+image. We benchmark this via lexical(Tab 1) and qualitative(Fig 1) analyses, showing downstream benefits in ASM prediction(Tabs 2–3). We also introduce a rich knowledge tree pertinent to epilepsy MRIs and provide a novel framing of ASM response prediction (classification) as a VQA task under an unseen ASM setting, using a tailored framework to fairly evaluate our IA dataset.

R1-Q2 Graph vs Tree: Our hierarchical knowledge tree emanates from clinically grounded MRI findings in epilepsy as described in PMID:23925763. Single-parent relationships enable clear categorization and are clinically appropriate for our task;a graph adds unnecessary complexity without semantic benefit. While DR.KNOWS uses graphs, it relies on general conceptual knowledge in UMLS with unclear relevance to epilepsy MRIs, which may add noise for our task. It also doesn’t generate IA datasets and is thus not compared.

R1-Q3 Generalization: We also performed inter-cohort validation (trained on ED1, tested on ED2) yielding 0.647 AUC, and testing on unseen non-epileptic drugs (e.g. Metformin, Omeprazole) on ED1 yielded 0.862 ACC, but these were omitted for brevity. We will analyze drug class similarity as our future work.

R1-Q4 We will clarify the following in the revision: IA generation: GPT-4o was deterministically prompted (temp=0.1) with sequential inputs (tree, MRI, report) to generate open-ended IA pairs using parent nodes as context and child as answers. ED1 yielded 4,896 IA pairs across 274 patients (median:24/patient; avg. length:49.1). Lexical diversity is in Tab 1. Architecture & Fusion: MRI, ASM(SMILES), and instructions are encoded to H_I(1024), H_ASM(128), and H_q(128), projected into a shared space, and concatenated before fed into a causal LM. The output token’s hidden state forms H_a(128). Training combines autoregressive (seq aggr.) and binary cross-entropy losses.

R1-Q5: See open source availability,R1-Q4

R1-Q6 Clinical Relevance: Clinical deployment of AI-based ASM recommendation remains an open challenge, with prior works reporting only ~0.65 AUC. Our average AUC of 0.714 exceeds this benchmark with statistically significant gains (p<0.05). We will add CIs and p-values in the revision.

R2-Q1 See R1-Q1

R2-Q2 Non-VLM approaches were evaluated but had much lower performance, only 0.52 AUC on ED1 and were therefore not reported.

R2-[Q3-Q5] Discussion: i)ED2 has fewer samples and greater variability due to multiple hospital cohorts, while ED1 includes data from only two hospitals, resulting in a higher performance. For inter-cohort validation see R1-Q3; ii)ED2 and the unseen ASM scenarios (Tab 2) present more complex challenges than ED1. TREE-TUNE’s performance improvement in ED2 and unseen ASM settings highlights its strength. We will include these discussions in revision.

R3-Q1 Seen ASMs are included in training; unseen ASMs are entirely novel to the model. Prior work focused only on seen ASMs in closed-set settings. We are the first to define and assess unseen ASMs (see R1-Q3).

R3-Q2 Our core novelty is generating a high-quality contextualized IA dataset. Thus, comparisons focus on models fine-tuned with different IA datasets (see R1-Q1).




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    A well-motivated and thoughtfully engineered application of foundation models and domain knowledge for ASM prediction, offering practical clinical relevance despite modest novelty.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top