Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Identifying the associations between imaging phenotypes and disease risk factors and outcomes is essential for understanding disease mechanisms and improving diagnosis and prognosis models. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce a MESHAgents framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies, using cardiovascular imaging as a proof of concept. Specifically, we orchestrate a multi-disciplinary team of AI agents, which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system’s capabilities through a population-based study of imaging phenotypes of the heart and aorta. The framework autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as -0.004 on disease classification tasks. Notably, the recall score based on MESHAgents improves for 6 out of 9 disease types. These results demonstrate MESHAgents’ ability to automatically discover clinically relevant imaging phenotypes with transparent reasoning trails, providing a scalable alternative to traditional expert-driven approaches in medical imaging studies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0477_paper.pdf

SharedIt Link: https://rdcu.be/eHwLp

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_41

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaWei_MultiAgent_MICCAI2025,
        author = { Zhang, Weitong AND Qiao, Mengyun AND Zang, Chengqi AND Niederer, Steven AND Matthews, Paul M. AND Bai, Wenjia AND Kainz, Bernhard},
        title = { { Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {429 -- 439}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces MESHAgents, a multi-agent LLM framework for automated discovery of cardiovascular imaging phenotypes and their associations with non-imaging factors. The system mimics a multidisciplinary team, providing transparent, automated phenome-wide association studies and achieving diagnostic performance comparable to expert-selected features.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novelty: First to use a multi-agent LLM system for medical imaging phenotype discovery.
2. Clinical Relevance: Demonstrates comparable or better performance than expert-selected features on real-world disease classification tasks.
3. Confounder Identification: Automatically identifies additional relevant confounders, addressing a key challenge in medical studies.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Implementation Details: Lacks specifics on LLM agent configuration, prompts, and domain knowledge integration, limiting reproducibility.
2. Comparative Analysis: Does not benchmark against traditional feature selection methods (e.g., LASSO, PCA).
3. Hallucination Analysis: Limited analysis of how the system mitigates LLM hallucinations in a clinical context.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Strong methodological innovation and clinical relevance, but limited by lack of detail and broader validation. Addressing these would strengthen the paper’s impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed all the major weaknesses in the rebuttal process.

Review #2

Please describe the contribution of the paper

The paper presents MESHAgents, a multi-agent framework driven by large language models to dynamically identify and select confounders and phenotypes in association studies, with a focus on cardiovascular imaging. The AI agents first analyze imaging phenotypes within a defined clinical domain. They then explore a broad spectrum of non-imaging variables to uncover statistically significant associations. In the final stage, the agents engage in collaborative reasoning to reach a consensus and generate comprehensive reports. When applied to heart and aorta imaging data, the framework successfully identified meaningful associations with non-imaging factors, revealing additional confounders beyond standard demographic variables.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents an intriguing and timely idea. The objectives, problem formulation, and methodology are clearly articulated. It tackles a critical challenge in medical research—enhancing our understanding of disease mechanisms and improving diagnostic and prognostic models. By leveraging a multi-agent framework powered by large language models (LLMs), the approach aligns well with current research trends. Its ability to dynamically identify confounders and phenotypes in association studies marks a distinctive and valuable contribution.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The presentation would benefit from a more detailed explanation of the implementation. Additionally, further clarification on the design and role of the domain-specific LLM agents is needed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The paper introduces a timely and innovative approach to understanding disease mechanisms and improving diagnostic and prognostic models. It clearly outlines the objectives, problem, and methodology, utilizing a multi-agent framework with large language models (LLMs). The framework’s ability to dynamically identify confounders and phenotypes in association studies presents a unique and valuable contribution to the field. Providing more in-depth implementation details would enhance the clarity of the presentation. Moreover, a clearer explanation of the structure and functions of the domain-specific LLM agents is recommended. A few errors must be rectified Table 1. caption must be at the top of the table. Multiple ands used in the abstract “understanding disease mechanisms and improving diagnosis and prognosis models”. Deviation values are shown as subscripts please check the style to be used.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces a timely and innovative approach to understanding disease mechanisms and improving diagnostic and prognostic models. It clearly outlines the objectives, problem, and methodology, utilizing a multi-agent framework with large language models (LLMs). The framework’s ability to dynamically identify confounders and phenotypes in association studies presents a unique and valuable contribution to the field. Providing more in-depth implementation details would enhance the clarity of the presentation. Moreover, a clearer explanation of the structure and functions of the domain-specific LLM agents is recommended.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The concerns mentioned in the review comments have been addressed

Review #3

Please describe the contribution of the paper

The authors proposed the MESHAgents framework (a Multi-agent Exploratory Synergy for the Heart) that leverages large language models (LLMs) as agents to find phenome-wide associations (PheWAS) between image phenotypes (structural and functional) and factors (demographics, anthropometrics, lifestyle, risk factors) in the context of cardiovascular diseases. This work is likely the first framework for medical imaging PheWAS. The proposed framework addresses current challenges of LLMs in several ways. First, the lack of multi-disciplinary domain expertise is addressed by designing memory-augmented agents that integrate domain expertise with historical experience across local analysis and group reasoning, which enables collective analysis; each specialist agent maintains a structured, long-term memory bank that encodes imaging phenotype patterns and associative factors from past analyses. This memory mechanism enables agents to leverage past experiences and statistical evidence when evaluating phenotypes and factors. Second, the LLM limitation in phenotype discovery problems due to the tendency to generate hallucinations was addressed by implementing a sequential discussion protocol for building consensus. This sequential mechanism creates traceable emergence patterns and transparent validation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Validation of the proposed framework appears very sound and well-designed. The framework was tested for automatic discovery of phenotypes (auto PheWAS) and demonstrated improved performance in comparison to the latest multi-agents frameworks for medical imaging (MedAgents and RareAgents); to assess the clinical utility in disease diagnosis, the MESHAgents discovered phenotype-factor associations were used to train classification models and the performance was assessed in comparison to human experts identified imaging phenotypes. The classification results validate MESHAgents-identified parameters maintain robust diagnostic value across different conditions, demonstrating the generalizability and scalability of the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

none that I could identify
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The approach is of high interest and the validation seems very good
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the reviewers for their valuable feedback. We’re pleased that reviewers recognized our work as “intriguing and timely” (R1), “clearly articulated” (R1), with “very sound and well-designed validation”(R3), and “strong methodological innovation and clinical utility” (R2, R3).

Implementation details (R1, R2-1, R3): MESHAgents implements a structured multi-agent reasoning framework for phenotype discovery via a three-stage cardiovascular pipeline. Domain-specific agents (LV, RV, LA, RA, AAo, DAo) operate with: (1) EvidencePool component managing statistical evidence with quantifiable metrics; (2) ConsensusBuilder orchestrating multi-round voting; (3) DebateArena implementing sequential discussion with traceable pathways. Agent communication uses standardized JSON with domain-specific prompts. The framework integrates phenotype analysis across domains while maintaining evidence-based inference chains. As noted by R1, the manuscript provides a “clear and detailed description of the algorithm” that contributes to reproducibility. To ensure full transparency and facilitate adoption by the clinical and research communities, we will release the complete codebase and accompanying documentation upon acceptance. We are currently preparing a clean and accessible version of the implementation to ensure maximum clinical impact.

Comparative analysis with traditional methods (R2-2): We acknowledge the value of comparing against traditional feature selection methods such as LASSO and PCA. Our work includes Linear Discriminant Analysis (LDA) as a downstream classifier (Table 2), which evaluates the diagnostic utility of selected phenotypes. PCA is inapplicable to handle all these tasks [1], as it produces latent components that are linear combinations of input features and do not retain original, interpretable variables. Supervised LASSO performs sparsity-driven feature selection but requires access to outcome labels (e.g., disease status) to optimize its selection [2]. It does not incorporate domain-specific dependencies thus is inapplicable for unsupervised and reasoning tasks considering anatomical structure, clinical relevance, or reasoning traceability. Furthermore, these methods have not been used in the compared papers [3] and clinical studies [4]. In contrast, MESHAgents is designed to identify clinically meaningful and anatomically diverse features by coordinating multiple specialist agents equipped with structured memory and consensus protocols. The framework successfully recovered important phenotypes such as AAo distensibility and alcohol intake, which may be overlooked by sparsity-based methods. We agree that benchmarking against LASSO-based selection could provide a useful comparison and will consider this in future work.

[1] Jolliffe, I.T. Principal component analysis . Springer New York, 2002. [2] Freijeiro‐González, L., et al. A critical review of LASSO and … Int. Stat. Rev. 90(1):118-145, 2022. [3] Tang X, et al. Medagents: Large language models as collaborators … arXiv:231110537, 2023. [4] Bai W, et al. A population-based PheWAS of cardiac and aortic structure and function. Nat. Med. 26(10), 2020.

Addressing LLM hallucination analysis (R2-3):Our framework mitigates hallucination through three verification mechanisms: (1) EvidencePool implements statistical validation, rejecting associations without sufficient support; (2) Cross-agent validation requires multiple confirmations; (3) Quantitative assessment through structure-function metrics in Table 1 shows MESHAgents achieved best score (0.350) versus single/multi-agent SOTAs. Diagnostic validation (Table 2) confirms our framework competes with expert-selected approaches, reliably identifying genuine clinical associations as demonstrated by downstream performance (Green boxes).

Minor corrections (R1):We will correct the table caption placement, review the writing throughout the abstract and main text, and ensure consistent formatting in final version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

There is strong support from two reviewers. The third acknowledges innovation but finds level of details insufficient.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper proposed a novel and interesting multi-agent framework. The authors addressed most of the reviewer concerns.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper introduces a novel multi-agent LLM framework for medical imaging phenotype discovery with strong clinical relevance and promising results. While reviewers raised concerns about missing implementation details, limited benchmarking against traditional methods, and hallucination handling, the rebuttal provides reasonable clarifications and commits to releasing code and improving transparency. Overall, the paper offers meaningful contributions, and I recommend acceptance.

back to top

Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Author(s):