Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Current echocardiography MLLMs rely on diagnostic-focused data lacking detailed image-text descriptions and systematic multi-modal cardiac knowledge, resulting in suboptimal performance across diverse echocardiography visual question answering tasks. Existing methods to integrate clinical expertise face three key challenges when adapting to echocardiography: labor-intensive curation processes, overlooking textual or diagrammatic knowledge sources essential in cardiac diagnosis, and incompatibility with pretrained MLLMs. To address these gaps, we propose Multi-Agent Collaborative Expertise Extractor (MACEE), a multi-agent framework employing MLLM-powered agents to extract echocardiography expertise from diverse sources. MACEE collects the EchoCardiography Expertise Database (ECED), the first comprehensive knowledge repository covering 100+ common and rare cardiac conditions from textbooks, guidelines, and case studies. To integrate ECED into MLLMs, we introduce Echocardiography Expertise-enhanced Visual Instruction Tuning (EEVIT), a lightweight training framework using expertise-guided instruction tuning. EEVIT employs adapters in vision and language modules, enabling efficient expertise integration while training less than 1% of the model’s parameters. Experiments validates the effectiveness of these three components. Codes and license details: https://github.com/xmed-lab/ECED

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3614_paper.pdf

SharedIt Link: https://rdcu.be/eHwWA

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_34

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xmed-lab/ECED

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QinYi_MultiAgent_MICCAI2025,
        author = { Qin, Yi AND Gamage Nanayakkara, Dinusara Sasindu AND Li, Xiaomeng},
        title = { { Multi-Agent Collaboration for Integrating Echocardiography Expertise in Multi-Modal Large Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {358 -- 368}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a Multi-Agent Collaborative Expertise Extractor (MACEE) pipeline for constructing a new echocardiography benchmark:

1) The paper presents the MACEE pipeline, a novel approach for efficiently extracting expert knowledge from diverse sources to build specialized medical databases.

2) It introduces the EchoCardiography Expertise Database (ECED), a comprehensive resource that covers over 100 heart conditions in multiple formats, designed to enhance multimodal learning in echocardiography.

3) The paper also proposes Echocardiography Expertise-enhanced Visual Instruction Tuning (EEVIT), a method that incorporates ECED knowledge into pretrained models by enriching image-based instruction data with domain-specific expertise.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strengths of the paper:
1. The paper introduces the EchoCardiography Expertise Database (ECED), which covers over 100 heart conditions and five different echocardiography modalities.
2. Instead of relying on traditional human annotation, the paper proposes a pipeline that uses multiple agents to build the dataset more efficiently.
3. The paper also introduces Echocardiography Expertise-enhanced Visual Instruction Tuning (EEVIT), a method that boosts pretrained models with lightweight adapters to efficiently integrate echocardiographic knowledge into MLLMs.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Although I agree with the motivation behind building a new benchmark for echocardiography, I have several questions and concerns about this paper:
1. Will the database presented in this paper be open-source, or will only the multi-agent collaborative pipeline’s code be made available?
2. While multi-agent approaches to building datasets are common in general domains, in the medical domain, the performance of current SOTA LLMs/VLMs still needs to be verified by clinicians. This is particularly important because the primary contribution of the paper is the echocardiography dataset. Has the dataset been evaluated by human clinicians in any way?
3. The authors claim that the database incorporates knowledge from four textbooks, 35 ASE guidelines, 9 Chinese guidelines, and 176 online case studies. My main concern here is: have the authors checked the licenses for these materials or obtained the necessary copyright permissions? Some online sources have strict terms against scraping data and using it for training deep learning models.
For the experiments section:
1. The implementation details of the proposed EEVIT are insufficient. In Table 2, when it says “ours,” does this mean EEVIT was used with the proposed framework to fine-tune on the EchoCardiography Expertise Database and test on PMC-VQA, PMC-OA, and ROCO-V2? Are all other methods tested directly on PMC-VQA, PMC-OA, and ROCO-V2 without training on the proposed benchmark?
2. In Table 1, as stated in the paper, I assume the performance results are only for the selected echocardiogram data. After selecting PMC-OA and ROCO-V2, I think there are still open-ended questions. How did the authors calculate F1 and recall? Additionally, in general, the BLEU-1 scores seem a bit low from my perspective. I am not sure if the selected questions are too difficult?
3. Table 2 presents the ablation study of the proposed EEVIT, but I feel the table lacks important details. What does “CG” mean? I could not find a definition for it.
4. Regarding the ablation study in Table 2: I believe it primarily demonstrates that using the proposed ECED dataset can improve VLM performance. However, I find it unclear what the authors are trying to convey here. Also, why are image-text pairs, tables, diagrams, and pure text included as part of the ablation study? More explanation is needed for the ablation study; the results aren’t entirely clear. Specifically, how are tables, diagrams, and image-text pairs used for training—via VQA or captioning?
5. Furthermore, the authors introduce EEVIT, but I believe there is a lack of experiments to validate its effectiveness. I understand the main idea behind EEVIT is to fine-tune models more efficiently with fewer parameters, but it would be helpful to compare it with just LoRA, just Expert-Lens, and also with supervised fine-tuning to better demonstrate the performance improvements.
Some writing suggestions:
1. There are too many abbreviations.
2. In Fig 1(c), the text is too small. If it’s just for illustration, you could simplify it.
3. In line 11, “unparalleled repository” might be too strong.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the motivation behind this paper is good, and building a dedicated echocardiography database is highly meaningful. However, I have several concerns, as outlined above, that need to be addressed. These include questions around dataset evaluation, licensing issues, the clarity of the ablation study, and the lack of sufficient experimental validation for the proposed methods.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Thank you for your response. However, several key concerns from my side remain unresolved. As a benchmark paper, the licensing and copyright issues are still unclear. Additionally, the experimental details and clinical evaluation lack sufficient clarity and completeness.

I reviewed reference [20] and their dataset repository, which also lack detailed information about copyright. Many of my concerns require major revision, so I must maintain my original decision.

Review #2

Please describe the contribution of the paper

The paper’s main contribution is a novel integrated approach for injecting echocardiography domain expertise into a multi-modal large language model. The authors introduce a Multi-Agent Collaborative Expertise Extractor (MACEE) framework that uses multiple MLLM-based agents to automatically gather and organize echocardiography knowledge from diverse sources (textbooks, clinical guidelines, case studies). This process yields the EchoCardiography Expertise Database (ECED) – the first comprehensive echocardiography knowledge repository covering over 100 cardiac conditions across multiple ultrasound modalities. Furthermore, they propose Echocardiography Expertise-enhanced Visual Instruction Tuning (EEVIT), a lightweight fine-tuning strategy that augments a pre-trained vision-language model’s training data with the curated ECED knowledge and employs parameter-efficient adapters to integrate this expert information while training less than 1% of the model’s parameters. Together, the MACEE pipeline and EEVIT tuning enable the model to achieve significantly improved understanding of echocardiography images and questions, attaining state-of-the-art performance on several echocardiographic comprehension benchmarks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A key strength is the novel MACEE pipeline that automates the curation of echocardiography expertise with minimal human labor. By deploying five specialized LLM-powered agents for tasks such as content extraction, figure segmentation, text cleaning, and concept summarization, the authors drastically reduce manual effort compared to prior methods that required extensive hand-crafted curation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Although the framework is well-designed, some components of the approach are built on existing ideas, which limits its groundbreaking nature. The multi-agent extraction pipeline largely leverages known capabilities of LLMs for tasks like OCR, summarization, and alignment, which is a clever engineering integration but not a fundamentally new technique. Similarly, the use of lightweight adapter-based fine-tuning to inject knowledge is an effective but already established strategy in multi-modal modeling.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper presents a useful system and a well-constructed echocardiography knowledge base (ECED), its methodological novelty is somewhat incremental, mainly involving a combination of existing components. The lack of clinical validation and limited comparison to recent echocardiography-specific models weakens its overall impact. However, the engineering contribution is solid, the experiments are well-conducted, and the domain relevance is strong. Therefore, I recommend a weak accept– the work is promising, but some parts require refinement or additional evidence to reach a stronger
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes wepropose Multi-Agent Collaborative Expertise Extractor (MACEE), a multi-agent framework employing MLLM-powered agents to extract echocardiography expertise from diverse sources.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The writing is clear and well organized with clear motivation
2. The experiment is relatively comprehensive and does a good job on ablation study
3. The method is sound, and the results well support the claims
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. How the number of agent affect the performance should be analyzed and discussed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The writing is clear and well organized with clear motivation
2. The experiment is relatively comprehensive and does a good job on ablation study
3. The method is sound, and the results well support the claims
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their insightful feedback. The reviewers commended the motivation as “clear” (R2), “good” (R3), and “promising” (R4), the problem as “highly meaningful” (R3) with “strong domain relevance” (R4), the method as “sound” (R2) and “well-designed” (R4), the experiments as “comprehensive” (R2) and “well-conducted” (R4). The remaining concerns focus on clinical validation, licensing, implementation, and experiment details. Below we address the reviewers’ concerns, and will update in our revision.

[R3, R4] Clinical Evaluation: Two board-certified echocardiologists validated this study by: (1) Confirm the Expertise Database’s source is authoritative and comprehensive and (2) Verify the output of each step in our method is clinically correct and relevant.

[R2] Effect of Agent Quantity. We followed the agent design principle to create five agents, each for a specific task. Fewer agents (multi-tasking) reduce task-specific performance, while more agents increase unnecessary token costs.

[R3] Reproducibility. The database and pipeline code will be released under a suitable license, and we will ensure full reproducibility.

[R3] License and copyright. We followed the IP conventions from prior works ([20] in paper), ensuring all sources were processed legally under their respective licenses. Details of the book sources and licenses will also be provided.

[R3] EEVIT Implementation: Our method (Ours in Table 1) utilized EEVIT to fine-tune on the Expertise Database and was tested on three benchmarks, while other methods were tested directly on these benchmarks. To ensure fairness, no models were trained on the testing benchmarks (see Dataset and Preprocessing).

[R3] Metric Calculation. We used evaluation codes from published works to calculate F1, recall, and BLEU-1 (see Evaluation Metrics). F1 and recall were computed at the token level by comparing predicted and ground truth sentence tokens. Due to diverse writing styles and answer patterns in these benchmarks, BLEU-1 scores are generally lower.

[R3] “CG” in Table 2 is “EE” in the caption. We apologize for the typo and will correct it in the revision.

[R3] Ablation Details (Table 2). The ablation aims to (1) validate the contribution of “image-text pairs, tables, diagrams, and pure text” in ECED to model performance; results show including more types of knowledge enhances performance. (2) verify the best strategy for pure text in multi-modal tuning (“D”/”EE” in Table 2); results show our new augmentation strategy (“EE”) is the best. These are the core contributions of our work. Our method uses both VQA and captioning tasks to train tables, diagrams, and image-text pairs by leveraging proposed augmentation strategy (see Fig. 3a).

[R3] The effectiveness of EEVIT. Thank you for your advice. The EEVIT contains two innovations: the new augmentation strategy and adapter design. Table 2 validated the effectiveness of the augmentation. In previous experiments, we have already confirmed that our proposed adapter outperformed LoRA (designs mentioned in the comment), verifying its effectiveness. Due to length and rebuttal restrictions, we would like to include these into discussion in revision.

[R4] Method Novelty. Our method introduces two key methodological innovations: Instead of using MLLMs individually in prior works, our multi-agent extraction pipeline proposes a new collaborative way to maximize MLLM’s capacity to process diverse knowledge sources. Our fine-tuning method overcomes the limitation of prior works that rely solely on image-caption pairs to inject knowledge from scratch by using multi-form expertise augmentation and new adapters on pretrained MLLMs (see Fig. 3a).

[R4] Comparison to recent echocardiography-specific models. Recent echocardiography models are mainly CLIP-based discriminative, while we focus on enhancing pretrained generative MLLMs and providing an expertise database. Enhancing such models would be a promising future work.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewer mentioned that the licensing and copyright issues remain unclear, which undermines the credibility of a benchmark paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Multi-Agent Collaboration for Integrating Echocardiography Expertise in Multi-Modal Large Language Models

Author(s):