Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Radiology students often struggle to develop perceptual expertise due to limited time for expert mentorship, leading to errors in visual search patterns and diagnostic interpretation. These perceptual errors—such as missed fixations, brief dwell times, or misinterpretations—are not adequately addressed by existing AI systems, which focus on diagnostic accuracy but fail to explain how and why errors occur. To bridge this gap, we propose MAARTA (Multi-Agentic Adaptive Radiology Teaching Assistant), a multi-agent framework that analyzes gaze patterns and radiology reports to provide personalized feedback. Unlike single-agent models, MAARTA dynamically recruits agents based on error complexity, ensuring adaptive and efficient reasoning. By leveraging thought graphs to compare expert and student gaze behavior, the system identifies missed findings and assigns Perceptual Error Teacher (PET) agents to analyze discrepancies. Using Chain-of-Thought (CoT) prompting, MAARTA generates meaningful insights, helping students understand their errors and refine their diagnostic reasoning, ultimately enhancing AI-driven radiology education.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1455_paper.pdf

SharedIt Link: https://rdcu.be/eHwTG

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_34

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/a04101999/MAARTA.git

Link to the Dataset(s)

https://github.com/a04101999/MAARTA.git

BibTex

@InProceedings{AwaAka_MAARTAMultiAgentic_MICCAI2025,
        author = { Awasthi, Akash AND Chung, Brandon V. AND Vu, Anh M. AND Le, Ngan AND Agrawal, Rishi AND Deng, Zhigang AND Wu, Carol AND Nguyen, Hien V.},
        title = { { MAARTA:Multi-Agentic Adaptive Radiology Teaching Assistant } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {358 -- 368}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper presents MAARTA, a multi-agentic adaptive radiology teaching assistant framework aimed at improving perceptual learning in radiology students. The main contribution lies in integrating eye-tracking data and radiology reports to provide personalized feedback on perceptual errors made by students, offering a more detailed explanation than current systems, which typically focus only on diagnostic accuracy.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Multimodal integration: The integration of eye-tracking data and diagnostic reports offers a unique method for analyzing perceptual errors and improving diagnostic reasoning.
2. Scalable and efficient: MAARTA’s multi-agent approach balances computational efficiency with enhanced reasoning capabilities, offering potential scalability in real-world applications.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Inconsistency of results in Table 1: There are some inconsistent results in Table 1. For instance, in the Single Agent section, the first and second rows, as well as the Single Agent (first row) and Single Agent with Thought Graph (first row), do not align in terms of accuracy, precision, recall, and F1 score. The relationships between these metrics appear inconsistent and do not follow an expected pattern when the data is balanced, which raises doubts about the reliability and validity of the results, making it difficult to draw clear conclusions from these results.
2. Lack of discussion on results: For example, the paper does not provide detailed discussions on results, like the performance degradation when integrating thought graphs or agents communicate. A more thorough analysis of why these happen and potential solutions or refinements to them would strengthen the paper’s contributions.
3. Lack of Clarity in Figure 1: The current layout and the depiction of the different components, such as the workflow of various steps (ABCD), are somewhat difficult to follow. Improving the clarity of this figure would contribute to the overall readability and comprehensibility of the paper.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Here are some additional comments and suggestions for minor improvements and clarifications in the paper:
1. Explanation on inconsistency in Table 1: I highly suggest the authors make a thorough discussion about the results provided in Table 1. Understanding why these metrics behave in such a manner and whether there are any underlying factors affecting the results would greatly enhance the transparency and robustness of the study’s findings.
2. Add discussions on results: It would be helpful to provide more detailed discussions of how the thought graphs and communication decrease the performance.
3. Revision of Figure 1 for clarity: We kindly hope that the authors to revise Figure 1 for clarity. Specifically, enhancing the visual distinctions between the different components and providing clearer annotations or labels would help readers better understand the methodology.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper introduces the MAARTA framework, which is a promising approach to improving radiology education using multi-agent systems and eye-tracking data. However, the following issues led to my recommendation for rejection:
1. Inconsistent Results in Table 1: The correspondence between different metrics (e.g., precision & recall vs F1) seems inconsistent, raising concerns about the reliability of the results.
2. Lack of Discussion on Results: The paper lacks a detailed analysis of the performance degradation observed when integrating thought graphs or enabling agent communication.
3. Lack of Clarity in Figure 1: The figure is unclear and could benefit from better visual distinctions and annotations to aid in understanding the methodology. These weaknesses make it difficult to draw clear conclusions from the paper, and further revisions are necessary for it to be considered for acceptance. Therefore, I recommend a rating of weak reject in the current form.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My comments have been well addressed. The rebuttal provides a thorough clarification, improving the precision and depth of the discussion. I am satisfied and accept the paper.

Review #2

Please describe the contribution of the paper

This paper introduces MAARTA (Multi-Agentic Adaptive Radiology Teaching Assistant), a multi-agent system designed to provide personalized feedback to radiology students by analyzing their eye gaze patterns and diagnostic reports. The system leverages Large Multimodal Models (LMMs) to analyze perceptual errors and adjust the number of agents based on error complexity. The authors demonstrate MAARTA’s effectiveness through experiments on a simulated dataset and real-world chest X-ray images, showing improvements in diagnostic accuracy and efficiency.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Innovative Multi-Agent Framework:
  - The approach of dynamically recruiting agents based on error complexity is a novel and effective solution for adapting to the varying levels of perceptual errors in radiology students. This enables the system to scale efficiently, providing adaptive and targeted feedback based on the student’s performance.
2. Personalized Feedback:
  - MAARTA moves beyond traditional methods by offering personalized feedback that focuses on the student’s perceptual errors. By analyzing eye-tracking data alongside diagnostic reports, the system provides real-time, context-specific insights, which is critical for enhancing diagnostic skills and reducing errors in medical education.
3. Clear Experimental Design:
  - The paper presents a comprehensive experimental setup and demonstrates consistent performance improvements across various models (GPT-4o, LLaMA, Mistral). The comparison with baseline models highlights the advantages of the multi-agent system in improving diagnostic accuracy without compromising computational efficiency.
4. Computational Efficiency:
  - The system shows that it can enhance reasoning capabilities through multi-agent coordination while maintaining efficiency. The slight increase in response time for the multi-agent system, compared to single-agent models, suggests that MAARTA strikes a good balance between performance and computational demands.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited Real-World Testing:
  - While the experiments on simulated data and chest X-ray images show promising results, the evaluation lacks a broader range of real-world data. Further testing on diverse medical imaging datasets (e.g., MRI, CT scans) and more varied diagnostic scenarios would help establish the system’s robustness and generalizability in practical clinical settings.
2. Error Complexity Function:
  - The method for calculating error complexity and determining the number of agents to be recruited is an interesting approach, but the linear relationship assumption may not be optimal across all scenarios. A more detailed exploration of the error complexity function or additional empirical validation could strengthen the system’s theoretical foundation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper introduces an innovative approach to enhancing radiology education through personalized feedback based on perceptual error analysis. The use of a multi-agent framework demonstrates clear benefits in diagnostic accuracy and computational efficiency. However, further testing with real-world data and refinement of the error complexity model would be beneficial. Despite these points, I recommend weak accept, with the expectation that these areas will be addressed in a revised version.
Reviewer confidence

Not confident (1)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
1. A novel simulated dataset focused on perceptual errors in radiology.
2. A very innovative framework that analyses students’ eye-tracking data and reports, explaining why they missed a particular finding, thus providing personalised feedback to improve diagnostic skills.
3. A multi-agent system that dynamically recruits LLM/LMM agents based on error complexity to efficiently process multimodal data, enhancing reasoning while maintaining scalability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This article is very novel in method, combining LLM, CoT and gaze data and has superior results compare to the baselines.
2. Innovatively use graph to model gaze data, making the expression of gaze data more structured. Also, it is easier to calculate the error based on the graphs.
3. A dataset about Perceptual Error will be released, which will promote the development of this research field.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The application prospects are relatively limited. This method requires gaze data to drive it, but it is difficult to obtain gaze data in practical applications other than teaching.
2. Many technical details in the article are not clear, such as training loss, etc.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a very novel paper, with impeccable ideas and impressive model performance. However, the lack of many technical details in the article makes it difficult to understand and reproduce, and the model’s application prospects in a wider range of medical tasks are relatively limited. If the concern could be resolved, this article deserves accept instead of weak accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

This paper addresses an important subfield of medical diagnosis, using gaze information to assist diagnosis. Although this paper focuses on the teaching part, its effectiveness also conveys a message that the method of encoding information from the clinician’s gaze reasoning process can robustly guide the model to perform diagnostic reasoning.

Author Feedback

We appreciate the reviewers’ unanimous recognition of the novelty of our proposed framework, MAARTA. We are encouraged by the positive feedback on its scalability, efficiency, innovative multi-agent design, sound methodology, strong performance, and the contribution of a new dataset. Below, we address key reviewer concerns:

R1: Inconsistency of Results in Table 1 The perceived inconsistencies stem from multilabel evaluation metrics, not computational errors. Our task is multilabel classification. The reported “accuracy” is subset accuracy, a stringent metric requiring all predicted labels to match the ground truth exactly. In contrast, macro-averaged precision, recall, and F1-score evaluate each class independently and permit partial matches—often yielding higher values despite lower subset accuracy. Label co-occurrence in our dataset contributes to this pattern. While class distribution is provided in the supplementary material, we will add a brief version to the main paper and revise the evaluation section to clarify the role of subset accuracy. All metrics were computed using standard scikit-learn functions.

R1: Lack of Discussion on Results Agent Communication: MAARTA’s communication between agents can introduce overhead or misalignment, particularly when different agents process separate segments of multimodal input (i.e., thought graphs). This is more noticeable in simpler error cases and aligns with findings in Zhang et al. (2024) “Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems.”

Thought Graph Integration: In single-agent settings, performance drops with Thought Graphs due to increased prompt complexity. MAARTA’s multi-agent setup distributes this cognitive load across agents, boosting both efficiency and performance. We will elaborate on these findings in the revised manuscript. Additional discussion has also been added to our anonymous repository.

R1: Clarity in Figure 1 We thank the reviewer for this suggestion and will revise Figure 1 to enhance clarity and readability.

R2: Limited Real-World Testing Our work focuses on the technical development of MAARTA for delivering personalized perceptual feedback in radiology education, particularly addressing perceptual errors through gaze-informed interventions. Clinical user studies are vital but require IRB approvals and integration into hospital workflows, which are logistically intensive. Our team includes radiologists and is actively planning hospital-based evaluations to assess MAARTA’s impact. Our current controlled evaluation using chest X-rays demonstrates feasibility. Due to the lack of public gaze datasets for CT/MRI, we did not evaluate those modalities here. However, we are now collecting private gaze-annotated data and will extend MAARTA to CT, MRI, and more complex diagnostics. These limitations and future plans will be included in the final version.

R2: Error Complexity Function We initially used a linear assumption between error complexity (C_error) and the number of agents (N_agents) for simplicity. As shown in Fig. 2B, this relationship varies by model size and is not strictly linear—already noted in the results section. Future work will explore empirically derived or learned mappings (e.g., logarithmic or polygonal).

R3: Limited Application Prospects MAARTA is specifically designed to identify and explain perceptual errors in radiology education—where gaze-based reasoning is essential. As perceptual errors are tightly linked to visual search behavior, integrating gaze data with LLMs/LMMs enables more personalized and informative feedback for trainees.

R3: Missing Technical Details This work relies on prompt-based reasoning using pretrained LLMs/LMMs and does not involve training or loss functions. Implementation details, libraries, APIs, and the error dataset construction are included in the main paper. We will ensure all technical aspects are explicitly stated in the camera-ready version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

this paper focuses on a model being a teaching assistant for radiology students, over the many models proposed that provides clinical diagnostic. i think it can be an interesting paradigm for the miccai community to explore further. The authors also addresses the reviewers concerns well with the improvements suggested

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

MAARTA:Multi-Agentic Adaptive Radiology Teaching Assistant

Author(s):