List of Papers Browse by Subject Areas Author List
Abstract
In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature, blood pressure, ECG) and imaging results (e.g., MRI, X-ray) from a measurement agent to mimic the real-world diagnostic process.
Additionally, we incorporate self improvement mechanisms that allow models to iteratively refine their diagnostic strategies. We enhance LLM performance in our simulated setting by integrating multi-agent discussions, chain-of-thought reasoning, and experience-based knowledge retrieval, facilitating progressive learning as doctor agents interact with more patients. We also introduce an evaluation benchmark for assessing the LLM’s ability to engage in dynamic, context-aware diagnostic interactions. While MedAgentSim is fully automated, it also supports a user-controlled mode, enabling human interaction with either the doctor or patient agent. Comprehensive evaluations in various simulated diagnostic scenarios demonstrate the effectiveness of our approach. Our codebase, simulation environment, and benchmark datasets are publicly available on the project page: https://medagentsim.netlify.app/.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2575_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: https://papers.miccai.org/miccai-2025/supp/2575_supp.zip
Link to the Code Repository
https://medagentsim.netlify.app/
Link to the Dataset(s)
Benchmark datasets: https://huggingface.co/ItsMaxNorm/MedAgentSim-datasets
BibTex
@InProceedings{AlmMoh_MedAgentSim_MICCAI2025,
author = { Almansoori, Mohammad and Kumar, Komal and Cholakkal, Hisham},
title = { { MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15968},
month = {September},
page = {363 -- 373}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper introduces MedAgentSim, an open-source, self-evolving multi-agent simulation environment designed to enhance the performance of Large Language Models in realistic, interactive clinical diagnostic scenarios.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The framework’s design is innovative, particularly its emphasis on interactive, multi-turn conversations that closely mirror clinical decision-making processes. Unlike previous static, single-step evaluation environments, MedAgentSim requires agents to proactively query patients and systematically integrate diagnostic tests and medical imaging into their decision-making. This is further substantiated by simulating dynamic doctor-patient interactions and requiring explicit diagnostic test requests. Finally, the evaluation is robust, covering a diverse array of widely recognized benchmarks and clearly demonstrating the superiority of the proposed method over baselines.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Despite strong efforts to replicate clinical interactions, the simulation environment remains simplified compared to real clinical practice. For instance, the complexity of actual patient behavior, nuanced symptom presentation, and clinical decision-making under uncertainty is partially abstracted. Furthermore, the diagnostic accuracy metric relies on an LLM-based evaluator to judge correctness, potentially introducing bias or variability into evaluations. A more rigorous external clinical validation or manual expert verification of diagnoses would further strengthen the robustness of the evaluation.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, the paper significantly advances interactive, dynamic LLM simulations in clinical settings, effectively bridging the gap between static benchmarks and real-world applicability, despite some realism limitations and evaluation pitfalls.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
MedAgentSim introduces an open-source simulated clinical environment featuring doctor, patient, and measurement agents to enhance and evaluate LLM performance in dynamic diagnostic settings. It innovates by requiring doctor agents to engage in multi-turn conversations with patients, requesting various medical examinations and imaging results to mimic real-world diagnostics. The framework incorporates self-improvement mechanisms, including multi-agent discussions, chain-of-thought reasoning, and experience-based knowledge retrieval, enabling progressive learning. An evaluation benchmark assesses the LLM’s ability to handle dynamic, context-aware diagnostic interactions. MedAgentSim supports both automated and user-controlled modes for flexible evaluations, demonstrating its effectiveness through comprehensive tests in diverse simulated scenarios.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.The automatic diagnostic process proposed in MedAgentSim is innovative, as the system is fully automated while also supporting a user-controlled mode. It simulates the entire dynamic consultation process, demonstrating strong potential for practical applications. 2.The manuscript is well-organized and logically smooth, highlighting the strengths of the method and presenting a clear structure of the entire process. Furthermore, the MP4 video included in the appendix effectively illustrates the complete workflow of this study. 3.The visualizations in this paper enrich the experimental section, with Figures 3 and 4 clearly demonstrating the robustness of the proposed method and its advantages in handling erroneous cases.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.The comparative experiments only evaluate the performance of general-domain language models, such as Llama, GPT, and Qwen, on medical tasks, without including a comparison with specialized medical language models. 2.The article’s layout has some issues, particularly with avoiding instances where citations of references occupy a line by themselves.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
None
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
None
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
1) Introduces a game-based hospital environment powered by open-source LLMs (e.g., LLaMA 3.3, Mistral) to simulate dynamic doctor-patient interactions. 2) Implements doctor, patient, and measurement agents that engage in multi-turn dialogues, mirroring real-world diagnostic workflows (e.g., requesting tests like X-rays, blood pressure checks). 3) Integrates memory buffers (Medical/Experience Records) and retrieval-augmented learning for iterative refinement of diagnostic strategies. 4) Combines text-based dialogues with visual data (e.g., X-rays, MRIs) using VLMs like LLaVA 1.5. 5) Proposes a dynamic benchmark for assessing LLMs in context-aware clinical interactions, addressing gaps in static medical QA datasets. 6) Supports human-AI interaction modes, enabling real-time collaboration with doctor/patient agents.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Dynamic multi-turn interactions and test-dependent data access replicate real-world clinical workflows better than static benchmarks.
- Memory buffers and experience replay enable LLMs to learn from past cases, improving accuracy over time (e.g., 16.1% boost for LLaMA 3.3 with COT+ensembling).
- Incorporates medical imaging (via LLaVA 1.5), addressing a critical gap in prior text-only simulations.
- Avoids reliance on closed-source models (e.g., GPT-4o), promoting reproducibility and community-driven development.
- Quantifies model sensitivity to cognitive biases (Figure 3), highlighting fairness considerations for clinical AI.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Performance heavily relies on the base LLM’s medical knowledge (e.g., Mistral 24B underperforms LLaMA 3.3).
- Multi-agent ensembling and memory retrieval require significant GPU resources (4×A6000 GPUs), limiting scalability.
- Evaluations are confined to synthetic/benchmark data (NEJM, MedQA); real-world clinical validation is absent.
- Predefined datasets (e.g., MIMIC-IV) may not capture global demographic variability in symptoms or communication styles.
- Unsupervised “self-improvement” could propagate biases or errors stored in memory buffers without clinical oversight.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Innovation: MedAgentSim advances clinical LLM evaluation by integrating multi-agent dynamics, visual data, and self-improvement mechanisms, addressing critical limitations of static benchmarks. Practicality: While the framework is technically robust, reliance on synthetic data and high compute costs hinder immediate real-world adoption. Reproducibility: Open-source code, models, and benchmarks ensure transparency and community engagement. Impact (8/10): The focus on bias analysis and progressive learning aligns with ethical AI priorities, though clinical validation remains pending. Key Weakness: Lack of real-world deployment evidence and ethical safeguards for autonomous “self-improvement” lowers the score. However, the framework’s methodological rigor and open-source ethos make it a significant step toward realistic clinical AI simulations.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank all reviewers for their thoughtful feedback and their recommendations. Below we focus on the major concerns they raised.
- Simulation realism & clinical validation Critique (R1, R3): Our simulated doctor-patient interactions abstract away patient behavior nuances and rely solely on synthetic benchmarks, with no external clinical validation.
- User-controlled mode for expert evaluation: MedAgentSim’s “user-controlled” mode (Sect. 3.1) is designed exactly to enable human or clinical experts to step in,either to test the system or to be tested themselves,thereby providing a path for manual, real‐world evaluation without altering the core pipeline.
- Manual review of accuracy logs: As noted in the manuscript (“Both the dataset conversion process and accuracy logs were manually reviewed to ensure reliability”), we manually verified a representative subset of the LLM-based judge’s binary verdicts, each assessing whether the natural-language diagnosis matched the ground-truth multiple-choice answer, by comparing them against the original dataset labels to confirm fidelity.
- Future clinical pilots: We agree that large-scale external validation is essential; we are actively planning a small pilot study with clinical collaborators using the user-controlled mode, which will appear in a follow-up paper.
- Reproducibility & open sourcing Critique (R1, R2, R3): Submission lacks sufficient detail for reproduction; code/data release is only promised “upon acceptance.”
- Fully open-source repository: Our complete codebase, including environment setup, dataset conversion scripts, accuracy-logging utilities, and example prompts, is already available on a public GitHub repo. We will include a pointer to that repository in the camera-ready version.
- Detailed README and sample configs: The repository contains step-by-step instructions and sample configuration files to reproduce all experiments end-to-end.
- Comparison to specialized medical LLMs Critique (R2): Experiments evaluate only general-domain LLMs (LLaMA, GPT, Qwen) and omit specialized medical models.
- Following MedPrompt findings: We intentionally mirrored the setup of MedPrompt (PromptBase, Microsoft), which showed that generalist foundation models often outperform narrow-domain fine-tuned models in few-shot settings.
- Pipeline-centric focus: Our key contribution is the interactive, test-dependent pipeline; demonstrating strong gains on generalist models highlights its broad applicability.
- Future work: We fully agree that evaluating medical-foundation models (e.g., BioMedLM, BiMediX, MedPaLM) is valuable, and we plan such comparisons in a companion study.
- LLM-based evaluator bias Critique (R1): Reliance on an LLM evaluator for diagnostic accuracy may introduce bias/variability.
- Manual consistency checks: As above, each time the LLM-based judge issues its binary “correct/incorrect” verdict, comparing the natural-language diagnosis to the ground-truth multiple-choice answer, we manually verify that this judgment aligns with the original dataset label.
- Robust benchmark suite. We report results across multiple benchmark datasets (NEJM, MedQA, MIMIC-IV), and all exhibited the same performance ordering, indicating that evaluator variability did not drive the observed gains.
- Compute requirements & scalability Critique (R3): Multi-agent ensembling and memory retrieval require 4×A6000 GPUs, limiting scalability.
- Peak-accuracy goal: Our primary aim was to demonstrate the maximum attainable performance of this pipeline.
- Under-utilized capacity: In practice, GPU utilization plateaued at 60-70% (not 100%), and many steps can be performed offline or batched.
- Light-weight variants: We are exploring distilled ensembles and single-model memory retrieval as future enhancements.
- Minor formatting & layout
- We have tightened citation placements to avoid isolated lines (R2), improving flow.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A