Abstract

Large language models (LLMs) have demonstrated considerable potential in automating assignment scoring within higher education, providing efficient and consistent evaluations. However, existing systems encounter substantial challenges when assessing students’ responses to open-ended short-answer questions. These challenges include the need for large, annotated datasets for fine-tuning or additional training, as well as inconsistencies between model outputs and human-level evaluations. This issue is particularly pronounced in domains requiring specialized knowledge, such as dentistry. To address these limitations, we propose DentEval, an LLM-based automated assignment assessment system supporting multimodal inputs (e.g., text and clinical images) that is tailored for dental curricula. This framework integrates role-playing prompting and Self-refining Retrieval-Augmented Generation (SR-RAG) to assess student responses and ensure that the system’s outputs closely align with human grading standards. We further utilized a dataset annotated by dental professors, dividing it into few-shot learning and testing sets to evaluate the DentEval framework. Results demonstrate that DentEval exhibits a stronger correlation with human grading compared to representative baselines. Finally, comprehensive ablation studies validate the effectiveness of the individual components incorporated in DentEval. Our code is available on GitHub at: https://github.com/DXY0711/DentEval

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1271_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DXY0711/DentEval

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DenXin_DentEval_MICCAI2025,
        author = { Deng, XinYu and Miletic, Vesna and Trinh, Elvis and Gao, Jinlong and Xu, Chang and Liu, Daochang},
        title = { { DentEval: Fine-tuning-Free Expert-Aligned Assessment in Dental Education via LLM Agents } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {143 -- 152}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a multimodal, LLM-based framework designed to simulate human-like reasoning in evaluating student responses to dental questions and for broader educational applications.

    The main contributions of DentEval, an automated LLM-based homework assessment framework tailored to dental curricula, are as follows:

    • Self-Refined Retrieval Augmented Generation (SR-RAG):

    Innovation: Introduction of a new framework to refine the results obtained and autonomously assess their relevance.

    • Role-Playing and Sample Answer Generation (SAG) Module:

    Innovation: Use of a SAG module where the LLM student plays the role of a teacher to generate reference answers.

    The framework operates in three main stages: (1) processing four inputs — the query, a related figure, the student’s answer, and the marking rubric; (2) retrieving and refining relevant information from the inputs and dental handbooks to generate reference answers; and (3) assessing the student’s response by comparing it to the reference answers using few-shot learning, ultimately producing a final score through majority voting.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Addresses the challenge of acquiring sufficient knowledge in a specific domain without having to refine knowledge in specialized areas.

    The use of role-playing and benchmark responses generated by LLM agents allows for close alignment of system results with human grading standards.

    DentEval operates without requiring significant computing resources or retraining, making it more accessible and cost-effective

    Ablation study

    Table 1 and results are well explained

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    As it is, the paper is not reproducible because it lacks explanations (prompting techniques, the diagram if figure 1, sufficiency check, few-shot learning, etc)

    The system relies on several complex steps, including knowledge retrieval, information refinement, and reference response generation, that makes it too difficult for real case use.

    The quality of assessments is highly dependent on the quality of the input data, including the student’s response, the rubric, and information extracted from the dental textbook.

    Although the diversity of reference responses is an advantage, it can also introduce variability into assessments. Aggregating scores by majority vote may not always guarantee a consistent assessment, especially if the generated reference responses are highly diverse.

    Novel approach SAG is good, but may deserve more explanation as well: why it improves quality? In the experiments, how much is N? Or how to choose N to be optimal?

    Not valuable for MICCAI conference: the imaging part is not the main part, and in some configuration is not present at all in the framework, no clinical application, no diagnosis.

    Two questions aren’t enough to make meaningful claims about performance or reliability. It is difficult to convincingly show that a framework works broadly if it’s only tested on two examples.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Page 3 : the authors state that the framework requires three inputs, while in the caption of Figure 1, they state that four inputs are required

    In the paper the authors do not provide information about the LLM professor vs the LLM evaluator : different prompting ?

    Table 2 is not useful as it repeats same information as in the text

    Page 7 : The sentence should not be interrupted by the table 1

    For the evaluation part, we expect to have evaluation across diverse or standardized datasets, but in the paper it is very limited.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an automated assessment framework for dental domain; however, there are several critical issues that limit its suitability for acceptance at this conference:

    Insufficient Evaluation: The primary concern lies in the evaluation methodology. The framework is assessed using a dataset comprising only two questions. This extremely limited sample size is insufficient to draw any meaningful or generalizable conclusions about the effectiveness, robustness, or applicability of the proposed method. A more comprehensive evaluation—quantitative and/or qualitative—across diverse and representative data would be necessary to support the paper’s claims.

    Misalignment with Conference Scope – Image Computing: While the paper includes images as an input modality, the role of image data is secondary and not central to the contribution. The focus is not on image processing techniques, algorithms, or advancements. The paper does not contribute new techniques or insights specific to image analysis, which is core to this conference’s domain.

    Limited Relevance to Medical Applications: While medical questions are used in the evaluation, the primary goal of the paper is to assess general framework capabilities, not to solve a specific medical problem. The contribution is therefore more methodological than application-driven, and does not clearly address a concrete need in the medical imaging domain.

    Lack of Reproducibility and Methodological Clarity: The paper lacks sufficient detail to enable reproduction of the proposed framework. Key components of the workflow are under-described, including how few-shot learning is implemented, how the different LLMs are used in the roles of professor or evaluator” and how data flows through the system. This lack of clarity impedes both verification and potential extension of the work.

    Taken together, the limited experimental validation and misalignment with both the technical and application scope of the conference make this submission unsuitable for acceptance in its current form.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    It is better to take time to revise the paper and submit it next time to get a high quality paper.



Review #2

  • Please describe the contribution of the paper

    The paper addresses the problem of automatic scoring for dental evaluations with LLMs and multi-modal inputs that incorporate text and images. The proposed model incorporates a multi-agent solution that integrates self-refining retrieval-augmented generation and role-playing to address domain-specific requirements for generating adequate grades in the dentistry domain.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses an interesting topic that aims to extend the applications of LLMs into automatic grading in the medical domain, particularly in dentistry.

    • While the application is domain-specific, the proposed model could be employed with different tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Some points of the work can be clarified, for example:

    • Would it be possible to provide more details regarding the sufficiency check? It is mentioned that an LLM is used to determine if the retrieved information is “sufficient” to generate a high-quality reference answer. How is this decision process performed, and what is considered “sufficient information”?

    • How many questions of type 1 and type 2 are in the dataset?

    • I do not see a reason to avoid using the model for different applications. It is possible that the same architecture could be employed for automatic grading in other domains, what could be the requirements or potential limitations in these cases?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an application of LLMs for automatic grading; while the applications presented in the paper are domain-specific (to my understanding), they can potentially be extended to other domains. Some points of the methodology can be clarified, similar to the description of the dataset employed. Particularly, I am not sure how big the dataset is in terms of diversity and number of questions.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reading the response and other reviews, a suggestion for improving the manuscript is to expand the number of questions used to evaluate. To my understanding, only two questions were employed, which may limit the evaluation of the model’s capacity to handle questions formulated from a wider variety of topics. However, the experiments were performed with a dataset that includes responses from 28 users for each category. While the number of questions employed can be a limitation of the study, the number of responses included in the analysis can provide insights about the model’s capability to handle the variations in the response style derived from each individual user. After considering the rebuttal, I will keep my initial tendency to accept the paper.



Review #3

  • Please describe the contribution of the paper

    The paper’s primary contribution is a fine-tuning-free, expert-aligned framework that leverages LLM agents, retrieval-augmented generation, and role-playing to address challenges in automated assessment for specialized education, achieving both high accuracy and efficiency.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. SR-RAG enables autonomous knowledge enrichment, eliminating the need for expensive fine-tuning on large datasets while ensuring accuracy in interpreting nuanced dental terminology.
    2. By generating diverse references, SAG aligns the system with human grading practices, where experts recognize multiple valid reasoning paths. This reduces bias in scoring and improves accuracy, especially for questions with subjective or multi-perspective answers.
    3. Ablation studies provide clear insights into the framework’s design, ensuring reproducibility and guiding future iterations.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While the paper highlights multimodal support for text and clinical images, the integration of visual content is relatively simple. The system relies on GPT-4o’s generic multimodal capabilities but does not develop specialized visual processing in the dental context.
    2. While the data is annotated by dental professors, the small sample size and narrow domain raise concerns about generalizability to other dental subfields such as orthodontics and oral surgery as well as more complex clinical conditions.
    3. While role-playing prompts are a key innovation, the paper does not address the subjectivity of expert grading or potential biases in how LLMs simulate professor roles.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation is due to the paper’s novel contributions to expert-aligned assessment in dental education, specifically the integration of SR-RAG and role-playing prompts to address domain-specific challenges without fine-tuning. However, existing weaknesses limit its broader impact.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author addresses the concerns from reviewers in a satisfactory manner.




Author Feedback

Many thanks to reviewers. We address majors

<R1: Sufficiency Check> A dedicated Judge LLM assesses whether the retrieved evidence is sufficient to answer the question, prompted with: “Based on the question, evaluate whether the following information is sufficient to give a full mark answer.” This process is grounded in a theoretical framework (Ref: arXiv:2411.15594) and was validated by dental academics, who confirmed that the judgments align with expert assessments.

<R1 R2 R3: Diversity of Dataset> We respectfully disagree with R3’s claim that the dataset is “not meaningful.” It includes two questions (Fig. 2, Page 5), each with 28 student responses. The questions form two subsets that differ both structurally and substantively: (1) One is text-only, while the other requires multimodal (text + image) reasoning. (2) They cover distinct dental domains, such as morphology and materials. (3) Responses vary in length and complexity, requiring flexible, domain-aware evaluation. To our knowledge, this is the first and only dataset specifically constructed for automated assessment in dental education, and no comparable datasets currently exist for direct benchmarking. As such, it provides a necessary foundation for future work in this underexplored domain. Our evaluation is conducted at the response level, regardless of the number of questions, and the proposed method outperforms baselines with statistically significant results (p=0.0001 in Table 1, Page 7).

<R1: Generalizability to Other Domains> Our framework is generalizable. Adapting it to new domains mainly requires domain-specific rubrics and a relevant knowledge base for retrieval and evaluation.

<R2: Multimodal Support> The core innovation lies in SR-RAG and role-playing agent design (Page 2). Multimodal input improves flexibility for diverse question types. The modular structure allows GPT-4o to be replaced by more specialized vision-language models (e.g., Gemini 1.5; Ref: arXiv:2403.05530) without changing the overall architecture or evaluation pipeline.

<R3: Reproducibility> We acknowledge the concern. All prompts and code will be released upon publication to ensure reproducibility.

<R3: System Complexity and Real World Applicability> We disagree with the claim that the system is too complex. The framework uses an elegant prompting-based design, avoids fine-tuning, and suits education settings with limited resources (Table 2, Page 7). All components, including rubrics, responses, and handbooks, are drawn from authentic dental education settings, and the evaluation results (Page 7) validate their practical effectiveness.

<R3: Replying on Input Data Quality> The experiments use raw data from actual dental education settings, without any quality-enhancing preprocessing. This reflects deployment conditions and demonstrates strong performance under realistic constraints (Table 1, Page 7).

<R3: Diversity of Reference Responses and why SAG improves quality> We respectfully disagree with the claim that diverse references reduce consistency. The concern reflects a misunderstanding. For questions with multiple valid answers (e.g., Question 2, Fig. 2), reference diversity improves fairness by capturing acceptable variation. Majority voting mitigates individual bias and enhances consistency. This design also supports the effectiveness of Sample Answer Generation (SAG), which produces varied, high-quality references aligned with authentic student reasoning.

<R3: Relevance to MICCAI> This work is closely aligned with the growing interest in agent-based Large Multimodal Models frameworks within the MICCAI community. Recent MICCAI 2024 papers, such as MRScore (Liu et al., 2024), show how LLMs can support medical report evaluation. Building on this emerging direction, our framework exhibits strong generalizability and can be extended to clinical applications, diagnostic reasoning, and medical education.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed pipeline, although kind of complex, is intuitive in combining self-refined RAG knowledge and role-play prompting techniques for approximating expert-level assessments. However, the authors should seriously consider the concerns raised by reviewers where the generalizability of the evaluation pipeline is not studied in the present study.



back to top