Abstract

Recently, large language models (LLMs) have been increasingly utilized for decision support across various domains. However, due to their probabilistic nature and diverse learning influences, LLMs can sometimes generate inaccurate or fabricated information, a phenomenon known as ``hallucination’’. This issue is particularly problematic in fields like medical diagnosis, where accuracy is crucial and the margin for error is minimal. The risk of hallucination is exacerbated when patient data are incomplete or vary across different clinical departments. Consequently, using LLMs directly for clinical decision support presents significant challenges. In this paper, we introduce ProCDS, a system that integrates Prolog-based rule diagnostics with LLMs to enhance the precision of clinical decision support. ProCDS begins by converting medical protocols into a set of rules and patient information into facts. Then, we design an update cycle to extract and update related facts and rules due to possible discrepancies and missing patient information. After that, we perform a logical inference using the Prolog engine and acquire the response. If the Prolog engine cannot produce certain results, ProCDS would perform another iteration of facts and rules update to fix the potential mismatch and perform logical inference again. Through this iterative neuro-symbolic integrated process, ProCDS can perform transparent and accurate clinical decision support. We evaluated ProCDS in Obstructive Sleep Apnea Hypopnea Syndrome (OSAHS) real-world clinical scenarios and other logical reasoning benchmarks, achieving high accuracy and reliability in our results. Our project page is available at: https://github.com/testlbin/procds.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2399_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/testlbin/procds

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanXia_PrologDriven_MICCAI2025,
        author = { Tan, Xiaoyu and Li, Bin and Xu, Weidi and Qu, Chao and Chu, Wei and Xu, Yinghui and Qi, Yuan and Qiu, Xihe},
        title = { { Prolog-Driven Rule-Based Diagnostics with Large Language Models for Precise Clinical Decision Support } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {412 -- 422}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • Proposes a two‑stage framework that adaptively refines rules and facts through feedback from a Prolog engine.
    • Demonstrates the approach on an obstructive sleep apnea–hypopnea syndrome (OSAHS) diagnostic task.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The adaptive rule‑refinement pipeline is conceptually sound and clearly outperforms baseline approaches.
    • Baseline selection and evaluation metrics are appropriate and well explained.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is unclear why the paper includes evaluations on non medical reasoning tasks. Although these additional benchmarks are executed competently, they seem tangential to MICCAI’s focus and may be better placed in a different publication.
    • Key methodological details are missing. For example, in the masking experiment: 1) Why was 10 % of the cohort masked? 2) How were the masked “pieces of information” selected, and by how many experts? 3) Can inter‑rater reliability be reported?
    • The statement “GPT‑3.5‑Turbo was able to correct most errors” is vague. Consider rephrasing for a more accurate claim or quantify “most.”
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Please proofread the manuscript to eliminate typos and informal phrasing.
    • Subsections in Section 3.1 can be better organized for a clearer flow.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The framework is interesting and clinically relevant, but gaps in methodological detail and scope dilute the contribution. Strengthening the Methods section and focusing the evaluation on clinically meaningful benchmarks would significantly improve the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns were sufficiently addressed in the rebuttal. I find the contributions to be meaningful and clearly presented, and I gave a recommendation to accept.



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel framework that combines Prolog-based rule diagnostics with LLMs to improve the accuracy of clinical decision support. It refines rules and facts using feedback from the Prolog engine to increase adaptability and precision.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The overall idea of integrating Prolog engine to enhance the reasoning adaptability is well-motivated, considering the current challenges.
    2. The performance improvement of the proposed framework is impressive.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Several terms are unclear or ambiguous. For instance, “reasoning steps c” in section 2.1 and “COT” in section 3.2 are introduced without prior context or explanation.
    2. Please explain the necessity of the two rounds of prompting in Stage 1. If the first extraction might be insufficient, does that imply the second round could also be insufficient? It’s better to provide a comparison of the results with and without the iteration.
    3. Please explain the adaptation details for Proofwriter and GSM8K datasets, especially for how the rules and facts are defined within this framework for these specific datasets.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Please see the weakness.
    2. It’s better to provide experiments on the pure Llama3 backbone, as the paper has already applied the framework to this backbone and demonstrated its performance.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Limited novelty; Lack of discussions on partial steps of the framework design.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a ProCDS system that integrates the Prolog logic programming language with Large Language Models (LLMs), aiming to enhance the precision of clinical decision-making. Notably, this approach significantly reduces the issue of hallucinations by LLMs in this field. The proposed system is divided into two stages: extracting relevant rules and facts, and iteratively refining the neuro-symbolic ensemble. The iterative reasoning and error correction mechanisms notably improve the system’s accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The integration of Prolog and Large Language Models (LLMs) significantly reduces the issue of hallucinations.

    2. The introduction of dynamic updates and iterative reasoning further enhances the accuracy of diagnoses.

    3. By integrating LLMs, the system’s capabilities naturally improve over time as the LLMs’ performance advances.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Please provide a more detailed description of the specific inputs and outputs of the trained neural network. This should include, but not be limited to, the type, format, and expected output results.

    2. Combining Prolog logic engines with LLMs is not entirely novel. The authors’ primary innovation lies in its application to clinical decision support. However, the following key issues need to be addressed for this method to be applied in clinical. Dependency on high-quality prompts. The paper does not provide a detailed discussion on how variations in prompt design affect model performance. In practical applications, the design of prompts is crucial for obtaining accurate and relevant responses. Therefore, further research is needed to optimize prompt design for different application scenarios.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Compared to the weaknesses, the strengths are more prominent. Therefore, the overall score is Accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Q6 lists the major strengths. The author also explains some questions about the details in the rebuttal.



Review #4

  • Please describe the contribution of the paper

    This paper proposed a method of using prolog as a guide to improve the outcomes of LLM. The authors tested their method on both clinical scenarios and general reasoning tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The use of prolog engine to generate prolog from rules and facts has shown efficiency in improving LLM outcomes.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While it is commendable that the authors included two general tasks to showcase the generalizability of their method, the connection between these tasks and the core medical application is not clearly explained. As a result, the paper’s focus feels somewhat diluted, and the motivation for these auxiliary experiments could be better justified.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the connection between the general tasks and the main application scenario could be more clearly articulated, I consider this a minor limitation. Overall, the paper makes a valuable contribution and meets the standards for acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We appreciate the reviewers’ valuable feedback. We are grateful for the positive evaluations and have thoroughly addressed all raised concerns. To enhance clarity and avoid redundancy, we organize our responses into three focused sections corresponding to shared reviewer questions.

I. Clarification Details for GSM8K and ProofWriter We appreciate the reviewers’ questions (#3, #5, #6) regarding the confusion of GSM8K and ProofWriter. Below we clarify the motivation and implementation details: 1.Motivation We included GSM8K and ProofWriter in evaluation for two key reasons: Generalisability – Demonstrating strong performance on these widely used, high‑difficulty datasets shows that ProCDS is not limited to a single domain but can transfer to varied reasoning tasks. Clinical relevance – Real‑world medical decision‑making often requires multi‑step logical and numerical reasoning (e.g., dosage calculations, rule‑based differential diagnosis). Excelling on GSM8K and ProofWriter therefore indicates ProCDS’s capacity to tackle the same complexity in future, more demanding clinical scenarios. 2.Dataset Adaptation For ProofWriter , each problem provides premises and “if–then” rules. These are converted into Prolog facts and rules, enabling logical deduction chains fully compatible with Prolog inference. For GSM8K, we extract entities, quantities, and relations as logical facts, and represent arithmetic operations using Prolog rules or built-in predicates, enabling step-by-step symbolic reasoning.

II. Clarification Experiment Design in ProCDS: We thank the reviewer #5 for requesting further explanation of our masking and error correction experiments. Below we detail the key aspects: 1.Masking experiment details To simulate real-world data imperfections, we designed three masking strategies: 10 % mask: Chosen to mimic the rate of missing fields in real EHRs, and test general robustness. Key‑field mask: “critical” fields (e.g., BMI, weight) masked in 10 % of records; these fields influence ≥3 diagnostic rules. Three clinicians selected the key fields by consensus following guidelines. 2.Error‑correction quantification As shown in Table 2, GPT-3.5-Turbo reduced error cases by an average of 57%. We have replaced the vague phrase “most errors” with this exact figure. 3.Proofreading and structure We will correct all typos, and re‑organize Section 3.1 to improve clarity.

III. Clarification of Methodology: We thank the reviewer #1 and #6 for requesting further explanation of methodology and terminology. Below we detail the key aspects:

  1. Inputs and Outputs Inputs: Patient EHR records and associated medical-protocol text. After preprocessing, the LLM translates these natural-language inputs into structured facts and rules (e.g., age, BMI, clinical signs). Outputs: A diagnostic label (e.g., “High-Risk OSAHS”) and an explicit reasoning path. Results are returned in structured formats (text or JSON), listing the diagnosis along with supporting facts and rules.
  2. Prompt Robustness and Design Prompt design plays a crucial role in system performance. ProCDS enhances robustness through: Two-round prompting: In the Stage 1, two rounds of prompting are used to improve the comprehensiveness and accuracy of fact and rule extraction. This helps mitigate extract errors caused by LLM limitations such as parameter size or generalization bias. Iterative error correction: A refinement mechanism dynamically revises extracted facts and rules, compensating for potential errors or incompleteness. This iterative process further improves reasoning accuracy and system robustness. While prompt optimization remains an open challenge, we plan to explore broader prompt strategies in future work.
  3. Chain-of-Thought Explanation We have clarified this term and provided its full name and definition in Section 2.1 and 3.2. “CoT” refers to “Chain-of-Thought,” a prompting strategy for stepwise reasoning. “reasoning steps c” are the logical steps generated in this process.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    N/A

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top