List of Papers Browse by Subject Areas Author List
Abstract
Delays in processing urgent cancer referrals hinder Faster Diagnostic Standards (FDS), with manual extraction of patient data (demographics, symptoms and test results) remaining a bottleneck in colorectal two-week wait (2WW) pathways. We evaluate generative AI (GenAI) for automating structured data extraction from colorectal cancer (CRC) 2WW referrals, comparing the reasoning capabilities of GPT-4o-Mini and DeepSeek-R1 against clinician-led extraction. Both models achieved near-human precision (GPT-4o-Mini: 94.83\%, DeepSeek-R1: 93.72\%) while reducing the processing time by 10-fold. Key challenges included non-deterministic output, OCR noise (e.g. handwritten annotations, overlapping text), and contextual ambiguity, notably misclassified checkboxes, symptom misattribution, and numerical inconsistencies (e.g. fecal immunochemical test (FIT) unit conversions). We also proposed an uncertainty quantification mechanism to flag uncertain extractions for human review. Despite residual limitations, GenAI shows the potential to improve efficiency, standardisation, and equity in cancer pathways by alleviating administrative burdens. Future work should prioritise hybrid AI-clinician workflows, domain-specific fine-tuning, and real-world validation to ensure reliable clinical integration.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4862_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/bilalcodehub/swiftcare-ai
Link to the Dataset(s)
N/A
BibTex
@InProceedings{AbiSof_RAPTOR_MICCAI2025,
author = { Abioye, Sofiat and Ashraf, Shazad and Qadir, Junaid and Byfield, Adam and Jose, Anusha and Poulett, William and Wallace, Ben and Butt, Adil and Forde, Colm and Mottershead, Marcus and Fallis, Simon and Beggs, Andrew and Bhangu, Aneel and Akanbi, Lukman and Bilal, Muhammad},
title = { { RAPTOR: Generative AI for Parsing Colorectal Cancer Referrals to Streamline Faster Diagnostic Standard Pathways } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15966},
month = {September},
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors of this manuscript develop a method (RAPTOR) for semi-structured data extraction from synthetic colorectal cancer two-week wait referrals. The authors quantify the time and cost savings of their method against the current non-automated workflow for processing referrals in the NHS. The manuscript also presents a robust risk quantification of the method for use in both research and clinical practice.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The authors present very robust risk quantification of their method and focus greatly on the clinical impact, two extremely important areas which are missing from a great deal of MICCAI papers. The overall method seems sound.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Lack of detail about the integrity of their data splits, lack of information about their dataset, and overly-strong claims about the potential impact of their method once deployed
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
-
Please be sure to define all acronyms (i.e., CRC, 2WW, UOM) upon first use in the main text, regardless of whether they were defined in the abstract.
-
I would recommend the title of the manuscript be amended to include “colorectal cancer” or some other term that indicates that this framework has only been developed and tested on colorectal cancer data. The authors make overly strong claims about the potential for their method to transfer well to other domains, but don’t provide any evidence for this statement.
-
Much more information needs to be given on the synthetic dataset used for model development and validation. This is possibly best-suited for supplementary information, but it needs to be present so the reader can contextualize the reported results. How many synthetic cases did each clinician contribute? Was there any sort of review of the synthetic cases by other clinicians? How was quality of the synthetic cases judged? Was there any consideration of congruency between patient demographics and symptoms/diagnostic testing/medical history? The results on this dataset being indicative of performance in the wild would be much more convincing with some more details about how carefully it was generated to mirror real referrals.
-
If the authors used synthetic data to avoid complications of dealing with PHI and have reportedly already deposited their data into an online repository, why are the data not being released upon publication of this paper? A compelling justification is needed.
-
The tables in this manuscript are very poorly formatted and hard to read.
-
This manuscript would be enhanced by the use of a figure to describe the referral process both with and without RAPTOR, showing how the method can be integrated into the clinical workflow and the benefits it provides in the diagnostic pathway of colorectal cancer patients.
-
The authors state they “fine-tuned Google Document AI” OCR for their purposes but fail to provide details on which dataset(s) were used for fine-tuning, training configurations, image preprocessing, etc.
-
It seems that with a synthetic dataset, much more data could have been generated for use in than 111 examples. Unless this dataset was generated for another use (i.e., teaching) in which case the authors need to specify the intended use of the dataset, as opposed to how they are making use of it.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Unless the authors can clarify the integrity of their data splits, provide more information about their dataset (possibly through supplemental information) and pull back some of the premature claims they make about the impact of their method once deployed, I do not think this manuscript is acceptable for presentation at MICCAI. With these changes applied during the rebuttal period, I believe this manuscript could be made acceptable.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors seem to have reasonable responses prepared for the critiques from myself and other reviewers. They address their splitting strategy, have elected to limit the scope of their manuscript to colorectal cancer through amendment of their title, and have agreed to release their dataset upon publication. I believe this paper is well-suited for acceptance to MICCAI at this point.
Review #2
- Please describe the contribution of the paper
The authors investigate the use of LLMs for automated extraction of structured data from CRC 2WW referrals. The paper demonstrates a reduction of 10-fold in the processing time compared to manual extraction of structured information by clinicians. The method uses uncertainty quantification to flag low quality extractions. This allows clinicians to intervene and prioritise reviews. The research is well designed, taking account of the risks of deploying AI solutions and building on the NHS’s AIQCoP
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The authors show that the method performs well, achieving a good level of accuracy and ultimately delivering significant performance improvements of 10-fold over human extraction. The overal architecture is well designed and scalable and can be adapted to other referrals. There is clearly the potential for broad application in clinical workflows. Many models do not quantify the uncertainty in their predictions. The proposed method uses uncertainty quantification to identify low quality extractions and this represents a sensible way of managing risk when trying to deploy AI models in practice.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The work relies on synthetic data generated by experienced clinicians. Given the privacy considerations, this seems like a reasonable approach, but it would still be important to evaluate the method on real world data. The data is colorectal cancer two-week wait referrals. This is also a limitation of the work, to depend on referrals in a single domain. The robustness and generalisation will be improved by evaluating on different referral domains. Only two models were used: GPT-4o-Mini and DeepSeek-R1. These are reasonable choices, but there are many others, including open source models- why choose only these two to evaluate with.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is clear and well written. Despite the limitations, I think this is a significant contribution. The work is well designed, the experiments are implemented well and the results are significant.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper evaluates the use of generative AI, specifically GPT-4o-Mini and DeepSeek-R1, for automating the extraction of structured clinical data from colorectal cancer (CRC) two-week wait (2WW) referral forms. The authors simulate a real-world triage environment using a synthetically generated dataset that captures the variability and challenges of NHS referrals, including OCR noise, inconsistent terminology, and handwritten annotations.
The study demonstrates that both models achieve near-human accuracy, with GPT-4o-Mini slightly outperforming DeepSeek-R1 overall. Importantly, the authors propose a hybrid human-AI workflow, incorporating confidence-based uncertainty quantification to flag low-confidence outputs for clinical review. This layered approach balances efficiency with safety and trust key factors in clinical AI deployment.
The paper also offers a structured methodology for AI-assisted referral parsing and risk mitigation strategies aligned with NHS governance protocols, making it a timely and practical contribution to the field of clinical AI implementation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
One of the strongest aspects of this paper is its clear focus on a real-world clinical challenge, delays in processing urgent cancer referrals, and its practical application of generative AI to streamline this process. The study addresses a critical bottleneck in the NHS Faster Diagnostic Standard (FDS) pathway and provides a thoughtful, well-executed evaluation of two large language models (GPT-4o-Mini and DeepSeek-R1) for structured data extraction from CRC 2WW referrals.
The use of a synthetically generated dataset that mirrors real referral complexities (including handwritten notes, ambiguous checkbox states, and inconsistent formats) is a strength, as it allows for rigorous testing without compromising patient privacy. The paper goes beyond performance metrics by incorporating risk-aware design, uncertainty quantification, and human-in-the-loop verification, which are crucial for clinical safety and adoption.
Another strength is the level of technical detail provided. The pipeline is clearly described, from OCR adaptation and prompt tuning to output structuring and field-level analysis. The side-by-side comparison of model performance is thorough, and the authors’ breakdown of task-specific strengths between models (e.g., temporal reasoning versus lab data accuracy) demonstrates a nuanced understanding of LLM behavior in healthcare settings.
Finally, the collaboration with NHS England’s AI Quality Community of Practice and the proactive integration of clinical assurance protocols show that the authors are thinking beyond the lab and toward scalable, responsible deployment. This real-world orientation adds substantial value to the work.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While the paper is well-motivated and methodologically sound, a few limitations are worth noting. First, although the authors use a carefully constructed synthetic dataset, the absence of real-world referral data limits the ability to fully assess how the system would perform under live deployment conditions. Clinical documents often contain messier and more unpredictable content, and without live testing, it’s hard to know how well the models would generalize.
Second, the models were tested in zero-shot settings without domain-specific fine-tuning. While this is helpful for assessing generalizability, it’s also a constraint. Incorporating a few-shot or fine-tuned baseline would have helped contextualize the models’ current limits and future potential.
Additionally, while uncertainty quantification is introduced and discussed, the actual impact of confidence scores on clinical workflow decisions is not deeply evaluated. For instance, it remains unclear how often flagged extractions would truly benefit from human intervention or how this would affect overall triage efficiency.
Lastly, although the collaboration with NHS England is commendable, the paper could benefit from more detailed discussion around end-user feedback. Direct input from clinicians using the system in practice or even a simulation of such use would strengthen the argument for real-world readiness.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
This paper addresses an important real-world challenge in urgent cancer diagnostics and presents a thoughtful application of generative AI to streamline referral processing. The clinical context is clearly explained, and the technical pipeline is well-documented. The hybrid approach—balancing automation with human oversight—is particularly appropriate given the high-stakes nature of cancer diagnosis.
The use of a synthetic dataset is justified, but future work involving real clinical data or clinician feedback would help further validate the system’s effectiveness and safety. Additionally, expanding on how uncertainty scores impact clinical decision-making could add clarity around how the system would function in practice.
Overall, this is a strong and timely contribution that blends AI capability with clinical governance considerations, and it has clear potential for real-world impact
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper makes a valuable contribution by applying generative AI to a meaningful and urgent clinical use case. It offers a practical solution to a common bottleneck in colorectal cancer referrals and demonstrates how large language models can be integrated into triage workflows while maintaining safety and clinician oversight. The performance metrics are strong, the risk analysis is thoughtful, and the discussion of limitations is transparent.
While the study is based on synthetic data and would benefit from further real-world validation, the authors have laid a solid foundation for future deployment. The hybrid human-AI model proposed is realistic, scalable, and aligned with ongoing efforts to responsibly implement AI in healthcare. For these reasons, I recommend acceptance.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors’ rebuttal effectively addressed the key concerns raised during the initial review. They acknowledged the limitations of using synthetic data and clarified that steps are underway to validate the approach with real NHS referral data, which aligns with their goal of responsible clinical translation. Additionally, their explanation of the uncertainty quantification pipeline and how flagged cases would be routed for human verification added clarity to the proposed hybrid model. While fine-tuning and domain-specific training were not explored in the current work, the authors rightly emphasized the benefits of zero-shot generalizability as a baseline and indicated plans for future iterations that include fine-tuning.
The authors also clarified the involvement of end-users in system design and piloting stages, which reinforces the clinical relevance and translational readiness of the proposed solution. Given the strength of the original submission, the clear clinical motivation, and the well-considered responses to reviewer feedback, I maintain my strong recommendation for acceptance.
Author Feedback
We sincerely thank the reviewers for their thoughtful and constructive feedback. In the responses below, we clarify the concerns raised, provide additional details where appropriate, outline our plans to enhance the robustness and clinical applicability of RAPTOR.
Synthetic Data & Real-World Generalization: Reviewers raised concerns about using 111 synthetic colorectal cancer (CRC) 2WW referrals. This number mirrors a typical monthly volume (~100) at a major NHS hospital, enabling detailed per-field evaluation without oversampling synthetic patterns; it was created solely for this study to prioritize depth and realism over bulk generation. In Section 2.1 (p. 3), three NHS clinicians (two colorectal surgeons, one gastroenterologist; ≥ 5 years post-GMC) each contributed ~37 cases, stratified by risk (46 low, 7 intermediate, 58 high). Forms were blinded and cross-reviewed, with consensus resolving discrepancies. Quality control enforced consistency between demographics, symptoms, and investigations; applied NICE NG12 criteria; included clear and ambiguous cases; and ensured demographic diversity. Each PDF features typed and handwritten text, checkboxes, and deliberate “noise” to emulate PHI-free clinical variability. For OCR, we fine-tuned Google Document AI on a CRC-risk–stratified 70:30 train:test split to rigorously evaluate recognition performance. Upon acceptance, we will release code and the synthetic dataset. We are preparing an IRB-approved study on 10,000 de-identified CRC referrals in the hospital’s Secure Data Environment (SDE) to validate RAPTOR on real-world data and will expand the corpus for broader testing.
Domain & Model Scope: Reviewers noted our focus on CRC 2WW and two large language models (LLMs). In Section 2.2 (pp. 3–4), we justify selecting GPT-4o-Mini and DeepSeek-R1 for clinical performance, inference efficiency, and availability. We acknowledge this zero-shot setup is a limitation; we will note plans to benchmark few-shot/fine-tuned baselines in this paper if accepted. Future work will extend RAPTOR to breast, lung, and prostate pathways, piloting on those forms. Early investigation shows similar OCR challenges across domains, but formal validation is needed as highlighted. We also plan to evaluate open-source models (e.g., Meditron, Mistral) with few-shot/fine-tuning contexts.
Reproducibility & Transparency: Requests for methodological detail are addressed in Sections 2.2–2.3. We report fine-tuning Google Document AI for OCR (F1 90.5%; precision 93.5%, recall 87.8%) and describe our four-step pipeline—OCR processing, data cleaning/normalization, zero-shot LLM JSON generation, and output structuring (pp. 3–4). Model outputs were blindly compared to expert annotations, with two CRC triage consultants adjudicating discrepancies to ensure rigorous validation. We intend to make the code available after acceptance including the synthetic dataset.
Claims & Impact: Our Abstract reports GPT-4o-Mini at 94.83% accuracy and DeepSeek-R1 at 93.72%, both achieving a 10× speed-up over clinicians. We acknowledge the need for real-world validation and are evaluating RAPTOR on 10,000 de-identified CRC referrals in the SDE in another study; this will appear in Future Work. We will adjust cross-domain claims in the camera-ready version accordingly. We also plan to integrate RAPTOR in a clinician-facing UI to pilot our uncertainty-based flagging and gather feedback which will also described in our future works section.
Minor Points: In the camera-ready version, we will (i) restructure Tables 1–2 into sub tables with grouped fields, consistent fonts, and optimized widths; (ii) define key acronyms at first use; (iii) add a workflow diagram contrasting standard vs. RAPTOR-assisted pipelines; and (iv) update the title to reflect Colorectal Cancer.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A