List of Papers Browse by Subject Areas Author List
Abstract
Cardiac surgery is associated with the risk of acute kidney injury (AKI), which can lead to prolonged hospital stays and increased mortality. Accurate prediction of AKI before its onset could significantly improve patient outcomes. However, existing AKI prediction models primarily focus on numerical features such as laboratory values and vital signs, while overlooking textual features, including preoperative diagnoses and surgical procedures.
To address this limitation, we propose MedICL, which applies in-context learning (ICL) to the cardiac surgery domain. By leveraging the powerful comprehension and reasoning capabilities of large language models, MedICL enables the integration of textual and numerical features for AKI prediction. Nevertheless, the performance of ICL is highly sensitive to the quality of the provided examples, potentially limiting its effectiveness. To overcome this challenge, we introduce a Semantic Matching Unit (SMU), which selects semantically relevant examples for each sample, thereby significantly enhancing the model’s performance.
Furthermore, we observed that ICL-based AKI predictions often suffer from instability and exhibit suboptimal performance on downstream tasks. To address these issues, we developed the Task Adaptability Enhancer (TAE), which calibrates the prediction probabilities generated by ICL on the validation set. This approach not only stabilizes the model’s outputs but also enhances its adaptability to specific task scenarios. A series of experiments on the datasets collected from West China Hospital (WCH) demonstrated that MedICL achieved state-of-the-art performance. These results highlight the indispensable role of medical text data in AKI prediction for cardiac surgery scenarios, showcasing its potential to improve clinical practice.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5250_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{SuChe_MedICL_MICCAI2025,
author = { Su, Chenyang and Wang, Yishun and Xu, Boqiang and Feng, Rong and Du, Lei and Liu, Hongbin and Meng, Gaofeng},
title = { { MedICL: In-Context Learning for Semantically Enhanced AKI Prediction in Cardiac Surgery } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15970},
month = {September},
}
Reviews
Review #1
- Please describe the contribution of the paper
-the in-context learning is applied to the acute kidney injury prediction problem -the semantic matching unit is proposed to select the most relevant examples for each sample as demonstrations -the Task Adaptability Enhancer is leveraged to calibrate the probability distribution for classification
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-one major strength is that the paper applies the semantic matching unit is adapted to for the targeting problem; -the benefits/usefulness of the proposed modules are clearly validated
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-in section 2.2, it is not clear why the authors chose to use x_i^{text} instead of q in the example -the details (such as the formats) of the textual data, including medical history, preoperative medications and intraoperative details, are not sufficiently elaborated -the implentations are not clear. As one example, how many data samples (the absolute value) are used? -statistical analysis of results is missing, and thus it is not clear whether the improvements are statistically significant -all the mathematical symbols should be clearly defined with their dimensions explicitly given
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
-the scientifi contribution seems minor, and the authors should justify the novelty of the proposed semantic matching unit and task adaptability encoder -the compared methods are out-of-date, being published in 2019 and 2021
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
I hold the previous opinion due to the following two points. (1) The authors do not respond to concerns about the minor contributions, which is a very important issue. (2) The reason why recent methods are not compared is not convincing.
Review #2
- Please describe the contribution of the paper
This paper presents MedICL, a novel framework that applies in-context learning (ICL) based on large language models (LLMs) for the prediction of acute kidney injury (AKI) following cardiac surgery. By integrating semantic matching (SMU) to retrieve personalized relevant examples and a Task Adaptability Enhancer (TAE) to calibrate output probabilities, the proposed approach improves prediction accuracy and robustness in real-world clinical datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
The SMU component leverages semantic similarity for demonstration selection, addressing the sensitivity of ICL to prompt quality and enhancing contextual relevance.
-
The proposed TAE introduces a calibrated probability output, improving both prediction stability and adaptability to downstream evaluation.
-
Evaluation on a real-world dataset (ACSD) with comprehensive perioperative data confirms its utility and superiority over traditional ML methods.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
The baseline methods used in this study are primarily traditional machine learning models (e.g., logistic regression, random forest, XGBoost), without including standard deep neural network classifiers that are widely used in medical prediction tasks (e.g., multilayer perceptrons or Transformer-based models). Since the proposed method leverages general-purpose LLMs through ICL, which are relatively complex and resource-intensive, the absence of a head-to-head comparison with task-specific supervised deep learning baselines limits the ability to justify the practical advantages and necessity of using ICL in this context.
-
While GPT-4 and o1-mini are compared, the paper lacks evaluation of MedICL across diverse medical tasks or hospitals, raising concerns about transferability.
-
Using cosine similarity without redundancy control might lead to repetitive or highly similar examples, risking overfitting or semantic collapse.
-
The paper does not compare MedICL with emerging domain-specific LLMs like BioGPT, GatorTron, or Med-PaLM, which could challenge its performance claims.
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The main strength lies in its focus on a clinically meaningful task—AKI prediction in cardiac surgery—using a real-world dataset and an emerging method, in-context learning (ICL), which makes the study interesting and relevant. However, a major concern is the lack of comparison with stronger deep learning baselines. While the paper compares MedICL with traditional machine learning models (e.g., Logistic Regression, Random Forest), it does not evaluate against task-specific supervised deep learning models, which are widely used in medical prediction tasks and may perform better than general-purpose LLM-based ICL. Without this comparison, it is difficult to assess the practical advantage of MedICL over more commonly used neural approaches.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
1) In this study, the authors proposed a novel framework, MedICL, which integrates textual and numerical features by utilizing large language models for AKI prediction. 2) In addition, the authors proposed the SMU and TAE modules to address key challenges in the practical application of ICL, further enhancing model stability and adaptability. 3) Experimental results showed that MedICL achieved significant improvements over traditional ML methods, reaching state-of-the-art performance.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1) Machine learning methods have been widely applied to AKI prediction, but studies on AKI prediction in populations undergoing cardiac surgery remain limited. Traditional ML methods primarily rely on numerical data. In this study, the authors present the first work applying an in-context learning (ICL) framework, MedICL, which integrates both textual and numerical data for AKI prediction. 2) The authors proposed a Semantic Matching Unit (SMU), which selects the most semantically relevant examples from the training set based on similarity to address substantial fluctuations in prediction results. 3) The authors proposed a Task-Adaptive Enhancement (TAE) module, which adjusts the probability distribution to ensure robust and reliable outputs. The distribution is further calibrated to improve stability and adaptability to downstream tasks.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1) The Task-Adaptability Enhancer section is not sufficiently clear. Is the ‘validation set’ mentioned here consistent with the ‘validation’ split in the dataset partitioning? 2) Additionally, how are the prompts P_v generated? 3) In the ablation study section, a new component ‘PA’ is mentioned. What does it represent, and how is it technically implemented? 4) In the “2.2 Overall Structure” section, there should be a space before citation [24].
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This study presented the first application of in-context learning (ICL) for AKI prediction by incorporating popular LLMs to integrate clinical text data, offering limited but meaningful novelty. However, the baseline comparisons are predominantly traditional machine learning methods. Although the proposed MedICL framework shows outstanding performance, its convincing power remains limited due to insufficient comparative benchmarks.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We sincerely thank the reviewers for their time and insightful feedback on our paper. Below, we address the main comments and suggestions. Reviewer #1 Comments 1 and 4: Comparison Methods and LLMs We explored deep learning models like MLPs but found they underperformed compared to traditional machine learning methods due to high-dimensional, uneven data with limited sample sizes. As noted in the introduction, most prior AKI prediction research used machine learning methods. These two factors motivated our decision not to include MLP as a baseline. For large language models, we selected o1-mini for its reasoning capabilities and GPT-4 for its performance on the Open Medical LLM Leaderboard. We believe these models are sufficiently representative for evaluating MedICL and already demonstrate strong performance. We will consider including MLP comparisons and evaluations with additional domain-specific LLMs in the camera-ready version. Comment 2: Generalizability Our research focuses on AKI prediction in cardiac surgery, limiting the evaluation of MedICL on other tasks. However, we are collecting data from additional hospitals to validate our approach across multiple datasets in future work. Comment 3: Redundancy Control in ICL Selecting semantically relevant examples is vital for improving ICL. Therefore, we prioritize semantic relevance as the primary criterion for example selection. In Section 4, we conduct experiments where the number of examples is gradually increased from 0 to 15. This approach, to some extent, also examines the impact of diversity on performance. Reviewer #2 Comments 1 and 2: Task Adaptability Enhancer (TAE) The “validation set” refers to the dataset’s validation split. In the TAE, we perform probabilistic calibration on this set to optimize parameters A and b, applied to the test set. Prompts P_v consist of a prompt template (which includes task descriptions, key points, etc.) and examples selected from the training samples through the SMU. We will clarify these details in the camera-ready version. Comment 3: Probability Averaging (PA) As stated in the title of Table 2, we enhance MedICL’s robustness by performing multiple sampling and averaging the probability distributions, which significantly boosts performance. Comments 4 and Reviewer #3’s Comments 1 and 5: Formatting Issues We acknowledge the formatting issues raised. In the camera-ready version, we will review and correct mathematical symbols, ensuring they are clearly defined and consistently formatted. Specifically, we use 𝑞 to represent the query and 𝑥 for example samples, with a superscript “text” to indicate the textual part involved in semantic matching. Reviewer #3 Comment 1 and 5: Formatting Issues Kindly refer to the text above. Comments 2 and 3: Data Used We processed textual data into a structured keyword format for consistency. Key information was standardized, with medical history as predefined fields, intraoperative details summarized into structured events, etc. Due to space and privacy considerations, we did not provide detailed examples but plan to release a portion of de-identified data to support the research community. Additionally, in the “Dataset” section of the “Experiments” chapter, we have explicitly specified the absolute number of data samples. Comment 4: Statistical Analysis In the paper, we conducted extensive experiments to demonstrate the effectiveness of our method. Due to space constraints, detailed statistical analyses are not included but will be added in the camera-ready version. For instance, as shown in Table 1, under the Text-Augmented setting, our method achieved an F1 score of 0.77, outperformed Random Forest (0.72). Based on that, a paired t-test across ~1000 test samples yielded a p-value < 0.01, confirming the improvement was statistically significant. Additionally, the 95% confidence interval for the F1 score improvement was [0.04753, 0.05247], further validated the reliability of our approach.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The application of text augmentation processes to numerical tabular data yields statistically significant improvements in classification model performance.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
In my view, critical comments have been sufficiently addressed in the rebuttal stage. While a lack of comparison against advanced deep learning methods seems to be a weakness of this work, I agree with R1’s comments on this work being a clinically motivated and meaningful task.