List of Papers Browse by Subject Areas Author List
Abstract
The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensics methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce MedForensics, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose DSKI, a novel Dual-Stage Knowledge Infusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2888_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LiShu_Toward_MICCAI2025,
author = { Li, Shuaibo and Xing, Zhaohu and Wang, Hongqiu and Hao, Pengfei and Li, Xingyu and Liu, Zekai and Zhu, Lei},
title = { { Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {638 -- 648}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper presents a new large dataset of real and AI generated image pairs covering 6 imaging modalities and uses 12 generative models. The dataset is designed to help train and test systems that can detect fake medical images. The paper also introduces a deep fake detection model to learn low-level and domain specific artifacts in real and fake images.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The large dataset of real and fake medical images would be very useful for other researchers, especially if it is made open-source. Its variety across image types and AI models makes it a strong tool for building and testing fake image detectors.
The visual Turing test with medical experts shows that even professionals struggle to tell real and fake images apart. This shows why tools for detecting fake medical images are important and needed in healthcare.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
A major concern is with the evaluation. The authors use the same image generation models to create both the training and test data, which can cause bias. This might explain the very high accuracy reported. While they do test the model on 1,000 fake and 1,000 real images from a new generator, this test is too limited. They don’t say which imaging type (modality) was used, and they also don’t compare how other detection methods perform on this new data.
This brings up an important point: Is it useful to train the model on fake images from one generator (like DiffEcho) and test it on another (like EFN)?. In medical forensics, detection models must be able to work well on fake images from new or different AI models. If not, the model might fail in real situations where the fake images come from sources it hasn’t seen before.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The major concern lies with the evaluation strategy as explained.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
There are two main contributions for the paper. First, this paper presented MedForensics, a large dataset for medical deepfake detection. Secondly, this paper proposed DSKI, a novel dual-stage knowledge infusing detector that utilizes a dual-stage approach to learn subtle forgery clues and improve detection accuracy.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
This paper introduced MedForensics dataset, which is a large-scale multi-modality dataset specifically designed for medical forensics. This dataset will be a great contribution to the medical forensics community if it becomes publicly available.
-
The proposed DSKI detector is based on a Dual-Stage Knowledge Infusing approach to distinguish AI-generated medical images. This method leverages a cross-domain fine-trace adapter (CDFA) during training to extract subtle forgery clues from both spatial and noise domains.
-
The proposed Medical Forensic Retrieval Module (MFRM) in the testing stage further boosts detection accuracy through few-shot retrieval from a cache feature bank of real and synthetic medical images.
-
Experimental results demonstrate that DSKI significantly outperforms existing state-of-the-art methods in natural, facial, and medical image forensics across all six modalities in the MedForensics dataset.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
This work seems to be focused on the deepfake generation for the entire medical image. What about the case of detecting partial deepfake generation which modifies a real medical image in a local area with an image generation method? This is a common and important deepfake detection problem for real-life scenarios. I think the authors should discuss this issue for the proposed dataset and proposed DSKI method in the paper.
-
There is not much discussion on the model complexity and model inference speed for the proposed model. How is it compared to the other models used in the experimental comparison.
-
How is the generalization capability of the proposed model? The authors should apply their model to another dataset to show its generalization capability.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The paper does not mention if they will open-source the dataset and the implementation of the proposed method. The paper will give higher impact if the authors can make the dataset and code publicly available. If the authors plan to do it, it would be great to see the promise from the authors in the rebuttal.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper tackles the increasing threat of AI-generated fake medical images, which can cause significant problems to the healthcare systems. This paper presented a dataset and a novel method for medical forensics. The contributions are significant to the community if they can open-source the dataset and code.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper introduces a large-scale dataset specifically designed for use in medical forensics. MedForensics includes a wide range of medical imaging modalities and synthetic images generated by twelve state-of-the-art generative models, offering a comprehensive benchmark that reflects the diversity and complexity of real-world clinical scenarios.
In addition to the dataset, the authors propose a novel detection method, DSKI (Dual-Stage Knowledge Infusing Detector), which uses the CLIP vision-language model for medical image forensics. The model enhances CLIP’s discriminative capacity through a fine-tuned adapter module (CDFA) that captures spatial and noise-domain forensic cues, and further improves robustness via a retrieval-based module (MFRM) at test time. The combination of these techniques results in strong performance gains across all modalities, significantly outperforming prior methods and even human experts in a Visual Turing test.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
A key strength of the paper is the introduction of MedForensics, a large and diverse dataset tailored for detecting AI-generated medical images, which fills a major gap in the field. The proposed DSKI method is also novel in its dual-stage design, combining fine-grained artifact detection during training with a retrieval-based module for adaptability at test time. The model shows strong generalization to unseen generative models with minimal samples. Overall, the method achieves consistently high performance across multiple modalities and surpasses expert level accuracy.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
One limitation of the paper is that it does not mention whether the proposed MedForensics dataset will be publicly released. Given that the dataset is a core contribution and essential for reproducibility and future research in medical deepfake detection, the lack of clarity on its availability is a significant drawback.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The introduction of a large-scale dataset and a novel dual-stage detection framework addresses an important and timely problem in medical image forensics. While the method is well-motivated and shows impressive results, the lack of clarity regarding dataset release slightly limits its impact and reproducibility. Overall, the strengths outweigh the weaknesses, and the paper offers meaningful value to the community.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The auther has adequately addressed my concerns, and i have decided to recommend acceptance.
Author Feedback
We sincerely thank the reviewers for their time and constructive suggestions. We are encouraged that our dataset and method were recognized for their novelty (R1, R3), substantial contributions (R1, R2, R3). Below, we address the reviewers’ concerns in detail. Reviewer #1 Q1: Detecting partial forgery. A: Thanks. We have followed your suggestions to test the performance of detecting partial deepfake. First, we build a new testing set including 1500 partial edited images across modalities using DRDM (CT), RadiomicsFill (X-ray), and Diffuse-Gen (endoscopy). Then we compare the methods: Our DSKI (Acc/AP: 83.2/85.7) significantly outperforms the best compared method UniFD (72.6/74.1). Q2: Model complexity and inference time. A: Thanks for the good suggestion! We compare the complexity and inference speed of methods on an RTX 3090. Our DSKI has ~442M parameters, 81GFLOPs, and 17ms inference time, only slightly higher than the best baseline UniFD (~427M / 77GFLOPs /15ms), while achieving substantial improvements of 19.8 in Acc and 20.2 in AP. This efficiency is enabled by inserting lightweight adapters (CDFA) into only three transformer blocks, along with a retrieval module (MFRM) that adds minimal overhead. Q3: Generalization to another dataset.
A: Thanks for the valuable suggestion. We constructed a new testing set using unseen generators (MT-DDPM, XReal, HistoDiffusion), each contributing 500 synthetic images with paired real samples. We compared DSKI with baselines on it and the result (Acc/AP) is: UniFD: 78.4/78.2 DFH: 76.3/77.6 Ours: 88.1/90.4 The results demonstrates that DSKI maintains strong generalization ability across unseen generation methods. Q4: Reproducibility and release plan.
A: Thanks for highlighting this point. We promise to release both the dataset and code upon acceptance to support further research in medical image forensics. We will incorporate the above results and necessary discussions into the revised version. Reviewer #2 Q1: Clarification of new generator. A: Thanks. The modality of the new generator in the paper is UWF fundus image. The results (Acc/AP) of our method and the top two baselines are: UniFD: 76.1/80.7 DFH: 73.1/76.8 Ours: 86.6/89.4 Clearly, our method outperforms UniFD and DFH. We will include the description and results to the revision. Q2: Issues about evaluation strategies. A: Thank you for the insightful suggestion. We agree that cross-generator evaluation is crucial for assessing real-world generalization. Following your suggestion, we conduct cross-generator experiments regarding different modalities. Firstly, we trained models on DiffEcho and tested on EFN. The top three Acc/AP results are: UniFD: 80.3/76.7 DFH: 69.3/70.8 Ours: 91.6/90.2 Our DSKI also achieve the best results, demonstrating superior generalization. Moreover, inspired by evaluation protocols in natural image forensics, we incorporated a comprehensive cross-generator evaluation strategy into our benchmark. Specifically, for ultrasound (U-KAN-U, EFN, DiffEcho), we trained three models on each generator and evaluated them across all three, averaging nine results per method. Under this setting, DSKI (88.2/87.5) outperformed the best baseline UniFD (76.0/73.3). Following the same setup, we further conducted two additional cross-generator experiments on endoscopy and MRI, using different generators for training and testing. DSKI consistently outperformed all compared methods on endoscopy (86.2/85.9) and MRI (80.3/82.2) in both accuracy and AP. We will strengthen the evaluation, include these experiments along with the necessary discussion into the revision, and hope this supports reconsideration. Reviewer #3 Q: Dataset and code release. A: We thank the reviewer for raising this important point. We shall release the MedForensics dataset, all code and additional data upon paper acceptance. We also plan to regularly expand MedForensics with samples from emerging medical generative models to support the research community.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A