List of Papers Browse by Subject Areas Author List
Abstract
Adapting vision transformer (ViT) foundation models with parameter-efficient fine-tuning (PEFT) has become increasingly popular in medical imaging, enabling efficient adaptation while updating only a small subset of parameters. However, existing PEFT methods process tokens independently, overlooking cross-token dependencies and limiting their ability to capture global contextual information. To address these challenges, we propose FreqFiT, a novel Frequency-based Fine-Tuning module inserted between ViT blocks to enhance model adaptability. FreqFiT is effective and seamlessly integrates with existing PEFT methods to improve their performance. We evaluate FreqFiT across 2D and 3D medical imaging datasets, such as PAPILA, HAM10000, ADNI-1.5T, and COVID-CT-MD. It improves accuracy 9% and AUC 10%, surpassing the original PEFT methods on both MedMAE and DINOv2 backbones. Despite using only <1.2% of full fine-tuning parameters, FreqFiT achieves state-of-the-art medical imaging adaptation efficiently.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3066_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/tsly123/FreqFiT_medical
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LySon_Frequency_MICCAI2025,
author = { Ly, Son T. and Nguyen, Hien V.},
title = { { Frequency Strikes Back: Boosting Parameter-Efficient Foundation Model Adaptation for Medical Imaging } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {262 -- 272}
}
Reviews
Review #1
- Please describe the contribution of the paper
The main contribution of this thesis is to propose a novel frequency-based fine-tuning module, FreqFiT, for enhancing the adaptability of the Visual Transformer (ViT) base model in the medical image domain with minimal parameter overhead. Unlike traditional parameter-efficient fine-tuning (PEFT) methods that process tokens independently only in the spatial domain, FreqFiT operates in the frequency domain and is able to capture global dependencies and subtle anatomical features that are easily overlooked by spatial-domain methods. The module can be seamlessly integrated into existing PEFT methods (e.g., LoRA, AdaLoRA, BOFT, and FourierFT) and significantly improves performance on both multiple 2D and 3D medical image datasets (e.g., PAPILA, HAM10000, ADNI-1.5T, and COVID-CT-MD), while using no more than 1.2% of the modeling parameters. The authors also provide theoretical proofs that FreqFiT can achieve feature transformations that cannot be reproduced by spatial domain methods and show superiority in complete fine-tuning and small sample learning scenarios.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The core technological innovation of this paper is the proposal of FreqFiT, a frequency-domain based fine-tuning module that is able to capture global dependencies by modulating the interactions between tokens in the frequency domain. This is novel because most existing PEFT methods operate only in the spatial domain and treat the tokens independently. In contrast, FreqFiT achieves global context modeling without a significant increase in computational overhead, which is particularly important in medical images where fine-grained structural information is of concern.2. An enhancement module for FreqFiT that seamlessly integrates with existing PEFT methods such as LoRA, AdaLoRA, BOFT, and FourierFT. The authors provide theoretical proof that combining FreqFiT with spatial-domain PEFT methods enables feature transformations that are not achievable by either side alone, making an important contribution to efficient model adaptation research.3. The authors provide rigorous theoretical analysis that demonstrates that feature transformations achieved by FreqFiT cannot be replicated by spatial-domain PEFT methods. This makes the method not only an empirical improvement, but also has theoretical depth and credibility.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Although the method was tested on multiple datasets, the article does not demonstrate its integration in real clinical processes or prospective validation based on data from clinical settings. There is also no discussion of how the outputs of FreqFiT are used by physicians in clinical practice (e.g., for assisted diagnosis or to improve segmentation accuracy, etc.), which diminishes its clinical relevance. 2, Some of the datasets used in this paper (e.g., PAPILA, HAM10000) have small data sizes and have been extensively studied in academia, which may not adequately represent the challenges of domain bias and generalization capabilities in the real world. Evaluations on larger and more complex datasets (e.g., MIMIC-CXR or DeepLesion) would better demonstrate the scalability and robustness of the method. 3, Although FreqFiT operates in the frequency domain, the paper does not provide any interpretable analysis or visualization results of the learned filters or frequency components. It would be more helpful to show what frequency features FreqFiT actually learns and how these differ from spatial domain methods to build confidence in the model in medical scenarios.
- Despite the high technical value of this paper, the most important issue is that the paper does not adequately demonstrate why frequency domain fine-tuning is particularly applicable to medical images. The paper lacks an in-depth analysis of the characteristics of medical images, such as the high degree of inter-patient variability, the weak nature of lesion features, and the significant differences between modalities, which could have been used as a motivation for introducing frequency modeling. The lack of a motivational analysis focusing on medical image scenarios weakens the relevance and persuasiveness of the medical scenarios in this work.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I recommend acceptance of this paper on the grounds of its remarkable technical novelty, good generalizability, and systematic and comprehensive empirical evaluation. The proposed FreqFiT module introduces a theoretically supported frequency-domain fine-tuning mechanism that not only complements existing spatial-domain PEFT methods, but also significantly improves their performance in multiple tasks. The authors validate the effectiveness of the method through theoretical analysis and a large number of experiments (covering multiple 2D/3D medical image datasets and different PEFT methods), which show significant improvement in both accuracy and AUC, especially in the few-shot small-sample learning conditions where the performance is still outstanding. In addition, the method is practical in terms of tuning efficiency, requiring only less than 1.2% fine-tuning of the model parameters to achieve performance close to that of SOTA, which is particularly attractive for clinical or resource-constrained environments. Although the paper suffers from shortcomings such as lack of discussion of actual clinical deployment and insufficient analysis of medical image motifs, these shortcomings do not diminish its core contribution.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper proposes FreqFiT, a novel frequency-based fine-tuning module inserted between ViT blocks to improve the adaptation of foundation models for medical imaging. The key idea is to transform and manipulate feature tokens in the frequency domain, thereby capturing global context and subtle high-frequency image patterns that standard PEFT methods might miss. The main contribution is a simple yet effective add-on module that boosts PEFT on medical image tasks, demonstrated through substantial gains in accuracy and AUC on multiple 2D and 3D medical imaging datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- FreqFiT is conceptually simple – leveraging a Fourier-based transformation to mix token information – yet yields significant performance improvements. This simplicity makes it easy to implement and integrate with existing fine-tuning pipelines. The performance jump with minimal modification is impressive.
- FreqFiT consistently outperforms baseline PEFT methods across diverse datasets (PAPILA, HAM10000, ADNI-1.5T, COVID-CT-MD). These extensive experiments show that the proposed method generalizes well.
- The method adds only about 1% extra trainable parameters relative to full fine-tuning, which means it retains the low-resource advantages of PEFT.
- The structure of the paper is clear and easy to follow. The illustration of the method (Fig. 1) is straightforward.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- A key concern is that the proposed method is not inherently specific to medical imaging. The frequency-based adaptation module addresses a general problem (capturing cross-token dependencies and high-frequency details) that is equally relevant to natural image tasks. As a result, the paper might be better suited to a general vision or machine learning venue (e.g., CVPR, ICCV, NeurIPS)
- The idea of using frequency domain transformations to enhance model training is not entirely new. Prior works like FNet and FouRA have explored frequency-based feature mixing or adaptation in other contexts.
- While the paper presents a solid array of experiments, there are a couple of aspects where the evaluation could be more exhaustive to fully convince the reader of the method’s advantages. For example, the comparison against full fine-tuning, qualitative or frequency-spectrum analysis to support the “subtle patterns” capturing, and the visualization showing that FreqFiT indeed emphasizes high-frequency components.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, I am positive about this paper. It presents a simple yet effective method that yields substantial improvements in the adaptation of foundation models for medical imaging. The strengths, especially the significant performance gains on multiple datasets with minimal additional parameters, and the thorough experimentation, strongly outweigh the weaknesses. The paper is well-executed, with a sound motivation (addressing the limitations of current PEFT methods in capturing global context) and convincing experiments that the proposed solution works.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The paper introduces FreqFiT, a parameter-efficient fine-tuning (PEFT) technique designed for adapting foundation models. FreqFiT operates by transforming the feature map at a selected layer into the frequency domain, applying a learnable convolutional filter, performing an inverse Fourier transform, and then fusing the result with the original feature map via a skip connection. The method is evaluated on two foundation models—DINOv2 and MedMAE—across four datasets. Experiments are conducted by augmenting existing PEFT methods with FreqFiT and comparing them against the standalone PEFT baselines. Results show that FreqFiT consistently enhances the performance of existing PEFT methods across all tested scenarios. Additionally, few-shot experiments in 1-, 5-, and 10-shot settings further highlight the effectiveness of FreqFiT in low-data regimes.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The use of frequency domain filtering to capture spatial dependencies—unlike many existing PEFT methods that treat each token independently—is a novel and well-motivated idea.
- The experimental results clearly demonstrate the effectiveness of the proposed method when combined with existing PEFT techniques.
- The method is evaluated across four diverse datasets and with two foundation models (DINOv2 and MedMAE), providing strong evidence of its robustness and generalizability.
- The inclusion of ablation studies in low-shot settings (1-, 5-, and 10-shot) further supports the method’s effectiveness in data-scarce scenarios.
- The paper is clearly written and well-organized, making it easy to follow the proposed approach and its contributions.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1- Missing Standalone Performance of FreqFiT: The performance of FreqFiT when combined with existing PEFT methods is impressive. However, it would be valuable to include results showing the standalone performance of FreqFiT. Even if its performance is not superior to standard PEFT methods when used in isolation, presenting these results would help clarify the method’s individual contribution. Moreover, since FreqFiT introduces minimal additional parameters, its utility as a complementary module remains strong. Including standalone results would enhance the paper’s completeness and provide better insight into the independent capabilities and limitations of FreqFiT.
2- Theorems: The theorems provide valuable intuition but are not fully formalized or supported by detailed proofs, so referring to them as theorems may be somewhat misleading. A more appropriate alternative might be to present them as propositions.
3- Incorrectly Highlighted Results in Table 1: The caption of Table 1 states that bolded results indicate cases where FreqFiT improves the original method in both ACC and AUC. However, there are six instances where values are highlighted in bold despite not showing improvements in “both” metrics. This is potentially misleading, and either the caption explanation or the highlighted values should be corrected for consistency.
4- Application to broader tasks such as segmentation: It would be interesting to explore how the proposed method performs on other tasks, such as segmentation, as the current work focuses solely on classification.
5- Generalization performance after fine-tuning: Foundation models are known for their strong generalization capabilities, including in medical imaging tasks [1]. An interesting additional experiment would be to evaluate the domain generalization performance of models fine-tuned with FreqFiT, to assess how well the method preserves or enhances this property.
[1] Cekmeceli et al. “Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?”
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I find the core idea presented in the paper both interesting and novel. The authors support their method with thorough and well-designed experiments that clearly demonstrate its effectiveness. I do not identify any major weaknesses that would warrant rejection. Therefore, I recommend acceptance of this work.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank the reviewers for their thoughtful feedbacks and suggestions. We appreciate the opportunity to clarify key points. Due to space constraints, we organize our responses by theme and denote reviewer comments as R#–C# (e.g., R1–C1 refers to Reviewer 1, Comment 1)
Clinical Relevance and Dataset Choice (R1–C1, C2, C4) The focus of this work is to improve parameter-efficient adaptation for medical imaging under limited supervision. The selected datasets (PAPILA, HAM10000, ADNI, COVID-CT-MD) were chosen to reflect a diverse range of imaging modalities, anatomical regions, and task difficulty. While not large-scale, these benchmarks are widely used in the community and representative of real-world constraints, such as limited annotations or subtle visual cues. As shown in Sec. 5 and Fig. 2–3, FreqFiT improves consistently across all settings and backbones, indicating strong robustness. While clinical deployment is beyond the current scope, FreqFiT is designed for easy integration into low-resource medical pipelines, given its minimal overhead, PEFT seamless compatibility, and plug-and-play nature.
Motivation for Frequency-Based Adaptation in Medical Imaging (R1–C3, C4; R2–C1, C2) The rationale for operating in the frequency domain is grounded in the structural properties of medical images. Unlike natural images, medical images often contain sparse, weak, or globally distributed features that are difficult to capture through local, token-wise PEFT alone. FreqFiT addresses this by aggregating token information via frequency modulation—an operation that is theoretically full-rank (Theorem 1) and complementary to spatial PEFT (Theorem 2). Prior frequency methods (e.g., FNet, FourierFT) lack this targeted integration with ViT-based PEFT. Our design, illustrated in Fig. 1 and Sec. 3, enables cross-token interaction and improved feature alignment under frozen backbone constraints. An additional motivation for using the frequency domain is computational efficiency: applying FFT to input tokens X with shape H×W×D enables global token interaction with a computational complexity of O(NlogN), where N=H×W. This provides an efficient mechanism for modeling long-range dependencies without introducing computational overhead.
Standalone Utility and Visual Interpretability (R2–C3; R3–C1, C5) FreqFiT is designed to enhance the existing PEFT methods. As shown in Table 1 and Fig. 3, its integration consistently improves performance across all PEFT baselines, datasets, and few-shot settings. Although standalone results were omitted due to space, we note that FreqFiT alone offers modest gains but is most effective in combination with spatial PEFTs, as supported by Theorem 2. Regarding interpretability, performance improvements on structure-sensitive modalities (e.g., MRI, CT) reflect its ability to emphasize frequency patterns. We agree that visualizing learned frequency responses is a valuable future direction.
Broader Applicability and Domain Generalization (R2–C1; R3–C4, C5) While this work focuses on classification, FreqFiT is model- and task-agnostic. Its insertion between ViT blocks makes it naturally extendable to segmentation and detection pipelines, particularly under frozen-backbone setups. On generalization, the robustness of FreqFiT in few-shot settings (Fig. 3) reflects its ability to preserve domain-transferable representations, especially under data scarcity. This aligns with the reviewer’s suggestion of assessing domain generalization, which we plan to explore further.
Theoretical Framing and Table Highlighting (R3-C2, C3) We acknowledge the reviewer’s point regarding terminology. Due to the space limit, we do not present the full proof. In the final version, we will revise “Theorems” to “Propositions” to better reflect their role in offering formal design justification rather than rigorous proof. Regarding Tab 1, we thank the reviewer for the careful observation. We will correct the table in the final version.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A