Abstract

Recent advances in generative self-supervised learning, particularly Masked Autoencoders (MAE), have shown significant promise in medical image pre-training. However, ultrasound poses unique challenges due to its intrinsic low signal-to-noise ratio. While previous studies have enhanced MAE with deblurring for improved performance, their static deblurring strategy fails to consider domain discrepancies arising from variations in ultrasound imaging. To overcome these limitations, we propose D^2MAE—a Diffusional Deblurring-enhanced MAE framework that seamlessly integrates a diffusional deblurring objective into MAE pre-training, simultaneously optimizing both deblurring and masked image reconstruction within a unified framework. Furthermore, we introduce an optimal blurriness-aware fine-tuning strategy that dynamically adjusts blurriness through an optimal blurriness search procedure, effectively accommodating the inherent domain discrepancies in ultrasound images. Extensive experiments across multiple ultrasound datasets, including thyroid, pancreas, and ovary, demonstrate that D^2MAE outperforms state-of-the-art methods, significantly enhancing generalizability and diagnostic performance across diverse ultrasound tasks. Our results establish D^2MAE as a superior approach for ultrasound imaging pre-training, paving the way for improved ultrasound image analysis. The code and pre-trained models are publicly available on GitHub.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0490_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{KanQin_D2MAE_MICCAI2025,
        author = { Kang, Qingbo and Gao, Jun and Zhao, Hongkai and He, Zhu and Li, Kang and Lao, Qicheng},
        title = { { D2MAE: Diffusional Deblurring MAE for Ultrasound Image Pre-training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {106 -- 116}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    A self-supervised learning framework for ultrasound image pre-training that integrates a diffusional deblurring objective into the Masked Autoencoder (MAE) paradigm. It jointly optimizes masked image reconstruction and progressive deblurring, using blurriness level embeddings

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Addresses the specific challenges of low signal-to-noise ratio and domain shift (varying blurriness) in ultrasound imaging within the effective MAE pre-training framework
    2. It proposes an original Optimal Blurriness Search (OBS) procedure used during fine-tuning. This allows the model to dynamically adapt the level of deblurring to the specific characteristics of the downstream dataset
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While DeblurrMAE and DiffMAE are included, the comparison lacks engagement with some concurrent or very recent advancements in generative SSL for medical/ultrasound imaging. For instance, UltraMAE proposed a unified MAE for ultrasound images and videos, and various diffusion model approaches exist for medical image self-supervised denoising/reconstruction
    2. Component Clarity: The core method integrates known concepts (MAE, scheduled blurring, timestep embeddings inspired by diffusion models ). While the integration is novel, the “diffusional deblurring” mechanism itself isn’t an iterative diffusion process, and its specific advantages over simpler deblurring strategies within MAE (like the cited DeblurrMAE ) or more complex true diffusion models require clearer comparisons
    3. Evaluating sensitivity to the number of blur steps, providing more detailed analysis of the OBS adaptivity per image would be better
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Concerns about the completeness of SOTA comparisons and the novelty of the core “diffusional” component, If author can add experiments/ add more clarifications, I would be happy to re-look at the score

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel self-supervised pre-training framework(D2MAE), whiche integrates a diffusional deblurring process with masked image reconstruction paradigm of MAE for ultrasound image pre-training. By adaptively adjusting the deblurring level during both pre-training and fine-tuning, the proposed mehtod mitigates the challenges posed by domain discrepancies in ultrasound imaging. The experimental results on several ultrasound datasets show that the proposed method outperforms state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A novel generative SSL framework that integrates a diffusional deblurring process with MAE to jointly optimize progressive deblurring and masked image reconstruction.
    2. This paper introduces an optimal blurriness-aware fine-tuning strategy, which employs an optimal blurriness search to dynamically select the most appropriate deblurring level for downstream tasks.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In formula(5),the minmize loss corresponding blurred images x(t∗)b are utilized as input for downstream fine-tuning of the supervised model fθ. Why the model select the Xb(t*) as input, and the other blurred images are not suitable for down-stream fine-turning?
    2. In figure 1,why the the values of σ1, …,σt,…, σT are fixed? If the vlaues are random, what are the drawbacks?
    3. Lack of comparison of model parameters and computational costs during model training and inference in experiments.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The paper lacks sufficient discussion and analysis of related work.
    2. The paper lacks detailed analysis of the proposed method.
    3. The innovation of the proposed method is not particularly significant.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although some of the rebuttle is not convincing, experimental results show that the proposed methods looks some interesting. If the authors are able to make the revisions as promised in the rebuttle, it is recommended that the paper be accepted.



Review #3

  • Please describe the contribution of the paper

    This paper proposes D2MAE, a masked image modeling framework for ultrasound image pre-training that incorporates a diffusional deblurring objective. D2MAE unifies denoising diffusion-inspired deblurring with masked image reconstruction during pre-training to better address the inherently low signal-to-noise ratio of ultrasound images. Furthermore, it introduces an optimal blurriness-aware fine-tuning strategy, where a novel Optimal Blurriness Search (OBS) dynamically determines the most appropriate blurriness level for each image during fine-tuning. This strategy effectively mitigates domain discrepancies arising from variations in imaging protocols, devices, and anatomical characteristics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. By incorporating a progressive, diffusion-inspired blurring process into masked image reconstruction, D2MAE enhances the model’s capacity to learn robust representations from inherently noisy ultrasound images. This strategy directly addresses the low signal-to-noise ratio, a fundamental challenge in ultrasound imaging.

    2. The proposed Optimal Blurriness Search (OBS) strategy dynamically determines the most suitable blurriness level for each image during downstream fine-tuning, effectively adapting to domain-specific variations in imaging protocols, devices, and anatomy.

    3. The paper presents comprehensive ablation studies—including analyses of blurriness range, embedding strategies, OBS configurations, and search steps—which clearly demonstrate the contribution of each design component to the overall performance gains.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Incremental Novelty: While the integration of diffusion-inspired deblurring with MAE is well-motivated, both diffusion models and deblurring-enhanced MAEs have been explored in prior works (DeblurrMAE in MICCAI’23 and DiffMAE in ICCV’23). While the integration is sound, it may not represent a fundamentally new methodological advance.

    2. Lack of generalizability evaluation: The downstream datasets used for the thyroid and pancreas tasks are overlapped with the data seen during pre-training, raising concerns about the validity of the generalization evaluation. This overlap makes it difficult to assess how well the proposed D2MAE framework transfers to unseen domains or datasets. Moreover, it remains unclear whether the denoising mechanism and the Optimal Blurriness Search (OBS) strategy are truly generalizable improvements, or if they simply overfit to the characteristics of the specific datasets used. While denoising-based pre-training is a reasonable and intuitive approach for ultrasound imaging, it remains unclear whether its effectiveness would generalize to other domains or imaging modalities beyond ultrasound.

    3. Computational Complexity and Efficiency Not Addressed Although the paper claims that the Optimal Blurriness Search (OBS) is performed only once with negligible overhead, it still requires multiple forward passes per image to determine the optimal blurriness level. This additional computational cost may pose challenges for real-world clinical deployment, particularly in time-sensitive scenarios or zero-shot inference settings.

    4. Narrow Scope of Clinical Validation Is limited to three ultrasound datasets, all in the context of classification tasks. Masked image modeling has been proven to learn position-relate representations, which has significant implications for pixel-level downstream tasks. Why not apply it to segmentation tasks?

    5. Lack of linear probing and few shot evaluation results.

    6. Lack of Joint Ablation Experiments The current ablation studies treat blurriness embedding and OBS settings independently, despite their potential interaction. Since these components may influence each other’s effectiveness, joint ablation experiments should be conducted to better understand their combined impact on model performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the incremental novelty of the approach, as similar methods like DeblurrMAE and DiffMAE have been explored, and concerns about the generalizability due to dataset overlap with pre-training data, the paper has some limitations. The computational cost of the Optimal Blurriness Search (OBS) and the narrow clinical validation scope also raise concerns about real-world deployment. However, the paper still presents a solid methodology, and the ideas have potential. Thus, a weak accept is recommended, with the suggestion to address the mentioned weaknesses in future work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors promise to note the lack of segmentation, small sample sizes, and linear probing as limitations, and plan to explore them further in follow-up work.

    I hope the author could open source this code to promote the development of the medical community.




Author Feedback

We thank all reviewers for the constructive comments. Our responses are as follows:   R1:

  1. More blurred images for fine-tuning Including more blurred images than optimal leads to a train-test distribution mismatch, as the test set images use OBS-optimized blur. This mismatch degrades transfer performance (F1 drops from 89.84% to 87.87% in our preliminary experiment).
  2. Random values for σ1,…,σT We adopt a fixed-step σ schedule to ensure uniform blur coverage, reduce computation, and facilitate OBS. Fully random sampling would complicate optimal blur level selection.
  3. Model parameters and computational costs D2MAE adds a non-trainable 768-d blur token, increasing ViT-Base’s parameters by 0.0009%. The only extra operations (OBS and Gaussian blur) add <0.2 s per image on a V100 GPU. We will include these analysis and discussions in the revision.   R2:
  4. Comparisons with more SOTAs Although UltraMAE was not directly compared due to unavailable code and pretrained weights, our preliminary experiments show that our method pretrained on breast images achieved 89.42% accuracy on the BUSI dataset, outperforming UltraMAE’s reported 87.21% by 2.21%. Furthermore, our approach is orthogonal to and fully compatible with recent generative SSL methods, aiming primarily to show that diffusional deblurring improves ultrasound MIM, as exemplified by its integration with MAE.
  5. Component Clarity D2MAE is inspired by diffusion models but designed as a lightweight alternative that avoids costly iterative sampling. It uses a progressive, scheduled blurring strategy that efficiently emulates the deblurring trajectory. Compared to DeblurrMAE’s fixed σ, D2MAE samples variable σ across training steps. This better reflects the diverse noise characteristics inherent in ultrasound, thereby promoting robustness to real-world variability. Unlike true diffusion models, D2MAE jointly optimizes deblurring and MIM under a single end-to-end objective, avoiding multi-step denoising or pixel-only losses. This enables learning of noise-invariant yet semantically rich features, enhancing downstream performance. These distinctions will be clarified in the revision.
  6. Sensitivity to blur steps We evaluate OBS sensitivity through ablation on its search step. As shown in Table 3d, OBS maintains robust performance under moderate deviations: F1 drops by only 0.10% and 0.89% when deviating by ±1 and ±2 steps, respectively. This indicates strong per-image adaptivity.   R3:
  7. Novelty D2MAE departs from prior works in both design and motivation. Unlike DeblurrMAE (fixed blur) or DiffMAE (pixel-focused diffusion conditioning), D2MAE applies progressive blurring tailored to ultrasound noise and unifies deblurring with masked reconstruction under a single MIM objective. OBS further enables image-specific adaptive blur selection. This is not a simple combination of prior methods, but a modality-driven redesign for ultrasound. We will emphasize this distinction in the revision.
  8. Generalizability, data overlap To clarify, there is no data overlap between pre-training and fine-tuning: the former uses only unlabeled GE4K and LEPset data, while the latter uses disjoint labeled splits. Thus, our multi-organ evaluation reflects genuine transfer. Moreover, our diffusional deblurring and OBS are designed to address the unique noise characteristics of ultrasound in a principled and modality-specific manner. We also plan to extend this approach to other noise-sensitive imaging domains.
  9. Computational efficiency OBS adds <0.2s/image (V100 GPU) and runs only once before fine-tuning. We acknowledge the importance of speed in time-sensitive settings and will explore faster OBS variants or fixed-σ presets in future work.
  10. More evaluation tasks and settings We appreciate these suggestions. Due to space limits, we focused on multi-organ classification. We will note the lack of segmentation, few-shot, and linear probing as limitations and plan to explore them in follow-up work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work has certain limitations but may still be of interest to the conference audience. The authors are encouraged to release their code as open source.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Please try to make revisions as promised in the rebuttal and release the code to the community.



back to top