Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

While Whole Slide Imaging (WSI) scanners remain the gold standard for digitizing pathology samples, their high cost limits accessibility in many healthcare settings. Other low-cost solutions also face critical limitations: automated microscopes struggle with consistent focus across varying tissue morphology, traditional auto-focus methods require time-consuming focal stacks, and existing deep-learning approaches either need multiple input images or lack generalization capability across tissue types and staining protocols. We introduce a novel automated microscopic system powered by DeepAf, a novel auto-focus framework that uniquely combines spatial and spectral features through a hybrid architecture for single-shot focus prediction. The proposed network automatically regresses the distance to the optimal focal point using the extracted spatiospectral features and adjusts the control parameters for optimal image outcomes. Our system transforms conventional microscopes into efficient slide scanners, reducing focusing time by 80% compared to stack-based methods while achieving focus accuracy of 0.18 μm on same-lab samples—matching the performance of dual-image methods (0.19μm) with half the input requirements. DeepAf demonstrates robust cross-lab generalization with only 0.72% false focus predictions and 90% of predictions within the depth of field. Through an extensive clinical study of 536 brain tissue samples, our system achieves 0.90 AUC in cancer classification at 4× magnification, a significant achievement at lower magnification than typical 20× WSI scans. This results in a comprehensive hardware-software design enabling accessible, real-time digital pathology in resource-constrained settings while maintaining diagnostic accuracy.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0544_paper.pdf

SharedIt Link: https://rdcu.be/eHxbZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_11

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0544_supp.zip

Link to the Code Repository

https://deepautofocus.github.io/

Link to the Dataset(s)

Incoherent Dataset: https://doi.org/10.6084/m9.figshare.5936881

BibTex

@InProceedings{YegYou_DeepAf_MICCAI2025,
        author = { Yeganeh, Yousef AND Frantzen, Maximilian AND Lee, Michael AND Hsing-Yu, Kun AND Navab, Nassir AND Farshad, Azade},
        title = { { DeepAf: One-Shot Spatiospectral Auto-Focus Model for Digital Pathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {106 -- 116}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper describes a method to utilize both spatial and frequency domain features to predict autofocus distance from a single image. The system starts by using simple thresholding to discard empty frame, and then run a low-resolution image through the network to predict optimal focusing distance. The system was designed to work on a custom-made motorized platform. The performance is demonstrated to be competitive with a dual-image approach. Ablation is given and shows that the spatiospectral variant of the model worked best. Utilizing this system, the authors collected a dataset of brain tissue and used it to train a cancer/no cancer classifier that achieves a high AUC.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed system seems performant, with high accuracy compared to larger, more complicated systems.
- The technical choices presented are sound. The models seem to have been trained properly.
- The utility of the system is demonstrated with a dataset collection.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Generalization was one of the main features of this paper. However, so far, the extent of generalization was to a single different source of data that comprises only a single tissue type. It would be helpful to see the breakdown of how many different kinds of tissues this system can handle, and to rigorously test across various types of stain. These limitations can be very limiting in real world use cases, especially in low-resource settings mentioned, where the input is likely to be very varied.
- Low resource setting was one of the use cases claimed, but there is no study of how run time and how reliable the system is (does it break after scanning 1,000 slides? 10k?). It could be interesting to expand more on this point so the reader understands what the major challenges being faced by the low resource setting.
- While the ablation is very useful, it does seems strange that the fully spatial model is larger than the spatialspectral model. Shouldn’t the former be a subset of the latter? This raises some questions around the fairness of the comparison
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While I find the result of this paper very encouraging, generalization is an important issue that is not addressed and can limit the usefulness of this work for the community.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The authors present DeepAf, a CNN-based framework designed to predict the optimal focal plane from a single low-resolution image. The architecture consists of a feature extractor that integrates both spatial and spectral information, followed by a regression head that estimates the optimal focus position. The proposed method is evaluated against two relevant baselines (Table 1), demonstrating comparable performance to a dual-image method by Datidar et al., and outperforming the single-image approach by Jiang et al. Additionally, the authors conduct ablation studies to assess the individual contributions of spatial and spectral features, and they include a clinical study to support the practical relevance of the acquired images.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The results indicate that the proposed approach can achieve focal position estimation accuracy comparable to a dual-image method, while requiring only a single input image.
- Although the combination of spatial and spectral features—via parallel processing and concatenation—is relatively straightforward, and the spectral features contribute only marginally to performance, the overall approach remains novel and contributes a fresh perspective to the problem.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The clinical case study lacks critical details, such as information on training procedures and data preparation, and is not compared against any existing studies. As a result, its ability to substantiate the effectiveness of the proposed method is extremely limited, if not negligible.
- The manuscript suffers from several phrasing issues and omits key methodological information, which compromises the clarity and readability of certain sections (see specific comments below). In particular, the method for determining the z-position of the low-resolution input image is not clearly described. Given that this choice directly impacts performance, the lack of explanation raises concerns about the reproducibility and reliability of the results.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Major Comments:
- There appears to be an error in the definition of focus error (FE). It is described as a mean absolute error, but the specific quantity being measured is not clearly stated (presumably the absolute difference between the predicted and optimal focal positions). More critically, FE values are shown as negative in Figure 2, which contradicts the definition of a mean absolute error and suggests either a mistake in the computation or a mislabeling in the figure.
- The rationale behind doubling the spatial and spectral blocks in the ablation experiments is unclear. Why are the blocks not simply removed rather than duplicated? This assumption arises from the differing parameter counts reported in Table 2. One would expect the spatio-spectral model to contain the same number of parameters as the two components combined, yet this is not the case. This raises the question of whether the improved performance is due to architectural changes or merely an increase in model capacity.
- Distances are typically defined as non-negative values. If an oriented (signed) distance is being used, this should be explicitly stated and clarified in the text.
- Since focal position varies along the z-axis, references to positional changes should use terms like “up” and “down” rather than “left” and “right,” as currently used in several parts of the paper. The existing terminology is spatially misleading.
- The meaning of “the median of all patches of one image” is ambiguous. Does this refer to the pixel-wise median across patches, or to the median of the model’s predictions for each patch? This should be clarified and corrected.
- The loss function used in training should be formally defined, including its mathematical formulation and justification.
- The data augmentation strategy should be described in more detail, including the types of transformations applied and their parameters.
- The function B() is described as a “bottleneck,” but this is misleading. It would be more accurate to describe it as a feature extractor with a bottleneck, since it depends on the full preceding network, not just the final bottleneck layer.
- The symbols X, Y, and Z are inconsistently defined. For example, Z is written to be a direction (i.e., a signed scalar or vector?), while X and Y appear to denote sets from which coordinates (x,y) are drawn. This inconsistency is confusing and should be resolved by defining all symbols clearly and using them consistently throughout the paper.
Minor Comments
- Several figures, especially Figures 1 and 2, are too small and difficult to read. Improving their size and resolution would greatly aid comprehension.
- In Figure 1a, the indices i and j appear to be mislabeled and should likely be x and y.
- The expression x={0,xmax} in Figure 1a is incorrect; the equality sign should be replaced with ∈ to correctly express set membership.
- The symbol N used in the formula on page 4 is undefined—presumably, it refers to the number of pixels, but this should be explicitly stated.
- Including the unit “mm” after δz is insufficient without providing a typical value or range. Quantitative context should be added or the units removed.
- The method for selecting the threshold τ (page 4) is not explained and should be added.
- There is an extra closing parenthesis in the formula defining the function f
- Replace “control hats” with “control heads.”
- Replace “optical focal point” with “optimal focal point.”
- Replace “generated images” with “acquired images.”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed approach is both practically relevant and sufficiently novel; however, the current level of formal and methodological clarity is inadequate and requires significant improvement before the paper can be considered for acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

It seems the authors will address the main issues in the final reversion.

Review #3

Please describe the contribution of the paper

The paper proposes a focus prediction model for improving the efficiency of using a microscope to take full slide images. The prediction model uses two encoders, one for the low-res image, and another for its 2D FFT. The model is able to predict the optimal focus in one-shot.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed approach could increase the scanning efficiency of a microscope significantly by avoiding taking a full z stack of images to determine the optimal focus.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

How does one justify that one single out-of-focus input image contain enough information to infer about the optimal focus point?
The large delta in FE in table 1 between same lab and different lab results suggests the model may have overfitted.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

It would be helpful to provide examples of the input low res images that are used for inferring the focus point.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See above.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We are glad that the reviewers found our work novel (R1), high-performing (R2), yet straightforward (R1), technically sound, well-trained (R2), and focus-optimized (R3). Here, we address the concerns: Generalization (R1W1,R2W1) We evaluated DeepAf on a public autofocusing benchmark in histology, with superior focus error (0.18±0.17μm vs 0.21±0.21μm same lab; 0.32±0.36μm vs 0.43±0.49μm diff. lab) compared to [a] (pub. Nov. 2024), using three-branch training, MobileOne with DoG preprocessing, and larger batch size (128 vs 32). [a] compares against 8 methods (2018-23), which could be cited and included in our results, as DeepAf decisively outperforms all of them (same lab: 0.51-0.21μm, diff. lab: 1.30-0.43μm). Our single-shot architecture shows superior accuracy and generalization without complex preprocessing. Our clinical case further shows how DeepAf with a light architecture can be employed within a microscopic system for simultaneous low-cost scanning and cancer classification. [a] Chen et al. “Microscope Autofocus Based on Difference of Gaussians and Triplet Loss.” ICCSI 2024. Reproducibility The network is trained with L1 regression loss (R1W8) between the GT and pred z values, obtained from the network’s final layer (R1W2,sec2.2). At inference, the final prediction is the median of the model’s predictions for each patch (R1W7). We will clarify the text and release the source code for augmentation details (R1W9) with microscope blueprints for reproducibility. Our system comprises a Raspberry Pi (exp. lifespan ~>24/7x3-5yr), stepper motors (exp. lifespan ~>10k hrs), and 3D-printed Nylon gears (replaceable with metal/injected plastic). Despite no formal durability stats, our system has done >2k scans without failure, due to low resistance torque on microscope adj. gears (R2W2). The open-source code and microscopic system can potentially democratize WSI in low-resource settings.
Definitions (R1) The focus error as a metric is the mean absolute error in all experiments, while Fig.2 visualizes the mean error in z-values, which can also be negative (W3,W5). The left/right direction was used as a reference to the distance from optimal focal point (Fig.2) and the motor movement direction (W6), via control hat (Hardware Attached Top) (W19). We will improve the ambiguities (W10,W11,MC) in the final version. Parameters size (R1W4,R2W3) The efficiency of DeepAf comes from architectural design. At the bottleneck, U-Net processes 256 filters (256×256×3×3), while ours processes two streams of 128 filters each (2×128×128×3×3) - halving parameters at this critical layer, rather than doubling. Throughout the network, similar factorization reduces parameters at each level. Single Image (R3W1) A single out-of-focus image contains enough information to infer the optimal focus due to the relationship between defocus and frequency domain characteristics (p.5). Defocusing creates unique signatures in the cut-off frequency and spectral distribution (cf. Fig.1d, [14]), which correlate with the defocus distance. Our spatiospectral model leverages this by extracting both spatial (tissue morphology, edges) and spectral features from a single image, maximizing the utilized information and allowing for accurate inference of the optimal focus. Low-res images (R3W3) Some samples exist in Fig.1; we will add more details in the final version. Lab Performance Gap (R3W2) The gap reflects domain shift problem rather than overfitting. DeepAf shows substantially better generalization to diff. lab settings compared to the fully-spatial one or prior work [14]. The high DoF and low false direction rate further validate that DeepAf captures generalizable focus cues rather than memorizing dataset-specific patterns. Tab.2 demonstrates the significant cross-domain generalization improvement when combining spatial and spectral features, suggesting our model learns key relationships between image characteristics and focus distance, rather than lab-specific artifacts.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

The reviewers have highlighted several areas that require improvement, including the need for clearer articulation of the technical contributions, more detailed explanation of the methodology, and stronger comparisons with existing approaches. Additionally, both qualitative and quantitative analyses need to be more comprehensive. The supplementary video also does not clearly demonstrate how your approach functions on whole slide images; it appears that certain conditions or setups may have been predetermined. To enhance the manuscript, it is important to clarify the novelty of the technical contributions, elaborate on the clinical relevance of the work, and provide robust statistical and comparative analyses. Please refer to the reviewers’ comments for more details.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

My recommendation is based on the following: Vox populi (though it was admittedly not a decisive vox): 2 out of 3 reviewers voted “for”. One of the “for” votes was mild. The “against” vote was an experienced researcher, so my recommendation is a bit of a gamble. In the paper’s favor is the utility of the method, and the ingenious use of physics to do an on first glance an impossible task. If the paper is accepted, I urge the authors to insert, early on as motivation, their explanation from their rebuttal of the physics whereby a single slice can contain actionable information about the focus in other slices.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper introduces DeepAf, a CNN-based framework that predicts the optimal focal plane from a single low-resolution image by leveraging both spatial and spectral features, achieving performance comparable to dual-image methods. The approach is validated through ablation studies and a clinical study, demonstrating its effectiveness in practical applications like automated microscopy and medical imaging. According to the reviews, I think most of the issues have been addressed. The reviewers with negative comments didnot give feedback while the confidence of the review is also low. To my understanding, the issues raised by this reviewer have been properly answered in the rebuttal. Therefore, my recommendation is accept.

back to top

DeepAf: One-Shot Spatiospectral Auto-Focus Model for Digital Pathology

Author(s):