Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet’s superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0146_paper.pdf

SharedIt Link: https://rdcu.be/eHxd0

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05325-1_5

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ShanghaiTech-IMPACT/DVCTNet

Link to the Dataset(s)

https://github.com/ShanghaiTech-IMPACT/DVCTNet

BibTex

@InProceedings{LuoTao_Adapting_MICCAI2025,
        author = { Luo, Tao AND Wu, Han AND Yang, Tong AND Shen, Dinggang AND Cui, Zhiming},
        title = { { Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {46 -- 55}
}

Reviews

Review #1

Please describe the contribution of the paper

In the manuscript titled “Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training”, the authors address two key limitations of current dental caries detection approaches: (1) The lack of high-quality and comprehensive dental caries datasets; and (2) The inability of existing generic computer vision models to perform dual-view reasoning or to align with clinical diagnostic workflows, which leads to suboptimal performance. To overcome these challenges, the authors introduce a new DVCT dataset and propose a dual-view co-training (DVCT) framework that adapts foundation models for dental caries detection through joint reasoning from both panoramic and periapical views.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The paper is well-organized, with a clear articulation of its motivation, methodology, and clinical relevance. The proposed approach demonstrates strong potential for practical application in real-world dental diagnostic settings. (2) The dual-branch co-training framework is thoughtfully designed, incorporating a feature pyramid network for global view detection and a region proposal module for local tooth-level refinement. This combination enhances both spatial granularity and diagnostic precision.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) The training strategy for the dual-view pretraining stage requires further clarification. Although the authors mention self-supervised learning on unlabeled data, it is unclear what loss function is used in this stage. Additionally, since the paper appears to incorporate a pre-trained tooth detector and uses labeled data during training, moreover, and evaluation of the proposed method is also on the labeled data.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors proposed a newly curated, high-precision dental caries detection dataset and the co-training design of dua-view reasoning or to align with clinical diagnostic workflows, which leads to suboptimal performance.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper
This submission (#146) makes two key contributions:
1. Gated Cross-View Attention Framework: The authors introduce a novel Gated Cross-View Attention mechanism designed to bridge two vision transformer branches, enabling the capture of both global and local features. This framework dynamically aggregates complementary information through learned gating weights, enhancing performance for downstream segmentation tasks.
2. Comprehensive Dataset with Refined Annotations: The authors present a new dataset featuring significantly more detailed and precise annotations compared to existing public benchmarks. They assert that this dataset addresses limitations in prior annotation granularity, enabling improved model training and evaluation for fine-grained detection/segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper exhibits several notable strengths:
- Effective Integration of Local and Global Features: The authors present a novel strategy for aggregating local and global contextual information through their Gated Cross-View Attention mechanism. This systematic approach to bridging multi-scale features demonstrably enhances segmentation accuracy, addressing a key challenge in the field.
- Methodological Clarity: The framework is well described, including theoretically grounded architectural choices and implementation details.
- High-Quality Dataset with Granular Annotations: The authors introduce a meticulously curated dataset featuring annotations of significantly higher granularity than existing alternatives. This resource provides a robust foundation for training and evaluating fine-grained segmentation models, with potential broader utility for the research community.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Absence of Related Works Section: The omission of a dedicated related works section limits the paper’s ability to situate its contributions within the broader field. By not formally discussing prior approaches to multi-scale feature aggregation, the authors miss opportunities to clarify the novelty of their method and justify key design choices. This undermines readers’ ability to evaluate the incremental advancement offered by the proposed framework.
- Insufficient Benchmarking Against Multi-Scale Baselines: While the method claims advantages in aggregating local and global features, the authors provide only limited comparisons with other multi-scale focused architectures (e.g., CrossViT, MPViT etc.) This lack of rigorous benchmarking against state-of-the-art approaches makes it difficult to objectively assess the technical advancement of the proposed gated attention mechanism relative to existing paradigms.
- Incomplete Figure Documentation: Several figures are accompanied by captions that lack technical specificity. For instance, architectural diagrams does not provide any details about the architecture diagram. On the other hand, dataset examples fail to highlight annotation granularity improvements with areas marked in colors never mentioned nor in text or caption. These omissions reduce the figures’ utility as explanatory tools.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The color distinction between yellow and orange in Figure 1 and 4 may pose challenges for readers, particularly in grayscale print formats. To improve accessibility, consider using higher-contrast color pairs (e.g., blue/orange) or supplementing colors with distinct patterns. Additionally, the dashed border’s relationship to the zoomed view is unclear: the caption states that dashed regions represent zoomed areas, but only the one is dashed. Furthermore, clarifying this with labels or insets would help non-experts interpret the figure’s purpose and the annotation differences being highlighted.

Figure 2’s dense visual structure could benefit from a more detailed caption and annotations. The authors could consider adding a brief description for each color section and their purpose in the model.

Table 1 would be strengthened by including references for the baseline methods (e.g., ‘Mask R-CNN [1]’).

The absence of a related works section severely limits the paper’s ability to contextualize how the proposed gated cross-view attention differs from prior multi-scale fusion strategies.

It is unclear to the reviewer whether DINOv2 was benchmarked. If DINOv2 is used in the ablation study, the paper must clarify how it is modified (e.g., “We remove cross-view gating from our framework to replicate a vanilla DINOv2 setup”).

The authors introduce more granular annotations for the object detection task, yet the experimental results do not clearly demonstrate an advantage from this additional detail. While it is intuitive that segmentation tasks benefit from more detailed annotations, for object detection such granularity might not be necessary. Could the authors clarify why more granular data is expected to be beneficial in this context? Specifically, how do the finer annotations contribute to improved detection performance? A comparison between the benefits of granular versus coarser annotations would help justify the additional annotation effort and strengthen the overall contribution of this work.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work presents a clinically relevant contribution with a clear, well-motivated framework for improving segmentation accuracy through multi-scale feature aggregation. However, the impact is hindered by several issues. The figures lack clarity, particularly in color contrast (e.g., yellow/orange distinctions in Figures 1/3) and caption specificity, which limits accessibility for non-experts. Revisions should employ high-contrast color schemes and detailed captions explicitly stating each figure’s purpose. A dedicated related works section is also needed to clarify the novelty of the gated cross-view attention mechanism relative to established multi-scale approaches (e.g., Feature Pyramid Networks). Finally, the experimental evaluation requires stronger benchmarking against state-of-the-art models or clear justification for omission. Addressing these gaps would strengthen the paper’s rigor and translational value.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
1. The paper proposed a method (DVCTNet) for dental caries detection with global and local views co-learning, adaptating large foundation models for improved performance.
2. The paper proposed a large-scale dataset with 500,000 panoramic X-ray images collected from eight centers. 2,000 of the X-rays were well annotated.
3. The proposed method was validated and compared with other recent ones using a public dataset and the proposed dataset.
4. Ablation study was performed to validate the key components of the proposals (global and local view pretraining and dual-view fusion module).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well-motivated and well-organized.
2. The proposed method utilized large foundation models to capture global-view and local-view features.
3. The paper proposed a large-scale dataset with 500,000 X-rys collected from multiple centers. 2,000 of the X-rays associated with high-quality annotations.
4. The proposed method consistently improved the accuracy over exisiting methods on two datasets.
5. Ablation study results suggeusted positive effects by the key components in the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The training and inference cost was not assessed. For example, the number of model parameters, the memory concumption were not disscussed. As the method utilized and trained large-scale foundation models, the cost may be a notable limitation for clinical applications.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a strong paper with moderator weakness. I recoomend an acceptance.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training

Author(s):