Abstract

Rapidly advancing multi-modal learning shows great promise in medical image analysis, but challenges remain in the detection of jawbone lesions. Existing general-purpose models fail to capture the relationships between anatomical contexts and spatial locations in CBCT images, and the complexity of these models hinders interpretability. We propose PolarDETR, a novel framework combining anatomical priors and multi-modal alignment through: 1) Polar Text-Position Encoding (PTPE), which links text to spatial coordinates via polar mapping, 2) Anatomical Constraint Learning, ensuring lesion detection within anatomically plausible regions, and 3) Position Matching Optimization for spatial consistency. Evaluated on 180 clinical cases (6929 CBCT slices), our method achieves a state-of-the-art mAP of 93.66\%, outperforming both single-modal (e.g., DETR at 89.35\%) and multi-modal models (e.g., CORA at 91.52\%). Additionally, PolarDETR excels in interpretability, with an ACS of 84.12\% and PMS of 80.45\%, demonstrating its potential to enhance both detection performance and clinical usability in real-world applications. Our code is available at https://github.com/Cxxxsky/PolarDETR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2840_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Cxxxsky/PolarDETR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanYux_PolarDETR_MICCAI2025,
        author = { Yang, Yuxuan and Zhong, Chen and Zhang, Xinyue and Ma, Ruohan and Li, Gang and Guo, Yong and Li, Jupeng},
        title = { { PolarDETR: Enhancing Interpretability in Multi-modal Methods for Jawbone Lesion Detection in CBCT } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {505 -- 514}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents PolarDETR, a novel multi-modal framework for jawbone lesion detection on CBCT images, incorporating clinical text. The method introduces Polar Text-Position Encoding (PTPE) to encode anatomical priors in polar coordinates, and Anatomical Constraint and Position Matching Loss (AC-PML) to align predictions with anatomically plausible regions. Two new interpretability metrics—Anatomical Consistency Score (ACS) and Position Matching Score (PMS)—are proposed. Evaluated on an in-house dataset, the method achieves superior mAP (93.66%) and interpretability scores over state-of-the-art unimodal and multimodal baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    · Well-motivated and timely: The paper targets a real-world and clinically significant task—detecting jawbone lesions in CBCT—where multimodal fusion is especially relevant due to the spatial ambiguity of lesions. · Novel polar coordinate design: The use of polar representation to match dental anatomy is well-justified and aligns with clinical language and structure. · Interpretability focus: The authors make a clear and credible effort to go beyond detection accuracy, proposing new evaluation metrics (ACS, PMS) that bridge the gap between AI predictions and clinical usability. ·Strong empirical results: The proposed method clearly outperforms both image-only and prior multi-modal baselines in mAP, ACS, and PMS. ·Ablation study: The contribution of each module (PTPE, AC-PML) is dissected effectively, supporting the claimed benefits.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I think this paper is technically solid. But have few questions:

    1. Only CLIP and CORA are selected as multi-modal baselines. More recent or domain-specific methods (e.g., BioViL-T or MedCLIP+ variants) could strengthen comparison.
    2. The authors didn’t use anonymous github which violates the double-blind requirement in MICCAI reviewing procedure… Please use anonymous github.
    3. I find there is no corresponding code in your released repository, so the reproducibility is still a problem.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-motivated, technically sound, and clinically meaningful contribution. The paper demonstrates strong results and offers new ideas for anatomical grounding and interpretability in multimodal detection. However, it falls slightly short on dataset scale, baseline breadth, and transparency of some critical components. These can be addressed in revision or supplementary material. Technically speaking, the paper is a solid addition to MICCAI and should be accepted. However, this paper seems to violate the double bind rule. I will directly increase my score if providing official github link is acceptable.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have rectified the prior concerns regarding code accessibility.



Review #2

  • Please describe the contribution of the paper

    The primary contribution of this paper is the introduction of PolarDETR, a multi-modal framework designed for jawbone lesion detection in CBCT images by fusing textual clinical data and image-based features. This fusion is achieved through three key innovations:

    1. Polar Text-Position Encoding (PTPE): A novel mechanism that maps anatomically relevant text descriptors (e.g., quadrant, distance from the mental foramen) into a polar coordinate system, aligning textual clues with spatial positions in the CBCT volume more naturally than traditional Cartesian coordinates.

    2. Anatomical Constraint Learning: The model enforces plausible lesion locations through predefined jawbone anatomical masks, ensuring that predicted lesions fall within clinically relevant regions.

    3. Position Matching Optimization: A custom loss function component that aligns bounding boxes with the polar-encoded text, reinforcing interpretability and minimizing mismatches between the text-derived region-of-interest and the model’s predicted detection.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Major strengths are as follows:

    1. Novel formulation via Polar Coordinate Encoding for CBCT: The Polar Text-Position Encoding (PTPE) introduces a clinically motivated coordinate system that better reflects jawbone geometry compared to Cartesian grids.

    2. Novel constraints and metrics on text alignment to image: PolarDETR incorporates an anatomical constraint learning stage, ensuring that predicted lesions conform to likely jawbone regions (e.g., alveolar bone, mandibular canal). The paper also presents custom metrics-Anatomical Consistency Score (ACS) and Position Matching Score (PMS)-quantifying how well the model output aligns with clinically relevant anatomy and text-derived location descriptions.

    3. Strong empirical performance comparing to SOTAs like DETR or CLIP-DETR

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While the authors emphasize interpretability, the proposed metrics (Anatomical Consistency Score and Position Matching Score) merely evaluate how well text-derived anatomical descriptors—already provided as inputs—match the model’s bounding-box predictions. In other words, the system aligns its output with what is essentially part of its input, rather than generating an explanation of why those detections are deemed lesions. Thus, although the method demonstrates strong input-output consistency, it does not offer deeper clinical insights into the features or reasoning that lead to a specific lesion detection.

    2. Because the proposed approach relies heavily on rich textual descriptors (e.g., distances, quadrant identifiers), a robust Named Entity Recognition (NER) process is crucial. The study should report the NER pipeline’s accuracy and clarify how errors or incomplete references in the text would affect the model’s performance. This is especially important given that the Polar Text-Position Encoding (PTPE) contributes substantially to detection accuracy, as evidenced by the ablation study.

    3. Figure 4 lacks sufficient description. It is unclear which columns correspond to PolarDETR and which represent the competing models.

    4. “Interpretability” issue is also reflected in the authors’ writing—for example, in the conclusion: “Results show PolarDETR outperforms existing models, offering greater transparency for AI-assisted diagnosis.” As noted in Comment 1, this suggests the authors equate improved text-image alignment with transparency. However, this primarily reflects model prediction accuracy rather than providing true interpretability or insight into clinical decision-making.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Github link provided is not anonymous and source code is not released

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution, a multi-modal framework designed for jawbone lesion detection in CBCT images, is novel and interesting. However, the interpretability claims are not valid and exaggerated.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The manuscript proposes PolarDETR, a multimodal approach that combines clinical text and CBCT images to improve jaw lesion detection. By incorporating PTPE and AC-PML, the model enhances detection accuracy, anatomical consistency, and interpretability. Results demonstrate that PolarDETR outperforms existing models, providing greater transparency for AI-assisted diagnosis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Multi-modal learning shows great potential in medical image analysis, but challenges remain in jawbone lesion detection. To address issues such as the inefficiency of general-purpose models and reduced interpretability with increasing complexity, this manuscript proposes PolarDETR, a novel method that improves both accuracy and interpretability. By incorporating clinical text-derived location information into the model’s query space, PolarDETR aligns anatomical knowledge with CBCT imaging in a polar coordinate framework, enhancing lesion localization while preserving interpretability

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Writing Weakness: The architectural overview in Figure 1 remains unclear, even after reviewing the methodology section. The distinction between the offline and online stages needs to be explicitly explained. Additionally, the handling and definition of ground truth data should be clarified. Figure 4 also lacks sufficient annotations and should be labeled more comprehensively. Contribution Weakness: While many existing studies have explored multimodal data for medical image segmentation, the manuscript should clearly highlight how the proposed method differs from previous work. Moreover, it should describe how the interpretability of the model is evaluated to support claims of improved transparency. Comments on Experiments: The manuscript should include a comparison between the proposed method and SAM-Med2D to provide a more comprehensive performance evaluation.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The four main modules of the multimodal deep learning system—image encoder, text encoder, fusion module, and pretraining objectives—should be described more clearly. The contribution of the text modality to detection and segmentation performance needs further clarification. Additionally, the manuscript should clearly define interpretability and explain how its correctness is evaluated to support claims of improved model transparency.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The contribution is acceptable. The responses to the reviewers are acceptable The manuscript needs to be revised according to the reviewers’ comments.




Author Feedback

Dear Reviewers: We appreciate your thorough review and valuable feedback. Below, we address each of your comments in detail. Lack of Comparison [R2Q1&R4Q4]: Following your suggestion, we have conducted in-depth studies comparing field-specific methods such as BioViL-T, MedCLIP+, and SAM-Med2D. Due to the limitations of the rebuttal process, we are unable to elaborate on the experimental results here, but we will further explore the comparison of these methods in future work. Double-Blind and Code Availability [R2Q2, Q3&R3]: The GitHub repository has now been updated to remove personal information, and the source code is now available. We hope this resolves the concerns. Misuse of Transparency and Interpretability Concerns [R3Q1, Q4&R4Q2]: It’s right to point out the confusion between “transparency” and “interpretability”. Our goal is to enhance interpretability by aligning the model with anatomical descriptors and improving its credibility in clinical decision-making. We will replace “transparency” with “interpretability” in the revised version to more accurately reflect our contribution. We recognize that ACS and PMS indirectly provide interpretability by evaluating the alignment of model outputs with anatomical descriptors, helping clinicians understand why certain areas are identified as lesions. Although these indicators do not fully explain the reasoning process, they improve the trust of the model in clinical practice. In future work, we plan to further explore the reasoning process and enhance interpretability by combining visualization techniques. NER Accuracy and Impact [R3Q2]: Clinician-validated NER annotations ensured high-quality training data, which contributed to improved lesion detection performance. We will add an analysis in Section 2.2 on how annotation noise (e.g., from real-world uncurated text) could theoretically impact model generalizability. Figure Clarity [R3Q3&R4Q1]: We will update Figure 4 to clearly label which model corresponds to each image for better comparison between PolarDETR and other methods. Regarding Figure 1, we will enhance it by adding more annotations to highlight key components (e.g., NER model) and improving the layout for better readability. These changes will help clarify the architecture and make the process flow easier to follow. Offline and Online [R4Q1]: We will clarify the pipeline of offline and online stages: in the offline stage, we first fine-tune the NER model with clinical text, extract polar coordinate information, and embed it into the position information encoding of PolarDETR; in the online stage, the fine-tuned NER model infers the position information from the new text and uses it together with the image data for lesion detection. GT Definition [R4Q1]: We defined the GT for lesion detection as the bounding box (BBox) predicted by the model and the sector-shaped area (PMS) and average jaw distribution (ACS) set based on the FDI tooth position method. The GT for NER includes the tooth number, quadrant identifier, and distance to the relevant anatomical reference point. We will further clarify these definitions in the revision (Section 3.1). Dataset Scale [R2]: Due to the difficulty in collecting annotated clinical data, our current dataset is small. However, we have collected 175 new cases and will continue to expand the dataset to facilitate further improvement and optimization of the model. Comparison with Previous Work [R4Q2]: Different from traditional medical image segmentation methods, our work enhances the alignment of the model with clinical anatomical descriptors by fusing position information with polar coordinates and combining anatomical consistency design. This approach not only improves the accuracy of lesion detection but also enhances the interpretability of the model to a certain extent. We will further clarify this in the revised version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers have reached a consensus to accept the paper.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top