Abstract

One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unavailable. In this paper, we propose the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for accurate landmark detection in medical images by utilizing solely a single template image without any additional unlabeled data. Specifically, we use the frozen image encoder of visual foundation models as the feature extractor, and introduce dual-branch global and local feature decoders to increase the resolution of extracted features in a coarse to fine manner. The introduced feature decoders are efficiently trained with a distance-aware similarity learning loss to incorporate domain knowledge from the single template image. Moreover, a novel bidirectional matching strategy is developed to improve both robustness and accuracy of landmark detection in the case of scattered similarity map obtained by foundation models. We validate our method on two public anatomical landmark detection datasets. By using solely a single template image, our method demonstrates significant superiority over strong state-of-the-art one-shot landmark detection methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0320_paper.pdf

SharedIt Link: https://rdcu.be/dV58q

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_28

Supplementary Material: N/A

Link to the Code Repository

https://github.com/JuzhengMiao/FM-OSD

Link to the Dataset(s)

https://github.com/MIRACLE-Center/Oneshot_landmark_detection

BibTex

@InProceedings{Mia_FMOSD_MICCAI2024,
        author = { Miao, Juzheng and Chen, Cheng and Zhang, Keli and Chuai, Jie and Li, Quanzheng and Heng, Pheng-Ann},
        title = { { FM-OSD: Foundation Model-Enabled One-Shot Detection of Anatomical Landmarks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {297 -- 307}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for medical images, achieving accurate landmark detection by utilizing only a single template image, without requiring any additional unlabeled images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper addresses the multiple challenges faced by features extracted from the foundation models for landmark detection in medical images. For example, the resolution of features from foundation models is lower, while precise landmark detection requires accurate location-related high-resolution semantic information.
    2. The paper applies pre-trained foundation models to the field of anatomical landmark detection in medicine, and proposes the first foundation model-enabled one-shot landmark detection framework.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Although the authors applied the Foundation models to anatomical landmark detection, the techniques used are not novel. The idea of global and local feature enhancement has been widely used, such as: [1] SegNetr: Rethinking the Local-Global Interactions and Skip Connections in U-Shaped Networks, MICCAI, 2023. [2] GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation, MICCAI, 2023. [3] Local-Global Dual Perception Based Deep Multiple Instance Learning for Retinal Disease Classification, MICCAI, 2021.

    2. Similarly, bidirectional matching strategy incrementally improves on the template matching method. It’s not attractive enough.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The author should label the symbols of each encoder, decoder, and features in Fig. 1 to correspond to the text.
    2. This paper uses a large model, which will inevitably increase the number of parameters and time complexity, which should be objectively compared with other methods and explained.
    3. In the introduction section, this paper mentions the challenge of the domain gap between natural images and medical images, but there is no specific domain adaptation module designed to solve such a problem.
    4. In addition to the application of the foundation model, the authors should highlight technical innovations with other methods.
    5. I’ve noticed that the existing methods tend to use a one-shot setting, which is a common task setup for fair comparison. However, for this task, it’s worth questioning whether such precision is clinically applicable. What about a few-shot setting? The few-shot setting does not add much cost, but the potential for performance improvement may be large. This is just a suggestion, but exploring this further could mean greater clinical significance.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is not innovative enough from a technical point of view. However, considering that this paper proposes the first foundation model-enabled one-shot landmark detection framework that is instructive to the community. I give 4 points.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    To address the challenge of extensive unlabeled data in one-shot anatomical landmark detection, the authors investigate the efficacy of leveraging a pre-trained foundation model’s inherent feature extraction capabilities. They use dual-branch global and local feature decoders to refine feature resolution in a coarse-to-fine manner, vital for accurate landmark detection. These decoders undergo training with a distance-aware similarity learning loss, facilitating the integration of domain knowledge from a single labeled template image, essential for one-shot detection. Moreover, for precise landmark localization, the authors employ bidirectional matching, utilizing inverse matching error to ensure correspondence between points in the query and template images. Notably, ‘FM-OSD’ outperforms the state-of-the-art on both Head and Hand X-Ray datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper demonstrates clear coherence and readability, facilitating easy comprehension.
    2. FM-OSD stands out as the pioneering one-shot anatomical landmark detection method leveraging foundation models effectively, notably requiring zero unlabeled data and just one labeled template.
    3. The methodology section and implementation details are meticulously outlined, promising efficient reproducibility once the code is shared publicly.
    4. The results section meticulously compares FM-OSD’s performance with current relevant baselines in the literature, offering both qualitative and quantitative analyses.
    5. Notably, the conclusion highlights the potential extension of this work to 3D images, presenting promising avenues for future research.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited novelty: While the use of foundation models represents a notable aspect, the novelty of the approach appears constrained for the following reasons: Previous studies, such as “Towards Foundation Models Learned from Anatomy in Medical Imaging via Self-supervision,” have introduced self-supervised strategies for gradually decomposing and perceiving anatomy in a coarse-to-fine manner. Additionally, research like “Relative distance matters for one-shot landmark detection” has demonstrated the efficacy of cosine similarity distance between extracted features and ground truth in improving landmark detection performance.

    2. The introduction section extensively cites applications and deep learning methods for landmark detection, but lacks detailed comparisons with previous approaches to facilitate understanding of the distinctive aspects of FM-OSD.

    3. Some literature reviews include performance evaluations on datasets like the Chest dataset for landmark detection, which are absent in this paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The methodology section provides thorough and detailed explanations, while implementation details are meticulously outlined. Once the code is made publicly available, it will greatly facilitate navigating through it and replicating the results with ease.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Upon careful examination of the results section of FM-OSD and the state-of-the-art (SOTA), some disparities have been identified in the reported results. We kindly encourage the authors to review the following points and consider necessary adjustments:

    1. Regarding SCP[16] - Table 2 of the paper [16] titled “Which images to label for few-shot medical landmark detection?”: (a) For the Head dataset,
      • The Mean Ranking Error (MRE) reported in [16] is 2.74, matching that of FM-OSD.
      • The Success Detection Rate (SDR) reported for the threshold 2mm is 43.79 in [16], consistent with FM-OSD.
      • However, discrepancies are noted in the SDR for other thresholds:
        • Threshold 2.5 mm: 53.05 in [16] compared to 48.04 in FM-OSD.
        • Threshold 3mm: 64.12 in [16] compared to 60.02 in FM-OSD.
        • Threshold 4mm: 79.05 in [16] compared to 77.72 in FM-OSD.
    2. Concerning EGTNLR[22] - Table 1 of the paper [22] titled “One-shot medical landmark localization by edge-guided transform and noisy landmark refinement”: (a) For the Head dataset,
      • The MRE reported in [22] is 2.13, slightly different from FM-OSD’s 2.27.
      • The SDR reported for various thresholds also displays discrepancies:
        • Threshold 2mm: 54.69 in [22] compared to 49.45 in FM-OSD.
        • Threshold 2.5 mm: 67.47 in [22] compared to 63.07 in FM-OSD.
        • Threshold 3mm: 77.85 in [22] compared to 74.70 in FM-OSD.
        • Threshold 4mm: 90.02 in [22] compared to 88.91 in FM-OSD. (b) For the Hand dataset,
      • The MRE reported in [22] is 1.82.
      • The SDR reported for various thresholds:
        • Threshold 2mm: 66.39 in [22] compared to 64.62 in FM-OSD.
        • Threshold 4mm: 92.93 in [22] compared to 95.03 in FM-OSD.
        • Threshold 10mm: 99.97 in [22], consistent with FM-OSD.

    It is worth noting that the reported results of other SOTA baselines align with those in their respective papers.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach has limited novelty, and there is a notable discrepancy between the Mean Ranking Error (MRE) and Success Detection Rate (SDR) scores reported in this paper compared to those in the original paper, indicating potential limitations in replicating the results of the state-of-the-art methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed an algorithm to detect landmarks on medical images with one labeled image, utilizing encoders pre-trained on natual images and decoders to extract global and local features, trained on similary loss. Compared to literature, no extra unlabeled image was needed. Experiment has been conducted on two public X-ray datasets where performance was compared with state-of-the-art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is organized in a clear and informative manner. The graph has illusrated the methodology into details. Experimental design was illustrated into details. Performance was compared with state-of-the-art on a set of evaluation metrics.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Null.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Null.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty regarding no extra unlabled images needed in landmark detection is what makes this paper stand out.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the AC and reviewers for their time and valuable feedback. We are delighted and encouraged that reviewers find our work is “the first foundation model-enabled one-shot landmark detection framework” (R1, R3), “notably requires zero unlabeled data” (R3, R4), “meticulously compares FM-OSD’s performance with current relevant baselines” (R3, R4), and has “clear coherence and readability” (R3, R4). We would like to further clarify several questions raised by the reviewers.

  1. Label the symbols in Fig. 1 (R1). Thanks for the suggestion. We will include the symbols of different components in Fig. 1 in our final version for better clarity.

  2. Details of tackling the domain gap between natural images and medical images (R1). As stated in our abstract and the last paragraph of the introduction section, we use learnable decoders supervised by a distance-aware similarity learning loss to narrow the domain gap between natural images and medical images and integrate domain knowledge from a single labeled template image to some extent.

  3. Distinctive aspects of the proposed FM-OSD compared with previous deep learning methods in the introduction section (R3). The deep learning methods mentioned in the first paragraph of the introduction section typically require a large number of high-quality labeled data to train their models and achieve accurate detection results. However, it is extremely time-consuming and difficult to obtain such high-quality labeled data from domain experts. By contrast, our proposed method utilizes solely a single labeled template image and does not require any additional unlabeled data.

  4. Some literature reviews include performance evaluations on datasets like the Chest dataset for landmark detection are absent (R3). We will include some references that discuss performance evaluations on different datasets in the introduction section of our final version.

  5. Disparities between reported results in this paper and results in the original paper (R3). We appreciate the reviewer for checking the consistency of the reported results of SOTA baselines with those documented in their respective papers. (1) For some of the SDR results for SCP, upon careful re-examination, we find that we accidentally referenced the data from the row above the target values for SCP in Table 2 of the original paper. We apologize for this and will update the correct values in our final version. Notably, with the correct values, our proposed method can still outperform the SCP method by a substantial margin, that is 77.92 (ours) vs 53.05 (SCP), 84.59 (ours) vs 64.12 (SCP), and 91.92 (ours) vs 79.05 (SCP) in terms of SDR under 2.5 mm, 3 mm, and 4 mm, respectively, demonstrating the effectiveness of our proposed method. We have thoroughly double-checked all the other numbers to ensure the correctness of the results in our final version. (2) Regarding EGTNLR, we find that the template image used on the Head dataset by EGTNLR differs from those used by other methods after careful examination. Additionally, the information on the template image used on the Hand dataset was not provided by EGTNLR. Thus, as mentioned in the first paragraph of “Comparisons with State-of-the-arts” on Page 7 in our manuscript, we re-implemented EGTNLR using their released code and reported the results under the same training and testing settings as other comparison methods as well as ours on both datasets for a fair comparison.




Meta-Review

Meta-review not available, early accepted paper.



back to top