Abstract

We propose a self-supervised model producing 3D anatomical positional embeddings (APE) of individual medical image voxels. APE encodes voxels’ anatomical closeness, i.e., voxels of the same organ or nearby organs always have closer positional embeddings than the voxels of more distant body parts. In contrast to the existing models of anatomical positional embeddings, our method is able to efficiently produce a map of voxel-wise embeddings for a whole volumetric input image, which makes it an optimal choice for different downstream applications. We train our APE model on 8400 publicly available CT images of abdomen and chest regions. We demonstrate its superior performance compared with the existing models on anatomical landmark retrieval and weakly-supervised few-shot localization of 13 abdominal organs. As a practical application, we show how to cheaply train APE to crop raw CT images to different anatomical regions of interest with 0.99 recall, while reducing the image volume by 10-100 times. The code and the pre-trained APE model are available at https://github.com/mishgon/ape.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3539_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/mishgon/ape

Link to the Dataset(s)

https://zenodo.org/records/7262581 https://flare22.grand-challenge.org/ https://www.cancerimagingarchive.net/collection/nlst/

BibTex

@InProceedings{Gon_Anatomical_MICCAI2024,
        author = { Goncharov, Mikhail and Samokhin, Valentin and Soboleva, Eugenia and Sokolov, Roman and Shirokikh, Boris and Belyaev, Mikhail and Kurmukov, Anvar and Oseledets, Ivan},
        title = { { Anatomical Positional Embeddings } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a self-supervised model for generating 3D anatomical positional embeddings (APE) that encode the anatomical closeness of voxels within medical images. Unlike existing models, this new approach can efficiently map voxel-wise embeddings across entire volumetric input images, enhancing utility across various downstream applications. It is trained on a large dataset of CT images from abdominal and chest regions and has shown exceptional performance in tasks like anatomical landmark retrieval and few-shot organ localization. The provision of the pre-trained model and the code enhances the model’s accessibility and potential for further research and application.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Novelty in Embedding Generation: The introduction of a self-supervised approach to generate 3D anatomical positional embeddings is innovative. This methodology enables the precise modeling of spatial relationships within medical imaging, a significant advance over previous models that could not efficiently handle whole-image embeddings. • Utility and Efficiency: The ability of the APE model to process entire volumetric images for embedding generation is a notable strength. This makes it not only more efficient but also more practical for integration into medical imaging workflows where quick and comprehensive analysis is critical. • Broad Applicability and Superior Performance: The model demonstrates superior performance in anatomical landmark retrieval and few-shot organ localization, outperforming existing models like RPR. Its design also facilitates easier application to other potential uses in medical imaging, such as conditioning in generative models or enhancing image cropping techniques. • Practical Application Demonstrated: A standout application detailed in the paper is the use of APE to efficiently crop raw CT images to specific anatomical regions with high accuracy. This application not only showcases the model’s practical utility but also highlights its potential to significantly reduce image processing times and storage requirements. • Open Accessibility: By making the code and pre-trained models publicly available, the paper encourages further exploration and adoption of the APE model, potentially accelerating advancements in medical imaging technology.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Comparison with MedLSAM: The authors should consider including a comparison with the MedLAM model introduced by MedLSAM[1], which is a foundational localization model for 3D CT. Given that APE is also trained on a large dataset, I am curious about how it compares with other foundational models.
    2. Statistical Significance in Experimental Results: The authors should consider adding statistical significance to the experimental results to strengthen the validity of their findings.
    3. Lack of Average Results in Table 4: Table 4 is missing a comparison of average results. It would be beneficial to include this data to provide a clearer overview of the model’s performance across different metrics.
    4. Detailed Labels in Figure 2: It would be helpful if Figure 2 included direct labels for axial, coronal, and sagittal views, along with their respective value ranges to enhance clarity and understanding of the visualized data.

    [1] Lei, Wenhui, et al. “Medlsam: Localize and segment anything model for 3d medical images.” arXiv preprint arXiv:2306.14752 (2023).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Inclusion of Comparative Analysis: The paper could significantly benefit from including a direct comparison with existing models such as MedLSAM’s MedLAM, which is also a foundational model for 3D CT. Such a comparison would not only help in highlighting the strengths and potential improvements of the APE model but also establish a benchmark against existing methodologies. It would be valuable to see both qualitative and quantitative comparisons to understand where APE provides improvements or may require further refinement.
    2. Statistical Significance: Adding statistical significance tests to the experimental results would enhance the credibility of the findings. This addition is crucial for the academic community to assess the reliability of the results presented. This would help in substantiating the claims made about the superior performance of APE over other models.
    3. Enhancement of Result Tables: The omission of average results in Table 4 makes it difficult for readers to gauge overall performance at a glance. Including these averages would provide a more comprehensive view of the model’s performance across different tests and conditions. It is recommended to also include standard deviations or other measures of variability to give a clearer picture of the model’s consistency.
    4. Clarification in Visual Representations: In Figure 2, it would be more informative if axial, coronal, and sagittal planes were clearly labeled and the respective value ranges were included. This would not only help in better understanding the images but also in appreciating the model’s ability to handle different anatomical views. Such clarifications would make the figures more self-explanatory and accessible to readers not familiar with medical imaging.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Simplicity: The method presented is straightforward and elegantly solves the problem of anatomical positional embeddings in 3D medical imaging.
    2. Usability: The practical applications demonstrated, especially the effective cropping of CT images to anatomical regions with high recall, show that the model is not only theoretically sound but also highly usable in real-world scenarios.
    3. Future Potential: The model opens up numerous possibilities for future applications, including its integration into more complex diagnostic systems or its adaptation for other imaging modalities beyond abdominal and chest CT scans.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a computationally efficient UNet-like model to produce 3D anatomical positional embeddings for voxels in CT images. Three different training strategies are explored and the best strategy not only performs competitively with recent position regression and anatomical embedding works but also results in notably better inference speed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method, APE, is novel and shows great computational benefits over its patch-based and voxel-based competitors. It also notably outperforms in few-shot organ localization.
    • Evaluation: the two experiments (landmark retrieval and few-shot localization) are reasonable and informative. The selected baselines are also sensible and up-to-date.
    • The article is well-written and clear. The introduction concisely summarizes existing methods, their drawbacks, and how APE addresses these drawbacks. The methods section is organized and introduces the training strategies coherently. The results section gives helpful clarifying details.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The primary weakness of this paper stems from the questionable comparison against SAM in both experiments, which is arguably the most interesting comparison in this topic.
    • The anatomical landmark retrieval task was evaluated using the labeled portion of the FLARE2022 dataset. This comparison is unfair since APE was pretrained using the unlabeled but in-distribution data from FLARE2022 while SAM was used out-of-box from SAM’s released weights. According to the SAM paper, it was pretrained on DeepLesion, NIH-Lymph Node (NIH-LN), and an in-house chest CT dataset. To the authors’ credit, they did mention this, but this is still an issue for performance comparisons.
    • The same concern applies to the few-shot localization task which also uses FLARE2022.
    • On a more minor note, APE shows no improvement over RPR in anatomical landmark retrieval. However, APE compensates for this by being more efficient.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors indicate that code and pretrained models will be publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Suggestions

    • NNs have been shown to be more biased toward textures than higher order features (e.g., shape). My concern is that by using only high resolution outputs, the model is over-biased toward local textures. The global embeddings from SAM were essential for filtering out all the false positives of local similarity. I’m unsure how this method addresses this issue. An explanation on this would be useful.
    • Should clarify: does this method assume the source and target volumes are reasonably aligned? Otherwise the embedded spatial information may not be appropriate. For instance, a keypoint at the bottom of the left lung in the source image will incorrectly correspond to the bottom of the right lung in the target image if it is horizontally flipped given the model’s reliance on local, high-resolution features.

    Other Feedback

    • In Table 4, adding in an ‘avg’ column like in Table 3 would help readers compare overall performances
    • Introduction: “outputing” is misspelled
    • Introduction: should be “2D to 3D” instead of “1D to 3D” no? BPR and Deep-index used axial slices which are 2D images. If the concept trying to be conveyed was that the dimensionality of the distances is increasing from 1D (distance between axial slices) to 3D (distance between x, y, and z patch coordinates), maybe adding this explanation would improve clarity a bit.
    • Intro, key contributions: fix “APE demonstrates a superior performance”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper is well-written and introduces a novel method that is both efficient and effective. My only reservation from marking this as accept is the fact that the SAM comparison experiments, the main SOTA competitor, is questionable: evaluation was performed on FLARE2022 (labeled portion), APE was trained on a separate but in-distribution unlabeled portion of FLARE2022 while SAM was not exposed to FLARE2022 data in any form. With that said, this paper would still be a valuable addition to the medical imaging community given its novelty and improvements in efficiency. I recommend a weak accept.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a self-supervised embedding model to generate 3-dimensional anatomical positional embedding for each voxel in medical image. The model is evaluated on anatomical landmark retrieval task and abdominal organ localization task and achieved SOTA performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The overall design of the proposed embedding model is reasonable as it is principle-driven. The choice to UNet-like architecture as the backbone of the embedding model is based on the following principles: 1) Each voxel should have a positional embedding so that downstream applications can use the voxel-wise embedding for retrieval or localization. 2) Since human organs have very similar shape and appearance in radiological images, the positional embedding should be dependent on the appearance of the surrounding of the voxel.

    The training strategy(augmentation, formulation of training objective) is then driven by the following principle: The distance between embeddings of two voxels should match with the distance of the two voxels in the original image, regardless of any non-anatomical changes on the image (e.g. crop, translation, color change etc).

    The single-stage, voxel-wise nature of the embedding model achieves high resolution of localization task as well as low latency of inference.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The evaluation experiments( for retrieval & localization tasks) sampled voxels & bounding boxes from the train dataset images, instead of from a standalone dataset (a subset from the test set). This may make the results over-optimistic. The embedding quality should be evaluated on a dataset that was not seen by embedding model during training.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    If the use of train dataset in the evaluation experiment is intended, I’d recommend the authors to describe why this could evaluate the performance of the embedding model without any risk of bias or overfitting.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The design logics of the proposed method is very clear and aligns with the principles. The evaluation results also show clear advantage of the proposed model.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their thoughtful and generally positive feedback, with all reviewers recommending our work to be accepted unconditionally (R1) or otherwise (R3, R5). We appreciate that R3 and R5 praised the novelty of our model, and R1, R3, R5 noted the efficiency of our method.

Below we address the reviewers’ concerns.

[Comparison with MedLAM] We appreciate R3’s suggestion to compare APE with MedLAM. However, we would like to highlight that, according to MICCAI rebuttal guidelines, we are not permitted to add new experimental results during the rebuttal process, and breaking this rule will lead to automatic rejection. Thus, we are unable to include a comparison with MedLAM in the current version of the paper. Nonetheless, we can mention that MedLAM’s part that produces anatomical embeddings is trained with the same loss as RPR, which we have compared with. Quantitatively, MedLAM reports comparable few-shot organ localization IoUs (though on a different dataset). Qualitatively, MedLAM predicts one positional embedding per patch, while APE produces embeddings for all individual voxels in one step.

[Comparison with SAM] We acknowledge R1 and R5’s valid concern about the potential impact of different pretraining datasets on the comparison between APE and SAM. While APE’s pretraining dataset includes the unlabeled part of FLARE2022, and SAM was pre-trained on a different data collection, we believe that the comparison remains valuable and informative for several reasons. First, the labeled and unlabeled parts of FLARE2022 used for testing and pretraining, respectively, do not intersect, ensuring that APE is evaluated on an independent test set. Second, the organ bounding box predictors based on APE, SAM, and other baselines (Tables 3-4) were trained on the same few-shot cross-validation splits of the FLARE2022 labeled set, ensuring a fair comparison. Third, FLARE2022 constitutes only 24% of APE’s pretraining data, mitigating the potential impact of domain-specific biases. Recall that FLARE2022 is a diverse dataset, with an unlabelled set collected from 22 different centers. Finally, LymphNode dataset used for SAM pretraining and the PancreasCT dataset (part of FLARE2022) are sourced from the same clinical center (The National Institutes of Health Clinical Center), suggesting similar domains. Moreover, APE and SAM exhibit qualitative differences: APE embeddings are three-dimensional, explicitly encoding anatomical positions, while SAM embeddings are high-dimensional and purposely redundant. [Overfitting to local textures] R5 expressed concern that APE overfits to local image features, e.g. textures. As written in Section 2.1, our loss enforces APE embeddings to be equivariant w.r.t. augmentations, including masking out random image patches, which prevents overfitting to local features.

[APE specifications] We appreciate R5’s request to clarify APE specifications, such as whether input images need to be aligned. The limitations of APE are already mentioned in Section 3.4: in its current version, it is not equivariant to flips and rotations, i.e., input images must be in canonical orientation, and it is trained only on chest and abdominal areas. However, in other aspects, APE is very robust, being equivariant w.r.t. input image shifts, crops, and changes in voxel spacing.

[Improving results presentation] Following R3’s suggestion, we have supplemented the results in Table 3 with statistical significance tests. We have added “We show that APE significantly outperforms all the baselines using Wilcoxon signed-rank test, with p-value < 1e-6 for all the baselines.” in the Results section.

Additionally, per R3’s advice, we have titled the projections in Figure 2 as “Axial”, “Coronal”, “Sagittal”, illustrated the directions of the main body axes (frontal, sagittal, and longitudinal), and added color bars to show the range of APE values.

We also added a column with average results to Table 4, as requested by R3 and R5.




Meta-Review

Meta-review not available, early accepted paper.



back to top