Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Achieving automated vertebrae classification in spine images is a crucial yet challenging task due to the repetitive nature of adjacent vertebrae and limited fields of view (FoV). Different from previous methods that leverage the serial information of vertebrae to optimize classification results, we propose VertFound, a framework that harnesses the inherent adaptability and versatility of foundation models for fine-grained vertebrae classification. Specifically, VertFound designs a vertebral positioning with cross-model synergy (VPS) module that efficiently merges semantic information from CLIP and spatial features from SAM, leading to richer feature representations that capture vertebral spatial relationships. Moreover, a novel Wasserstein loss is designed to minimize disparities between image and text feature distributions by continuously optimizing the transport distance between the two distributions, resulting in a more discriminative alignment capability of CLIP for vertebral classification. Extensive evaluations on our vertebral MRI dataset show VertFound exhibits significant improvements in both identification rate (IDR) and identification accuracy (IRA), which underscores its efficacy and further shows the remarkable potential of foundation models for fine-grained recognition tasks in the medical domain. Our code is available at https://github.com/inhaowu/VertFound.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3978_paper.pdf

SharedIt Link: https://rdcu.be/dY6lb

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_71

Supplementary Material: N/A

Link to the Code Repository

https://github.com/inhaowu/VertFound

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tan_VertFound_MICCAI2024,
        author = { Tang, Jinzhou and Wu, Yinhao and Yao, Zequan and Li, Mingjie and Hong, Yuan and Yu, Dongdong and Gao, Zhifan and Chen, Bin and Zhao, Shen},
        title = { { VertFound: Synergizing Semantic and Spatial Understanding for Fine-grained Vertebrae Classification via Foundation Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {763 -- 772}
}

Reviews

Review #1

Please describe the contribution of the paper

This method presents a method for detecting and identifying vertebrae in 2D MRI scans by leveraging and adapting features from two foundation models (SAM and CLIP) using cross attention mechanisms. The method is shown to be a significant improvement over other domain-agnostic detection algorithms (e.g. YOLO, DETR) on an in-house dataset and code is made publically available. An ablation study of different components of the feature adaption framework is performed.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The idea of using large pre-trained foundation models in medical imaging problems is an exciting and I believe the proposed method is a sensible way of doing this.

The presented results show a substantial improvement over domain-agnostic object detection methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some of the figures could be a bit clearer (see detailed comments below).

I may have misunderstood this but it appears that the method requires annotated bounding boxes as input meaning it is not fully automated.

Results are only reported on a single in-house dataset, so it is difficult to tell how well it would generalize.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?
- The data used is not public.
- The paper claims to release anonymized code however following the link lead to an empty repo with no code at all.
- Most parts of the method are well described.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- I could not see any code at the anonymised link, although I think adding this would really help the understanding of the method.
- How does proposed method deal with missed vertebral detections? These would be fairly common in clinical data (e.g. collapsed vertebrae) so some discussion of this is important. An experiment where some vertebrae detections are masked out and the remaining vertebrae must be identified would be interesting, to check that the model is not simply learning to put all it’s classifications along a single diagonal.
- I might have missed it, but please explain a bit more about the bounding box as I am unsure where they come from. Do they require manual annotation? If so this should be made clearer.
- Figure 4 is very hard to read. I would suggest simply showing confusion matrices for each configuration instead of the confusion star plots shown.
- Could the method be adapted to work with 3D MRI scans which are far more common in clinical practice?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the proposed method is very interesting and adapting foundation models for medical imaging tasks is an important research area as foundation models become more ubiquitous.

However, I think some parts of the paper require better explanation as detailed above.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper proposes VertFound for fine-grained vertebrae classification based on two foundation models. The proposed method mainly includes a vertebral positioning with cross-model synergy (VPS) module to fuse image-level features from the CLIP model with region-level features from the SAM model. In addition, a Wasserstein loss is designed to further optimize the transport distance between image and text feature distributions. Experimental results show that the proposed method outperforms the performance of several object detection methods with the latest foundation models for the fine-grained vertebrae classification task.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed method merges semantic information from CLIP and spatial features from SAM, leading to richer feature representations that capture vertebral spatial relationships.
2. The Wasserstein loss is demonstrated to assist in reducing the differences between image and text feature distributions, thus improving the model’s ability to discriminate vertebrae with high similarity.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The proposed VertFound model in Table 1 requires the use of Bbox prompts as inputs, however, it is not clear whether the other models compared also use Bbox prompts as a priori knowledge for a fair comparison.
2. Fig. 4 could be improved. The current version with vague colorful lines is difficult to comprehend. More explanations would be helpful.
3. The proposed method is designed for the classification task, however, only objection detection methods are listed as baselines in Table 1 for comparison.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

please see the weakness.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The VertFound framework proposed by the authors achieves state-of-the-art performance on the vertebral classification task, while the model requires additional Bbox prompts as input. Therefore, in the comparison experiments, the authors need to provide a more detailed experimental setup for the models being compared to ensure that the comparison experiments are fair.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper includes several contributions: 1) It proposes the combination of frozen pretrained CLIP image encoder and SAM models to extract features both at image and regional level followed by a model to improve the representation using the dot product attention mechanism, resulting in position-enhanced regions. 2) It further introduces the use of categorical information using CLIP text encoder and a modified Wasserstein loss to get a refined image-text alignment, and thus provide an enhanced detection. 3) The paper evaluates this architecture on an in-house dataset and include ablation studies to confirm the relative contribution of the main components of the method, as well as, detection models and recent foundation models as baselines.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

By using foundation models and a clever architecture, the authors are able to provide a well-performing vertebra identification approach. In particular, the combination of CLIP and SAM to get both image and region level features, enhanced by the use of dot product attention, is not only reasonable, but seems to impact positively the performance. Also, further integrating categorical text information combining the enhanced features with a CLIP text encoder to align image and text and its combination with the proposed Wasserstein loss further improves the performance and adds high level semantical information. Overall, both aspects are novel formulations and in my view an original way to combine foundation models such as CLIP and SAM. The ablation studies and comparison to state-of-the-art methods provides further insights in the strength of the introduced approach.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Unfortunately, the approach is only evaluated in a single private dataset which is not made available for the public. While publicn MRI datasets are scarce, and the detection task being harder in MRI than CT, the authors could have used as a secondary dataset any of the CT vertebrae datasets. Such an evaluation would enable to place the work better in the literature and further reduce the risk of low generalisation power of the approach.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

Actually, there is a link, but the repository is empty. The authors do not mention that the dataset will be made available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Overall the paper is well written both in terms of language and structure. The authors succeed in explaining a complex approach sufficiently good for a non-expert to able to follow. There are a few aspects that can be improved though: 1) As mentioned before, it would be highly beneficial to evaluate the method on another spine dataset, ideally a CT one. 2) The bounding box prompts for SAM are not detailed. This makes it hard to reproduce the paper and also keeps relevant information in the dark. A short description would be beneficial. 3) The weighting betwee CEL and WSL is set to be equal. What is the logic behind? Or differently said, was this empirically fixed after experimentation, or is it rather an educated guess for a good weighting of both losses? 4) In table 1 and 2, statistically significance between the proposed method and the baselines, or the ablated versions of VertFound should reported. Looking at the standard deviation and means, I would assume several of them are statistically significant.

On a less relevant side: a) The introduction paragraph of methodology (before 2.1.) repeats most of the content of the end of the introduction. This can be improved. b) In Fig. the C after the FPN should be C tilde if I do not mistake. c) C’l of equation 1 is never mentioned explicitly in the paper. d) In 2.2., VPM should be replaced with VPS.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written in terms of language and clarity. The method seems for my expertise to be novel and appealing. I can imagine that it could be ported to other applications without much hazzle. I do not give an accept due to two reasons: (1) I think that the strength of the method can only be confirmed if applied to other datasets, which are abudantly available (2) I am not confident given my expertise that the paper’s contribution are as novel as they seem to me
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We appreciate Reviewer #1’s recognition of the significant improvement our method presents over domain-agnostic detection algorithms and the excitement around leveraging pre-trained foundation models in medical imaging. Reviewer #3’s positive feedback on our innovative approach of merging semantic information from CLIP and spatial features from SAM, as well as the effectiveness of the Wasserstein loss, is greatly appreciated. Additionally, we are grateful to Reviewer #4 for acknowledging the novelty and impact of our proposed method, as well as the detailed ablation studies and comparisons provided. Below we address all the constructive comments in detail and will incorporate this feedback and code link in the revision. (1) Clarity of Figures: We will revise Figure 4 to present confusion matrices for each configuration instead of the current confusion star plots to enhance readability. Besides, we have corrected some minor errors, such as spelling mistakes, throughout the paper. (2) Annotated Bounding Box Requirement and Comparison Models: During training, the annotated bounding boxes are input to SAM and used for training object detectors as well as the trainable parameters in VertFound. During testing, the bounding boxes input to SAM are predicted by a pre-trained detector like YOLO v8. In other words, the bounding boxes input to VertFound are those predicted by the trained object detectors. We will clarify this in the latest version of our work. Therefore, our comparison is fair, as both our method and other models like YOLOv8 and DETR rely on annotated bounding boxes for training and predicted bounding boxes for testing. Additionally, as our method requires the position predictions of different vertebrae to further determine their classes, the focus of our comparisons is mainly on object detection methods. (3) Generalization and Adaptation to 3D Scans: Thanks for your suggestions. We will extend our experiments to include additional datasets from different sources and modalities to test the generalization capabilities of our method. Additionally, we acknowledge the suggestion to adapt our method to work with 3D MRI scans, which are more common in clinical practice. This expansion will be a focus for our future journal version of the paper, where we will further explore and validate our method’s performance on diverse and comprehensive datasets, including 3D MRI and CT scans. (4) Code Availability: We have now fully uploaded the code and detailed instructions to facilitate reproducibility. We apologize for the confusion regarding the initially empty code repository and assure you that it has now been updated with the complete code. We are committed to continuing to improve and extend this work, ensuring that our contributions to medical imaging research remain robust and impactful. (5) Experiment Setup and Analysis: Thank you for pointing out the unclear explanation of loss weights and statistical significance. We will clarify all of these expressions in the latest version of our work. Furthermore, in the future journal version of the paper, we will continue to explore the impact of missing detections on the final classification results, further deepening our research. In conclusion, we express our gratitude to the reviewers for their detailed and constructive feedback. Your insights have been invaluable in identifying areas for improvement, and we are committed to addressing these issues in our revised submission. We appreciate your careful review and look forward to continuing to refine and enhance our work.

Meta-Review

Meta-review not available, early accepted paper.

back to top

VertFound: Synergizing Semantic and Spatial Understanding for Fine-grained Vertebrae Classification via Foundation Models

Author(s):