List of Papers Browse by Subject Areas Author List
Abstract
To restore proper blood flow in blocked coronary arteries via angioplasty procedure, accurate placement of devices such as catheters, balloons, and stents under live Fluoroscopy or diagnostic Angiography is crucial. Identified Balloon markers help in enhancing stent visibility in X-ray sequences, while the Catheter tip aids in precise navigation and co-registering vessel structures, reducing the need for contrast in angiography. However, accurate detection of these devices in interventional X-ray sequences faces significant challenges, particularly due to occlusions from contrasted vessels and other devices and distractions from surrounding, resulting in the failure to track such small objects. While most tracking methods rely on spatial correlation of past and current appearance, they often lack strong motion comprehension essential for navigating through these challenging conditions, and fail to effectively detect multiple instances in the scene. To overcome these limitations, we propose a self-supervised learning approach that enhances its spatio-temporal understanding by incorporating supplementary cues and learning across multiple representation spaces on a large dataset. Followed by that, we introduce a generic real-time tracking framework that effectively leverages the pretrained spatio-temporal network and also takes the historical appearance and trajectory data into account. This results in enhanced localization of multiple instances of device landmarks. Our method outperforms state-of-the-art methods in interventional X-ray device tracking, especially stability and robustness, achieving an 87% reduction in max error for balloon marker detection and a 61% reduction in max error for catheter tip detection.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2053_paper.pdf
SharedIt Link: pending
SpringerLink (DOI): pending
Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2053_supp.pdf
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{Isl_ANovel_MICCAI2024,
author = { Islam, Saahil and Murthy, Venkatesh N. and Neumann, Dominik and Cimen, Serkan and Sharma, Puneet and Maier, Andreas and Comaniciu, Dorin and Ghesu, Florin C.},
title = { { A Novel Tracking Framework for Devices in X-ray Leveraging Supplementary Cue-Driven Self-Supervised Features } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
year = {2024},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15006},
month = {October},
page = {pending}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors propose a model for tracking surgical devices in X-ray fluoroscopy images. The main contribution of a spatiotemporal transformer that better models occlusions, enabling improved device tracking over time.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The model is clearly presented. Furthermore, evaluations are performed on a large dataset.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
While the authors compare against existing catheter and balloon tracking methods, they do not compare against the most advanced particle tracking models in general computer vision. The PIPs and PIPs++ models currently achieve state-of-the-art and handle occlusions better than any prior method. Comparison to these models would help elucidate if the proposed model really sets a new standard (https://github.com/aharley/pips).
The proposed model has many components, and the ablations do not sufficiently demonstrate the importance of each component. For example, what happens if the weak label decoder and segmentation loss is removed?
The KDEs in Fig 3 do not add any information and make the plots harder to interpret. Additionally, RMSE could be plotted on a log scale to better visualize differences between the different models.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
No mentions of code release, model release, or data release are made in the manuscript.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Small nitpick: some words are capitalized that should not be (e.g., Fluoroscopy, Angiography, Balloon, Catheter, Self-Supervised Learning). They’re not proper nouns and should all be lowercase.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Reject — could be rejected, dependent on rebuttal (3)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The proposed model is very involved without much justification for the multiple components. Better ablations would help justify the design of the model. Furthermore, comparison to state-of-the-art particle trackers would help.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Weak Accept — could be accepted, dependent on rebuttal (4)
- [Post rebuttal] Please justify your decision
Thanks to the authors for the detailed comparison to PIPS++. It was illuminating! Happy to upgrade the paper to a weak accept.
Review #2
- Please describe the contribution of the paper
The authors proposed a self-supervised learning approach utilizing contextual cues for Balloon markers and Catheter tip detection in X-ray.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The proposed method holds significant clinical importance, and the motivation behind it is well-defined and evident.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The method is relatively naïve self-supervised learning methods with mask reconstruction and weak label prediction. The comparison is not comprehensive, lacking comparison with baseline and state-of-the art self-supervised learning methods like MAE and dinov2.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
no
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
The manuscript presents a valuable contribution to the field of device tracking in X-ray images through the utilization of supplementary cue-driven self-supervised features. The proposed method demonstrates clinical relevance, and the motivation behind it is clearly articulated. However, the self-supervised learning method employed in this study is relatively naïve, relying on mask reconstruction and weak label prediction. Additionally, the comparison conducted in this study appears to be incomplete, as it lacks a comparison with SOTA self-supervised learning methods such as dinov2.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Accept — could be accepted, dependent on rebuttal (4)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
the authors are working on a real issue and have a pathway to solve it. even though the method in use is not novel enough and the results can be more convincing, it is the best paper in my hand.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
- The author combines temporal mask image modelling (self-supervised learning) and vessel segmentation (semi-supervised learning) to pre-training the encoder.
- The author proposes a temporal tracking framework based on the pre-trained encoder.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The combination of temporal MIM and semi-supervised segmentation (multi-task pre-training) for x-ray image is novel.
- The designed temporal tracking framework based on pre-trained encoder is novel. The tracking framework outperforms other SOTA tracking models.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The figure should be consistent with text via the same symbol.
- Some typos in paper.
- In the real application, for the tracking framework, the past frames are accumulated (0-n) over time, which can also increase the computational memory in GPU over time (as n increases over time).
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
N/A
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- The task-specific model in Fig. 1 (a) is not mentioned in the text.
- In section 2.2, space-time attention (MHA) is not correct. Should it be STA?
- In the real application, for the tracking framework, the past frames are accumulated (0-n) over time, which can also increase the computational memory in GPU over time (as n increases over time). How does the author solve this issue? Do you set the maximum sequence length of past frames?
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Accept — could be accepted, dependent on rebuttal (4)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The proposed pretraining approach and tracking framework are novel. The raised real application issue need to be resolved.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Weak Accept — could be accepted, dependent on rebuttal (4)
- [Post rebuttal] Please justify your decision
The real application issue has been clarified in rebuttal. Author states that the typos and figure will be modified in revised paper. However, the task-spesific model isn’t explained in the rebuttal.
Therefore, I keep weak accept.
Author Feedback
We thank the reviewers (R3, R4 and R5) for their feedback and R5 for noting the novelty of our approach. Please find clarifications for the main points raised below:
Lack of comparison to particle trackers (e.g., PIPs++) (R4): Particle tracking methods are trained on synthetic datasets of natural images using modules inspired by optical flow. They require large amounts of point correspondences between frames to learn unique features for various scene points, which is impractical in our interventional setup. For example, pips++ trained on PointOdyssey consists of 18,700 trajectories. While these methods can technically be trained with sparse points, our early experiments with CoTracker (claimed to outperform pips++) struggled to converge and performed 4x worse than our method for catheter tip tracking, likely due to sparse trajectories in our data. Additionally, all particle tracking methods include a refinement stage that smooths trajectories temporally, making them not real-time. Despite pips++ claiming 55 FPS, they use non-overlapping windows (36 frames) in time to process the entire video. In a practical real-time setting, overlapping sliding windows in time are used (as shown in Fig 1 supplementary) as we cannot wait to accumulate frames. In that case, pips++ results in a runtime of 2 FPS. Due to these impracticalities and to have a fair comparison, we focused only on feature-based tracking methods, which have proven suitable in addressing our problem.
Details on comparison with self-supervised learning methods and novelty (R3): One of the novelties in our SSL approach is introducing weak supervision forcing the network to learn richer features. This method can be applied to any latest SSL techniques, such as MAE and DINO. We chose FIMAE, which showed superior results compared to other recent video-based SSL methods, cited as [1] (recently published in a medical journal) to demonstrate our claim (Table 2). Our paper’s major novelty is the downstream tracking framework, designed to effectively leverage SSL features for tracking under complex angiography scenes. A naive use of the SSL encoder gives inferior results, as shown in Table 3’s first row, where motion-aware feature matching and past trajectory are absent. Moreover, we show that matching in space-time feature space (Cross-Attention) is more effective than traditional spatial feature matching using asymmetrical crops. This approach also results in flexible tracking, extending to multiple object instances, unlike existing tracking frameworks.
Details on ablation analysis (R4): We refer to the baseline SSL as FIMAE and the version incorporating weak supervision as FIMAE-SC. Table 2 compares these models across downstream trackers (SimST and ours), showing that weak supervision-based SSL enhances performance for both trackers. Table 3 presents ablations on downstream tracker components, including space-time aware feature matching (appearance tokens) and past trajectory. Supplementary Table 1(a) explores how varying weights of reconstruction and mask segmentation (from weak supervision) impact performance and supplementary Table 1(b) highlights the effective size of the features for matching. In conclusion, weak supervision requires less weight than reconstruction, and matching with a proper feature size is required to balance accuracy with noisy predictions.
Other Clarifications (R5, R4): During inference, the tracker uses 5 frames; the current frame and the last 4 frames. As time progresses, the oldest frame in the sequence is dropped and the newest is added, ensuring the sequence maintains 5 frames. This inference strategy is illustrated in supplementary Fig 1. We empirically determined that using 5 frames provides sufficient temporal context while maintaining the runtime for practical applications. The results presented were conducted on test sequences that can go up to 200 frames. We will fix the typos and make the figure interpretable in the final version.
Meta-Review
Meta-review #1
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
N/A