Abstract

Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4725_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/4725_supp.zip

Link to the Code Repository

https://github.com/PCASOlab/Xslot

Link to the Dataset(s)

Abdominal dataset: https://upenn.box.com/s/493licnenrssjukuvok5zkvc5cqmx1nh Cholec dataset: https://upenn.box.com/s/ree79lv9fbibjbs2b8mkwzz207oqu6jj Thoracic dataset: https://upenn.box.com/s/rxqoi81j5ar4l343ob6otdxxeusc3iwg

BibTex

@InProceedings{LiaGui_Future_MICCAI2025,
        author = { Liao, Guiqiu and Jogan, Matjaz and Hussing, Marcel and Zhang, Edward and Eaton, Eric and Hashimoto, Daniel A.},
        title = { { Future Slot Prediction for Unsupervised Object Discovery in Surgical Video } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose to use slot attention for the given task with some modifications. The results are well compared and the method is somewhat justified

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Comprehensive comparison in results 2) Multiple datasets 3) Clear methodology 4) Well written paper

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Comparison with slot bert methodically is misssing 2) Why does this model performs better than slot bert and others it should be discussed 3) Computational efficieny should be discussed 4) Statistical analysis w.r.t previous works must be added

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Major bottleneck is slot bert. It should be resolved before anything else

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although the rebuttal seems okay. It is important to incorporate the changes in the revised version



Review #2

  • Please describe the contribution of the paper

    This paper applies the unsupervised object centric learning towards complex surgical videos for low training cost and real-time interpretation. The proposed methods demonstrate superior performance on surgical instrument segmentation task across multiple datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Investigating unsupervised object discovery in surgical videos is quite encouraging to this community considering the complex surgical scenes.
    • The evaluations are extensive, covering three surgical datasets and the ablation study on long clip and zero-shot scenario are quite useful.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The novelty of the method is kind of limited. Specifically, the slot prediction design is quite similar to the rollout module in “Object-Centric Multiple Object Tracking”, published in ICCV 2023. It is good to cite this paper and compare the major difference.
    • How does slot prediction handle the object in-n-out issue if they are used as next frame slot initialization?
    • Since the instruments are easily to be occluded in surgical video, how does slot merger handle the occlusion problem when the occluded instrument may be more similar to other instruments in the scene rather than itself in last frame?
    • The training signal is the reconstruction loss. I believe slot attention is good at object reconstruction but bad at complex background reconstruction, especially the tissue, blood, and lesions in surgical video. What do you do to avoid such challenges during training?
    • In Table 1, it is better to report the other supervised method (non object-centric) or SOTA supervised methods as reference.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic studied in this paper is inspiring to surgical video community and the work is solid and well-evaluated.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes to achieve unsupervised segmentation using dynamic future slot prediction for abdominal and cholecystectomy surgery videos. A dynamic slot initialization module that predicts the number and allocation of slots in future frames via a masked auto-encoding objective is proposed.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes a dynamic slot initialization module that predicts the number and allocation of slots in future frames via a masked auto-encoding objective.
    • The module can be used for bidirectional temporal reasoning using random masking in training.
    • The model outperforms benchmark models in comparison and the study is supported with ablation studies
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The literature review on object-centric learning for the temporal domain is very limited. Differences and contributions over these methods with the proposed approach should be discussed more in depth.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes to achieve unsupervised segmentation using dynamic future slot prediction for abdominal and cholecystectomy surgery videos. A dynamic slot initialization module that predicts the number and allocation of slots in future frames via a masked auto-encoding objective is proposed. The model outperforms benchmark models in comparison and the study is supported with ablation studies, however, an in depth discussion of literature review on object-centric learning for the temporal domain is lacking. Differences and contributions over these methods with the proposed approach should be discussed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my concerns about the paper.




Author Feedback

We appreciate the time and effort of the AC and the reviewers. Below we clarify the main concerns (C) and reply to other questions (Q) listed by the reviewers.

C-R1|R3: Limited discussion in the temporal domain, especially R3: Slot-BERT.

A: Dynamic slot initialization and temporal reasoning are a significant improvement over existing architectures, including Slot-BERT. Key differences: 1) Dynamic Slot Prediction: A fixed number of slots per video (e.g., Slot-BERT) limits the ability to adapt to dynamic scenes. We dynamically predict the number of slots per frame. 2) Our dynamic slot prediction allows for a much more refined temporal slot initialization beyond contextual learning of temporal coherence as in Slot-BERT and other temporal learning architectures. We can expand on this in the discussion.

R3-Q2: Why do we outperform Slot-BERT and other methods?

A: The performance gain is due to dynamic slot allocation and more refined temporal reasoning. E.g., in Fig. 3, Slot-BERT with a fixed slot count over- (11 slots) or under-segments (5 slots). Our dynamic slots approach balances granularity and completeness, resulting in better delineation of instruments and tissues.

R3-Q3: Computational efficiency

A: The forward time per frame on a single NVIDIA RTX A6000 GPU is 5.6 ms, supporting real-time downstream tasks. The merging and future prediction adds a little overhead - e.g., a fixed slot count method like Slot-BERT requires 1.7 ms per frame. This overhead (3.9 ms) is minor given the current context window as temporal reasoning operates in latent space. We can increase the context window by 20X to still have latency less than 100 ms.

R3-Q4: Statistical analysis

A: One tailed t-test of reported results to Slot-BERT (second best) on mBO-V, mBO-F, mHD, ARI and Corloc are significant (p<0.01 with Bonferroni correction) on Thoracic (all p<1E-4), Cholec (p< { 0.004,0.008,1E-5,0.007,0.003}), and Abdominal (except CorLoc) (p< {0.002, 0.005, 0.001, 0.004, 0.03}). We can add this to the final version.

R4-Q1: Novelty compared to Slot-Roll-Out in Zhao et. al.

A: Slot-Roll-Out learns to roll out duplicated slots but requires sparse supervision. Our method performs unsupervised merging of redundant slots and integration of object parts purely based on appearance and dynamics. A citation & discussion will be added.

R4-Q2: Handling in-and-out object transitions

A: Our model handles object transitions via the merging mechanism in both the slot refinement and initialization stages. There are always slots available (before slot refinement) and if a new object enters, a redundant slot that would otherwise be merged will capture the object. When an object exits, a redundant slot will be merged with one encompassing existing objects. This clarification will be added.

R4-Q3: How is occlusion handled?

A: The slot attention mechanism jointly embeds both appearance and location in the latent space, which allows for better slot tracking even when partially or temporarily obscured. Extreme overlaps of similar objects might still be difficult to disentangle. E.g., two long instruments will be correctly outlined after intersection (due to location bias) but longer durations exceeding the context window (i.e., >11 sec) might split the slots. Exploring solutions with a longer context window and specialized training set could be future work.

R4-Q4: Reconstruction in complex backgrounds

A: We agree that reconstructing full images in cluttered scenes is challenging. Following prior work ([12], [18]), our method reconstructs feature representations instead of raw pixels, which is demonstrably more robust on complex backgrounds.

R4-Q5: Other supervised baselines should be reported

A: We agree results from supervised baselines could be informative. We could add Per-SAM and Surgical SAM (Yue et al., 2023) on Abdominal EndoVis data (mIoU scores 60.26 and 80.33, respectively) and Nwoye et al., 2019 (CorLOC 38.2 on Cholec).




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    N/A

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top