Abstract

Analyzing temporal developments is crucial for the accurate prognosis of many medical conditions. Temporal changes that occur over short time scales are key to assessing the health of physiological functions, such as the cardiac cycle. Moreover, tracking longer term developments that occur over months or years in evolving processes, such as age-related macular degeneration (AMD), is essential for accurate prognosis. Despite the importance of both short and long term analysis to clinical decision making, they remain understudied in medical deep learning. State of the art methods for spatiotemporal representation learning, developed for short natural videos, prioritize the detection of temporal constants rather than temporal developments. Moreover, they do not account for varying time intervals between acquisitions, which are essential for contextualizing observed changes. To address these issues, we propose two approaches. First, we combine clip-level contrastive learning with a novel temporal embedding to adapt to irregular time series. Second, we propose masking and predicting latent frame representations of the temporal sequence. Our two approaches outperform all prior methods on temporally-dependent tasks including cardiac output estimation and three prognostic AMD tasks. Overall, this enables the automated analysis of temporal patterns which are typically overlooked in applications of deep learning to medicine.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1006_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1006_supp.pdf

Link to the Code Repository

https://github.com/Leooo-Shen/tvrl

Link to the Dataset(s)

N/A

BibTex

@InProceedings{She_Spatiotemporal_MICCAI2024,
        author = { Shen, Chengzhi and Menten, Martin J. and Bogunović, Hrvoje and Schmidt-Erfurth, Ursula and Scholl, Hendrik P. N. and Sivaprasad, Sobha and Lotery, Andrew and Rueckert, Daniel and Hager, Paul and Holland, Robbie},
        title = { { Spatiotemporal Representation Learning for Short and Long Medical Image Time Series } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a contrastive learning-based method to adjust temporal deep representations for time series data. Two options were examined: one employing the CLIP learning technique and the other involving masked tokens in the embeddings. The method was implemented on a magnetic resonance (MR) video dataset and an optical coherence tomography (OCT) time series images, demonstrating the effectiveness of incorporating these spatiotemporal considerations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed strategy to encode the temporal information of the data appears to be innovative. Furthermore, when employing the CL approach, the resulting method demonstrates consistency and exhibits outstanding performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the analyzed approaches are primarily designed for modeling temporal data, the majority of the results are centered on performance over OCT data, which is pertinent, but is expected to generate more discussion around the MR dataset due to its inherent long temporal nature. Conversely, there are few statements about these results in the discussion section. Moreover, from the results, it is evident that although the contrastive approach appears superior to other methods, the TVRL proposed method is not superior to the baseline CL.

    The manuscript is structured into numerous sections, which impedes to comprehend the discussion and methodology with ease. Specifically, the results section has many subsections.

    The results were calculated using only two random seeds, which appears to be a limitation from a statistical standpoint.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The work has some areas of improvement. With respect to the structure of the manuscript, the third point in the introduction should be reevaluated as it does not qualify as a contribution. Additionally, the organization and readability of the paper could be improved. Also there are minor typo errors. Furthermore, the experimentation section could benefit from additional details, for example regarding the type of augmentations used for the CL task. Lastly, a more thorough assessment of the results from Table 2 is necessary.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The organization of the paper and the evaluation of its initial stated contributions are not totally adequate.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The papers proposes a simple clip-level contrastive learning strategy that leverages time embeddings in irregular and variable length time series, and a new temporally-variant approach that explicitly models frame-level variation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1)Previous spatiotemporal learning methods failed to track long term developments and disease trajectories in longitudinal series. To address this, this paper proposes a simple clip-level contrastive learning strategy that leverages time embeddings in irregular and variable length time series, and a new temporally-variant approach that explicitly models frame-level variation. (2)Comparative experiments on spatiotemporal feature learning have been relatively sufficient.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) No visual results plots are available to illustrate the modeling and inference concerning historical change trajectories. (2) In figure 2, the illustration about the frame-level predictive approach is not clear. (3) The motivation for choosing a masking ratio of 0.15 is unclear. Are there any ablation experiments to support this choice? (4) The detail about the temporal transformer used in Spatiotemporal encoder is not clear. (5) The absence of an overview diagram impedes the understanding of the system’s operation from a high perspective. (6) In Table 1, the contrast between SimCLR and SimCLR+TE is confusing. Can you provide more analysis on why SimCLR+TE is sometimes lower than SimCLR?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No code is provided to ensure reproducibility

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.A brief summary of the paper: The papers proposes a simple clip-level contrastive learning strategy that leverages time embeddings in irregular and variable length time series, and a new temporally-variant approach that explicitly models frame-level variation. 2.The assessment of the reviewer about the major strengths of the paper (1) Although some ablation experiments are lacking, comparative experiments are sufficient. (2) Previous spatiotemporal learning methods failed to track long term developments and disease trajectories in longitudinal series. This paper solves a relatively novel problem.

    1. The assessment of the reviewer about the major weaknesses of the paper. (1) No visual results plots are available to illustrate the modeling and inference concerning historical change trajectories. (2) In figure 2, the illustration about the frame-level predictive approach is not clear. (3) The motivation for choosing a masking ratio of 0.15 is unclear. Are there any ablation experiments to support this choice? (4) The detail about the temporal transformer used in Spatiotemporal encoder is not clear. (5) The absence of an overview diagram impedes the understanding of the system’s operation from a high perspective. (6) In Table 1, the contrast between SimCLR and SimCLR+TE is confusing. Can you provide more analysis on why SimCLR+TE is sometimes lower than SimCLR? (7) Lack of presentation of intermediate results, such as reconstructed the masked tokens.
    2. The assessment of the reviewer about the clarity of presentation, paper organization and other stylistic aspects of the paper. (1) The paper is easy to read. (2) The illustration of the proposed method is not clear enough, and there is a lack of visualization of the experimental results.
    3. Comment on the reproducibility of the paper. No code is provided to ensure reproducibility
    4. Detailed constructive comments (1) Improvements can be made corresponding to the major weaknesses of the paper mentioned previously. (2) In addition to aiming to expand the number and diversity of tasks, future work can also consider further innovation in spatiotemporal feature learning and verification of real-time performance.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    (1) The original contrastive learning strategy technology was based on images, and the author extended it to be based on video clips. This novelty is relatively limited. (2) The contrast between SimCLR and SimCLR+TE in Table 1 is confusing. (3) In Table 2, Compared with the baseline, the proposed method has little improvement.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper propose a novel method that combines the efficacy of clip-level contrastive learning with a frame-level latent feature prediction task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper propose a novel method that combines the efficacy of clip-level contrastive learning with a frame-level latent feature prediction task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experiment is insufficiency. The ViT encoder may be not efficient for the time series feature.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The methodology sounds good but the modules such as ViT and Finetuning protocol is insufficient.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper propose a novel method that combines the efficacy of clip-level contrastive learning with a frame-level latent feature prediction task.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    This paper propose a novel method that combines the efficacy of clip-level contrastive learning with a frame-level latent feature prediction task.



Review #4

  • Please describe the contribution of the paper

    The paper “Spatiotemporal representation learning for short and long medical image time series” proposes novel methods in representation learning that accommodate key characteristics of longitudinal medical imaging data.

    Prior work in spatiotemporal contrastive representation learning usually aligns different sub-sequences (aka. Clips aka. Series of images) of the same sequence. This encourages the representation to focus on features that remain static between time points. In medical machine learning, however, prognosis of various diseases often depends on the evolution of images over time.

    The main contribution of the authors is the formulation of a contrastive learning approach that preserves and emphasizes the temporal evolution in the latent representation. Additionally, the authors propose a time embedding to accommodate for the fact that medical images are often acquired at irregular intervals.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed methods perform better than previous representation learning schemes, when evaluated on downstream prognostic and diagnostic tasks (using linear probing). In particular, the proposed representation learning scheme outperforms other models on data acquired at long irregular intervals.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In section 3.3 Contrastive learning alignment is performed between clips and augmented versions of the same clips. However, the extent and type of augmentations are not described. This information is quite important, given that this key characteristic is where the proposed method departs from previous work.

    In section 4.1, the authors state that the TVRL model is evaluated both with and without the temporal embedding. From table 1 and 2, it seems that this analysis is missing and it is not completely clear whether temporal embedding was used for the final model.

    In section 4.1 It is not clear whether “Parametrized by a ViT-S” indicates that the spatial model was initialized by the pretrained weights of the ViT article or just employed an identical architecture.

    In section 4.2 It is not completely clear how the clip of 8 images is selected for inference, in case the series contains more than 8 images.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I would recommend writing the forward model equations explicitly to avoid confusion.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The article proposes a general methodology to deal with time series of images acquired in irregular intervals. This is especially relevant, firstly since such datasets are frequent medical domain, and secondly, because the temporal evolution may contain significant diagnostic prognostic information, which would otherwise be lost if analysing each time point in isolation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my major concerns regarding clarity in methodology.




Author Feedback

We thank the reviewers for their time and constructive reviews. We appreciate that all reviewers see merit in our approaches that model temporal variation in both short- and long-term time series of medical images. In this revision, we address their remaining concerns:

  1. Contributions and novelty (R5, R6) Our work identifies and addresses fundamental limitations in widely adopted video representation learning methods, that were introduced during their original extension from image-based applications (R6). We strongly believe that it is important to study, report, and address these limitations. While straightforward, our clip-level contrastive strategy is beneficial for preserving large temporal changes in spatiotemporal medical data (R5). Current approaches cannot handle such changes, especially in irregularly sampled long-term sequences, which is why we focus on the performance benefits shown in Table 1 (R5). We have updated the Introduction to reflect these contributions more precisely.

  2. Methodological details (R4, R6, R7) We now provide more details on methodology. We would like to clarify that the temporal Transformer shares the same architecture as the ViT described in Section 4.1 (R6), which are both randomly initialized before pretraining (R7). Given a sequence of images, we employ a spatial ViT, E_s, to extract a feature vector per image zs_i=E_s(img_i) and a temporal Transformer, E_t, to generate a global representation of the sequence zt = E_t(zs_1…zs_i) (R6, R7). While zt is used for the contrastive loss in Eq. 1, Et also predicts a subset of the feature tokens that are masked specifically for the frame-level prediction task. We have included these forward equations in Section 3.2 (R7) and added an overview diagram to Figure 2 (R6). Our two-encoder design efficiently operates on spatiotemporal data by reducing attention token length instead of processing all video patches at once (R4).

  3. Inference and experimental details (R5, R7) We have added additional details regarding the inference (R7). In sequences with more than eight images, we adapt the common practice in evaluating video models [13,28] and apply a sliding window with 50% overlap to contiguous clips before averaging these predictions. We also clarify that TVRL does not use time embeddings (TE), and have added these details to Section 4.1. We have repeated our experiments with five random seeds during finetuning and find consistent performance as currently reported in both tables (R5). We have now added the missing augmentation details to Section 4.1, which are standard SimCLR augmentations (R5, R7).

  4. Results discussion (R4, R5, R6) The reviewers asked for more experimental detail and a more thorough discussion of the results. We present TE as a well-motivated option for incorporating irregular intervals, but hypothesize that the model could overfit on TE during contrastive pretraining (R6), and may benefit from temporal augmentations. Given the inherent difficulty posed by the frame-level prediction task, we found lower mask ratios to be beneficial for TVRL (R6). We also welcome the suggestion of R6 to plot these intermediate reconstruction results using Eq. 2, and have visualized the feature trajectories using their first two PCA components (R6). These findings have been added to Section 4.4. Regarding Table 2, on video data with regular frame intervals, it is expected that TE does not provide any benefit over positional embeddings (R5, R6). Similarly, TVRL performs comparably to contrastive baselines, as cardiac videos contain limited temporal change within a single cardiac cycle (R5). We have clarified these points in results Section 4.4. Overall, experiments on two distinct spatiotemporal datasets including eight diverse tasks demonstrate our success in modeling temporal variations for diagnosis and prognosis (R4). Our approach excels on long, irregularly acquired data (Table 1) and maintains competitive performance on short videos (Table 2).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There are still some major concerns which are not well addressed, such as limited novelty, and marginal and unclear experimental comparison. Therefore, I am leaning to rejection for this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There are still some major concerns which are not well addressed, such as limited novelty, and marginal and unclear experimental comparison. Therefore, I am leaning to rejection for this paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I believe the rebuttal has addressed most of the points raised by all the reviewers (R5 and R6 in particular). However, there are still several points that I would like the authors to bear in mind for the final revision of the paper. A thorough assessment of Table 2 was not provided; looking at the results, I believe the contribution of the time embedding (TE) on standard SIMCLR versus the primary loss of the proposed method was not discussed properly (as commented by R5). Additionally, try to distinguish between SIMCLR, SIMCLR+TE, and TVRL when discussing Tables 1 (R6) and 2. The structure of the paper can be improved (R5), and please address all the minor points raised by R6.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I believe the rebuttal has addressed most of the points raised by all the reviewers (R5 and R6 in particular). However, there are still several points that I would like the authors to bear in mind for the final revision of the paper. A thorough assessment of Table 2 was not provided; looking at the results, I believe the contribution of the time embedding (TE) on standard SIMCLR versus the primary loss of the proposed method was not discussed properly (as commented by R5). Additionally, try to distinguish between SIMCLR, SIMCLR+TE, and TVRL when discussing Tables 1 (R6) and 2. The structure of the paper can be improved (R5), and please address all the minor points raised by R6.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After reading the paper, reviews, meta-reviews and rebuttal, as well as the overall rankings, I believe this paper should be accepted as it demonstrates strong points with regards to the originality of the method for time-series data.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After reading the paper, reviews, meta-reviews and rebuttal, as well as the overall rankings, I believe this paper should be accepted as it demonstrates strong points with regards to the originality of the method for time-series data.



back to top