Abstract

We present a knowledge augmentation strategy for assessing the diagnostic groups and gait impairment from monocular gait videos. Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos, through a collective learning across three distinct modalities: gait videos, class-specific descriptions, and numerical gait parameters. Our specific contributions are two-fold: First, we adopt a knowledge-aware prompt tuning strategy to utilize the class-specific medical description in guiding the text prompt learning. Second, we integrate the paired gait parameters in the form of numerical texts to enhance the numeracy of the textual representation. Results demonstrate that our model not only significantly outperforms state-of-the-art methods in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions using the vocabulary of quantitative gait parameters. The code and the model will be made available at our project page: https://lisqzqng.github.io/GaitAnalysisVLM/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2283_paper.pdf

SharedIt Link: https://rdcu.be/dV17D

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_24

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2283_supp.pdf

Link to the Code Repository

https://lisqzqng.github.io/GaitAnalysisVLM/

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wan_Enhancing_MICCAI2024,
        author = { Wang, Diwei and Yuan, Kun and Muller, Candice and Blanc, Frédéric and Padoy, Nicolas and Seo, Hyewon},
        title = { { Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {251 -- 261}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper utilizes a Video-Language Model (VLM) for gait video analysis in the context of neurodegenerative diseases. By employing an integrative training approach that incorporates video, text, and tabular data, the model achieved high performances in the experiments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper achieves high data efficiency through the use of a pretrained VLM, attaining commendable performance in the classification of neurodegenerative diseases with just over a hundred data points.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The most significant concern with the paper involves its organization and clarity. The clinical setting being considered is not clearly defined throughout the document, for example, whether numerical data is available during inference remains unclear. It seems the authors intend to train a high-performance vision model using paired video, text, and numerical data during training, with only the video modality used for classification at inference. Yet, on page 6, the section on Cross-modal contrastive learning suggests that video and numerical data are not fully paired, which further muddies the waters. These critical details of the research framework should be explicitly stated early on, ideally in the Introduction. Despite piecing together this overarching approach, how classification is conducted during inference remains unanswered. Is the process akin to CLIP’s strategy of leveraging the similarity between vision features and category text features? Additionally, despite being listed as a key contribution, the description of the decoder is vague, and the decoder module is notably absent from Fig. 1.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In addition to the foundational settings, the design and description of the methodology require further refinement for conciseness and clarity. For instance, the process of numerical text encoding is currently ambiguous. Based on the authors’ description and Fig. 3, it appears that the text is first segmented and processed through a CLIP encoder, and then concatenated and passed through another CLIP encoder. This approach seems problematic, as typically, frozen CLIP encoder should not be used in this recurrent manner. Is it possible that the authors intended to describe an embedding process instead of using encoder in the first step? If the design does indeed involve recurrent usage of a CLIP encoder, then a thorough justification and explanation of the motivation behind this unconventional choice are necessary.

    Furthermore, towards the end of Page 5, the manuscript claims that the proposed numerical features outperform others, yet there is a lack of clear explanation on how the features used for comparison were processed. The manuscript appears to conflate ‘encoding’ and ‘embedding’—two distinct operations. Clarifying and consistently applying these terms would enhance the technical precision of the document.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Reject — must be rejected due to major flaws (1)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper showcases some innovative elements, the manuscript, in its current form, falls short of the clarity and rigor expected for publication. The topic of classifying Neurodegenerative Diseases using RGB-video is indeed engaging. I believe that with meticulous revisions and polishing of the article, there is potential for this paper to be considered for future submission to other top-tier conferences or journals.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a method for gait classification that leverages the large vision-language model, CLIP. The method is enhanced with a knowledge-aware prompt and numerical text, providing a comprehensive approach to gait analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is the first attempt to adopt vision-language model to gait classification. The proposed numerical text encoding paradigm is novel.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In the first section, the author should clarify the limitations of existing Vision-Language Models (VLMs) and distinguish between their proposed method and these VLMs.
    2. Regarding the 92 gait videos, are these a part of the author’s indoor dataset? If so, how do they differ from the other 28 videos in terms of variables like age? Could these differences potentially introduce bias into the training process? Additionally, the paper lacks implementation details such as dataset split and parameter specifications.
    3. In Section 2.2, while the authors may not need to delve into the specifics of KAPT and KEPLER, a brief introduction to these concepts would be beneficial for the reader’s understanding.
    4. The novelty of the proposed method is somewhat unclear, as it appears to be a combination of several different modules. The author should highlight the unique aspects of their approach.
    5. Table 1 (a) indicates that the proposed numerical text embedding performs worse than the baseline when used independently. Could the authors provide an analysis or explanation for this outcome?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper could benefit from a more professional presentation, with a more detailed description of the method and the addition of an implementation section to provide greater clarity and depth.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I have concerns regarding the novelty of this approach, as similar strategies have been employed in contrastive learning.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a knowledge augmentation strategy for gait analysis in neurodegenerative diseases, leveraging a large-scale pre-trained Vision Language Model (VLM) to learn visual, textual, and numerical representations of patient gait videos. The proposed model achieves state-of-the-art performance in video-based gait classification tasks by incorporating class-specific descriptions and paired gait parameters into the VLM through knowledge-aware prompt tuning and numerical text embeddings. Additionally, the learned class-specific text features can be decoded into natural language descriptions using quantitative gait parameters.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Novel formulation: The paper proposes a new knowledge augmentation strategy for gait analysis, which incorporates class-specific medical descriptions and paired gait parameters into the VLM through knowledge-aware prompt tuning and numerical text embeddings. This approach demonstrates a novel way to leverage textual and numerical information to enhance visual representations in the context of gait analysis. 2) Original way to use data: The paper presents an original method to encode and align numerical gait parameters with the text representation using a dedicated embedding base. This novel approach enables the model to learn and decode quantitative features from textual data, representing an original way to incorporate numerical data into text-based models. 3) Demonstration of clinical feasibility: The paper demonstrates the clinical feasibility of the proposed method by achieving state-of-the-art performance on gait scoring and dementia subtyping tasks using video-based gait analysis. This demonstrates that the method is effective in the context of real clinical applications. 4) Novel application: The paper represents the first attempt to deploy a large-scale pre-trained Vision Language Model for the analysis of pathological gait videos, showcasing a novel application of VLMs in the context of medical video analysis. 5) Strong evaluation: The paper conducts comprehensive evaluations on two gait classification tasks, demonstrating significant performance gains over several strong state-of-the-art baselines. The ablation studies also provide insights into the contributions of each component of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Limited dataset: The paper relies on a relatively small dataset of 120 gait videos, which may limit the generalizability of the proposed method. The performance gains achieved may not be as significant on larger datasets. 2) Lack of comparison with VLM baselines: While the paper compares with state-of-the-art gait analysis methods, it does not directly compare with baseline VLMs or other VLM fine-tuning approaches. This limits the evaluation of the proposed knowledge augmentation strategy.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper’s reliance on a small dataset of 120 gait videos may hinder the generalizability of the proposed method, and the lack of comparison with VLM baselines limits the evaluation of the knowledge augmentation strategy.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic and methods of this work are both very novel.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    The authors have addressed all my concerns.




Author Feedback

We thank the reviewers for their valuable comments. We are happy they appreciated our work: The paper is well written (Rev#3 & Rev#4), novel (Rev#3 & Rev#4) and original (Rev#4), showcase some innovative elements (Rev#1), and can be considered for future submission to a top-tier conferences or journals (Rev#1). We provide our rebuttals around the following main points:

Clarity of description

  • Clinical setting (Rev#1) We acknowledge your point and will revise the content on page 6: We use gait parameters to boost the numeracy of category text features via contrastive learning. Of these, 117 gait parameter sets had diagnosis labels, and 75 were linked to videos thus gait score labels. Only the video modality is required during inference.
  • How classification is conducted (Rev#1) Yes, the process is akin to CLIP’s strategy of leveraging the cosine similarity between vision features and category text features, classifying the videos into category labels.
  • The decoder (Rev#1) The decoder is not part of our vision-language training but is a validation manner to investigate whether the cross-modal alignment is formed through the training. We trained a 4-layer transformer decoder to reconstruct numerical text features into sentences (Sec3.2). Sec3.3 shows that learned category text features can be decoded into a sentence with gait parameters.
  • ‘encoding’ vs ‘embedding’ (Rev#1) We understand these two operations are distinct, but similar to the VLM, our ‘encoder’ conducts the embedding. To prevent potential confusion, we can rename Fig.3 to “Numerical text embedding pipeline deploying the frozen CLIP text encoder”.
  • How the features were processed (Fig.1 in Supp. Mat.) (Rev#1) We compare different ways of generating token embedding for integers [value]. Our method (a) normalizes [value] and multiplies them with the vector [NUM] (Sec2.3); (b) utilizes positional embedding; (c) and (d) represent [value] using digits (“0”, “1”) and number text (“zero”, “one”) respectively, then generate CLIP token embeddings.
  • Novelty is somewhat unclear (Rev#3) We pioneer the clinical-specific adaptation of pretrained VLMs from action recognition tasks to pathological gait classification. Beyond using prompt tuning with external clinical description and KAPT text model, we innovatively align the gait parameter modality with videos and texts modalities, which enables a stronger visual representation and interpretable textual representation (Sec3.3).

Concatenated usage of frozen CLIP text encoder (Rev#1). We aim for the embeddings of gait parameters that preserve the numerical continuity. To achieve this, the frozen CLIP text encoder is used twice. First, it encodes token embeddings of gait parameter description into F_gp, so that the clustering based on textual similarity can be avoided for final F^num. The token embedding [IS] is separately treated to help preserve the relational information between F_gp and number embedding. Second, it generates a feature vector from F_gp, [IS], and the number embedding, ensuring that F^num aligns with the pretrained CLIP text-visual embedding space. We propose to redraw Fig.3 to show the tokenization of raw text, to better present the idea.

Experiments

  • Numerical text embedding alone performs worse than baseline (Rev#3) NTE quantifies the criteria within the category description. Therefore, without KAPT, linking the gait parameter information with the class becomes challenging especially in gait scoring, where the class label is directly derived from the per-class description.
  • Lack of comparison with VLM baselines (Rev#4) While we think our experiments in Table 1 based on Vita-CLIP were appropriate for this submission, we can provide comparisons using other VLMs.

Generalizability of the method (Rev#4) Our work demonstrates how to enhance representation learning efficiently, even with a small dataset. It offers a new alternative to incorporate patient metadata, which often comes in the form of tabular data.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes innovative approaches for an interesting application. The authors are encouraged to address the reviewer’s comments in their final submission.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes innovative approaches for an interesting application. The authors are encouraged to address the reviewer’s comments in their final submission.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although the underlying techniques are not new by themselves, reviewers agree that the design and application of them to gait analysis are novel. The major concerns were mainly on the clarity of writing, which were largely addressed in the rebuttal and mostly addressable in the final manuscript.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although the underlying techniques are not new by themselves, reviewers agree that the design and application of them to gait analysis are novel. The major concerns were mainly on the clarity of writing, which were largely addressed in the rebuttal and mostly addressable in the final manuscript.



back to top