Abstract

Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3185_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/lhqqq/TG-MILNet

Link to the Dataset(s)

https://github.com/ShiqiYu/OpenGait

BibTex

@InProceedings{LiHai_TextGuided_MICCAI2025,
        author = { Li, Haiqing and Guo, Yuzhi and Jiang, Feng and Dang, Thao M. and Ma, Hehuan and Zhou, Qifeng and Gao, Jean and Huang, Junzhou},
        title = { { Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {661 -- 671}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, the authors proposed TG-MILNet, a framework for scoliosis detection based on text guidance and multi-instance learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The task is interesting. The approach follows the trends in computer vision. Better results than SOTA.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Only one dataset to evaluate the performance of the model. Generalisation is not evaluated. The performance of the DTW-based clustering is not evaluated. How can this be affecting the final performance of the method? What are the steps of this stage of your framework? What are example outputs? Is it reliable? I would expect some visual examples of correct vs wrongly classified samples. Also, what does the text look like? Is it consistent among samples? Does it vary considerably?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I miss details about the methodology. If the code is not made publicly available, this project cannot be reproduced. I miss visual examples and an analysis of the impact of the different steps of the proposed framework.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I am satisfied with the rebuttal by the reviewers.



Review #2

  • Please describe the contribution of the paper

    The paper proposes TG-MILNet, a novel pipeline to analyse gait videos integrating multi-instance learning with expert text guidance for non-invasive scoliosis screening. The main contributions are the use of Dynamic Time Warping (DTW) clustering to treat temporal misalignment, an Inter-Bag Temporal Attention (IBTA) module to focus on relevant gait phases, and a Boundary-Aware Model (BAM) to enhance borderline-case detection. The paper offers a solution to improve gait deviations detection integrating relevant visual and semantic cues.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Combination of visual features and textual guidance: this approach enables a better focus on slight gait deviations and this is shown in the results where the model is better at classifying neutral cases compared to baselines. 2) Novel Architectural Design: the Inter-Bag Temporal Attention proved successful at extracting relevant information from a sequence of gait phases and the Boundary-Aware Model, by focusing on edge cases, effectively reduces false positives. 3) Robustness to class imbalance: the model proves performant under class imbalance from low to severe. In real-case scenarios, this is an interesting validation as we know negatives cases are generally overrepresented. 4) Interpretability: the combination of textual guidance and semantic cues enhances the interpretability of the model’s predictions

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Ablation studies: although ablation experiments are provided for the influence of BAM and textual guidance, further exploration isolating the influence of each module on performance could provide more insights into model understanding. The paper would benefit from a more thorough exploration of text guidance. 2) Computational requirements: for the goal of model deployment, we need to consider memory requirements, training and inference times. A detailed analysis of the computational requirements is missing. 3) Lack of qualitative visualisations: qualitative visual comparisons and the effect of text guidance are missing. The authors could add visual attention maps to understand the model’s decision-making. Qualitative comparison of the effect of text guidance would enhance the validity of the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    To tackle the problem of non-invasive scoliosis screening, TG-MILNet includes a multi-instance approach that effectively combines temporal analysis, attention mechanisms, and domain-specific knowledge to improve gait deviation detection. It proved performant at detecting neutral cases where past work would typically fail. The novel integration of text and visual features, as well as a robust validation are the main strengths of the paper. However, a better understanding of the effect of each component of the model would enhance the interpretability. Furthermore, a detailed efficiency analysis is needed for potential application in clinics as an early-stage scoliosis detection tool.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes TG-MILNet, a multimodal and multi-instance learning framework for non-invasive scoliosis screening using gait video analysis. The main contributions are as follows:

    1. Introduction of a dynamic time warping (DTW)-based clustering method to address temporal misalignment in gait video sequences.

    2. Implementation of a boundary-aware model (BAM) and inter-bag temporal attention (IBTA) mechanism to improve the detection of borderline cases.

    3. Integration of expert knowledge and GPT-4o-generated textual insights into the model training process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Innovative methodological framework: The paper introduces multiple novel methods, notably the DTW-based clustering and IBTA mechanism, effectively addressing the temporal misalignment inherent in gait data.

    1. The attempt to improve sensitivity to borderline cases and address class imbalance is meaningful.
    2. The proposed method is evaluated on the Scoliosis1K dataset with ablation studies to assess the contribution of each component.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited effectiveness of GPT-4o-based textual guidance: The claimed contribution of GPT-4o-generated textual insights to model performance improvement is minimal. According to the ablation results, removing textual guidance leads to only marginal changes in specificity (from 86.4% to 85.1%) and F1-score (from 90.2% to 89.4%), suggesting that the use of textual guidance contributes less than 1% improvement. This questions the necessity and effectiveness of incorporating such guidance.

    2. Redundancy between expert knowledge and GPT-generated text: Although expert review appears to have been conducted, the generated textual insights largely repeat well-established medical knowledge (e.g., stride length differences, torso sway) and do not provide novel information. The justification for relying on GPT-4o for this purpose remains weak, especially given the limited performance impact.

    3. Limited dataset validation: The model evaluation is restricted to the Scoliosis1K dataset. The generalizability of the approach to other independent datasets has not been demonstrated.

    4. Specificity trade-off Issue: While the model focuses on improving sensitivity for borderline cases, this results in a trade-off with lower specificity. Traditional screening methods such as the Adams Test and Scoliometer maintain higher specificity by applying conservative diagnostic thresholds. The increased false positive rate of the proposed model could lead to overdiagnosis and unnecessary follow-up examinations in clinical practice. This important issue is not sufficiently discussed in the paper.

    5. Lack of comprehensive performance metrics: The evaluation is limited to accuracy, sensitivity, and specificity. Other relevant metrics such as ROC curves and AUC scores are not provided, making it difficult to fully assess the model’s discriminative ability.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the methodological approach is interesting, the limited contribution of GPT-4o-based textual guidance and the insufficient discussion on the specificity trade-off significantly weaken the paper’s impact. Considering these points, I recommend a weak accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We appreciate the reviewers for their thoughtful comments and for recognizing the strengths of our work, including interpretability, novelty, handling of class imbalance, and clinical relevance.

[R1, R4] Data and Evaluation We agree that broader dataset validation is important. At present, Scoliosis1K remains the only public dataset for video-based scoliosis screening, particularly for challenging borderline cases [Ref.10]. To assess generalizability, we have previously evaluated our approach on static datasets (Universe-Roboflow & Mendeley; X-ray/RGB). Notably, adding textual guidance (without video-specific modules) improved F1 from 98.3% to 99.5% and specificity from 95.2% to 99.8%. This suggests improved sensitivity to subtle spinal cues. We plan to construct a multi-modal benchmark to support future validation.

[R1, R2] DTW Clustering-Evaluation and Clarity We previously conducted ablation experiments and briefly mentioned the choice of K=4 in Sec.3.2. To evaluate the reliability, we manually sampled and labeled gait frames to construct clustering (K=2~5). We then evaluated clustering quality and found K=4 yielded the best alignment (NMI=0.81, ARI=0.71), corresponding to forward/backward walking, turning, and a residual phase capturing ambiguous frames. This setup yielded a 10% relative gain over other methods (e.g., K-means, fixed-interval). Fig. 2 (b1~bK) shows visualizations. For DTW-based clustering, we extract frame-wise optical flow features and compute DTW distances across frames. Hierarchical clustering is then applied to segment the sequence into K gait phases as MIL bags. We will revise the text for clarity.

[R1, R2, R4] Effectiveness of Textual Guidance We are pleased this component drew interest. This is our first attempt to integrate clinical guidelines (text) into scoliosis screening, and we are encouraged by its effectiveness. Existing scoliosis guidelines (NEJM’08 & AAOS‘15) are based on static imaging. We input video (frame-level) to GPT-4o to generate dynamic textual guidance that supplements static modalities(e.g., CoM deviation, postural drift), rather than simply repeating existing knowledge. Text is fixed: 3-class emphasizes abnormalities; binary highlights borderline differences (see Sec.2.3; text example in Fig.2). While the overall improvement may appear modest, we emphasize that the guidance improves the F1 on borderline cases by 5.3% (1:1:8) and up to 10% (1:1:16). We will revise the text for clarity. Further research and expert-driven video-based guideline construction will be included in the revision.

[R1, R2] Visualisation We have conducted such analyses, but did not include them in the paper due to space limitations. We would like to share our insights here: we observed that shoulder asymmetry during turning often leads to misclassification (solution: a consensus-based strategy). Moreover, textual guidance expands the model’s focus from limb movement to clinically relevant torso cues. We will clarify this and include additional visualizations in the revision.

[R2] Computational requirements Our backbone (TPAMI’23) and Perceiver IO module (ICLR’22) are designed to be lightweight. The text encoder is frozen, and DTW clustering is offline. Training completes in approx. 1.5 hours on a single A6000 GPU (~20 GB memory usage), and inference runs at over 1,000 FPS. We will revise the text to clarify it.

[R4] Specificity and metric Thank you for highlighting this clinical concern. We acknowledge the specificity trade-off, and our method is designed for efficient early screening (not diagnosis) to flag cases needing further X-ray examination. This is especially valuable in low-resource settings, enabling timely prevention and reducing disease progression risk. We agree that ROC and AUC would enhance evaluation and will include them in the revision. We focus on sensitivity/specificity as they better align with clinical standards.

[R1] Code availability Code will be released publicly.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have addressed the concerns raised by all three reviewers, each of whom now leans toward acceptance. It is recommended that the authors further refine the current version in accordance with the reviewers’ suggestions.



back to top