Abstract

We propose an autism spectrum disorder (ASD) screening framework that integrates an expert vision-language model (VLM), CARE-VL, with a large language model (LLM)-based aggregation module to assess children’s social interactions and derive subject-level ASD/typical development (TD) classifications.
Our framework processes video data collected using social interaction-inducing content, where medical experts annotated predefined query-response (Q-R) intervals based on key social indicators—such as response to name, eye contact, imitation behavior, social smiling, and pointing—by marking correct responses and assigning subject-level ASD/TD classifications. To adapt the general-purpose VLM to the ASD screening domain, we constructed a synthetic instruction-tuning dataset using a label-guided reasoning method on these clinical tags, fine-tuning the model to generate detailed captions and multiple-choice question-answer (MC-QA) pairs, capturing children’s critical social behaviors.
CARE-VL processes Q-R intervals to produce clip-level MC-QA results and descriptive captions, which are then aggregated by an LLM to derive final ASD/TD classification and clinical reasoning.
Our end-to-end framework combines visual understanding and linguistic reasoning, achieving 84.6% accuracy for clip-level response prediction and 75.8% accuracy for subject-level ASD/TD classification. These results demonstrate the potential of our framework as a practical and interpretable tool for early ASD screening and behavioral assessment. The code is publicly available at https://github.com/etri/AI4ASD.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0486_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0486_supp.zip

Link to the Code Repository

https://github.com/etri/AI4ASD

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YooChe_CAREVL_MICCAI2025,
        author = { Yoo, Cheol-Hwan and Yoo, Jang-Hee and Jang, Jaeyoon},
        title = { { CARE-VL: A Domain-Specialized Vision-Language Model for Early ASD Screening } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {56 -- 65}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an autism spectrum disorder (ASD) screening framework with a fine-tuned vision-language model (VLM), CARE-VL, and a large language model (LLM)-based aggregation module. The proposed CARE-VL manages to outperform baseline VLM methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of bringing recently advanced techniques like VLMs and LLMs to ASD screening is novel.
    2. The paper is well organized, and figures and the supplementary video are very helpful in terms of explaining the idea behind this work.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Experiment is only conducted on non-public datasets. To further validate the effectiveness of the proposed method, major experiments should also be conducted on at least one more public dataset (if available).
    2. The paper claims that the proposed method has interpretability, but it is not validated in experiments.
    3. The experiments of CARE-VL only compare with general-purpose baseline methods, i.e., VLMs without fine-tuning. As far as I’m concerned, an effective comparison here should be between the proposed CARE-VL and other fine-tuned VLMs (probably with other fine-tuning methods).
    4. The LLM aggregation part of the proposed framework does not have any baseline method to compare with.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of utilizing VLMs and LLMs is very interesting, and the paper is very well organized as well. However, the weaknesses listed above significantly reduce my confidence in the contribution of this paper. If my concerns in the weakness part can be addressed properly during the rebuttal, I would be happy to change my mind.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal basically managed to resolve my concerns.



Review #2

  • Please describe the contribution of the paper

    The paper presents a structured framework leveraging Vision-Language Model (VLM) and Large Language Model (LLM) for early autism spectrum disorder (ASD) screening. This study uses a fine-tuned VLM, CARE-VL, to construct a synthetic dataset for instruction-tuning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The proposed ASD early screening framework is of a two-stage structure. In the first stage, VLM is fine-tuned to provide detailed captions of video clips and answer questions (MC-QA) for each social indicator. In the second stage, LLM is used for subject-level classification reasoning according to the output captions and MC-QAs. 2) This study is an exploration of the video understanding capability of the VLM in the ASD screening field. This study also evaluates the reasoning ability of LLM for ASD/TD classification given detailed textual descriptions and MC-QAs of each indicator.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The computational demand of this method is huge. The authors did not discuss its clinical translation potential. 2) From the motivation perspective, is this work the first method using VLM and LLM for ASD screening? If so, the previous work (manual screening, traditional deep learning methods, etc.) should be described in the paper, and what is the baseline accuracy? If not, the comparison between ASD screening domain VLM/LLMs is missing. The authors should illustrate the motivation for using VLM/LLM. 3) After the first stage of visual understanding, based on the output MC-QAs, why not simply use a traditional machine learning classifier? Or train a deep neural network as the classifier? It’s easier to make the pipeline end-to-end. Since the ASD/TD accuracy is 75%, which is not a very high value, it is not sufficient to convince me that LLM is better than ML methods (which are much cheaper and more explainable).

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I suggest redrawing Fig. 1 and 2 for better illustration.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation for using VLM/LLM is not well clarified. Besides, there is a lack of description and comparison with ASD-special VLMs and the traditional DL baseline. I would suggest Weak Rejection.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I am thankful to the authors’ kind response. However, my primary concern–the motivation of using VLM–was not addressed. I understand this paper might be the first VLM method for ASD screening. Given the expensive cost of VLM, however, I expect to see a superior performance compared to traditional ML methods (and human expert’s 85% acc), which is not shown in the paper (~75% acc and without key comparison). The author argued that “Prior DL methods rely on a single indicator and act as black-box classifiers.”, which does not convince me. Why ML/DL methods not use multiple indicators (and I would like to see references)? What if these methods using multiple indicators (for fair comparison)? Why using VLM naturally obtained interpretability than traditional classifiers (I don’t believe it is common sense and any references)?

    In summary, I appreciate the author’s exploration of new techniques. I understand that MICCAI’s page limitation might make it difficult to include sufficient experiments, clarification and discussion. However, given the paper’s content as it is, I would like to suggest Weak Reject.



Review #3

  • Please describe the contribution of the paper

    This paper presents a fine-tuning VLM with domain specific knowledge for early ASD screening.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Proved the feasibility of using label-guided reasoning method to leverage domain specific hint with standard LLM to create dataset for Early ASD Screening.

    • The authors have conducted comprehensive experiments to show case the added value of domain specific fine-tuning of VLMs with improved accuracy and better captions.

    • The paper is well-structured and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Lack of innovation in the method side but it’s reasonable as main contribution lies in the application.

    • The LLM aggregation could be further looked into as the video clip level accuracy seems already quite good and caption quality seems nice but the final ASD/TD classification accuracy is not that high. There might be huge improvement potential as few-shot prompting is already very helpful. Also backbone with better reasoning capacity may be open as some privacy implications might be lifted compared to raw video clips.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Fig. 2 might be clearer if the annotation phase and fine-tuning phase is more distinct. Maybe consider add a snowflake and a fire icon to distinguish “VLM” on the left/right.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall this is a nice paper with clear contribution to the field. It shows the largely increased reliability in video clip level ASD screening with domain specific knowledge achievable with reasonable resources.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors provide convincing response to my raised questions during rebuttal. Though the network structure follows existing open-source solution, the author explores a low cost approach for domain specific fine-tuning with very reasonable resources and conducted detailed experiments to support their claims. I think this could be of good contribution to the field for its well-considered clinical transferring capabilities.




Author Feedback

We thank all reviewers for their thoughtful feedback. The comments raised many key points and have greatly improved the paper.

[R2] Use of public dataset: To the best of our knowledge, there is no public video dataset including DSM-5-based indicators (e.g., pointing, eye contact, social smiling). Therefore, in collaboration with hospitals, we collected our own dataset using SIIC protocols, designed to elicit social responses. We believe this dataset meaningfully contributes to the field. Upon acceptance, we will release anonymized data and code to support reproducibility.

[R2] Interpretability: Interpretability in our work refers to the model’s ability to express intermediate reasoning through natural language—crucial for clinical trust. We support this with both qualitative and quantitative evaluations. Figure 4 shows clip-level captions that go beyond yes/no responses, offering clinically meaningful descriptions of the child’s behavior. Caption quality is also assessed with the LLaVA-Critic, where CARE-VL outperforms all baselines (Table 1). At the subject level, Table 2 (B.ASD. The child’s…) lists the full textual rationales from the LLM, revealing the intermediate reasoning behind final ASD/TD decision.

[R2] Comparison with other fine-tuned VLMs: We compared against general-purpose VLMs because CARE-VL is the first domain-specialized VLM fine-tuned for ASD screening via our label-guided reasoning strategy. To validate its effectiveness, we included a controlled baseline using the same backbone (LLaVA-OV-7B) without fine-tuning; CARE-VL clearly outperforms it (Table 1).

[R2&R5] LLM aggregation justification: ASD screening is challenging—clinical studies report even trained experts reach 85% accuracy. Our 75.8% reflects a realistic benchmark on expert-labeled data. In our framework, the LLM serves as a tool to merge clip-level textual outputs (captions and MC-QA) and provide reasoning for final decision (Table 2, B.ASD. The child’s…). Unlike traditional classifiers (e.g., MLP, SVM), which require manual feature encoding and cannot operate directly on language, LLMs can process these outputs in their native form and articulate clinical justifications. This interpretability is crucial for building clinical trust and usability. To this end, we focused on a controlled comparison between zero-shot baseline and few-shot prompting (Table 3.b). Exploring alternative aggregation strategies is a promising next step.

[R4] Methodological contribution: Thank you for the positive feedback. Using label-guided reasoning, we turn sparse clinical labels into rich supervision, allowing ASD-specific VLM fine-tuning. We expect this system could serve as a valuable decision-support tool for early ASD diagnosis.

[R4] Potential for stronger LLMs: We agree stronger LLMs could improve subject-level accuracy and plan to explore this. We also aim to refine SIIC to elicit clearer indicator-level responses for sharper ASD/TD distinction.

[R5] Computational cost and clinical applicability: Our system is designed for offline analysis; so, real-time inference is not a strict requirement. Since SIIC-based videos are short (<6 min.), using large models such as VLM/LLMs remains feasible. In practice, full analysis per subject completes in under 90 s on a single GPU, supporting clinical applicability.

[R5] Motivation for using VLM/LLM: To the best of our knowledge, this is the first VLM/LLM-based ASD screening method. Prior DL methods rely on a single indicator and act as black-box classifiers. In contrast, our method mirrors clinical protocols by modeling multiple DSM-5 indicators (e.g., pointing, eye contact) via VLM, and integrates these into a subject-level classification through LLM-based reasoning. This allows both indicator-level and subject-level transparency— aligned with how clinicians make decisions. As no prior benchmark exists for this setting, we report 75.8% accuracy against expert manual labels—establishing a baseline for future work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is an interesting paper where authors claim that this is the first use of a VLM for ASD screening. However, I completely agree with R5, that the justification for use of VLM is missing in the paper. MICCAI is not only about using new (but exisiting) techniques for a problem, but also presenting a scientfic justification. Similarly, comparison with other methods (conventional ML and Deep ML) methods is missing. Hence, I lean towards rejecting this paper.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top