Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Due to the complex and resource-intensive nature of diagnosing Autism Spectrum Condition (ASC), several computer-aided diagnostic support methods have been proposed to detect autism by analyzing behavioral cues in patient video data. While these models show promising results on some datasets, they struggle with poor gaze feature performance and lack of real-world generalizability. To tackle these challenges, we analyze a standardized video dataset comprising 168 participants with ASC (46% female) and 157 non-autistic participants (46% female), making it, to our knowledge, the largest and most balanced dataset available. We conduct a multimodal analysis with a primary focus on gaze behaviour, complemented by facial expressions, voice prosody, head motion, heart rate variability (HRV). Addressing previous limitations in gaze modeling, we introduce novel statistical descriptors that quantify variability in eye gaze angles, improving gaze-based classification accuracy from 64% to 69% and aligning computational findings with clinical research on gaze aversion in ASC. Using late fusion, we achieve a classification accuracy of 74%, demonstrating the effectiveness of integrating behavioral markers across multiple modalities. Our findings highlight the potential for scalable, video-based screening tools to support autism assessment. To facilitate reproducibility, we share our code on GitHub: https://github.com/mbp-lab/miccai25_sit_autism_classification

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4692_paper.pdf

SharedIt Link: https://rdcu.be/eHaVG

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_24

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/mbp-lab/miccai25_sit_autism_classification

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SaaWil_Improving_MICCAI2025,
        author = { Saakyan, William AND Norden, Matthias AND Eversmann, Lola AND Kirsch, Simon AND Lin, Muyu AND Guendelman, Simon AND Dziobek, Isabel AND Drimalla, Hanna},
        title = { { Improving Autism Detection with Multimodal Behavioral Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {252 -- 262}
}

Reviews

Review #1

Please describe the contribution of the paper

a study on behavioral analysis is given in that paper - autism detection is the goal of the application
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- a relevant application is studied
- a new data set is collected analyzed
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- classification models are simple
- presentation of results must be improved (see for instance Fig 1 and Fig 2)
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- a very relevant application
- analysis is weak , as well as the discussion of the results
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper presents a multimodal behavioural analysis approach for the computer-aided detection of Autism Spectrum Condition (ASC) in adults, using standardised video-recorded social interactions. The authors develop and analyse a large dataset combining facial expressions, gaze behaviour, head motion, voice prosody, and heart rate variability. The main technical contribution lies in the refinement of gaze feature descriptors, which led to improved classification accuracy. The paper further explores unimodal vs multimodal fusion models, provides explainability via SHAP, and outlines directions for extending the approach toward real-world clinical use. Overall, it is an interesting read.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper addresses a clinically significant and timely application: the computer-aided assessment of Autism Spectrum Condition (ASC) in adults using behavioural signals captured in standardised video interactions.
2. Their study makes use of a relatively large and gender-balanced dataset, with ASC diagnoses confirmed by clinicians.
3. The authors propose refined gaze features based on a transformation to a screen-centred coordinate system. This appears to better capture gaze aversion patterns and contributes to improved performance relative to previous approaches.
4. The study includes both unimodal and multimodal behavioural markers (gaze, facial expressions, head motion, voice prosody, HRV) in classifications, comparing early and late fusion strategies, and provides SHAP-based feature analyses to support interpretability.
5. Their evaluation uses a participant-based leave-one-out cross-validation approach, to make the most of their data. The authors also report on misclassification analysis by gender, recording setting (lab vs home), and Autism Spectrum Quotient scores, contributing to a more nuanced understanding of model behaviour.
6. The authors have mentioned potential future work, with relevant suggestions such as temporal modelling and exploring differential diagnosis across overlapping conditions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the study is framed as a multimodal behavioural analysis, it places considerable emphasis on gaze features, with limited discussion of the role and outcomes of other modalities (e.g., facial expressions, voice prosody, HRV). The analyses seem to have been conducted appropriately across modalities, but the manuscript does not sufficiently describe them in the introduction or discuss their findings beyond gaze-related features.
2. Performance improvements are primarily seen in the gaze and head motion modalities, with facial and audio features showing little to no change relative to prior work. While this is not necessarily a negative outcome, it would have been nice to offer a few words as an appropriate discussion and explanation around them. This way, they could gain further interest and understanding from the reader.
3. Throughout the paper, the dataset is repeatedly described as “context-balanced” and “naturalistic”, yet the majority of recordings were conducted in clinical settings (~78%). Moreover, simply being video-recorded or loosely scripted (such as in SIT paradigms) does not necessarily make an interaction naturalistic. In this study, the procedure appears at least partly scripted and constrained, particularly given that the interaction is simulated and not live. Thus, the wording seems inappropriate and misleading and would be best if the authors removed mentions of “context-balanced” and “naturalistic”, or explain their case better.
4. Please avoid using the word “significant” unless it refers to statistical significance, which is not the case here. The reported +5% accuracy improvement for gaze is promising, but as a reader, it’s hard to know whether this reflects a robust gain or could fall within expected variability from cross-validation. You could consider briefly explaining this improvement; for example, noting whether it was consistent across participants or CV folds, could help readers interpret its practical relevance.
5. The conclusion section does not fully express the multimodal nature of the study, focusing almost exclusively on gaze. A brief mention of the limitations or relative performance of the other modalities would improve transparency and better align the narrative with the study’s framing.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

This is a timely and clinically relevant study on machine learning-based ASC assessment using behavioural features extracted from video-recorded social interactions. I appreciated the thoughtful feature engineering, particularly around gaze, the inclusion of SHAP-based model interpretation and ROC curves, and the attention to future clinical applicability. The paper has potential, but could be improved through greater clarity and transparency in several areas, outlined below, section by section.

Before I begin with each individual section though, I would like to mention that a consistent issue I noticed throughout the paper was the primary focus on gaze due to its strongest contribution, although the study is framed as a multimodal approach (which would make it more interesting). Don’t hesitate to highlight more modest performance gains from other modalities you analysed, because when appropriately interpreted and discussed, they can actually generate further interest in understanding the conditions under which different behavioural signals add value.

The abstract focuses heavily on gaze, while the contributions from other modalities (e.g., facial expressions, voice prosody, head motion, HRV) are not mentioned, even though they were analysed. Consider providing a more balanced overview or clarifying that gaze is the central focus of your contribution.

Secondly, the claim of sharing code is appreciated, but the GitHub link provided is a placeholder and does not lead to an anonymised repository.

A detailed background (introduction section) on gaze and HRV is offered, but the rest of the introduction provides little motivation for including the other modalities, such as facial expressions and prosody. Moreover, the authors should aim to briefly make the link between all of those modalities and ASC, clearer. For example, “why are you examining those modalities in relation to ASC?”

Furthermore, it was unclear until later in the paper that the dataset was newly collected by the authors. Please consider stating this earlier to avoid confusion with reused datasets.

In the methods section, please report the total number of dataset stimuli and/or interaction segments (e.g., how many stimuli produced per participant), otherwise it limits the reader’s understanding of your data structure and affects transparency for reproducibility purposes.

Additionally, it’s important for the authors to clarify and justify their decision on limited exclusion criteria in the methods section. Conditions like ADHD or anxiety, which often co-occur with ASC, could also influence the behavioural signals you analyse (e.g., prosody, gaze). A brief explanation of why these were not excluded would enhance transparency and help readers assess the purity of the study’s ASC data.

Since features were extracted from different interaction phases (e.g., HRV during listening, audio during speaking), please clarify by briefly mentioning the distribution of segments across these phases.

The gaze transformation section is well explained; this level of transparency is appreciated, well done. Overall, well-documented process on classification and analysis sub-section, such as mention of hyperparameters and fusion models. Nevertheless, it would have been nice, to see a justification on why logistic regression and the fusion models were chosen over other methods, but not a major issue.

In the results and discussion sections, the authors report that late fusion outperformed early fusion, which is interesting. Could this be due to differences in how modalities contribute, or differences in feature compatibility across modalities (e.g., different scales, distributions, or noise levels)? If so, a brief explanation could help readers understand why one fusion approach worked better than the other.

Furthermore, the SHAP and misclassification analyses are well presented, well done. The Autism Spectrum Quotient distribution analysis with Mann–Whitney is a nice touch, showing you have considered the dimensional nature of ASC.

Overall, the “Contribution of each modality” sub-section is well-written too. However, since you note that HRV improved multimodal performance despite its low standalone accuracy, it may be worth briefly mentioning why HRV might add complementary value (e.g., its sensitivity to autonomic nervous system differences not captured by other features?).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a study on computer-aided ASC detection using machine learning applied to behavioural signals collected from standardised video interactions. Strengths include the use of clinician-confirmed ASC diagnoses, the refinement of gaze descriptors, the evaluation using Leave-One-Out Cross-Validation (LOOCV), and inclusion of interpretability analysis (SHAP). The work also analyses multiple behavioural modalities (facial expressions, voice prosody, gaze behaviour, head motion, and heart rate variability), though the introduction, interpretation and discussion focus largely on gaze, with the rest receiving limited or no attention in the narrative.

While the dataset is described as large and context-balanced, the actual number of stimuli is not reported, and the clinical setting represents the majority (78%) of recordings. This raises questions about the justification of terms like “naturalistic” and “context-balanced”, which appear to be misleading toward the study’s actual contributions. Reported performance gains are presented without accompanying information about variability across LOOCV folds or participants, which limits the reader’s ability to assess their consistency or robustness.

These issues do not indicate major methodological flaws but do affect the paper’s transparency and may impact reproducibility. My score of 4 reflects that this is a potentially valuable contribution that could be suitable for acceptance with minor revisions and an effective rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed the concerns raised in my first review.

Review #3

Please describe the contribution of the paper

This work presents a multimodal dataset and conducts an extensive analysis to investigate the role of multimodal features—such as gaze, facial expressions, and voice—in the detection of autism. The study focuses on evaluating multimodal behavioral markers and aims to address the limitations of existing gaze-based models. To this end, the authors introduce novel statistical descriptors that capture variability in eye gaze angles, leading to improved classification accuracy in gaze-based autism detection.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The study addresses an important and impactful problem—standardized and generalized autism detection from video data, with a particular focus on facial expressions and gaze. The authors present a solid and clearly articulated methodology, conduct well-designed experiments, and provide thoughtful discussion of the results. The paper concludes with a coherent and well-supported summary of findings, enhancing its overall clarity and contribution to the field.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

I did not identify any major weaknesses in the paper. The methodology is sound, the experiments are well-executed, and the presentation is clear and complete.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe this is a strong paper for the reasons outlined above. It addresses an important and timely problem with a well-motivated methodology, clear experimental design, and thoughtful discussion. The results are well-presented and the conclusions are well-supported. I did not find any major issues or concerns, which led to my positive overall evaluation.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I recommended acceptance in my initial review, and after reading the rebuttal, I maintain my recommendation to accept the paper.

Author Feedback

We thank all reviewers for their thoughtful feedback and are pleased that the clinical relevance, balanced dataset, and refined gaze descriptors were recognized. Below, we address reviewer comments and outline planned revisions.

Emphasis on Gaze Modality vs Multimodal Framing (R3) We thank the reviewer for the thoughtful feedback. We agree the manuscript strongly emphasizes the gaze modality. This was deliberate, as our main contribution aimed to improve gaze-based features, previously among the weakest in SIT-based autism detection, despite the altered gaze behaviour of individuals with autism described in the clinical literature. While gaze provided the largest improvements, other modalities (facial expressions, vocal prosody) offered meaningful complementary value in multimodal fusion. Facial and head features received modest enhancements (e.g., AU onset frequency, nod detection), but these lead to limited impact individually. Given our primary methodological novelty was in gaze refinement and HRV integration, we focused our analysis accordingly. Our study uses a multimodal framework to reflect the clinical reality of diverse behavioral cues and to ensure comparability across feature types. We will revise the abstract, introduction, and conclusion to position gaze as the central contribution while explicitly acknowledging HRV and head movement cues. We will also expand discussion on fusion strategies and modality interactions (e.g., HRV’s complementary role).

Dataset Terminology (R3) We agree that the terms “naturalistic” and “context-balanced” may overstate the actual properties of our dataset. We will revise this terminology to accurately reflect our dataset, emphasizing diverse contexts (home vs. clinical settings) without overstating their characteristics.

Justification of Classifier and Fusion Strategy (R3, R1) We chose XGBoost and logistic regression for fusion for transparency, comparison with prior work, their strong performance on structured data and more comprehensive interpretability (critical for clinical applications). Complex models (DNNs, Transformers) were tested but underperformed, likely due to limited dataset size, thus excluded here. We plan to revisit them in future work, particularly as larger datasets become available.

Clarification of “Significance” and Performance Reporting (R3) We thank the reviewer for pointing this out. We will revise “significant” to explicitly refer only to statistically tested results. Regarding performance variability: our Leave-One-Out Cross-Validation approach yields binary outcomes (correct/incorrect) per one held-out participant. Thus, fold-level performance variability (e.g., standard deviation) cannot be meaningfully computed. We chose this method for its conservativeness and appropriateness for smaller samples.

Reproducibility and Code Availability (R1, R2, R3): Due to anonymization, a placeholder GitHub link was provided. We confirm that all code and a detailed README will be made publicly available upon acceptance.

Additional Clarifications (R1, R3) The SIT paradigm is standardized; all participants received identical passive (listening) and active (speaking) segments with fixed topics. We will clarify in the introduction that the dataset was newly collected by the authors. We will clarify that comorbidities (e.g., ADHD, Social Anxiety Disorder, Major Depression) were documented but not excluded, as our focus was on ASC-related interaction traits in a clinically representative sample. We will emphasize this in the discussion and note that future work will extend to transdiagnostic settings and examine how interaction markers differ across overlapping conditions. We will improve the clarity of Figures 1 and 2 by enlarging labels and adjusting layout for readability.

We thank the reviewers for their constructive and appreciating feedback and believe that our revisions will strengthen the paper and enhance its contribution to the community.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

This paper has received positive overall reviews but I would like to invite the authors to respond to the thoughtful comments of R3.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

While reviewers all lean toward acceptance of this work, I have concerns about its fitting in the scope of miccai. It is a very well presented study on using multimodal visual features from video, audio, and heart rate to perform autism classification. I would classify this work as an application study, as it uses standard technical approaches. My concerns include: 1. while gaze and natural video data can be considered “medical” image modalities, it is not a traditional medical imaging modality (such as CT, MRI, PET, etc), and thus may not fit the right audience; 2. the study essentially reports on refined feature engineering of features for input to an XGboost classifier, in which the features are extracted using standard tools. as such, there is no technical novelty in this work. As an application study, this may be fine, but I would expect then either a novel application, or extensive validation. The types of modalities and features have been used before for autism classification, and thus the application of the specific types of features and machine learning method are not new. And while I appreciate the difficulty of acquiring a gender balanced autism dataset, the data size is still small (few 100) and from a single provider, and thus the results of this study may not be generalizable. 3. As an application study, I would expect more comparisons as evidence for validation of the utility of the approach, but there is only comparison to one other method which uses a different feature set and modality information separately. In summary, while the work is solid in describing the study that was performed and the results, I do not feel this work would be of interest to the miccai community.

back to top

Improving Autism Detection with Multimodal Behavioral Analysis

Author(s):