Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Screening for Autism Spectrum Disorder (ASD) is an important yet challenging task. Traditional screening tools, such as questionnaires and other technical methods, face difficulties in large-scale implementation, such as primary healthcare and home monitoring settings. To address this issue, we develop a smartphone application to highlight atypical eye movement behaviors in children with ASD and extract multi-modal features, including eye movements, head pose, and emotional expressions, from smartphone videos to characterize the subjects’ viewing behavior. Additionally, we propose a multi-modal progressive fusion framework to comprehensively integrate the relationships between different modalities. The progressive fusion strategy combines multi-modal features at multiple scales to achieve attention-based deep fusion. Moreover, we develop a global intra- and inter-modality interaction (GIIMI) module to enhance competition and interaction within and between modalities. In the experiment, we constructed a smartphone video dataset of 124 children aged 3 to 6 years and validated the performance advantages of the proposed algorithm.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3536_paper.pdf

SharedIt Link: https://rdcu.be/eHw1T

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_38

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhoWen_MultiModal_MICCAI2025,
        author = { Zhong, Wenqi AND Li, Bohan AND Xia, Chen AND Li, Kuan AND Zhang, Dingwen},
        title = { { Multi-Modal Progressive Fusion for ASD Screening Using Smartphone Video } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {393 -- 403}
}

Reviews

Review #1

Please describe the contribution of the paper

In this paper, the authors propose a mobile-based method for diagnosing children with autism spectrum disorder using extracted multi-modal features, including eye movement, head pose, and emotional expressions. Their contributions also include: 1) providing a new way to fuse different modalities; and 2) collecting a new dataset for this task.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The application itself has strong motivation: by using mobile-based diagnostic tools, it may help regions with limited medical resources or enable early-stage testing before visiting a hospital. Such attempts should be encouraged in the community.
2. In addition, the paper contributes by investigating the fusion of multi-modal features.
3. The presentation of this work is very neat and clear. Figures such as 1 and 2 are helpful for readers to understand the context and the details of the methodology.
4. The experiments are comprehensive, including comparisons between statistical methods, deep learning methods, and the proposed method. The ablation study is also complete.
5. The authors also include necessary background knowledge on autism and explain how the extracted features align with diagnostic criteria.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major:
1. In Section 3.1, when explaining how you extracted information from videos, did you use any software or algorithms developed by others? If yes, please cite them more clearly (e.g., we implement gaze estimation following [cite])—I couldn’t find such citations. If not, you need to provide more details on how you implemented this part. If there isn’t enough space in the main text, this information should be included in the supplementary material.
2. There is a lack of detail in the methodology section. For example, in Equations 1 and 2, the outputs for gaze and head estimation are described as having the shape of T×2 or T×3. Please specify exactly what data you are extracting and why the shape is structured that way.
Minor weaknesses:
- No full term is given for TD (also DSM-V, in the first paragraph of Section 2).
- In Section 2: “In the testing phase, each subject watches a 2-minute video. Children with ASD and TD children exhibit significant differences in attention to geometric patterns and social scenes [21].” → Is this also used as a standard diagnostic criterion in autism practice? The authors should clarify this to better connect the logic between how the data is collected and standardized diagnostic criteria.
- Page 6, under Equation 4: “where DownSampling(·) represents the downsampling operation using convolutional layers to reduce the resolution of features.” → It would be more accurate to say “dimension” of features rather than “resolution.” Also, please specify what dimensions the features are downsampled to.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Just wondering—will you be releasing the dataset or code? It would be a great resource for the community. I put this here as this doesn’t affect my evaluation of the paper in any way.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper has a well-structured format with a solid motivation. It combines new techniques with medical knowledge of autism, providing sufficient background information for researchers from both fields. The presentation of the idea is very clear and easy to follow. The experiments are comprehensive. While some details still need to be added to the methodology section, overall, the paper is sufficiently prepared for publication.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors addressed my concerns in rebuttal. My recommendation for this paper is acceptance, due to it’s clear structure and promising results.

Review #2

Please describe the contribution of the paper

The main contribution of this paper is the proposal of a multi-modal progressive fusion framework for Autism Spectrum Disorder (ASD) screening using smartphone videos. The framework involves designing a child-friendly eye-tracking experiment to record children’s viewing behaviors via smartphone cameras and extracting multi-modal features, including eye movements, head pose, and emotional expressions, to comprehensively characterize viewing behavior. Additionally, the authors introduce a progressive fusion strategy that integrates relationships between different modalities through multi-scale feature fusion and the use of Global Intra- and Inter-Modality Interaction (GIIMI) and Emotion-Enhanced Fusion (EEF) modules, thereby improving the accuracy of ASD screening.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Cost-effectiveness and Scalability：The method is implemented on a smartphone platform, which is cost-effective and scalable. The widespread use of smartphones makes it possible to deploy this method in resource-limited areas (such as rural or remote regions). Clinical Feasibility：By collaborating with multiple hospitals and rehabilitation centers, the authors constructed a smartphone video dataset of 124 children and validated the method’s effectiveness. The experimental results demonstrate a high accuracy rate (86.96%) for ASD screening, indicating strong potential for clinical application.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Insufficient Discussion of Multi-Modality Interactions：The authors introduce the GIIMI module to integrate intra- and inter-modality interactions, but the implementation relies on simple feature concatenation and attention mechanisms, lacking in-depth discussion of complex interaction relationships. For instance, intra-modality interactions may involve feature consistency and complementarity, while inter-modality interactions may involve semantic associations between different modalities. The authors do not adequately discuss the complexity of these interactions and how to handle them with more sophisticated methods. Insufficient Comparison to State-of-the-Art：In the experimental section, the authors only compare their method to some baseline methods but do not compare it to the latest relevant studies.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I have a few suggestions for the authors to consider. First, regarding the representativeness of the dataset, the authors used a limited age range (3 to 6 years) and a relatively small sample size (63 ASD children and 61 TD children). This dataset may not effectively reflect the behavior patterns of children across different age groups and regions. I recommend expanding the sample size and considering children from various age groups and regions in future experiments. Second, concerning the in-depth analysis of experimental results, the authors provided accuracy, precision, recall, and F1-score metrics but lacked a detailed analysis of the results. I suggest further exploring the complementarity and synergy between different modalities and how these factors influence the final screening performance. Lastly, I recommend validating the method’s performance in a multi-modal environment in future experiments to demonstrate its scalability. I hope these suggestions can help the authors further improve their research.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My overall score for this paper is 4, based on several key factors. First, the experimental design has limitations. Second, the analysis of experimental results is not in-depth enough. Considering these factors, I believe that the paper needs further improvements in experimental design, result analysis, and method scalability to meet the acceptance criteria.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The paper proposes a smartphone-based multi-modal ASD screening framework that leverages eye movements, head pose, and emotional expressions to achieve high accuracy in ASD detection. The method is innovative and has strong potential for clinical application. The authors have addressed most of the concerns raised in the initial review, including the comparison to state-of-the-art methods and the significance test, which strengthens the paper’s credibility. They have also provided detailed explanations and clarifications regarding the implementation details and dataset limitations.

Review #3

Please describe the contribution of the paper

This paper proposes a smartphone-based multi-modal ASD identification paradigm, which is a low-cost ASD screening method. A multi-modal progressive fusion framework is designed for identifying ASD. A progressive fusion module is designed to fuse multi-modal correlations at different time scales. A mobile-based ASD dataset is built and better results are achieved.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents a smartphone-based multimodal ASD screening method, which is a relatively low-cost screening approach. The novelty lies in the extraction of multimodal features, including eye gaze, head pose, and emotion recognition features. Another innovation is the proposal of a progressive fusion module that achieves deep multimodal interaction across multiple time scales. The method achieves good performance and is compared with the results reported by other methods. Ablation experiments were conducted on the proposed method to demonstrate the effectiveness of the multimodal features and each of its components.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

There are many multimodal fusion methods, and the superiority of the proposed multimodal progressive fusion method over existing methods is not very clear. The comparative methods are relatively limited. The number of samples in the dataset is relatively small, and the results may be overly optimistic. A larger number of samples is needed for validation. A significance test is needed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper provides a very interesting application for ASD screening. It designs a smartphone-based multimodal ASD screening method that is low-cost and easy to use.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all reviewers for their valuable time and comments. We are encouraged by the positive comments regarding novelty (R1-R3), performance (R1&R3), and writing (R2).

Common Questions of R1 & R3:

Comparison to SOTA Methods: We have conducted experiments incorporating two recent SOTA approaches: SOFTS (Han et al., 2024) and iTransformer (Liu et al., 2024), which achieve accuracies of 73.91% and 78.26%, respectively. Our method consistently outperforms these models, further demonstrating its superior effectiveness.

Relatively Small Dataset: Recruiting participants with ASD, particularly young children, presents significant ethical and practical challenges, making large-scale data collection difficult. Therefore, most multi-modal or eye-tracking-based studies on ASD involve around 100 participants. To ensure robust results, we employed 5-fold cross-validation on 124 subjects in all experiments. As reviewers’ suggestions, we are actively expanding the dataset to include a broader range of age groups and regions in future work.

R1:

Significance Test: We have performed a t-test, and our method significantly outperforms the other methods (p<0.05).

R2:

Details of Implementation: We have cited the gaze estimation method used (Section 3.1, first paragraph; Ref. [28]) and will include references for head pose estimation and emotion recognition. Head pose is estimated using a landmark-based model, while emotion recognition is performed using a ResNet-18 model trained on the FER2013 dataset.

Experimental Stimuli: The rationale behind our experimental design is grounded in well-established behavioral traits of ASD. Social videos assess deficits in social attention, while geometric videos target restricted, repetitive interests. Prior studies (e.g., Pierce et al., 2016) show that individuals with ASD exhibit distinct visual attention to geometric vs. social scenes, making such stimuli widely employed in eye-tracking research on ASD.

Details of the Methodology: We represent gaze data as a matrix of size T×2, where T denotes the number of frames, and each row contains the (x, y) gaze coordinates corresponding to a stimulus frame. Similarly, head pose data is represented as a T×3 matrix, capturing pitch, yaw, and roll angles for each frame.

Missing Full Terms: We have added the full forms of the following terms: Typical Development (TD) and Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-V).

Downsampling Operation: We have revised the expression for improved clarity. The feature dimensions after downsampling are 128 and 64 (Implementation Details in Sec. 4).

R3:

Insufficient Discussion of Multi-Modality Interaction: For intra-modality interaction, we adopt a progressive multi-scale strategy to capture both local transient changes and global behavioral patterns, enabling more effective modeling of asynchronous relationships across different temporal scales. For inter-modality interaction, we leverage the GIIMI and EEF modules to explore the complementarity among distinct modalities. Specifically, eye movement reflects object-driven attention, while head pose captures overall viewing orientation. GIIMI uses cross-attention to model semantic associations between modalities. Finally, our model jointly integrates intra- and inter-modality interaction with learnable adaptive weighting, rather than simple concatenation. The effectiveness of these components is validated through the ablation study in Table 3. Furthermore, the significant improvement of the multi-modal approach over the single modality in Table 2 also shows the importance of the proposed fusion strategy.

Validation in Multi-modal Environments: We appreciate the suggestion to evaluate the model in broader multi-modal settings. This work focuses on commonly used modalities to develop a simple yet effective mobile-based AI model for ASD recognition. We plan to incorporate additional modalities in our future work for further validation.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

While this paper has received overall positive feedback, I would like to invite the authors to respond to the comments of R2 and R3. These reviewers have some serious concerns and I would appreciate if these are addressed.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

I have read the manuscript, review comments, rebuttal letter. All reviewers recommend acceptance (after rebuttal). This meta reviewer believes that the authors did a good job in addressing concerns.

back to top

Multi-Modal Progressive Fusion for ASD Screening Using Smartphone Video

Author(s):