Background: The early detection of autism spectrum disorder (ASD) is imperative for enhancing long-term developmental outcomes. Nevertheless, conventional screening methods depend on time-consuming, expert-driven behavioral assessments and are characterized by limited scalability. Automated video-based analysis provides a noninvasive and objective approach for the extraction of behavioral biomarkers from naturalistic recordings.
Methods: A modular multimodal framework was developed that integrates motion-based video analysis and facial feature extraction for the purpose of ASD versus typically developing (TD) classification. The system is capable of processing RGB videos, skeleton/stickman representations, and motion trajectory streams. A comprehensive set of kinematic features was extracted, encompassing joint trajectories, velocity and acceleration profiles, posture variability, movement smoothness, and bilateral asymmetry. The repetitive stereotypical behaviors exhibited by the subjects were characterized using frequency-domain analysis via FFT within the 0.3–7.0 Hz band. Facial expression features derived from normalized face crops and landmark-based morphological descriptors were integrated as complementary modalities. The feature-level fusion process was executed subsequent to z-score normalization, and the classification procedure was conducted using a Random Forest model with stratified 5-fold cross validation. The implementation of GPU acceleration was instrumental in facilitating near real-time inference.
Results: The motion-based ComplexVideos pipeline demonstrated a cross-validated accuracy of 94.2 ± 2.1% with an area under the ROC curve (AUC) of 0.93. Skeleton-based KinectStickman inputs demonstrated moderate performance, with an accuracy range of 60–80%. In contrast, facial-only models exhibited an accuracy of approximately 60%. The integration of multiple modalities through feature fusion has been demonstrated to enhance the robustness of classification algorithms and mitigate the occurrence of false negative outcomes, thereby surpassing the performance of singlemodality models. The mean inference time remained below one second per video frame under standard operating conditions. Conclusions: The experimental results demonstrate that the integration of multimodal cues, including motion and facial features, facilitates the development of effective and efficient video-based screening methods for autism spectrum disorder (ASD). The proposed framework is designed to offer a scalable, extensible, and computationally efficient solution that can support early screening in clinical and remote assessment settings.