Deep learning models have shown promising results in recognizing depressive states using video-based facial expressions. While successful models typically leverage using 3D-CNNs or video distillation techniques, the different use of pretraining, data augmentation, preprocessing, and optimization techniques across experiments makes it difficult to make fair architectural comparisons. We propose instead to enhance two simple models based on ResNet-50 that use only static spatial information by using two specific face alignment methods and improved data augmentation, optimization, and scheduling techniques. Our extensive experiments on benchmark datasets obtain similar results to sophisticated spatio-temporal models for single streams, while the score-level fusion of two different streams outperforms state-of-the-art methods. Our findings suggest that specific modifications in the preprocessing and training process result in noticeable differences in the performance of the models and could hide the actual originally attributed to the use of different neural network architectures.
翻译:深层学习模型在利用视频面部表达式识别抑郁症方面显示出了令人乐观的结果。虽然成功模型通常利用3D-CNNs或视频蒸馏技术,但不同实验中不同的培训前技术、数据增强技术、预处理技术和优化技术的使用使得难以进行公平的建筑比较。我们提议加强基于ResNet-50的两种简单模型,这些模型仅使用静态空间信息,使用两种特定的面部调整方法,改进数据增强、优化和排期技术。我们在基准数据集方面的广泛实验取得了与复杂的单流时空模型相似的结果,而两种不同流流的分级融合则超越了最先进的方法。我们的调查结果表明,在预处理和培训过程中的具体修改导致模型性能的明显差异,并可能隐藏最初由于使用不同神经网络结构而产生的实际变化。