Facial expression spotting is the preliminary step for micro- and macro-expression analysis. The task of reliably spotting such expressions in video sequences is currently unsolved. The current best systems depend upon optical flow methods to extract regional motion features, before categorisation of that motion into a specific class of facial movement. Optical flow is susceptible to drift error, which introduces a serious problem for motions with long-term dependencies, such as high frame-rate macro-expression. We propose a purely deep learning solution which, rather than track frame differential motion, compares via a convolutional model, each frame with two temporally local reference frames. Reference frames are sampled according to calculated micro- and macro-expression durations. We show that our solution achieves state-of-the-art performance (F1-score of 0.126) in a dataset of high frame-rate (200 fps) long video sequences (SAMM-LV) and is competitive in a low frame-rate (30 fps) dataset (CAS(ME)2). In this paper, we document our deep learning model and parameters, including how we use local contrast normalisation, which we show is critical for optimal results. We surpass a limitation in existing methods, and advance the state of deep learning in the domain of facial expression spotting.
翻译:显性表达色是微观和宏观表达式分析的初步步骤。 在视频序列中可靠地定位这些表达方式的任务目前尚未解决。 当前的最佳系统取决于光学流方法, 以提取区域运动特征, 然后再将该运动分解为特定的面部运动类别。 光学流容易发生漂移错误, 这给长期依赖的运动带来严重的问题, 如高框架率宏观表达式( SAMM- LV ) 。 我们提议一个纯深层次的学习解决方案, 而不是跟踪框架差异变化, 比较于一个螺旋式模型, 每个框架都有两个时间性的本地参照框架。 参照框架是按计算出来的微和宏观表达持续时间来抽样的。 我们显示, 我们的解决方案在高框架率( 200 fps) 长的视频序列( SAMM- LV) 数据集中实现了状态( SAMM- LV ), 并且在一个低框架率( 30 fps) 数据集中具有竞争力。 2 在本文中, 我们记录了我们的深层次学习模型和参数, 包括我们如何在深度分析中采用最优化的地面常规的状态 。