Gesture recognition is essential for the interaction of autonomous vehicles with humans. While the current approaches focus on combining several modalities like image features, keypoints and bone vectors, we present neural network architecture that delivers state-of-the-art results only with body skeleton input data. We propose the spatio-temporal multilayer perceptron for gesture recognition in the context of autonomous vehicles. Given 3D body poses over time, we define temporal and spatial mixing operations to extract features in both domains. Additionally, the importance of each time step is re-weighted with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. Furthermore, we deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
翻译:眼界识别对于自主车辆与人类的互动至关重要。虽然目前的方法侧重于将图像特征、关键点和骨头矢量等几种模式相结合,但我们展示了仅以身体骨骼输入数据提供最新结果的神经网络结构。我们提议在自主车辆的背景下,用时空孔多层次的感官来表示姿态识别。鉴于三维体随时间推移而变化,我们定义了时间和空间混合操作,以提取两个领域的特征。此外,每个时间步骤的重要性都与挤压和抽查层重新加权。对TCG和驱动器和动作数据集进行了广泛的评估,以展示我们方法的有希望的绩效。此外,我们将我们的模型运用到我们的自主工具中,以展示其实时能力和稳定的执行能力。