Automatic action identification from video and kinematic data is an important machine learning problem with applications ranging from robotics to smart health. Most existing works focus on identifying coarse actions such as running, climbing, or cutting a vegetable, which have relatively long durations. This is an important limitation for applications that require the identification of subtle motions at high temporal resolution. For example, in stroke recovery, quantifying rehabilitation dose requires differentiating motions with sub-second durations. Our goal is to bridge this gap. To this end, we introduce a large-scale, multimodal dataset, StrokeRehab, as a new action-recognition benchmark that includes subtle short-duration actions labeled at a high temporal resolution. These short-duration actions are called functional primitives, and consist of reaches, transports, repositions, stabilizations, and idles. The dataset consists of high-quality Inertial Measurement Unit sensors and video data of 41 stroke-impaired patients performing activities of daily living like feeding, brushing teeth, etc. We show that current state-of-the-art models based on segmentation produce noisy predictions when applied to these data, which often leads to overcounting of actions. To address this, we propose a novel approach for high-resolution action identification, inspired by speech-recognition techniques, which is based on a sequence-to-sequence model that directly predicts the sequence of actions. This approach outperforms current state-of-the-art methods on the StrokeRehab dataset, as well as on the standard benchmark datasets 50Salads, Breakfast, and Jigsaws.
翻译:视频和运动数据的自动行动识别是一个从机器人到智能健康的应用程序中的重要机器学习问题。 多数现有工作都侧重于识别运行、 攀爬或切削蔬菜等耗时较长的粗略动作。 对于需要高时分辨率识别微妙动作的应用来说,这是一个重要的限制。 例如, 在中风恢复中, 量化恢复剂量需要与亚秒长度的动作区分。 我们的目标是缩小这一差距。 为此, 我们引入一个大型、 多式数据集, StrokeRehab, 作为一种新的行动识别基准, 其中包括以高时间分辨率标定的微妙短期行动。 这些短期行动被称为功能性原始动作, 包括连接、 运输、 重新定位、 稳定性 和闲置。 数据集包括高品质的内脏测量单位传感器和41个中风障碍患者的视频数据。 我们显示, 以分解为主的状态模型在应用高时间分辨率分辨率标定时会产生更响的预言。 这些数据通常以高分辨率排序为基础, 以高分辨率排序为我们所预见的语音排序 。