Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.
翻译:高效搜索手语视频收藏的系统被强调为手语技术的有用应用,然而,在文献中,搜索单个关键词之外的视频问题受到的注意有限。为了弥补这一差距,我们在此工作中引入了手语检索任务,并提供了免费文本查询:根据书面查询(例如一句)和大量手语视频汇编,目标是在收藏中找到最符合书面查询的签名视频。我们提议通过学习最近推出的美国手语大规模 How2Sign数据集(ASL)的跨模式嵌入来完成这项任务。我们发现,系统运行中的一个关键瓶颈是手语嵌入的质量,因为缺少贴标签的培训数据。因此,我们提出SPOT-ALIGN,即一个相互连接的迭接的签名定位和特征校准框架,以扩大现有培训数据的范围和规模。我们确认SPOT-ALIGN通过改进签名识别和拟议视频检索任务,学习强有力的手语视频嵌入的有效性。