Speech encodes a wealth of information related to human behavior and has been used in a variety of automated behavior recognition tasks. However, extracting behavioral information from speech remains challenging including due to inadequate training data resources stemming from the often low occurrence frequencies of specific behavioral patterns. Moreover, supervised behavioral modeling typically relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In this paper, we exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g., stand-up comedy). With encouraging results, our proposed framework shows the feasibility of unsupervised learning within cross-domain behavioral modeling.
翻译:演讲汇集了与人类行为有关的大量信息,并被用于各种自动行为识别任务。然而,从演讲中提取行为信息仍然具有挑战性,包括由于特定行为模式的频率往往较低,因此培训数据资源不足,具体行为模式的频率往往较低,因此缺乏足够的培训数据资源。此外,受监督的行为模型通常依赖特定领域的设计定义和相应的人工附加说明数据,使跨领域普遍化具有挑战性。在本文中,我们利用互动中人类行为的固定特性,并展示一种代表学习方法,以不受监督的方式从演讲中获取行为信息。我们假设附近部分的演讲具有相同的行为背景,从而绘制了类似的基本行为表层。我们展示了一个基于深背景化网络(DCN)的编码-解密数据资源,以及一个Triplet-Enhanced DCN(TE-DCN)框架,以捕捉行为环境背景,并得出一个多重的表述方式,在不同的行为框架保持较大距离的情况下,与类似行为的语音框架更加接近。这些模型经过了电影音频数据的培训,并在不同的领域进行了验证,包括伴侣治疗模型和其他公开收集的可行性框架。