Humans' perception system closely monitors audio-visual cues during multiparty interactions to react timely and naturally. Learning to predict timing and type of reaction responses during human-human interactions may help us to enrich human-computer interaction applications. In this paper we consider a presenter-audience setting and define an audience response prediction task from the presenter's textual speech. The task is formulated as a binary classification problem as occurrence and absence of response after the presenter's textual speech. We use the BERT model as our classifier and investigate models with different textual contexts under causal and non-causal prediction settings. While the non-causal textual context, one sentence preceding and one sentence following the response event, can hugely improve the accuracy of predictions, we showed that longer textual contexts with causal settings attain UAR and $F1$-Score improvements matching and exceeding the non-causal textual context performance within the experimental evaluations on the OPUS and TED datasets.
翻译:人类感知系统密切监视多党互动期间的视听提示,以便及时自然地作出反应。 学会预测人与人互动期间的反应的时机和类型可能有助于我们丰富人与计算机的互动应用。 在本文中,我们考虑展示者-观众的设置,并根据演讲者的文字演讲界定听众的反应预测任务。 任务被写成二进制分类问题,在演讲者文字演讲之后,作为发生和没有反应的问题进行分类。 我们使用BERT模型作为我们的分类师,在因果和非因果预测设置下,对不同文字背景的模型进行调查。 虽然非因果文字背景、前一句和后一句可以极大地提高预测的准确性,但我们在对OPUS和TED数据集的实验性评估中,我们展示了因果环境达到UAR和$F1的长文本环境,从而匹配和超过非非因果文字背景的改进。