What if emotion could be captured in a general and subject-agnostic fashion? Is it possible, for instance, to design general-purpose representations that detect affect solely from the pixels and audio of a human-computer interaction video? In this paper we address the above questions by evaluating the capacity of deep learned representations to predict affect by relying only on audiovisual information of videos. We assume that the pixels and audio of an interactive session embed the necessary information required to detect affect. We test our hypothesis in the domain of digital games and evaluate the degree to which deep classifiers and deep preference learning algorithms can learn to predict the arousal of players based only on the video footage of their gameplay. Our results from four dissimilar games suggest that general-purpose representations can be built across games as the arousal models obtain average accuracies as high as 85% using the challenging leave-one-video-out cross-validation scheme. The dissimilar audiovisual characteristics of the tested games showcase the strengths and limitations of the proposed method.
翻译:如果情绪能够以一般和主题不可知的方式被捕捉到呢?比如,能否设计出仅从人-计算机互动视频的像素和音频中测出影响作用的通用代表制?在本文件中,我们通过评价深知的表达制的能力来应对上述问题,以仅依靠视频的视听信息来预测影响作用。我们假设互动会议的像素和音频包含了为检测影响所需的必要信息。我们测试了我们在数字游戏领域的假设,并评估了深分解器和深偏好学习算法能够学习到多少程度来预测玩家仅以其游戏视频视频录像为根据的震动作用。我们四个不同游戏的结果表明,随着振动模型利用具有挑战性的独线视频跨校计划获得高达85%的平均美度,可以在游戏中建立通用代表制。我们测试过的游戏的不同视听特征展示了拟议方法的优点和局限性。