从机器学习的角度比较低收量和高收量的欺骗性视频数据集 (Can lies be faked? Comparing low-stakes and high-stakes deception video datasets from a Machine Learning perspective)

Despite the great impact of lies in human societies and a meager 54% human accuracy for Deception Detection (DD), Machine Learning systems that perform automated DD are still not viable for proper application in real-life settings due to data scarcity. Few publicly available DD datasets exist and the creation of new datasets is hindered by the conceptual distinction between low-stakes and high-stakes lies. Theoretically, the two kinds of lies are so distinct that a dataset of one kind could not be used for applications for the other kind. Even though it is easier to acquire data on low-stakes deception since it can be simulated (faked) in controlled settings, these lies do not hold the same significance or depth as genuine high-stakes lies, which are much harder to obtain and hold the practical interest of automated DD systems. To investigate whether this distinction holds true from a practical perspective, we design several experiments comparing a high-stakes DD dataset and a low-stakes DD dataset evaluating their results on a Deep Learning classifier working exclusively from video data. In our experiments, a network trained in low-stakes lies had better accuracy classifying high-stakes deception than low-stakes, although using low-stakes lies as an augmentation strategy for the high-stakes dataset decreased its accuracy.

翻译：尽管在人类社会里存在着巨大的影响,而且对欺骗性检测(DD)来说人类的精确度微弱54%,但是,由于数据稀缺,自动DD的机器学习系统仍然无法在现实环境中适当应用,因为数据稀缺。很少有公开的DD数据集存在,新数据集的创建也由于低取量和高取量之间的概念区别而受到阻碍。理论上,两种谎言都非常不同,以至于一种数据集无法用于其他类型的应用。尽管由于可以在受控环境中模拟(掩盖)低取量欺骗,因此更容易获得低取量欺骗的数据,但这些系统与真正的高取量谎言没有同等的意义或深度,而获取和保持自动DDD系统的实际利益则困难得多。为了从实际角度调查这种区别是否正确,我们设计了几种实验,比较一种高取量的DD数据集和一种低取量的DD数据集。尽管在只从视频数据中模拟(掩盖)得到的深学分类结果方面比较容易获得,但是在我们的实验中,一个经过训练的网络在低取量的精确度上比低取量的低取量的精确度,在低取量的低取量的精确度上,但在低取的低取率的精确度的网络在低取取取的低取率上,在低取取取的低的精确度上是低取的低取的低的精确性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

专知会员服务

39+阅读 · 2020年11月3日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation