This work proposes a data driven learning model for the synthesis of keystroke biometric data. The proposed method is compared with two statistical approaches based on Universal and User-dependent models. These approaches are validated on the bot detection task, using the keystroke synthetic data to improve the training process of keystroke-based bot detection systems. Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects. We have analyzed the performance of the three synthesis approaches through qualitative and quantitative experiments. Different bot detectors are considered based on several supervised classifiers (Support Vector Machine, Random Forest, Gaussian Naive Bayes and a Long Short-Term Memory network) and a learning framework including human and synthetic samples. The experiments demonstrate the realism of the synthetic samples. The classification results suggest that in scenarios with large labeled data, these synthetic samples can be detected with high accuracy. However, in few-shot learning scenarios it represents an important challenge. Furthermore, these results show the great potential of the presented models.
翻译:本文提出了一种数据驱动的学习模型,用于合成按键生物特征数据。该方法与基于通用模型和用户特定模型的两种统计方法进行了比较。这些方法通过使用合成按键数据来改进按键生物特征检测系统的训练过程,进行了机器人检测任务的验证。我们的实验框架考虑了一个包含来自168,000个受试者的1.36亿个按键事件的数据集。通过定性和定量实验,我们分析了三种合成方法的表现。考虑了基于多个有监督分类器(支持向量机,随机森林,高斯朴素贝叶斯和一种长短期记忆网络)的不同机器人检测器和包括人类样本和合成样本的学习框架。实验表明,合成样本的逼真性,这些合成样本在标记数据较多的情况下可以被高精度地检测出来。然而,在少量样本学习的情况下,这仍然是一个重要的挑战。此外,这些结果显示了所提出模型的巨大潜力。