Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.
翻译:文本挖掘和自然语言处理技术的最近进展使研究人员能够通过自动分析语言特征的差异,在数种文本中发现作者的身份或人口特征,例如年龄和性别,通过自动分析语言特征的变化,在几种文本中发现作者的身份或人口特征,然而,在野外应用这类技术,即在网络犯罪和定期在线社交媒体中应用这类技术,与较一般的应用不同,因为其定义特征既取决于领域,也取决于过程。这引起了一些挑战,当代研究只是从表面上刮过。更具体地说,在社交媒体通信中应用的文本挖掘方法通常无法控制数据集大小,现有通信的数量会因用户而异。因此,该系统必须能够稳健地应对有限的数据提供情况。此外,数据的质量不能保证。因此,这种方法需要容忍某种程度的语言噪音(例如缩写、非标准语言使用、拼写变异和错误)。最后,在网络犯罪论坛中,必须强有力地对待欺骗性或对抗性行为,即试图隐藏其犯罪意图的罪犯(拒绝)或现有通信数量会因用户而不同而不同,因此,系统必须能够保证数据供应的质量。此外,数据的质量。因此,该方法需要对某种可能使用目前的数字代码进行研究。