In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers, including CNN. The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection. The dataset was re-annotated by objective experts (psychologists), as the importance of professional annotation in cyberbullying research has been indicated multiple times. The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density while also proposing a new approach of training various linguistically-backed embeddings for Convolutional Neural Networks.
翻译:在这一研究中,我们研究了在使用数据集的各种语言预处理方法时机器学习(ML)分类器的性能变化,特别侧重于以语言支持的嵌入进化神经网络(CNN)的问题。此外,我们研究“特征密度”的概念,并证实它有可能比较预测包括CNN在内的ML分类器的性能。在Kaggle关于自动网络欺凌探测的竞赛中提供的成形数据集方面进行了研究。该数据集得到了客观专家(心理学家)的重新说明,因为已经多次表明在网络欺凌研究中专业注解的重要性。这项研究确认了神经网络在网络欺凌探测中的有效性,以及分类性能与特征密度之间的相互关系,同时还提出了一种新办法,培训以语言支持的“动态神经网络”的各种嵌入器。