We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
翻译:我们用语言支持的不同特性预处理方法研究地貌密度(FD)的有效性,以便估计数据集的复杂性,而后者又被用来比较估计机器学习(ML)分类人员在任何培训之前的潜在性能。我们假设估计数据集的复杂性可以减少所需的实验迭代次数。这样我们就可以优化ML模型的资源密集培训,由于可用数据集规模的扩大和基于深神经网络的模型越来越受欢迎,这种培训正在成为一个严重问题。对更强大的计算资源的需求不断增加的问题也影响到环境,因为大规模ML模型的培训导致二氧化碳排放量惊人地增加。对多种数据集进行了研究,包括用于培训典型情绪分析模型的Yelp企业审查数据集,以及试图解决网络欺凌问题的更近期数据集,这是一个严重的社会问题,也是一个更复杂的问题,也是语言处理语言差异的焦点,即语言变异的英语,我们利用了多种语言变异的网络化功能来讨论。