One of the major downsides of Deep Learning is its supposed need for vast amounts of training data. As such, these techniques appear ill-suited for NLP areas where annotated data is limited, such as less-resourced languages or emotion analysis, with its many nuanced and hard-to-acquire annotation formats. We conduct a questionnaire study indicating that indeed the vast majority of researchers in emotion analysis deems neural models inferior to traditional machine learning when training data is limited. In stark contrast to those survey results, we provide empirical evidence for English, Polish, and Portuguese that commonly used neural architectures can be trained on surprisingly few observations, outperforming $n$-gram based ridge regression on only 100 data points. Our analysis suggests that high-quality, pre-trained word embeddings are a main factor for achieving those results.
翻译:深层学习的一个主要缺点是,它假定需要大量的培训数据。因此,这些技术似乎不适合NLP中附加说明的数据有限的领域,如资源较少的语言或情绪分析,其格式有许多细微和难以获取的说明格式。我们进行了一项问卷研究,表明情感分析中的绝大多数研究人员确实认为,当培训数据有限时,神经模型低于传统机器学习。与这些调查结果形成鲜明对照的是,我们为英语、波兰语和葡萄牙语提供了经验证据,证明通常使用的神经结构能够以惊人的很少的观察来培训,仅100个数据点就超过以美元为基础的山脊回归。我们的分析表明,高质量、预先训练的词嵌入是取得这些结果的主要因素。