与BERT进行开放式问题的自动分类 (Automated classification for open-ended questions with BERT)

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. Recently, pre-training a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than other non-pre-trained statistical learning approaches. We found fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is not competitive. Second, we found fine-tuned BERT barely beats the non-pre-trained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT's relative advantage increases rapidly when more manually coded observations (e.g. 200-400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.

翻译：将不限名额问题的文本数据手工编码成不同类别既费时又费钱。自动编码使用统计/机械学习来训练人工编码文本的一小部分答案。最近,对大量不相干的数据进行一般语言模型的预培训,然后将模型适应于具体应用,在自然语言处理方面证明是有效的。我们用两个数据集,实证地调查目前占主导地位的未受过训练的语言模型BERT是否比其他未受过训练的统计学习方法更能自动编码对不限名额问题的答案。我们认为,对预先训练的BERT参数进行微调是必要的,因为否则BERT的参数就不具竞争力。第二,我们发现在100次人工编码观察培训时,经过微调的BERT在分类精度方面几乎比未受过训练的统计学习方法差。但是,如果有更人工编码的观测(例如200-400)用于培训,BERT的相对优势会迅速增加。我们的结论是,对不限名额问题的自动编码回答比非经过训练的模型,例如支持矢量器和增强。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日