Cant is important for understanding advertising, comedies and dog-whistle politics. However, computational research on cant is hindered by a lack of available datasets. In this paper, we propose a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective. We formulate a task for cant understanding and provide both quantitative and qualitative analysis for tested word embedding similarity and pretrained language models. Experiments suggest that such a task requires deep language understanding, common sense, and world knowledge and thus can be a good testbed for pretrained language models and help models perform better on other tasks. The code is available at https://github.com/JetRunner/dogwhistle. The data and leaderboard are available at https://competitions.codalab.org/competitions/30451.
翻译:Cant对于理解广告、喜剧和小说政治很重要,然而,由于缺乏可用的数据集,对Cable的计算研究受到阻碍。我们在本文件中建议从计算语言角度为创建和理解Cable提供庞大和多样化的中国数据集,我们为Cant制定任务,为测试过的类似语言和预先训练的语言模型提供定量和定性分析,实验表明,这一任务需要深入的语言理解、常识和世界知识,因此可以成为预先培训的语言模型的良好测试台,帮助模型更好地完成其他任务。代码可在https://github.com/JetRunner/dogwhistle查阅。数据和领导板可在https://competitions.codalab.org/competitions/30451查阅。