改进阿拉伯语目标感知核查的内地增加值 (Context-Gloss Augmentation for Improving Arabic Target Sense Verification)

Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).

翻译：阿拉伯语缺少语义数据集和感官目录。阿拉伯语最常见的语义标签数据集是ArabGlossBERT,这是一个相对较小的数据集,由阿拉伯词典收集的167K背景光谱配对(约60K正对和107K负对)组成。本文通过使用(阿拉伯文-英文-阿拉伯文)机器回译,对阿拉伯GlossBERT数据集进行了浓缩。放大将数据集的尺寸提高到352K对(149K正对和203K负对)。我们用不同的数据配置来衡量增强的影响,以微调显示BERT对目标感知核查(TSV)任务的影响。总体而言,不同数据配置的准确度在78%至84%之间。虽然我们的做法与基线相当,但我们在一些实验中观察到了某些POS标记的一些改进。此外,我们经过微调的模型还接受了涵盖较大词汇和背景的较大数据集的培训。我们对每个部分的准确性进行了深入分析。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日