设计用于远距离监督采掘技术的负面抽样战略 (Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction)

Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three different strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.

翻译：技能在就业市场和许多人力资源(HR)过程中发挥着中心作用。在其他数字经验之后,今天的在线就业市场有候选人期望看到以其技能组合为基础的正确机会。同样,企业日益需要利用数据保证其劳动力中的技能仍然不受未来的影响。然而,关于技能的结构性信息往往缺乏,以自我或管理者评估为基础的流程表明,在采用、完整和新颖数据方面所遇到的问题难以解决。提取技能是一项极具挑战性的任务,因为可能提到的成千上万个技能标签要么明确,要么只是隐含地描述,而且缺乏精细的附加说明的培训公司。以往关于技能提取的工作过于简化,以明确实体的发现任务,或者以人工的附加说明的培训数据为基础,如果应用到完整的技能词汇表,则不可行。我们提议在远程监督的基础上建立一个技能提取端对端系统。我们提议并评价若干负面的采样模型,以进一步校准的小型数据集为基准,以便改进技能提取的通用,以隐含地提及技能,同时,我们利用最隐含的税前定的成绩模型,我们利用最隐含的税前的成绩,我们用最隐含的成绩来进行一项税级研究。