Skill Extraction (SE) is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, we introduce SKILLSPAN, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans. We release its respective guidelines created over three different sources annotated for hard and soft skills by domain experts. We introduce a BERT baseline (Devlin et al., 2019). To improve upon this baseline, we experiment with language models that are optimized for long spans (Joshi et al., 2020; Beltagy et al., 2020), continuous pre-training on the job posting domain (Han and Eisenstein, 2019; Gururangan et al., 2020), and multi-task learning (Caruana, 1997). Our results show that the domain-adapted models significantly outperform their non-adapted counterparts, and single-task outperforms multi-task learning.
翻译:(SE)是一个重要的和广泛研究的任务,有助于深入了解劳动力市场动态。然而,存在着数据集和批注准则的空白;现有的数据集为数不多,而且包含跨层或预先界定的技能清单标签上的群落标签。为了弥补这一差距,我们引入了SKILLSPAN,这是一个新的SE数据集,由14.5K句和12.5K以上附加说明的跨段组成。我们发布了由三个不同来源创建的各自指南,对域专家的硬性和软性技能作了附加说明。我们引入了BERT基线(Devlin等人,2019年)。为了改进这一基线,我们试验了长期优化的语言模型(Joshi等人,2020年;Beltaty等人,2020年),在职位布置域(Han和Eisenstein,2019年;Gururrangan等人,2020年)和多塔斯克学习(Caruana,1997年)。我们的结果显示,域适应模型大大超越了非适应式的多式学习对应段。