Occupational data mining and analysis is an important task in understanding today's industry and job market. Various machine learning techniques are proposed and gradually deployed to improve companies' operations for upstream tasks, such as employee churn prediction, career trajectory modelling and automated interview. Job titles analysis and embedding, as the fundamental building blocks, are crucial upstream tasks to address these occupational data mining and analysis problems. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. We also illustrate the usefulness of IPOD by addressing two challenging upstream tasks, including: (i) proposing Title2vec, a contextual job title vector representation using a bidirectional Language Model (biLM) approach; and (ii) addressing the important occupational Named Entity Recognition problem using Conditional Random Fields (CRF) and bidirectional Long Short-Term Memory with CRF (LSTM-CRF). Both CRF and LSTM-CRF outperform human and baselines in both exact-match accuracy and F1 scores. The dataset and pre-trained embeddings are available at https://www.github.com/junhua/ipod.
翻译:职业数据挖掘和分析是了解当今工业和工作市场的一项重要任务,提出并逐步采用各种机器学习技术,以改进公司在上游任务方面的业务,例如雇员预测、职业轨迹建模和自动面试等。职称分析和嵌入是解决这些职业数据挖掘和分析问题的至关重要的上游任务。在这项工作中,我们介绍了工业和专业职业数据集(IPOD),该数据集有190,000多个职称,从Linkedin的56 000多份简介中爬出。我们还通过处理两项具有挑战性的上游任务,说明IPOD的效用,包括:(一) 采用双向语言模型(BILM)提出标题2vec,一个相关的职称矢;以及(二) 利用条件性随机字段(CRF)和双向短期短期记忆(LSTM-CRF)解决重要的职业名实体识别问题。CRF和LSTM-CRF-CRF在准确的准确和F1分数中都超越了人和基线。数据设置和预设式的www/junsmbasm/commations。