The U.S. Bureau of Labor Statistics allows public access to much of the data acquired through its Occupational Requirements Survey (ORS). This data can be used to draw inferences about the requirements of various jobs and job classes within the United States workforce. However, the dataset contains a multitude of missing observations and estimates, which somewhat limits its utility. Here, we propose a method by which to impute these missing values that leverages many of the inherent features present in the survey data, such as known population limit and correlations between occupations and tasks. An iterative regression fit, implemented with a recent version of XGBoost and executed across a set of simulated values drawn from the distribution described by the known values and their standard deviations reported in the survey, is the approach used to arrive at a distribution of predicted values for each missing estimate. This allows us to calculate a mean prediction and bound said estimate with a 95% confidence interval. We discuss the use of our method and how the resulting imputations can be utilized to inform and pursue future areas of study stemming from the data collected in the ORS. Finally, we conclude with an outline of WIGEM, a generalized version of our weighted, iterative imputation algorithm that could be applied to other contexts.
翻译:美国劳工统计局允许公众查阅通过其职业要求调查(ORS)获得的大部分数据。这些数据可用于推断美国劳动力中各种工作和职业类别的需求。然而,数据集包含大量缺失的观察和估计,这在某种程度上限制了它的效用。这里,我们提出一种方法,用调查数据中的许多固有特征来估算这些缺失的值,例如已知的人口限制和职业与任务之间的相互关系。迭代回归适合,采用最新版本的XGBoost,并且从调查中报告的已知价值分布及其标准偏差中得出的一组模拟值中执行。这是用来对每一项缺失估计的预测值进行分布的方法。这使我们能够用95%的置信间隔计算平均预测和约束的估计数。我们讨论了我们方法的使用情况,以及由此得出的估算值如何用于通报和研究未来领域。最后,我们用WIGEM的概要得出结论,即我们加权、迭交式的算法可以适用于其他的加权、迭交式的算法。