Combining many cross-sectional stock return predictors, as in machine learning, often requires imputing missing values. We compare imputation using the expectation-maximization algorithm with simple ad-hoc methods. Surprisingly, expectation-maximization and ad-hoc methods lead to similar results. This similarity happens because predictors are largely independent: Correlations cluster near zero and more than 10 principal components are required to span 50% of total variance. Independence implies observed predictors are uninformative about missing predictors, making ad-hoc methods valid. In an out-of-sample principal components (PC) regression test, 50 PCs are required to capture equal-weighted long-short expected returns (30 PCs value-weighted), regardless of the imputation method.
翻译:在机器学习中,将许多跨部门种群回报预测器合并在一起往往需要估算缺失值。我们用预期-最大化算法与简单的临时方法比较估算值。令人惊讶的是,预期-最大化和特别热方法导致类似的结果。这种相似性发生是因为预测器在很大程度上是独立的:在总差异的50%之间,需要有近零和超过10个主要组成部分的交错组合。独立意味着观察到的预测器对缺失的预测器缺乏信息,使临时方法有效。在模拟主要部件(PC)回归测试中,需要50个个人计算机来捕捉同等重量的长期短期预期回报(30个个人计算机价值加权),而不论估算方法如何。