Combining 100+ cross-sectional predictors requires either dropping 90% of the data or imputing missing values. We compare imputation using the expectation-maximization algorithm with simple ad-hoc methods. Surprisingly, expectation-maximization and ad-hoc methods lead to similar results. This similarity happens because predictors are largely independent: Correlations cluster near zero and more than 10 principal components are required to span 50% of total variance. Independence implies observed predictors are uninformative about missing predictors, making ad-hoc methods valid. In an out-of-sample principal components (PC) regression test, 50 PCs are required to capture equal-weighted long-short expected returns (30 PCs value-weighted), regardless of the imputation method.
翻译:结合100+ 跨部门预测器时, 需要减少90%的数据, 或者计算缺失的值。 我们用预期- 最大化算法与简单的临时方法比较估算法。 令人惊讶的是, 期望- 最大化和特别热方法导致类似的结果。 这种相似性主要因为预测器是独立的而发生: 交错组群几乎为零, 超过10个主要组成部分需要覆盖总差异的50%。 独立意味着观测到的预测器对缺失的预测器缺乏信息规范, 使临时偏移方法有效。 在模拟主部件( PC) 回归测试中, 需要50个PC 来捕捉同等重量的长期短期预期回报( 30 PCs 值加权 ), 而不考虑估算方法 。