We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.
翻译:暂无翻译