Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.
翻译:对调查答复率的准确和可解释的预测从业务角度来说很重要。 美国普查局众所周知的ROAM应用使用美国普查规划数据库数据培训的有原则的统计模型来识别难以调查的地区。早期的众包竞争显示,一系列回归树导致在预测调查答复率方面表现最佳;然而,由于解释性有限,因此无法对预期应用采用相应的模型。在本文件中,我们提出了新的可解释统计方法,以预测调查中的答复率。我们研究的是稀少的非参数添加模型,通过美元=0的正规化进行配对互动,以及提供更高可解释性的等级结构变量。尽管方法基础很强,但这种模型在计算上具有挑战性 -- -- 我们为学习这些模型提出了新的可变缩缩算算法。基于美国普查规划数据库的实验表明,我们的方法导致高质量的预测模型,使得不同人口部分可以采取行动解释。 有趣的是,我们以可变压的货币/递增货币网络在预测性变压性业绩方法上取得了显著的进展。