Brute force cross-validation (CV) is a method for predictive assessment and model selection that is general and applicable to a wide range of Bayesian models. However, in many cases brute force CV is too computationally burdensome to form part of interactive modeling workflows, especially when inference relies on Markov chain Monte Carlo (MCMC). In this paper we present a method for conducting fast Bayesian CV by massively parallel MCMC. On suitable accelerator hardware, for many applications our approach is about as fast (in wall clock time) as a single full-data model fit. Parallel CV is more flexible than existing fast CV approximation methods because it can easily exploit a wide range of scoring rules and data partitioning schemes. This is particularly useful for CV methods designed for non-exchangeable data. Our approach also delivers accurate estimates of Monte Carlo and CV uncertainty. In addition to parallelizing computations, parallel CV speeds up inference by reusing information from earlier MCMC adaptation and inference obtained during initial model fitting and checking of the full-data model. We propose MCMC diagnostics for parallel CV applications, including a summary of MCMC mixing based on the popular potential scale reduction factor ($\hat{R}$) and MCMC effective sample size ($\widehat{ESS}$) measures. Furthermore, we describe a method for determining whether an $\hat{R}$ diagnostic indicates approximate stationarity of the chains, that may be of more general interest for applications beyond parallel CV. For parallel CV to work on memory-constrained computing accelerators, we show that parallel CV and associated diagnostics can be implemented using online (streaming) algorithms ideal for parallel computing environments with limited memory. Constant memory algorithms allow parallel CV to scale up to very large blocking designs.
翻译:暂无翻译