Cumulative probability models (CPMs) are a robust alternative to linear models for continuous outcomes. However, they are not feasible for very large datasets due to elevated running time and memory usage, which depend on the sample size, the number of predictors, and the number of distinct outcomes. We describe three approaches to address this problem. In the divide-and-combine approach, we divide the data into subsets, fit a CPM to each subset, and then aggregate the information. In the binning and rounding approaches, the outcome variable is redefined to have a greatly reduced number of distinct values. We consider rounding to a decimal place and rounding to significant digits, both with a refinement step to help achieve the desired number of distinct outcomes. We show with simulations that these approaches perform well and their parameter estimates are consistent. We investigate how running time and peak memory usage are influenced by the sample size, the number of distinct outcomes, and the number of predictors. As an illustration, we apply the approaches to a large publicly available dataset investigating matrix multiplication runtime with nearly one million observations.
翻译:累积概率模型(CPM)是连续结果线性模型的可靠替代物,但是,由于运行时间和内存用量的增加,这些累积概率模型对于庞大的数据集并不可行,因为运行时间和内存用量的增加,这取决于抽样规模、预测数据的数量和不同结果的数量。我们描述了解决这一问题的三种办法。在分而治之的方法中,我们将数据分为子集,对每个子集适用CPM,然后汇总信息。在分期和四舍五入的方法中,结果变量被重新定义为不同值的数量大为减少。我们考虑四舍五入小数点和四舍五舍五舍五入,同时采取一个改进步骤来帮助实现所期望的不同结果的数量。我们通过模拟来显示,这些方法效果良好,其参数估计是一致的。我们调查运行时间和最高内存用量是如何受样本规模、不同结果的数量和预测数的影响的。举例说,我们采用这些方法对大量公开的数据集进行调查,调查矩阵乘数时间近100万次观测。