适用于大数据半参数累积概率模型 (Fitting Semiparametric Cumulative Probability Models for Big Data)

Cumulative probability models (CPMs) are a robust alternative to linear models for continuous outcomes. However, they are not feasible for very large datasets due to elevated running time and memory usage, which depend on the sample size, the number of predictors, and the number of distinct outcomes. We describe three approaches to address this problem. In the divide-and-combine approach, we divide the data into subsets, fit a CPM to each subset, and then aggregate the information. In the binning and rounding approaches, the outcome variable is redefined to have a greatly reduced number of distinct values. We consider rounding to a decimal place and rounding to significant digits, both with a refinement step to help achieve the desired number of distinct outcomes. We show with simulations that these approaches perform well and their parameter estimates are consistent. We investigate how running time and peak memory usage are influenced by the sample size, the number of distinct outcomes, and the number of predictors. As an illustration, we apply the approaches to a large publicly available dataset investigating matrix multiplication runtime with nearly one million observations.

翻译：累积概率模型(CPM)是连续结果线性模型的可靠替代物,但是,由于运行时间和内存用量的增加,这些累积概率模型对于庞大的数据集并不可行,因为运行时间和内存用量的增加,这取决于抽样规模、预测数据的数量和不同结果的数量。我们描述了解决这一问题的三种办法。在分而治之的方法中,我们将数据分为子集,对每个子集适用CPM,然后汇总信息。在分期和四舍五入的方法中,结果变量被重新定义为不同值的数量大为减少。我们考虑四舍五入小数点和四舍五舍五舍五入,同时采取一个改进步骤来帮助实现所期望的不同结果的数量。我们通过模拟来显示,这些方法效果良好,其参数估计是一致的。我们调查运行时间和最高内存用量是如何受样本规模、不同结果的数量和预测数的影响的。举例说,我们采用这些方法对大量公开的数据集进行调查,调查矩阵乘数时间近100万次观测。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日