Big data has grasped great attention in different fields over recent years. In the context of computer memory limitation, how to do regression on big data streams and solve outlier problems reasonably is worth discussing. Take this as a starting point, this article proposes an Online Updating Huber Robust Regression algorithm. By integrating Huber regression into Online Updating structure, it can achieve continuously updating on historical data using key features extracted from new data subsets and be robust to heavy-tailed distribution, cases with heterogeneous error and outliers. The Online Updating estimator obtained is asymptotically equivalent with Oracle estimator calculated by the entire data and has a lower computation complexity. We also execute simulations and real data analysis. Results in experiments shows that our algorithm performs outstandingly among other 5 algorithms in estimation and calculation efficiency, being feasible to real application.
翻译:近几年来,大数据在不同领域引起了极大关注。 在计算机记忆限制方面, 如何对大数据流进行回归和合理解决外部问题值得讨论。 以此为起点, 本文提出在线更新Huber Robust 回归算法。 通过将Huber回归纳入在线更新结构, 它可以利用从新数据子集中提取的关键特征不断更新历史数据, 并且能够对繁琐的分布、 具有差异性差错和外部差错的案例进行有力更新。 获得的在线更新估计数与由全部数据计算出来的Oracle估计数完全相同, 计算复杂程度较低。 我们还进行模拟和真实数据分析。 实验结果显示, 我们的算法在估算和计算效率方面与其他5种算法相比表现出色, 能够真正应用。