Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.
翻译:大规模稀有事件数据在实践中经常遇到。为了解决海量稀有事件数据, 我们提出了一种新的逻辑回归分布式估计方法,在分布式系统中进行逻辑回归估计。对于分布式框架, 我们面临以下两个挑战。第一个挑战是如何分配数据。为此, 我们研究了两种不同的分配策略(即RANDOM策略和COPY策略)。第二个挑战是如何选择适当类型的目标函数以实现最佳的渐近效率。然后我们考虑了径向基函数(US)和倒数概率加权(IPW)类型的目标函数。我们的结果表明,COPY策略与IPW目标函数结合在稀有事件的分布式逻辑回归中是最好的解决方案。通过模拟研究和现实世界的瑞典交通标志数据集,演示了分布式方法的有限样本性能。