收集大规模IPIP地址数据统计数据的有效算法 (Efficient algorithms for collecting the statistics of large-scale IP address data)

Compiling the statistics of large-scale IP address data is an essential task in network traffic measurement. The statistical results are used to evaluate the potential impact of user behaviors on network traffic. This requires algorithms that are capable of storing and retrieving a high volume of IP addresses within time and memory constraints. In this paper, we present two efficient algorithms for collecting the statistics of large-scale IP addresses that balance time efficiency and memory consumption. The proposed solutions take into account the sparse nature of the statistics of IP addresses while building the hash function and maintain a dynamic balance among layered memory blocks. There are two layers in the first proposed method, each of which contains a limited number of memory blocks. Each memory block contains 256 elements of size $256 \times 8$ bytes for a 64-bit system. In contrast to built-in hash mapping functions, the proposed solution completely avoids expensive hash collisions while retaining the linear time complexity of hash-based solutions. Moreover, the mechanism dynamically determines the hash index length according to the range of IP addresses, and can balance the time and memory constraints. In addition, we propose an efficient parallel scheme to speed up the collection of statistics. The experimental results on several synthetic datasets show that the proposed method substantially outperforms the baselines with respect to time and memory space efficiency.

翻译：计算大规模 IP 地址数据的统计是网络交通量测量的一项基本任务。统计结果用于评估用户行为对网络交通的潜在影响。这要求算法能够在时间和内存限制内存储和检索大量 IP 地址。在本文中, 我们提出两种有效的算法, 用于收集大型 IP 地址的统计数据, 以平衡时间效率和内存消耗。提议的解决方案考虑到 IP 地址统计数据的稀少性质, 同时建立散列功能并保持多层记忆区块之间的动态平衡。第一个拟议方法有两层, 每一层都有有限的内存区块。每个内存区块都包含256 美元和 8 美元的256 个大小的内存区块。每个内存区块包含64位系统的256 256 美元 8 美元的内存区块。与内建版图功能不同的是, 拟议的解决方案完全避免了昂贵的散列碰撞, 同时保留了基于散列解决方案的线性时间复杂性。此外, 机制动态地决定了IP 地址范围的指数长度长度长度,, 并可以平衡时间和内存限制。此外, 我们提议一个高效的平行的实验性模型,,,, 将与以快速收集。