Distributed systems store data objects redundantly to balance the data access load over multiple nodes. Load balancing performance depends mainly on 1) the level of storage redundancy and 2) the assignment of data objects to storage nodes. We analyze the performance implications of these design choices by considering four practical storage schemes that we refer to as clustering, cyclic, block and random design. We formulate the problem of load balancing as maintaining the load on any node below a given threshold. Regarding the level of redundancy, we find that the desired load balance can be achieved in a system of $n$ nodes only if the replication factor $d = \Omega(\log(n)^{1/3})$, which is a necessary condition for any storage design. For clustering and cyclic designs, $d = \Omega(\log(n))$ is necessary and sufficient. For block and random designs, $d = \Omega(\log(n))$ is sufficient but unnecessary. Whether $d = \Omega(\log(n)^{1/3})$ is sufficient remains open. The assignment of objects to nodes essentially determines which objects share the access capacity on each node. We refer to the number of nodes jointly shared by a set of objects as the \emph{overlap} between those objects. We find that many consistently slight overlaps between the objects (block, random) are better than few but occasionally significant overlaps (clustering, cyclic). However, when the demand is ''skewed beyond a level'' the impact of overlaps becomes the opposite. We derive our results by connecting the load-balancing problem to mathematical constructs that have been used to study other problems. For a class of storage designs containing the clustering and cyclic design, we express load balance in terms of the maximum of moving sums of i.i.d. random variables, which is known as the scan statistic. For random design, we express load balance by using the occupancy metric for random allocation with complexes.
翻译:暂无翻译