Modern stateful web services and distributed SDN controllers rely on log replication to omit data loss in case of fail-stop failures. In single-leader execution, the leader replica is responsible for ordering log updates and the initiation of distributed commits, in order to guarantee log consistency. Network congestions, resource-heavy computation, and imbalanced resource allocations may, however, result in inappropriate leader election and increased cluster response times. We present SEER, a logically centralized approach to performance prediction and efficient leader election in leader-based consensus systems. SEER autonomously identifies the replica that minimizes the average cluster response time, using prediction models trained dynamically at runtime. To balance the exploration and exploitation, SEER explores replicas' performance and updates their prediction models only after detecting significant system changes. We evaluate SEER in a traffic management scenario comprising [3..7] Raft replicas, and well-known data-center and WAN topologies. Compared to the Raft's uniform leader election, SEER decreases the mean control plane response time by up to ~32%. The benefit comes at the expense of the minimal adaptation of Raft election procedure and a slight increase in leader reconfiguration frequency, the latter being tunable with a guaranteed upper bound. No safety properties of Raft are invalidated by SEER.
翻译:最新的网络服务和分布式 SDN 控制器依靠日志复制来省略数据损失, 以防失败失败。 在单一领导执行中, 领导复制负责命令日志更新和启动分布式承诺, 以保证日志的一致性。 然而, 网络拥挤、 资源过剩的计算和资源分配不平衡, 可能导致不适当的领导人选举和增加集束响应时间。 我们介绍SEER, 这是在基于领导共识的系统中进行业绩预测和有效领导选举的逻辑集中化方法。 SER 自主地确定将平均集束响应时间减少到最低程度的复制件, 使用动态的预测模型。 为了平衡勘探和开发, SEER 探索复制品的性能并更新其预测模型, 只有在发现重大系统变化之后才能进行。 我们用交通管理情景来评价SEER, 包括 [3.7.7] Raft 复制品, 以及众所周知的数据中心和广效的广域网。 与Raft 统一领导选举相比, SER 将中平均控制平面响应时间降低到~ 32 %。 好处在于以最低程度的频率调整, 和最后的SEEEV 安全性变换为轻微增加。