Randomized election timeouts are a simple and effective liveness heuristic for Raft, but they become brittle under long-tail latency, jitter, and partition recovery, where repeated split votes can inflate unavailability. This paper presents BALLAST, a lightweight online adaptation mechanism that replaces static timeout heuristics with contextual bandits. BALLAST selects from a discrete set of timeout "arms" using efficient linear contextual bandits (LinUCB variants), and augments learning with safe exploration to cap risk during unstable periods. We evaluate BALLAST on a reproducible discrete-event simulation with long-tail delay, loss, correlated bursts, node heterogeneity, and partition/recovery turbulence. Across challenging WAN regimes, BALLAST substantially reduces recovery time and unwritable time compared to standard randomized timeouts and common heuristics, while remaining competitive on stable LAN/WAN settings.
翻译:随机选举超时是Raft中一种简单有效的活性启发式策略,但在长尾延迟、抖动和分区恢复场景下会变得脆弱,此时重复的分裂投票可能加剧系统不可用性。本文提出BALLAST,一种轻量级在线自适应机制,通过上下文赌博机替代静态超时启发式方法。BALLAST采用高效的线性上下文赌博机(LinUCB变体)从离散的超时"臂"集合中进行选择,并通过安全探索增强学习能力,在不稳定期间限制风险。我们在包含长尾延迟、丢包、相关突发、节点异构性及分区/恢复扰动的可复现离散事件仿真中评估BALLAST。在各类挑战性广域网场景下,相较于标准随机超时及常见启发式方法,BALLAST能显著缩短恢复时间和不可写入时间,同时在稳定的局域网/广域网环境中保持竞争力。