NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC setup. The traditional line of thinking is to design a system that is conceptually similar to transactional memory, where we log updates all the time, and minimize the wasted work or alternatively the MTTR (mean time to recovery). Such ``instant recovery'' systems allow the system to recover from a point that is quite close to the point of failure. The penalty that we pay is the prohibitive number of additional writes to the NVM. We propose a paradigmatically different approach in this paper, where we argue that in most practical settings such as regular HPC workloads or neural network training, there is no need for such instant recovery. This means that we can afford to lose some work, take periodic software-initiated checkpoints and still meet the goals of the application. The key benefit of our scheme is that we reduce write amplification substantially; this extends the life of NVMs by roughly the same factor. We go a step further and design an adaptive system that can minimize the WA given a target checkpoint latency, and show that our control algorithm almost always performs near-optimally. Our scheme reduces the WA by 2.3-96\% as compared to the nearest competing work.
翻译:以NVM为基础的系统自然是纳入定期检查站(或快照)的合适人选。 这提高了系统的可靠性,使其更不受电力故障的影响,减少了浪费的工作,特别是在HPC的设置中。 传统的思路是设计一个在概念上与交易记忆相似的系统,我们经常在其中进行更新,尽量减少浪费的工作,或以MTTR(恢复的时间)替代。这种“即时恢复”系统使系统能够从非常接近故障点的地方恢复过来。我们支付的罚款是给NVM额外写信的令人望而却步的数。我们在本文件中提出了一种范式不同的办法,我们提出在诸如HPC常规工作量或神经网络培训等大多数实际情况下,我们不需要立即进行这种恢复。这意味着我们有能力失去一些工作,采取定期软件启动的检查站,并且仍然达到应用的目标。我们计划的主要好处是大幅度减少写作的重复;我们支付的罚款是给NPMS的寿命以大致相同的因素扩大。我们在本文中提出一种典型的不同的方法,即我们说,在诸如HPC常规工作量或神经网络培训等最实际情况下,我们更进一步地设计一个自我调整的系统。