In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These tools, due to containerization, do improve sharing of results. However, they do not improve the efficiency of replay. In this paper, we present the multiversion replay problem which arises when multiple versions of an application are containerized, and each version must be replayed to repeat results. To avoid executing each version separately, we develop CHEX, which checkpoints program state and determines when it is permissible to reuse program state across versions. It does so using system call-based execution lineage. Our capability to identify common computations across versions enables us to consider optimizing replay using an in-memory cache, based on a checkpoint-restore-switch system. We show the multiversion replay problem is NP-hard, and propose efficient heuristics for it. CHEX reduces overall replay time by sharing common computations but avoids storing a large number of checkpoints. We demonstrate that CHEX maintains lightweight package sharing, and improves the total time of multiversion replay by 50% on average.
翻译:在科学计算和数据科学学科中,通常有必要共享应用工作流程和重复结果。 当前工具将应用工作流程容器化, 并共享由此产生的容器以重复结果。 这些工具由于容器化, 确实可以改善结果的共享。 但是, 它们并没有提高重复播放的效率 。 在本文中, 我们展示了多版本应用程序被容器化时产生的多版重玩问题, 每个版本必须重放以重复结果 。 为了避免单独执行每个版本, 我们开发CHEX, 由检查站程序来决定何时允许在不同的版本中再利用程序状态 。 它使用基于系统的执行线 。 我们发现不同版本的共同计算能力, 使我们能够考虑利用基于检查站- 存储开关系统的模拟缓存优化重玩。 我们显示多版本重玩问题是硬的, 并为它提出高效的超常态。 CHEX 共享通用计算, 从而减少整个重放时间, 避免存储大量检查站 。 我们证明 CHEX 保持了基于系统基于系统基于光重的软件共享, 并改进50 % 的平均重写时间 。