Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.
翻译:最近许多机器学习模型依靠细微的动态控制流动进行培训和推断。特别是,基于经常性神经网络和强化学习模型的模型取决于重复关系、数据依赖的有条件执行,以及需要动态控制流动的其他特征。这些应用程序受益于在一个分布式系统中对一组计算机设备作出快速控制流程决定的能力。对于性能、可缩放性和表达性,一个机器学习系统必须支持分布式和异质环境中的动态控制流动。本文为分布式机器学习提供了一个程序模型,支持动态控制流动。我们描述了程序模型的设计及其在分布式机器学习系统TensorFlow(一个分布式机器学习系统)中的实施。我们的方法扩大了数据流图的使用,以代表机器学习模式,提供了一些不同的特性。首先,对于一个分布式计算机设备,即一个功能、可缩放和机体的分支可以隔开许多机器,在分布式设备中运行,包括CPU、GPUS和定制的ASICT。第二,我们模型中写入的程序支持了自动区分和分布式计算,这是培训机器学习模型模型模型模型模型模型模型的设计和配置模型,这是必要的。第三阶段,我们选择的系统运行中所使用的系统运行和系统。