分布式TensorFlow入门指南

2017 年 11 月 28 日 机器学习研究会

Distributed TensorFlow allows us to share parts of a TensorFlow graph between multiple processes, possibly each on a different machine.

Why might we want to do this? The classic use case is to harness the power of multiple machines for training, with shared parameters between all machines. Even if we're just running on a single machine, though, I've come across two examples from reinforcement learning where sharing between processes has been necessary:

  • In A3C, multiple agents run in parallel in multiple processes, exploring different copies of the environment at the same time. Each agent generates parameter updates, which must also be sent to the other agents in different processes.

  • In Deep Reinforcement Learning from Human Preferences, one process runs an agent exploring the environment, with rewards calculated by a network trained from human preferences about agent behaviour. This network is trained asynchronously from the agent, in a separate process, so that the agent doesn't have to wait for each training cycle to complete before it can continue exploring.

Unfortunately, the official documentation on Distributed TensorFlow rather jumps in at the deep end. For a slightly more gentle introduction - let's run through some really basic examples with Jupyter.

If you'd like to follow along with this notebook interactively, sources can be found at GitHub.

(Note: some of the explanations here are my own interpretation of empirical results/TensorFlow documentation. If you see anything that's wrong, give me a shout!)

Introduction

import tensorflow as tf

Let's say we want multiple processes to have shared access to some common parameters. For simplicity, suppose this is just a single variable:

var = tf.Variable(initial_value=0.0)

As a first step, we can imagine that each process would need its own session. (Pretend session 1 is created in one process, and session 2 in another.)

sess1 = tf.Session()sess2 = tf.Session()sess1.run(tf.global_variables_initializer())sess2.run(tf.global_variables_initializer())

Each call to tf.Session() creates a separate execution engine, then connects the session handle to the execution engine. The execution engine is what actually stores variable values and runs operations.

Normally, execution engines in different processes are unlinked. Changing var in one session (on one execution engine) won't affect var in the other session.

print("Initial value of var in session 1:", sess1.run(var))print("Initial value of var in session 2:", sess2.run(var))sess1.run(var.assign_add(1.0))print("Incremented var in session 1")print("Value of var in session 1:", sess1.run(var))print("Value of var in session 2:", sess2.run(var))
Initial value of var in session 1: 0.0
Initial value of var in session 2: 0.0
Incremented var in session 1
Value of var in session 1: 1.0
Value of var in session 2: 0.0

Distributed TensorFlow

In order to share variables between processes, we need to link the different execution engines together. Enter Distributed TensorFlow.

With Distributed TensorFlow, each process runs a special execution engine: a TensorFlow server. Servers are linked together as part of a cluster. (Each server in the cluster is also known as a task.)

The first step is to define what the cluster looks like. We start off with the simplest possible cluster: two servers (two tasks), both on the same machine; one that will listen on port 2222, one on port 2223.

tasks = ["localhost:2222", "localhost:2223"]

Each task is associated with a job, which is a collection of related tasks. We associate both tasks with a job called "local".

jobs = {"local": tasks}

This completes the definition of the cluster.

cluster = tf.train.ClusterSpec(jobs)

We can now launch the servers, specifying which server in the cluster definition each server corresponds to. Each server starts immediately, listening on the port specified in the cluster definition.

# "This server corresponds to the the first task (task_index=0)# of the tasks associated with the 'local' job."server1 = tf.train.Server(cluster, job_name="local", task_index=0)server2 = tf.train.Server(cluster, job_name="local", task_index=1)

With the servers linked together in the same cluster, we can now experience the main magic of Distributed TensorFlow: any variable with the same name will be shared between all servers.

The simplest example is to run the same graph on all servers, each graph with just one variable, as before:

tf.reset_default_graph()var = tf.Variable(initial_value=0.0, name='var')sess1 = tf.Session(server1.target)sess2 = tf.Session(server2.target)

Modifications made to the variable on one server will now be mirrored on the second server.

sess1.run(tf.global_variables_initializer())sess2.run(tf.global_variables_initializer())print("Initial value of var in session 1:", sess1.run(var))print("Initial value of var in session 2:", sess2.run(var))sess1.run(var.assign_add(1.0))print("Incremented var in session 1")print("Value of var in session 1:", sess1.run(var))print("Value of var in session 2:", sess2.run(var))
Initial value of var in session 1: 0.0
Initial value of var in session 2: 0.0
Incremented var in session 1
Value of var in session 1: 1.0
Value of var in session 2: 1.0

(Note that because we only have one variable, and that variable is shared between both sessions, the second run of global_variables_initializer here is redundant.)

Placement

A question that might be in our minds at this point is: which server does the variable actually get stored on? And for operations, which server actually runs them?

Empirically, it seems that by default, variables and operations get placed on the first task in the cluster.

def run_with_location_trace(sess, op):
    # From https://stackoverflow.com/a/41525764/7832197
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(op, options=run_options, run_metadata=run_metadata)
    for device in run_metadata.step_stats.dev_stats:
      print(device.device)
      for node in device.node_stats:
        print("  ", node.node_name)

For example, if we do something to var using the session connected to the first task, everything happens on that task:

run_with_location_trace(sess1, var)
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
   var
run_with_location_trace(sess1, var.assign_add(1.0))
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
   AssignAdd_1/value
   var
   AssignAdd_1

But if we try and try do something to var using the session connected to the second task, the graph nodes still get run on the first task.

run_with_location_trace(sess2, var)
/job:local/replica:0/task:1/device:CPU:0
   _SOURCE
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
   var

To fix a variable or an operation to a specific task, we can use tf.device:

with tf.device("/job:local/task:0"):
    var1 = tf.Variable(0.0, name='var1')with tf.device("/job:local/task:1"):
    var2 = tf.Variable(0.0, name='var2')
    # (This will initialize both variables)sess1.run(tf.global_variables_initializer())

Now var1 runs on the first task, as before.

run_with_location_trace(sess1, var1)
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
   var1

But var2 runs on the second task. Even if we try to evaluate it using the session connected to the first task, it still runs on the second task.

run_with_location_trace(sess1, var2)
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
/job:local/replica:0/task:1/device:CPU:0
   _SOURCE
   var2

And vice-versa with var2.

run_with_location_trace(sess2, var2)
/job:local/replica:0/task:1/device:CPU:0
   _SOURCE
   var2
run_with_location_trace(sess2, var1)
/job:local/replica:0/task:1/device:CPU:0
   _SOURCE
/job:local/replica:0/task:0/device:CPU:0
   _SOURCE
   var1

Graphs

There are a couple of things to note about how graphs work with Distributed TensorFlow.



转自:Amid Fish


完整内容请点击“阅读原文”

登录查看更多
4

相关内容

【陈天奇】TVM:端到端自动深度学习编译器,244页ppt
专知会员服务
86+阅读 · 2020年5月11日
Sklearn 与 TensorFlow 机器学习实用指南,385页pdf
专知会员服务
129+阅读 · 2020年3月15日
专知会员服务
161+阅读 · 2020年1月16日
【干货】大数据入门指南:Hadoop、Hive、Spark、 Storm等
专知会员服务
95+阅读 · 2019年12月4日
开源书:PyTorch深度学习起步
专知会员服务
50+阅读 · 2019年10月11日
强化学习最新教程,17页pdf
专知会员服务
174+阅读 · 2019年10月11日
机器学习入门的经验与建议
专知会员服务
92+阅读 · 2019年10月10日
TensorFlow 2.0 学习资源汇总
专知会员服务
66+阅读 · 2019年10月9日
机器学习相关资源(框架、库、软件)大列表
专知会员服务
39+阅读 · 2019年10月9日
PyTorch  深度学习新手入门指南
机器学习算法与Python学习
9+阅读 · 2019年9月16日
移动端机器学习资源合集
专知
8+阅读 · 2019年4月21日
浅显易懂的分布式TensorFlow入门教程
专知
7+阅读 · 2018年6月22日
Python机器学习教程资料/代码
机器学习研究会
8+阅读 · 2018年2月22日
【推荐】MXNet深度情感分析实战
机器学习研究会
16+阅读 · 2017年10月4日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
10+阅读 · 2017年9月3日
【推荐】Python机器学习生态圈(Scikit-Learn相关项目)
机器学习研究会
6+阅读 · 2017年8月23日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
【推荐】图像分类必读开创性论文汇总
机器学习研究会
14+阅读 · 2017年8月15日
【推荐】(Keras)LSTM多元时序预测教程
机器学习研究会
24+阅读 · 2017年8月14日
A Survey on Bayesian Deep Learning
Arxiv
63+阅读 · 2020年7月2日
Arxiv
45+阅读 · 2019年12月20日
A Survey on Deep Transfer Learning
Arxiv
11+阅读 · 2018年8月6日
Arxiv
4+阅读 · 2018年4月30日
Arxiv
3+阅读 · 2017年11月20日
VIP会员
相关VIP内容
【陈天奇】TVM:端到端自动深度学习编译器,244页ppt
专知会员服务
86+阅读 · 2020年5月11日
Sklearn 与 TensorFlow 机器学习实用指南,385页pdf
专知会员服务
129+阅读 · 2020年3月15日
专知会员服务
161+阅读 · 2020年1月16日
【干货】大数据入门指南:Hadoop、Hive、Spark、 Storm等
专知会员服务
95+阅读 · 2019年12月4日
开源书:PyTorch深度学习起步
专知会员服务
50+阅读 · 2019年10月11日
强化学习最新教程,17页pdf
专知会员服务
174+阅读 · 2019年10月11日
机器学习入门的经验与建议
专知会员服务
92+阅读 · 2019年10月10日
TensorFlow 2.0 学习资源汇总
专知会员服务
66+阅读 · 2019年10月9日
机器学习相关资源(框架、库、软件)大列表
专知会员服务
39+阅读 · 2019年10月9日
相关资讯
PyTorch  深度学习新手入门指南
机器学习算法与Python学习
9+阅读 · 2019年9月16日
移动端机器学习资源合集
专知
8+阅读 · 2019年4月21日
浅显易懂的分布式TensorFlow入门教程
专知
7+阅读 · 2018年6月22日
Python机器学习教程资料/代码
机器学习研究会
8+阅读 · 2018年2月22日
【推荐】MXNet深度情感分析实战
机器学习研究会
16+阅读 · 2017年10月4日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
10+阅读 · 2017年9月3日
【推荐】Python机器学习生态圈(Scikit-Learn相关项目)
机器学习研究会
6+阅读 · 2017年8月23日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
【推荐】图像分类必读开创性论文汇总
机器学习研究会
14+阅读 · 2017年8月15日
【推荐】(Keras)LSTM多元时序预测教程
机器学习研究会
24+阅读 · 2017年8月14日
Top
微信扫码咨询专知VIP会员