Distributed TensorFlow

Todo list:

Distributed TensorFlow简介
Distributed TensorFlow的部署与运行
对3个台主机做多卡GPU和两台主机做多卡GPU的结果作对比

Distributed TensorFlow 意在使用等多主机的GPU加载模型,加速训练.

在分布式的tensorflow可以更快过运行更大的模型. Distributed tensorflow可以运行在分布式集群上,也可以运行在

在分布式的tensorflow是根据DisBelif进行的改进. 在DisBelief中有两个不同的进程,分别是Parameter Server(PS) 和 worker replices;

PS的职责是: 保存模型的状态(也是每次更新的参数值),并根据随后的梯度进行更新. 他的作用是将每个work中的图连接起来

worker的职责是: 计算权重的梯度

tensorflow借鉴了这种方式, 并且在程序代码的书写上更加人性化: DisBelief中的Work和PS是两种不同的代码执行的进程; 但是在tf中work和ps的代码是完全相同的,

Work Replication

Work Replication有两种方式一种是In-graph 另一种是Between-graph

In-graph:

将模型的计算图的不同部分放在不同的机器上执行

In-graph模式，把计算已经从单机多GPU，扩展到了多机多GPU了，但是数据分发还是在一个节点。这样的好处是配置简单，其他多机多GPU的计算节点，暴露一个网络接口，等在那里接受任务就好了。这些计算节点暴露出来的网络接口，使用起来就跟本机的一个GPU设备所调用的函数一样，指定tf.device(“/job:worker/task:n”)即可. PS负责join操作,

Between-graph:

数据并行，每台机器使用完全相同的计算图; Between-graph模式下，训练的参数保存在参数服务器，数据不用分发，数据分片的保存在各个计算节点，各个计算节点自己算自己的，算完了之后，把要更新的参数告诉参数服务器，参数服务器更新参数。这种模式的优点是不用训练数据的分发了，尤其是在数据量在TB级的时候，节省了大量的时间，所以大数据深度学习还是推荐使用Between-graph模式。

以上两种操作均支持同步更新和异步更新.

在同步更新的时候，每次梯度更新，要等所有分发出去的数据计算完成后，返回回来结果之后，把梯度累加算了均值之后，再更新参数。这样的好处是loss的下降比较稳定，但是这个的坏处也很明显，处理的速度取决于最慢的那个分片计算的时间。

在异步更新的时候，所有的计算节点，各自算自己的，更新参数也是自己更新自己计算的结果，这样的优点就是计算速度快，计算资源能得到充分利用，但是缺点是loss的下降不稳定，抖动大。

在数据量小的情况下，各个节点的计算能力比较均衡的情况下，推荐使用同步模式；数据量很大，各个机器的计算性能掺差不齐的情况下，推荐使用异步的方式。

如何部署分布式Tensorflow?

Demo:

环境简介:

ubuntu16.04 服务器 *3 , ip=[172.16.60.114,  172.16.60.107,  172.16.5:0.111]

Cuda8.0 , Cudnn6

Tensorflow 1.10.0

Anaconda3| python3.6

测试文件

代码详情参见:github: Leechen2014/tec4tensorflow

解析:

分布式使用方法

cluster = tf.train.ClusterSpec({'ps': 'ps的服务器的URL', 'worker': 'work服务的URL'})

server = tf.train.Server(cluster, job_name="自己其名字" task_index=FLAGS.task_index)

针对ps服务需要做:

server.join()

多卡的GPU 实现:

with tf.device(tf.train.replica_device_setter(cluster=cluster )) # 也可以在每台worker上写worker_device = '/job:worker/task%d/gpu:0' , 这种方式有点麻烦

运行方法:

# 在ps主机启动grcp服务, 运行的命令如下:

CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=ps --task_index=0

# 在107上运行命令如下:

CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0

# 在111上的运行命令如下:

CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1

注意事项:

不需要建立SSH 免密码登录.
代码中由于是使用

with tf.device(tf.train.replica_device_setter(cluster=XXX)

的方式分配GPU的, 所以在指定task_index的时候,其编号顺序应该和启动顺序应该与

flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs')

保持一致.

运行结果:

# 114 是ps, 启动grpc服务

2018-09-12 16:07:55.938936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1

2018-09-12 16:07:55.938944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y

2018-09-12 16:07:55.938949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N

2018-09-12 16:07:55.940175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1)

2018-09-12 16:07:56.080591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1)

2018-09-12 16:07:56.742461: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}

2018-09-12 16:07:56.742526: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221}

2018-09-12 16:07:56.764061: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221

------------------------

# 107 是work0

1536739841.883745: Worker 0: traing step 7599 dome (global step:9986)

1536739841.897058: Worker 0: traing step 7600 dome (global step:9988)

1536739841.910197: Worker 0: traing step 7601 dome (global step:9990)

1536739841.923900: Worker 0: traing step 7602 dome (global step:9992)

1536739841.936971: Worker 0: traing step 7603 dome (global step:9994)

1536739841.950250: Worker 0: traing step 7604 dome (global step:9996)

1536739841.964122: Worker 0: traing step 7605 dome (global step:9998)

1536739841.978155: Worker 0: traing step 7606 dome (global step:10000)

Training ends @ 1536739841.978258

Training elapsed time:98.617033 s

After 10000 training step(s), validation cross entropy = 1141.94

----------------------------

#111 是work1

1536739841.872289: Worker 1: traing step 2389 dome (global step:9985)

1536739841.885433: Worker 1: traing step 2390 dome (global step:9987)

1536739841.898431: Worker 1: traing step 2391 dome (global step:9989)

1536739841.911799: Worker 1: traing step 2392 dome (global step:9991)

1536739841.924894: Worker 1: traing step 2393 dome (global step:9993)

1536739841.938620: Worker 1: traing step 2394 dome (global step:9995)

1536739841.952448: Worker 1: traing step 2395 dome (global step:9997)

1536739841.966328: Worker 1: traing step 2396 dome (global step:9999)

1536739841.979593: Worker 1: traing step 2397 dome (global step:10001)

Training ends @ 1536739841.979693

Training elapsed time:41.149895 s

After 10000 training step(s), validation cross entropy = 1141.94

D0912 16:10:42.498070727   37760 dns_resolver.cc:280]        Start resolving.

通过以上的运行结果可以发现, 114启动了gRcp服务, 但没有关闭, 关于这个问题,stack overflow中已经有人给出解决方法Shut down server in TensorFlow , 关于gRcp详情参见[^using-grpc-in-python]:using-grpc-in-python

备注:

ps和worker可以在同一个host中共存, 这个很好理解,就像hadoop中master和slaver是可以共存的一样. 为了避免出现端口冲突, 在同一个主机上ps的端口和worker端口应该不一样
ps 可以有多个, 书写方式可以参照work
再次强调,由于使用的是 with tf.device(tf.train.replica_device_setter(cluster=XXX) 所以, Worker的启动顺序如果和lags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221','Comma-separated list of hostname:port pairs') 中书写的顺序不同, 将会导致其产生OS Error

将ps也做成worker进程的方式是:

将第20行: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221', 'Comma-separated list of hostname:port pairs')

添加114的ip和端口号, 修改为: flags.DEFINE_string('worker_hosts', '172.16.60.107:22221,172.16.50.111:22221,172.16.60.114:22222', 'Comma-separated list of hostname:port pairs')

从新运行即可,注意运行顺序

运行结果:

##############114 ps##################################

h strength 1 edge matrix:

2018-09-12 16:38:41.432822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1

2018-09-12 16:38:41.432830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y

2018-09-12 16:38:41.432835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N

2018-09-12 16:38:41.433475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 10403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0d:00.0, compute capability: 6.1)

2018-09-12 16:38:41.949217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 10403 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0e:00.0, compute capability: 6.1)

2018-09-12 16:38:42.086615: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}

2018-09-12 16:38:42.086674: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 172.16.60.107:22221, 1 -> 172.16.50.111:22221, 2 -> 172.16.60.114:22222}

2018-09-12 16:38:42.094741: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:375] Started server with target: grpc://localhost:22221

###############107 worker 0##########################

#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=0

1536741807.352432: Worker 0: traing step 3305 dome (global step:9997)

1536741807.388893: Worker 0: traing step 3306 dome (global step:10000)

Training ends @ 1536741807.388980

Training elapsed time:80.524482 s

After 10000 training step(s), validation cross entropy = 1127

####################111 worker 1###################################

#CUDA_VISIBLE_DEVICES='5,6' python TestDistributed.py --job_name=worker --task_index=1

1536741807.370341: Worker 1: traing step 3222 dome (global step:9998)

1536741807.398533: Worker 1: traing step 3223 dome (global step:10002)

Training ends @ 1536741807.398634

Training elapsed time:79.786702 s

After 10000 training step(s), validation cross entropy = 1127

#################114 worker2 #############

#CUDA_VISIBLE_DEVICES='0,1' python TestDistributed.py --job_name=worker --task_index=2

1536741807.346162: Worker 2: traing step 3474 dome (global step:9996)

1536741807.359073: Worker 2: traing step 3475 dome (global step:10000)

Training ends @ 1536741807.359174

Training elapsed time:79.858818 s

After 10000 training step(s), validation cross entropy = 1127

结果对比

根据日志可以做出初步对比:

使用两个worker平均耗时69.975s; loss=1141.94, 所需要的时间是三个worker,平均时间:80.806s;loss=1127

参考文献

Distributed TensorFlow

TensorFlow分布式全套（原理，部署，实例）

白话tensorflow分布式部署和开发

 分布式注意事项

 学习笔记TF061:分布式TensorFlow，分布式原理、最佳实践