CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2018-12-05 22:18:24.565303: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-05 22:18:24.565372: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-05 22:18:24.569212: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc: 2018-12-05 22:18:26.170901: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-05 22:18:26.170969: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-05 22:18:26.174856: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3330 2018-12-05 22:18:27.177003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376} 2018-12-05 22:18:27.177071: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331} 2018-12-05 22:18:27.180980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3331 2018-12-05 22:18:34.625459: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2018-12-05 22:18:34.625513: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-12-05 22:18:36.231936: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-12-05 22:18:36.231971: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-12-05 22:18:37.235899: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-12-05 22:18:37.235952: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
另外一种情况也会导致该问题的发生,从TensorFlow-1.4开始,分布式会自动使用环境变量中的代理去连接,如果运行的节点之间不需要代理互连,那么将代理的环境变量移除即可,在脚本的开始位置添加代码: 注意这段代码必须写在import tensorflow as tf或者import moxing.tensorflow as mox之前
1 2 3
import os os.enrivon.pop('http_proxy') os.enrivon.pop('https_proxy')
/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "trainer.py", line 14, in <module> import sklearn.datasets File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/__init__.py", line 134, in <module> from .base import clone File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 11, in <module> from scipy import sparse File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/__init__.py", line 229, in <module> from .csr import * File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/csr.py", line 15, in <module> from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks, ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so)
2018-12-07 15:40:05.167922: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3333} 2018-12-07 15:40:05.167970: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.100.36:3333, 1 -> 192.168.100.37:3333} 2018-12-07 15:40:05.171857: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3333 Parameter server: waiting for cluster connection... 2018-12-07 15:40:05.213496: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error 2018-12-07 15:40:05.213645: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error Traceback (most recent call last): File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _run_fn self._extend_graph() File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1352, in _extend_graph tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "trainer.py", line 364, in <module> tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "trainer.py", line 70, in main num_classes=num_classes) File "trainer.py", line 138, in parameter_server sess.run(tf.report_uninitialized_variables()) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnavailableError: OS Error
如上所示,运行多机分布式 tensorflow 的 parameter server 进程时,出现这个错误。 这里说道:
This has been troubling me for a while. I found out that the problem is that GRPC uses the native “epoll” polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, “GRPC_POLL_STRATEGY=poll” before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.