文章目录

  • 参考文献
  • 本文记录笔者在Tensorflow使用上的一些错误的集锦,方便后来人迅速查阅解决问题。

    我是留白。

    我是留白。

    CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    2018-12-05 22:18:24.565303: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:24.565372: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:24.569212: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc:
    2018-12-05 22:18:26.170901: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:26.170969: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:26.174856: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3330
    2018-12-05 22:18:27.177003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
    2018-12-05 22:18:27.177071: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
    2018-12-05 22:18:27.180980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3331
    2018-12-05 22:18:34.625459: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
    2018-12-05 22:18:34.625513: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
    2018-12-05 22:18:36.231936: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
    2018-12-05 22:18:36.231971: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
    2018-12-05 22:18:37.235899: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
    2018-12-05 22:18:37.235952: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

    首先保证job_name,task_index,ps_hosts,worker_hosts这四个参数都是正确的,考虑以下这种情况是不正确的:
    在一个IP为192.168.1.100的机器上启动ps或worker进程:

    1
    2
    3
    4
    --job_name=worker
    --task_index=1
    --ps_hosts=192.168.1.100:2222,192.168.1.101:2222
    --worker_hosts=192.168.1.100:2223,192.168.1.101:2223

    因为该进程启动位置是192.168.1.100,但是运行参数中指定的task_index为1,对应的IP地址是ps_hosts或worker_hosts的第二项(第一项的task_index为0),也就是192.168.1.101,和进程本身所在机器的IP不一致。

    另外一种情况也会导致该问题的发生,从TensorFlow-1.4开始,分布式会自动使用环境变量中的代理去连接,如果运行的节点之间不需要代理互连,那么将代理的环境变量移除即可,在脚本的开始位置添加代码:
    注意这段代码必须写在import tensorflow as tf或者import moxing.tensorflow as mox之前

    1
    2
    3
    import os
    os.enrivon.pop('http_proxy')
    os.enrivon.pop('https_proxy')

    — 摘自(https://bbs.huaweicloud.com/blogs/463145f7a1d111e89fc57ca23e93a89f)

    ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9’ not found

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    大专栏  Tensorflow 错误集锦class="line">11
    12
    13
    14
    /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
    from ._conv import register_converters as _register_converters
    Traceback (most recent call last):
    File "trainer.py", line 14, in <module>
    import sklearn.datasets
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/__init__.py", line 134, in <module>
    from .base import clone
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 11, in <module>
    from scipy import sparse
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/__init__.py", line 229, in <module>
    from .csr import *
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/csr.py", line 15, in <module>
    from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks,
    ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so)

    系统的库文件较老,不含CXXABI_1.3.9,可将<Anaconda_PATH>/lib加入LD_LIBRARY_PATH中,像这样:

    1
    export LD_LIBRARY_PATH=/home/../anaconda3/lib:$

    如此,系统会先找到anaconda里面的lib,从而满足要求。

    参考:Stackoverflow. 问题2

    分布式Tensorflow, ps端运行出现tensorflow.python.framework.errors_impl.UnavailableError: OS Error

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    2018-12-07 15:40:05.167922: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3333}
    2018-12-07 15:40:05.167970: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.100.36:3333, 1 -> 192.168.100.37:3333}
    2018-12-07 15:40:05.171857: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3333
    Parameter server: waiting for cluster connection...
    2018-12-07 15:40:05.213496: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error
    2018-12-07 15:40:05.213645: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error
    Traceback (most recent call last):
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _run_fn
    self._extend_graph()
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1352, in _extend_graph
    tf_session.ExtendSession(self._session)
    tensorflow.python.framework.errors_impl.UnavailableError: OS Error During handling of the above exception, another exception occurred: Traceback (most recent call last):
    File "trainer.py", line 364, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
    File "trainer.py", line 70, in main
    num_classes=num_classes)
    File "trainer.py", line 138, in parameter_server
    sess.run(tf.report_uninitialized_variables())
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
    File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.UnavailableError: OS Error

    如上所示,运行多机分布式 tensorflow 的 parameter server 进程时,出现这个错误。
    这里说道:

    This has been troubling me for a while. I found out that the problem
    is that GRPC uses the native “epoll” polling engine for communication.
    Changing this to a portable polling engine solved this issue for me.
    The way to do is to set the environment variable,
    “GRPC_POLL_STRATEGY=poll” before running the tensorflow programs. This
    solved this issue for me. For reference, see,
    https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

    按照其所属,在环境变量中新增一条:

    1
    export GRPC_POLL_STRATEGY=poll

    成功解决问题。

    参考文献

    Tensorflow 错误集锦的更多相关文章

    1. SVN下错误集锦

      SVN下错误集锦 一SVN下的文件被locked不能update和commit 最近做项目的时候,遇到这个问题,SVN下的文件被locked不能update和commit.其提示如下: 解决办法:执行 ...

    2. (转)Hadoop之常见错误集锦

       Hadoop之常见错误集锦            下文中没有特殊说明,环境都是CentOS下Hadoop 2.2.0.1.伪分布模式下执行start-dfs.sh脚本启动HDFS时出现如下错误:   ...

    3. 在Hadoop 2.3上运行C++程序各种疑难杂症(Hadoop Pipes选择、错误集锦、Hadoop2.3编译等)

      首记 感觉Hadoop是一个坑,打着大数据最佳解决方案的旗帜到处坑害良民.记得以前看过一篇文章,说1TB以下的数据就不要用Hadoop了,体现不 出太大的优势,有时候反而会成为累赘.因此Hadoop的 ...

    4. drp错误集锦---“Cannot return from outside a function or method”

      好久都不动的项目,今天打开项目突然是红色感叹号.详细错误表现为: 也就是说,如今MyEclipse已经不识别在JSP页面中使用的return方法了(并且不止一处这种警告),那怎么办?????顿时闹钟一 ...

    5. django 2.0 xadmin 错误集锦

      转载 django 2.0 xadmin 错误集锦 2018-03-26 10:39:18 Snail0Li 阅读数 5188更多 分类专栏: python   1.django2.0把from dj ...

    6. Tensorflow ARM交叉编译错误集锦

      版权声明:本文为博主(Jimchen)原创文章,未经博主允许不得转载. ttps://www.cnblogs.com/jimchen1218/p/11611975.html 前言: Tensorflo ...

    7. Python:常见错误集锦(持续更新ing)

      初学Python,很容易与各种错误不断的遭遇.通过集锦,可以快速的找到错误的原因和解决方法. 1.IndentationError:expected an indented block 说明此处需要缩 ...

    8. Tensorflow异常集锦

      一.tensorflow checkpoint报错 在调用tf.train.Saver#save时,如果使用的路径是绝对路径,那么保存的checkpoint里面用的就是绝对路径:如果使用的是相对路径, ...

    9. TensorFlow错误ValueError: No gradients provided for any variable

      使用TensorFlow训练神经网络的时候,出现以下报错信息: Traceback (most recent call last):   File "gan.py", line 1 ...

    随机推荐

    1. java复制对象,复制对象属性,只可复制两个对象想同的属性名。也可自定义只复制需要的属性。

      注意:使用时copy()方法只会复制相同的属性.常用的copy()方法.以下为封装的工具和使用方式. 1.封装类 import java.util.Map; import java.util.Weak ...

    2. 2.node。框架express

      node.js就是内置的谷歌V8引擎,封装了一些对文件操作,http请求处理的方法 使你能够用js来写后端代码 用node.js开发脱离浏览器的js程序,主要用于工具活着服务器,比如文件处理. 用最流 ...

    3. 学习spring第一天

      Spring第一天笔记   1. 说在前面 怎样的架构的程序,我们认为是一个优秀的架构? 我们考虑的标准:可维护性好,可扩展性好,性能. 什么叫可扩展性好? 答:就是可以做到,不断的增加代码,但是可以 ...

    4. 18)PHP,可变函数,匿名函数 变量的作用域

      (1)可变函数: 可变函数,就是函数名“可变”——其实跟可变变量一样的道理. $str1 = “f1”;   //只是一个字符串,内容为”f1” $v1 = $str1(3, 4);   //形式上看 ...

    5. [原]调试实战——使用windbg调试崩溃在ComFriendlyWaitMtaThreadProc

      原调试debugwindbgcrash崩溃COM 前言 这是几年前在项目中遇到的一个崩溃问题,崩溃在了ComFriendlyWaitMtaThreadProc()里,没有源码.耗费了我很大精力,最终通 ...

    6. Codeforces1303F Number of Components

      Description link 题意:给一个全\(0\)矩阵,每次支持一个修改,修改不还原(这要是还原了不就成\(A\)题了) 然后询问每一次修改完了当前矩阵的连通块个数 每一个修改的值单调不降 修 ...

    7. Cell theory|Bulk RNA-seq|Cellar heterogeneity|Micromanipulation|Limiting dilution|LCM|FACS|MACS|Droplet|10X genomics|Human cell atlas|Spatially resolved transcriptomes|ST|Slide-seq|SeqFISH|MERFISH

      生物信息学 Cell theory:7个要点 All known living things are made up of one or more cells. All living cells ar ...

    8. redis数据库写入数据时提示redis.exceptions.ResponseError错误

      今天运行Django项目在redis数据库写入数据时提示如下错误: ERROR log 228 Internal Server Error: /image_code/cf9ccd75-d274-45c ...

    9. Sass入门指南

      转自:http://www.imooc.com/article/1413 css预处理器已经算不上一个新鲜的词了,当前比较有代表性的css预处理器有sass.less.stylus.关于三者选择问题一 ...

    10. PAT甲级——1011 World Cup Betting

      PATA1011 World Cup Betting With the 2010 FIFA World Cup running, football fans the world over were b ...