废话不多说,先说最终成功的版本:系统=>centos7 ,cuda=>10.0 ,cudnn=>7.5 ,nccl=>源码编译, tensorflow=>最新版本源码编译

第一次尝试:cuda=>10.1 cudnn=>7.5 nccl=>2.4.2

1.cuda下载包:*.run,,直接 sh ./*.run 按照提示选择就能安装,一般选择默认路径 /usr/local/cuda方便后续操作

配置环境,在/etc/profile末尾加上

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local//lib64:$LD_LIBRARY_PATH"

2.cudnn 解压后文件夹为cuda,将头文件和库文件分别拷贝到cuda对应的目录下:

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

更改执行权限

sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

查看nvcc是否成功

nvcc --version

3.安装nccl

目前官网只有*.rpm格式,网上说的deb格式没找到,所以没法试验是否能用,所以使用rpm安装

rpm -ivh nccl*.rpm

但是这一步是解压,会解压到/var/nccl*目录下,发现下面有三个rpm文件,依次rpm安装

4.安装bazel

因为编译tensorflow需要使用google的bazel,看网上教程让下载bazel-0.24.1-dist.zip,解压后编译

./compile.sh 

发现报错,需要安装cmake(见后面)

编译报错,忘了什么错了,搜索无果,重新下载bazel-0.24.1-installer-linux-x86_64.sh版本在线安装,直接运行,成功!

5.安装cmake

下载cmake>3.4的版本,解压编译安装

./configure
gmake
make install

配置环境变量

PATH=/usr/local/cmake/bin:$PATH
export PATH

6.编译tensorflow

按照提示选择路径及插件

Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:10.1
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.1]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 2.4.2
Please specify the location where NCCL library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

使用编译命令

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 

报错

Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1

搜索后发现大部分人都认为cuda10.1尚不可用,只能放弃,中间试过加入链接(https://github.com/tensorflow/tensorflow/issues/26289)

sudo ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/lib64/libcublas.so.10.0

执行编译后报新的错误

Cuda Configuration Error: None of the libraries match their SONAME: /home/bernard/opt/cuda_test/cuda/lib64/libcublas.so.10.1

决定卸掉10.1,重装10.0

第二次尝试:cuda=>10.0 cudnn=>7.5 nccl=>2.4.2

1.下载cuda10.0的安装包,其他不变

2.编译tensorflow时报新的错误

fatal error: nccl.h: No such file or directory

找不到nccl.h,就是说上面那种方式安装失败

搜索发现需要安装 libnccl2 libnccl-dev libnccl-static ,但是网上教程都是ubuntu的使用apt get 安装,centos只有yum,尝试执行,报错

No package "libnccl" available

3.使用rpm卸载nccl,重新编译安装nccl

github上clone下nccl项目,编译安装

cd nccl
make -j src.build
make src.build
yum install build-essential devscripts debhelper
make pkg.debian.build

4.重新编译tensorflow

Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]:
Please specify the location where NCCL library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

标红的做了修改,其他不变,大概等一个小时后编译完成

转换为whl文件

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

使用pip安装

pip install /tmp/tensorflow_pkg/*.whl

成功截图

5.测试tensorflow,gpu是否可用

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

报了一个很奇怪的错误

开始以为是没有编译tensorboard依赖,看了源码发现并不需要另外下载,最后查看了一下tensorboard的文件时间,发现是以前安装的没有卸载干净,pip uninstall 卸载后重新安装,一切正常

总结

其实安装完cuda和cudnn后可以直接pip install tensorflow-gpu的,不用自己重新编译(也就不需要安装cmake,bazel),当初以为没有最新版本,所以自己编译,后来发现直接安装的编译环境就是cuda10.0,不过贴合系统的编译总是好用的,哈哈!

下面是直接安装的截图,AVX2没有正常使用,所以还是编译一把好点

记一次编译tensorflow-gpu爬过的坑的更多相关文章

  1. 【转】Ubuntu 16.04安装配置TensorFlow GPU版本

    之前摸爬滚打总是各种坑,今天参考这篇文章终于解决了,甚是鸡冻\(≧▽≦)/,电脑不知道怎么的,安装不了16.04,就安装15.10再升级到16.04 requirements: Ubuntu 16.0 ...

  2. Win10 x64 + CUDA 10.0 + cuDNN v7.5 + TensorFlow GPU 1.13 安装指南

    Win10 x64 + CUDA 10.0 + cuDNN v7.5 + TensorFlow GPU 1.13 安装指南 Update : 2019.03.08 0. 环境说明 硬件:Ryzen R ...

  3. 记录从裸机到TensorFlow GPU版运行 的配置过程

    实验室原来有一台装Ubuntu Server系统的服务器,安装有tensorflow,在使用过程中经常出现断网.死机.自动关机等毛病,忍无可忍,决定重装系统 配置如下:Dell工作站,Xeon-E5 ...

  4. 编译TensorFlow-serving GPU版本

    编译TensorFlow-serving GPU版本 TensorFlow Serving 介绍 编译GPU版本 下载源码 git clone https://github.com/tensorflo ...

  5. Ubuntu 16.04 + CUDA 8.0 + cuDNN v5.1 + TensorFlow(GPU support)安装配置详解

    随着图像识别和深度学习领域的迅猛发展,GPU时代即将来临.由于GPU处理深度学习算法的高效性,使得配置一台搭载有GPU的服务器变得尤为必要. 本文主要介绍在Ubuntu 16.04环境下如何配置Ten ...

  6. 备注: ubt 16.04 安装 gtx 1060 --- 成功运行 tensorflow - gpu

    ---------------------------------------------------------------------------------------------------- ...

  7. 编译TensorFlow源码

      编译TensorFlow源码 参考: https://www.tensorflow.org/install/install_sources https://github.com/tensorflo ...

  8. Python_记一次网站数据定向爬取实现

    记一次网站数据定向爬取实现 by:授客 QQ:1033553122 测试环境: Python版本:Python 3.4 Win7 请勿用于商业及非法用途,仅供学习研究用,否则后果自负 数据爬取场景 如 ...

  9. 通过Anaconda在Ubuntu16.04上安装 TensorFlow(GPU版本)

    一. 安装环境 Ubuntu16.04.3 LST GPU: GeForce GTX1070 Python: 3.5 CUDA Toolkit 8.0 GA1 (Sept 2016) cuDNN v6 ...

随机推荐

  1. 【BZOJ2006】[NOI2010]超级钢琴 ST表+堆

    [BZOJ2006][NOI2010]超级钢琴 Description 小Z是一个小有名气的钢琴家,最近C博士送给了小Z一架超级钢琴,小Z希望能够用这架钢琴创作出世界上最美妙的音乐. 这架超级钢琴可以 ...

  2. POJ 1068 Parencodings【水模拟--数括号】

    链接: http://poj.org/problem?id=1068 http://acm.hust.edu.cn/vjudge/contest/view.action?cid=27454#probl ...

  3. jsp联合javascript操作html

    1 执行的先后顺序 jsp先处理,给页面里面的变量赋值等等.然后整个页面发送给客户端,在客户端执行javascipt相关的代码. 2 jsp文件的构成 html文件+java程序片段+jsp标签=js ...

  4. 微信H5支付开发步骤总结

    * 开发步骤: * 1.在微信公众号平台设置授权目录,即jsapi.php所在的目录 * 2.在微信支付平台下载证书,放到cert目录 * 3.在微信支付平台设置API秘钥,同时在WxPay.Conf ...

  5. 蜗牛—ORACLE基础之触发器学习(三)

    版权声明:本文为大腰子原创文章,如若转载,请标明原地址. https://blog.csdn.net/u010071361/article/details/30037215 建立一个触发器, 当职工表 ...

  6. IE11 for Windows 7 Enterprise With SP1 故障

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/jaminwm/article/details/29592027 这个故障非常诡异,卸载IE11也没实 ...

  7. jauery table

    $("#tableData tr:gt(0)").each(function() { }//橘色部分是查找id为tableData的DataTable里面除第一行以外的行

  8. abap Excel 导入

    ABAP 将EXECL数据导入SAP内表的几个步骤. 本文转自:http://blog.csdn.net/szlaptop/article/details/8663451   http://www.c ...

  9. 云计算服务的三种类型(SaaS、PaaS、IaaS)

    云计算可以帮助企业降低IT方面的成本和复杂性,并获得他们蓬勃发展所需的灵活性与敏捷性.但是,规划出通往云的明确路径并非易事.毕竟用户需要看透与云相关的市场大肆宣传,然后理解并分析不同种类的云计算模式的 ...

  10. webdriver与JS操作浏览器元素

    1.JQuery的选择器实例 语法 描述 $(this) 当前 HTML 元素 $("p") 所有 <p> 元素 $("p.intro") 所有 c ...