Horovod Install
Horovod documentation
安装
【Step1】安装Open MPI
注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.
【Step2】安装 TensorFlow
- pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
- 也可以用conda 安装
【Step3】安装 horovod
cpu
pip install horovod
GPUs with NCCL:
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
Docker 文档:
https://horovod.readthedocs.io/en/stable/docker.html
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
CPU-Dockerfile
FROM ubuntu:18.04
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0
# Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION}
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
# Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn
WORKDIR "/examples"
GPU-Dockerfile
FROM nvidia/cuda:10.1-devel-ubuntu18.04
# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0
# Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION}
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn
WORKDIR "/examples"
Horovod Install的更多相关文章
- 机器学习分布式框架horovod安装 (Linux环境)
1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...
- 安装 openmpi 4.0 用于 horovod 编译
最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...
- Horovod 分布式深度学习框架相关
最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...
- [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator
[源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...
- OEL上使用yum install oracle-validated 简化主机配置工作
环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...
- org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS
17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...
- 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?
注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...
- Install and Configure SharePoint 2013 Workflow
这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...
- Basic Tutorials of Redis(1) - Install And Configure Redis
Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...
随机推荐
- nginx+php-fpm docker镜像合二为一
一.概述 在上一篇文章介绍了nginx+php-fpm,链接如下: https://www.cnblogs.com/xiao987334176/p/12918413.html nginx和php-fp ...
- 翻译:《实用的Python编程》03_00_Overview
目录 | 上一节 (2 处理数据) | 下一节 (4 类和对象) 3. 程序组织 到目前为止,我们已经学习了一些 Python 基础知识并编写了一些简短的脚本.但是,当开始编写更大的程序时,我们会想要 ...
- 中心化-ESB
服务调用者与服务提供者通过企业服务总线相连接: ESB成为瓶颈:无论在性能上还是成本消耗上,ESB都会导致瓶颈出现.
- virtualbox多个网卡添加(第5-8块儿)
virtualbox多个网卡添加(第5-8块儿) virtualbox默认只能启用4块网卡,如果4块网卡不够则需要通过命令添加.最多可以增加至8块 创建一个文件run.bat,添加如下内容到文件中,然 ...
- springboot的4种属性注入
1.Autowired注入 2.构造方法注入 3.@Bean方法形参注入 4.直接在@Bean方法上使用注解@ConfigurationProperties(prefix="jdbc&quo ...
- java 流程控制学习
https://www.kuangstudy.com/course 用户交互Scanner import java.util.Scanner; public class Demo01 { public ...
- C++图论算法——图的储存方式
使用二维数组邻接矩阵储存图 无向图: 图G 定义图G[101][101],G[i][j]的值表示从结点vi到vj是否有边或弧,若有,取值为1或权值,若无,则取值为0或∞.以下是图G用邻接矩阵表示的列表 ...
- Java 并发编程小册整理好了
Java 有并发,并发知识之大,一口吃不下 这曾是我不愿意触碰的知识角 多次一头扎进并发,无功而返 为应对面试,临时苦苦记忆,不成体系 这一次我决定从基础开始,攻克它 12,0000 字 68Mb 高 ...
- Java 重入锁和读写锁
本文部分摘自<Java 并发编程的艺术> 重入锁 重入锁 ReentrantLock,顾名思义,就是支持重进入的锁,它表示该锁能够支持一个线程对资源的重复加锁.除此之外,该锁还支持获取锁时 ...
- git配置,以及简单的命令
在 window 平台需要安装对应的客户端 git 配置全局用户名git config --global user.name "xxx"配置全局邮箱git config --glo ...