Horovod documentation

安装

【Step1】安装Open MPI

注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.

【Step2】安装 TensorFlow

  • pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
  • 也可以用conda 安装

【Step3】安装 horovod

cpu

pip install horovod

GPUs with NCCL:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod

Docker 文档:

https://horovod.readthedocs.io/en/stable/docker.html

https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu

CPU-Dockerfile

FROM ubuntu:18.04

ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

GPU-Dockerfile

FROM nvidia/cuda:10.1-devel-ubuntu18.04

# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

Horovod Install的更多相关文章

  1. 机器学习分布式框架horovod安装 (Linux环境)

    1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...

  2. 安装 openmpi 4.0 用于 horovod 编译

    最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...

  3. Horovod 分布式深度学习框架相关

    最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...

  4. [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator

    [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...

  5. OEL上使用yum install oracle-validated 简化主机配置工作

    环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...

  6. org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS

    17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...

  7. 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?

    注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...

  8. Install and Configure SharePoint 2013 Workflow

    这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...

  9. Basic Tutorials of Redis(1) - Install And Configure Redis

    Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...

随机推荐

  1. 1071 Speech Patterns——PAT甲级真题

    1071 Speech Patterns People often have a preference among synonyms of the same word. For example, so ...

  2. CentOS7上安装伪分布式Hadoop

    1.下载安装包 下载hadoop安装包 官网地址:https://hadoop.apache.org/releases.html 版本:建议使用hadoop-2.7.3.tar.gz 系统环境:Cen ...

  3. go 动态数组 二维动态数组

    go使用动态数组还有点麻烦,比python麻烦一点,需要先定义. 动态数组申明 var dynaArr []string 动态数组添加成员 dynaArr = append(dynaArr, &quo ...

  4. 2020年12月-第02阶段-前端基础-CSS Day04

    1. 浮动(float) 记忆 能够说出 CSS 的布局的三种机制 理解 能够说出普通流在布局中的特点 能够说出我们为什么用浮动 能够说出我们为什么要清除浮动 应用 能够利用浮动完成导航栏案例 能够清 ...

  5. windows电脑上传ipa到appstore的详细流程

    在使用H5混合开发的app打包后,需要将ipa文件上传到appstore进行发布,就需要去苹果开发者中心进行发布. 但是在苹果开发者中心无法直接上传ipa文件,它要求我们使用xcode或transpo ...

  6. Python开发环境从零搭建-02-代码编辑器Sublime

    想要从零开始搭建一个Python的开发环境说容易也容易 说难也能难倒一片开发人员,在接下来的一系列视频中,会详细的讲解如何一步步搭建python的开发环境 本文章是搭建环境的第2篇 讲解的内容是:安装 ...

  7. android 调用js,js调用android

    Java调用JavaScript   1.main.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <?xml v ...

  8. 利用flex解决input定位的问题

    用简单的布局搞定input框使用fixed下输入的问题 最近在做移动端H5聊天应用发现,当input框在最底部并且使用 position:fixed 属性的时候在苹果手机中会出现不兼容的情况 ​

  9. Ignatius and the Princess III HDU - 1028

    题目传送门:https://vjudge.net/problem/HDU-1028 思路:整数拆分构造母函数的模板题 1 //#include<bits/stdc++.h> 2 #incl ...

  10. P1739_表达式括号匹配(JAVA语言)

    思路:刚开始想用stack,遇到'('就push,遇到')'就pop,后来发现其实我们只需要用到栈里'('的个数,所以我们用一个变量统计'('的个数就好啦~ 题目描述 假设一个表达式有英文字母(小写) ...