Horovod documentation

安装

【Step1】安装Open MPI

注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.

【Step2】安装 TensorFlow

  • pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
  • 也可以用conda 安装

【Step3】安装 horovod

cpu

pip install horovod

GPUs with NCCL:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod

Docker 文档:

https://horovod.readthedocs.io/en/stable/docker.html

https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu

CPU-Dockerfile

FROM ubuntu:18.04

ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

GPU-Dockerfile

FROM nvidia/cuda:10.1-devel-ubuntu18.04

# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

Horovod Install的更多相关文章

  1. 机器学习分布式框架horovod安装 (Linux环境)

    1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...

  2. 安装 openmpi 4.0 用于 horovod 编译

    最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...

  3. Horovod 分布式深度学习框架相关

    最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...

  4. [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator

    [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...

  5. OEL上使用yum install oracle-validated 简化主机配置工作

    环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...

  6. org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS

    17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...

  7. 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?

    注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...

  8. Install and Configure SharePoint 2013 Workflow

    这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...

  9. Basic Tutorials of Redis(1) - Install And Configure Redis

    Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...

随机推荐

  1. DOM的理解

    https://www.cnblogs.com/djtang/p/11538420.html  dom的理解 https://blog.csdn.net/jiuqiyuliang/article/de ...

  2. MySQL如何搭建主库从库(Docker)

    目录 MySQL主从搭建 一.主从配置原理 二.操作步骤 1.创建主库和从库容器 2.启动主从库容器 3.远程连接并操作主从库 4.测试主从同步 MySQL主从搭建 一.主从配置原理 mysql主从配 ...

  3. xscan的安装和使用(作业整理)

    1.将学习通上下载的xscan.rar进行解压. 2.将缺少的.dll文件粘贴到软件解压目录中. 3.点击打开软件. 3.1在运行中除了发现缺少.dll文件的问题,我电脑又出现类似问题, 采取了关闭防 ...

  4. javascript中的闭包closure详解

    目录 简介 函数中的函数 Closure闭包 使用闭包实现private方法 闭包的Scope Chain 闭包常见的问题 闭包性能的问题 总结 简介 闭包closure是javascript中一个非 ...

  5. 13. Vue CLI脚手架

    一. Vue CLI 介绍 1. 什么是Vue CLI? Vue CLI 是一个基于 Vue.js 进行快速开发的完整系统.Vue CLI 致力于将 Vue 生态中的工具基础标准化.它确保了各种构建工 ...

  6. PTA1071 - Speech Patterns - map计算不同单词个数

    题意 输出给定字符串出现最多的字符串(小写输出)和出现次数. 所求字符串要求:字符中可以含有A-Z.0-9. 比如说题目给出的Can1,我们可以转换成can1,can1就算一个字符串整体,而不是单独的 ...

  7. 【MaixPy3文档】写好 Python 代码!

    本文是给有一点 Python 基础但还想进一步深入的同学,有经验的开发者建议跳过. 前言 上文讲述了如何认识开源项目和一些编程方法的介绍,这节主要来说说 Python 代码怎么写的一些演化过程和可以如 ...

  8. 死磕Spring之IoC篇 - Spring 应用上下文 ApplicationContext

    该系列文章是本人在学习 Spring 的过程中总结下来的,里面涉及到相关源码,可能对读者不太友好,请结合我的源码注释 Spring 源码分析 GitHub 地址 进行阅读 Spring 版本:5.1. ...

  9. WPF 基础 - 图片之界面截图

    1. 功能 系统截图. 2. 实现 2.1 思路 控件继承自 System.Windows.Media.Visual, 通过 System.Windows.Media.Imaging.RenderVi ...

  10. WPF 基础 - Window 启动动画

    <Window ... WindowStyle="None" AllowsTransparency="True" RenderTransformOrigi ...