1. 单机多卡启动并行训练

飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练，同时原有的paddle.distributed.launch的方式依然保留。

paddle.distributed.launch通过指定启动的程序文件，以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里，就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的，可以更好地控制进程，在日志打印、训练退出时更友好。这是当前推荐的用法。

下面分别介绍这两种方法。

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

当调用paddle.Model高层API来实现训练时，想要启动单机多卡训练非常简单，代码不需要做任何修改，只需要在启动时增加一下参数-m paddle.distributed.launch。

  #单机单卡启动，默认使用第0号卡

  $ python train.py

  #单机多卡启动，默认使用当前可见的所有卡

  $ python -m paddle.distributed.launch train.py

  #单机多卡启动，设置当前使用的第0号和第1号卡

  $ python -m paddle.distributed.launch --selected_gpus='0,1' train.py

  #单机多卡启动，设置当前使用第0号和第1号卡

  $ export CUDA_VISIABLE_DEVICES='0,1'

  $ python -m paddle.distributed.launch train.py

下面是一个高阶API的例子代码，直接执行cell代码框，就会在根目录生成hapitrain.py文件，然后就可以使用python来启动这个训练了。

%%writefile hapitrain.py

import paddle

from paddle.vision.transforms import ToTensor

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

lenet = paddle.vision.models.LeNet()

# Mnist继承paddle.nn.Layer属于Net，model包含了训练功能

model = paddle.Model(lenet)

# 设置训练模型所需的optimizer, loss, metric

model.prepare(

    paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),

    paddle.nn.CrossEntropyLoss(),

    paddle.metric.Accuracy(topk=(1, 2))

    )

# 启动训练

model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)

# 启动评估

model.evaluate(test_dataset, log_freq=100, batch_size=64)

单机单卡启动，默认使用第0号卡

# 单机单卡启动，默认使用第0号卡

!python hapitrain.py

Begin to download

Download finished

Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz

Begin to download

........

Download finished

Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz

Begin to download

Download finished

Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz

Begin to download

..

Download finished

W0628 15:25:11.488023   114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1

W0628 15:25:11.614305   114 device_context.cc:372] device: 0, cuDNN Version: 7.6.

The loss value printed in the log is the current step, and the metric is the average value of previous step.

Epoch 1/1

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working

  return (isinstance(seq, collections.Sequence) and

step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step

step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step

step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step

Eval begin...

The loss value printed in the log is the current batch, and the metric is the average value of previous step.

step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step

step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step

Eval samples: 10000

单机多卡启动，默认使用当前可见的所有卡

# 单机多卡启动，默认使用当前可见的所有卡

!python -m paddle.distributed.launch hapitrain.py

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  def convert_to_list(value, n, name, dtype=np.int):

-----------  Configuration Arguments -----------

gpus: None

heter_worker_num: None

heter_workers:

http_port: None

ips: 127.0.0.1

log_dir: log

nproc_per_node: None

server_num: None

servers:

training_script: hapitrain.py

training_script_args: []

worker_num: None

workers:

------------------------------------------------

WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode

launch train in GPU mode

INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):

    +=======================================================================================+

    |                        Distributed Envs                      Value                    |

    +---------------------------------------------------------------------------------------+

    |                       PADDLE_TRAINER_ID                        0                      |

    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:35079               |

    |                     PADDLE_TRAINERS_NUM                        1                      |

    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:35079               |

    |                     FLAGS_selected_gpus                        0                      |

    +=======================================================================================+

INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  def convert_to_list(value, n, name, dtype=np.int):

W0628 15:26:24.305920   285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1

W0628 15:26:24.311555   285 device_context.cc:372] device: 0, cuDNN Version: 7.6.

The loss value printed in the log is the current step, and the metric is the average value of previous step.

Epoch 1/1

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working

  return (isinstance(seq, collections.Sequence) and

step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step

step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step

step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step

Eval begin...

The loss value printed in the log is the current batch, and the metric is the average value of previous step.

step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step

step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step

Eval samples: 10000

INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.

单机多卡启动，设置当前使用第0号和第1号卡 aistudio单卡也可以运行，可以看到launch的容错率较高

# 单机多卡启动，设置当前使用第0号和第1号卡 aistudio单卡也可以运行，可以看到launch的容错率较高

!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py

-----------  Configuration Arguments -----------

gpus: None

heter_worker_num: None

heter_workers:

http_port: None

ips: 127.0.0.1

log_dir: log

nproc_per_node: None

server_num: None

servers:

training_script: hapitrain.py

training_script_args: []

worker_num: None

workers:

------------------------------------------------

WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode

launch train in GPU mode

INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):

    +=======================================================================================+

    |                        Distributed Envs                      Value                    |

    +---------------------------------------------------------------------------------------+

    |                       PADDLE_TRAINER_ID                        0                      |

    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:46909               |

    |                     PADDLE_TRAINERS_NUM                        1                      |

    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:46909               |

    |                     FLAGS_selected_gpus                        0                      |

    +=======================================================================================+

INFO 2021-06-28 15:28:10,637 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  def convert_to_list(value, n, name, dtype=np.int):

W0628 15:28:19.819196   448 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1

W0628 15:28:19.905493   448 device_context.cc:372] device: 0, cuDNN Version: 7.6.

The loss value printed in the log is the current step, and the metric is the average value of previous step.

Epoch 1/1

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working

  return (isinstance(seq, collections.Sequence) and

step 400/938 - loss: 0.0376 - acc_top1: 0.9136 - acc_top2: 0.9610 - 37ms/step

step 800/938 - loss: 0.0159 - acc_top1: 0.9423 - acc_top2: 0.9764 - 35ms/step

step 938/938 - loss: 0.0444 - acc_top1: 0.9479 - acc_top2: 0.9791 - 35ms/step

Eval begin...

The loss value printed in the log is the current batch, and the metric is the average value of previous step.

step 100/157 - loss: 0.0039 - acc_top1: 0.9767 - acc_top2: 0.9939 - 36ms/step

step 157/157 - loss: 0.0029 - acc_top1: 0.9815 - acc_top2: 0.9952 - 35ms/step

Eval samples: 10000

INFO 2021-06-28 15:29:19,766 launch.py:240] Local processes completed.

1.1.2使用基础API场景

如果使用基础API的代码程序启动单机多卡训练，需要对单机单卡的代码进行3处修改，具体看下面未改变版本和改变版本的对比：

修改三处：

第1处改动，import库**

import paddle.distributed as dist

第2处改动，初始化并行环境**

dist.init_parallel_env()

第3处改动，增加paddle.DataParallel封装

net = paddle.DataParallel(paddle.vision.models.LeNet())

import paddle #未改动版本

from paddle.vision.transforms import ToTensor

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

lenet = paddle.vision.models.LeNet()

# 加载训练集 batch_size 设为 64

train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)

def train():

    epochs = 1

    adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=lenet.parameters())

    # 用Adam作为优化函数

    for epoch in range(epochs):

        for batch_id, data in enumerate(train_loader()):

            x_data, y_data = data

            predicts = lenet(x_data)

            loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')

            acc = paddle.metric.accuracy(predicts, y_data, k=1)

            avg_acc = paddle.mean(acc)

            loss.backward()

            if batch_id % 400 == 0:

                print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy()))

            adam.step()

            adam.clear_grad()

# 启动训练

train()

> epoch: 0, batch_id: 0, loss is: [2.7922328], acc is: [0.15625] epoch:

> 0, batch_id: 400, loss is: [0.10373791], acc is: [0.96875] epoch: 0,

> batch_id: 800, loss is: [0.01435608], acc is: [1.]

这是有3处改动的基础API版本
还是先通过%%writefile normaltrain.py 命令将该文件存盘到根目录

%%writefile normaltrain.py

import paddle #这是有3处改动的版本

from paddle.vision.transforms import ToTensor

import paddle.distributed as dist #第1处改动，import库

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

# 加载训练集 batch_size 设为 64

train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)

def train():

    # 第2处改动，初始化并行环境

    dist.init_parallel_env()

    # 第3处改动，增加paddle.DataParallel封装

    net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径

    epochs = 1

    adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())

    # 用Adam作为优化函数

    for epoch in range(epochs):

        for batch_id, data in enumerate(train_loader()):

            x_data = data[0]

            y_data = data[1]

            predicts = net(x_data)

            acc = paddle.metric.accuracy(predicts, y_data, k=2)

            avg_acc = paddle.mean(acc)

            loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')

            loss.backward() #这里手册误写成了avg_loss

            if batch_id % 400 == 0:

                print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss

            adam.step()

            adam.clear_grad()

# 启动训练

train()

# 单机单卡启动，默认使用第0号卡 。这里单机单卡执行改后的代码会报错

# !python normaltrain.py

# 单机多卡启动，默认使用当前可见的所有卡

!python -m paddle.distributed.launch normaltrain.py

# 单机多卡启动，设置当前使用第0号和第1号卡 自动用当前所有的卡，只有单卡也不会报错

!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch normaltrain.py

1.2 单机多卡启动方式2、spawn启动【推荐！！】

就像把物品放进盒子寄快递一样，只要将待并行计算的train函数体放入paddle.distributed.spawn里面就行了。命令为：

import paddle.distributed as dist

# 启动train多进程训练，默认使用所有可见的GPU卡

if __name__ == '__main__':

    dist.spawn(train)

# 启动train函数2个进程训练，默认使用当前可见的前2张卡

if __name__ == '__main__':

    dist.spawn(train, nprocs=2)

# 启动train函数2个进程训练，默认使用第4号和第5号卡

if __name__ == '__main__':

    dist.spawn(train, nprocs=2, selelcted_gpus='4,5')

基础API场景(不管是否像launch里面那样改代码) aistudio
notebook里会报错，在实际多卡环境下正常。在aistudio 命令行下正常
高阶API场景 aistudio notebook里会报错，在aistudio 命令行下正常。

%%writefile normal3spawn.py

import paddle #这是有3处改动的版本

from paddle.vision.transforms import ToTensor

import paddle.distributed as dist #第1处改动，import库

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

# 加载训练集 batch_size 设为 64

train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)

def train():

    # 第2处改动，初始化并行环境

    dist.init_parallel_env()

    # 第3处改动，增加paddle.DataParallel封装

    net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径

    epochs = 1

    adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())

    # 用Adam作为优化函数

    for epoch in range(epochs):

        for batch_id, data in enumerate(train_loader()):

            x_data = data[0]

            y_data = data[1]

            predicts = net(x_data)

            acc = paddle.metric.accuracy(predicts, y_data, k=2)

            avg_acc = paddle.mean(acc)

            loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')

            loss.backward() #这里手册误写成了avg_loss

            if batch_id % 400 == 0:

                print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss

            adam.step()

            adam.clear_grad()

# 启动train多进程训练，默认使用所有可见的GPU卡

import paddle.distributed as dist

if __name__ == '__main__':

    dist.spawn(train)

1.3单机多卡简要总结：

spawn方式下在notebook里报错的情况，猜测应该是notebook进程管理限制导致的。在命令行情况下或者cell里加叹号运行的时候，就没有问题。

spawn方式不需要去修改代码的内部部分，只是加上dist.spawn(train)这句，相当于给训练代码加了一个多进程的壳，简单方便，是推荐使用的单机多卡组网方式！

在不支持spawn的情况，再去考虑用launch方式启动单机多卡。

飞桨完备的并行模式：

数据并行：针对产业界最常用的数据并行模式，飞桨针对实际业务需求重点打磨多项技术，包括；飞桨提供集合通信架构和参数服务器架构两种方式，支持工业实践中常见的同步训练和异步训练的机制，并提供收敛效果有保障的分布式优化算法。
流水线并行：面向异构硬件，流水线并行能够将模型计算部分拆分到不同硬件并充分流水线化，从而大规模提升异构硬件的整体利用率。
**模型并行：**对于超大规模分类问题，飞桨提供计算与存储同时并行的模型并行，解决单GPU无法解决的问题。

1.4使用fleetrun启动分布式任务

1.4.1 使用fleetrun启动分布式任务

Paddle提供命令行启动命令fleetrun，配合Paddle的分布式高级APIpaddle.distributed.fleet 即可轻松启动Paddle集合通信模式或参数服务器模式下的分布式任务。 fleetrun在静态图和动态图场景下均可使用。

注：目前paddle.distributed.fleet启动动态图分布式训练仅支持集合通信（Colletive Communication）模式，不支持参数服务器（Parameter-Server）模式。

GPU单机多卡训练

若启动单机4卡的任务，只需通过–gpus指定空闲的4张卡即可。

    fleetrun --gpus=0,1,2,3 train.py

注：如果指定了export CUDA_VISIBLE_DEVICES=0,1,2,3，则可以直接使用：

    export CUDA_VISIBLE_DEVICES=0,1,2,3

    fleetrun train.py

GPU多机多卡训练

[示例一] 2机8卡 (每个节点4卡)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1,2,3 train.py

注：如果每台机器均指定了export CUDA_VISIBLE_DEVICES=0,1,2,3，则可以直接在每台节点上启动：

    export CUDA_VISIBLE_DEVICES=0,1,2,3

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

[示例二] 2机16卡（每个节点8卡，假设每台机器均有8卡可使用）

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

1.4.2 Fleet单机多卡训练

使用Fleet接口进行动态图分布式训练其实非常的简单，基础API程序代码只需修改3个步骤：

导入paddle.distributed.fleet包
```
  from paddle.distributed import fleet
```
初始化fleet环境
```
  fleet.init(is_collective=True)
```

通过fleet获取分布式优化器和分布式模型

  strategy = fleet.DistributedStrategy()

  adam = fleet.distributed_optimizer(adam, strategy=strategy)

  dp_layer = fleet.distributed_model(layer)

Fleet手册提供的例子

%%writefile train_fleet.py

# -*- coding: UTF-8 -*-

import paddle

import paddle.nn as nn

#分布式step 1: 导入paddle.distributed.fleet包

from paddle.distributed import fleet

# 定义全连接网络，需继承自nn.Layer

class LinearNet(nn.Layer):

    def __init__(self):

        super(LinearNet, self).__init__()

        self._linear1 = nn.Linear(10, 10)

        self._linear2 = nn.Linear(10, 1)

    def forward(self, x):

        return self._linear2(self._linear1(x))

# 1.开启动态图模式

paddle.disable_static()

# 分布式step 2: 初始化fleet

fleet.init(is_collective=True)

# 2. 定义网络对象，损失函数和优化器

layer = LinearNet()

loss_fn = nn.MSELoss()

adam = paddle.optimizer.Adam(

    learning_rate=0.001, parameters=layer.parameters())

# 分布式step 3: 通过fleet获取分布式优化器和分布式模型

strategy = fleet.DistributedStrategy()

adam = fleet.distributed_optimizer(adam, strategy=strategy)

dp_layer = fleet.distributed_model(layer)

for step in range(20):

    # 3. 执行前向网络

    inputs = paddle.randn([10, 10], 'float32')

    outputs = dp_layer(inputs)

    labels = paddle.randn([10, 1], 'float32')

    loss = loss_fn(outputs, labels)

    print("step:{}\tloss:{}".format(step, loss.numpy()))

    # 4. 执行反向计算和参数更新

    loss.backward()

    adam.step()

    adam.clear_grad()

!fleetrun --gpus=0 train_fleet.py

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  def convert_to_list(value, n, name, dtype=np.int):

-----------  Configuration Arguments -----------

gpus: 0

heter_worker_num: None

heter_workers:

http_port: None

ips: 127.0.0.1

log_dir: log

nproc_per_node: None

server_num: None

servers:

training_script: train_fleet.py

training_script_args: []

worker_num: None

workers:

------------------------------------------------

WARNING 2021-06-28 15:56:16,986 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode

launch train in GPU mode

INFO 2021-06-28 15:56:16,990 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):

    +=======================================================================================+

    |                        Distributed Envs                      Value                    |

    +---------------------------------------------------------------------------------------+

    |                       PADDLE_TRAINER_ID                        0                      |

    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:47263               |

    |                     PADDLE_TRAINERS_NUM                        1                      |

    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:47263               |

    |                     FLAGS_selected_gpus                        0                      |

    +=======================================================================================+

INFO 2021-06-28 15:56:16,991 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

  def convert_to_list(value, n, name, dtype=np.int):

W0628 15:56:18.760403  1539 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1

W0628 15:56:18.826562  1539 device_context.cc:372] device: 0, cuDNN Version: 7.6.

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:633: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.

  "It is recommended to use DistributedStrategy "

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:423: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.

  warnings.warn("The program will return to single-card operation. "

step:0	loss:[2.747072]

step:1	loss:[3.9464068]

step:2	loss:[3.3363562]

step:3	loss:[1.7597802]

step:4	loss:[2.4984336]

step:5	loss:[1.3766874]

step:6	loss:[3.3678422]

step:7	loss:[1.8410085]

step:8	loss:[1.6417965]

step:9	loss:[4.009201]

step:10	loss:[1.7387416]

step:11	loss:[1.6013482]

step:12	loss:[1.6388085]

step:13	loss:[3.7573469]

step:14	loss:[0.9461777]

step:15	loss:[2.4906065]

step:16	loss:[2.613153]

step:17	loss:[2.8367076]

step:18	loss:[2.170548]

step:19	loss:[2.2705061]

INFO 2021-06-28 15:56:35,049 launch.py:240] Local processes completed.

2.手写数字识别API Fleet多版本

2.1手写数字识别基础API Fleet版本

%%writefile normal_fleet.py

import paddle #这是有3处改动的版本

from paddle.vision.transforms import ToTensor

#分布式step 1: 导入paddle.distributed.fleet包

from paddle.distributed import fleet

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

# 加载训练集 batch_size 设为 64

train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)

# 分布式step 2: 初始化fleet

fleet.init(is_collective=True)

def train():

    epochs = 1

    net = paddle.vision.models.LeNet()

    adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())

    # 分布式step 3: 通过fleet获取分布式优化器和分布式模型

    strategy = fleet.DistributedStrategy()

    adam = fleet.distributed_optimizer(adam, strategy=strategy)

    net = fleet.distributed_model(net)

    # 用Adam作为优化函数

    for epoch in range(epochs):

        for batch_id, data in enumerate(train_loader()):

            x_data = data[0]

            y_data = data[1]

            predicts = net(x_data)

            acc = paddle.metric.accuracy(predicts, y_data, k=2)

            avg_acc = paddle.mean(acc)

            loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')

            loss.backward() #这里手册误写成了avg_loss

            if batch_id % 400 == 0:

                print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss

            adam.step()

            adam.clear_grad()

if __name__ == '__main__':

    train()

!fleetrun --gpus=0 normal_fleet.py

 +=======================================================================================+

    |                        Distributed Envs                      Value                    |

    +---------------------------------------------------------------------------------------+

    |                       PADDLE_TRAINER_ID                        0                      |

    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:42501               |

    |                     PADDLE_TRAINERS_NUM                        1                      |

    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:42501               |

    |                     FLAGS_selected_gpus                        0                      |

    +=======================================================================================+

epoch: 0, batch_id: 0, loss is: [2.5425684], acc is: [0.234375]

epoch: 0, batch_id: 400, loss is: [0.05207598], acc is: [1.]

epoch: 0, batch_id: 800, loss is: [0.04818164], acc is: [1.]

2.2 手写数字识别高层API Fleet版本

%%writefile hapi_fleet.py

import paddle

from paddle.vision.transforms import ToTensor

import paddle.distributed as dist

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())

test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

lenet = paddle.vision.models.LeNet()

# Mnist继承paddle.nn.Layer属于Net，model包含了训练功能

model = paddle.Model(lenet)

# 设置训练模型所需的optimizer, loss, metric

model.prepare(

    paddle.optimizer.Adam(learning_rate=0.1, parameters=model.parameters()),

    paddle.nn.CrossEntropyLoss(),

    paddle.metric.Accuracy(topk=(1, 2))

    )

def train():

    # 启动训练

    # 使用VisualDL 可视化

    callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')

    model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)

    # 未使用VisualDL 可视化

    # model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)

    # 启动评估

#     model.evaluate(test_dataset, log_freq=20, batch_size=64)

if __name__ == '__main__':

    train()

!fleetrun hapi_fleet.py

2.3 多机多卡手写数字识别

从单机多卡到多机多卡训练，在代码上并不需要做任何改动，只需修改启动命令，以2机4卡为例：

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1 dygraph_fleet.py

在2台机器上分别运行以上启动命令，fleetrun将在后台分别启动2个多进程任务，执行分布式多机训练。您将在ip为xx.xx.xx.xx的机器上看到命令台输出日志信息。

下面还是以aistudio为例子演示一下多机多卡，直接运行：

!fleetrun --ips="127.0.0.1" --gpus=0 normal_fleet.py

3.飞桨2.0并行计算总结：

飞桨2.0在并行计算方面有着完备的解决方案，且是经过超大规模业务数据检验过的训练框架。并行计算，就是这么简单！

3.1 针对单机多卡的情况，优先推荐使用spawn方式

spawn的优点是：几乎不需要修改代码，只要导入spawn库，并在最后用spawn去调用训练函数即可。同时spawn方式可以更好地控制进程，在日志打印、训练退出时更友好

程序中只需要增加这两句：

    import paddle.distributed as dist

    if __name__ == '__main__':

        dist.spawn(train)

然后直接用python train.py启动训练即可

3.2 针对多机多卡的情况，使用fleet方式。

普通API程序需要对应修改3个步骤：

导入paddle.distributed.fleet包
```
  from paddle.distributed import fleet
```
初始化fleet环境
```
  fleet.init(is_collective=True)
```

通过fleet获取分布式优化器和分布式模型

  strategy = fleet.DistributedStrategy()

  adam = fleet.distributed_optimizer(adam, strategy=strategy)

  dp_layer = fleet.distributed_model(layer)

然后运行命令：
fleetrun --ips=“xx.xx.xx.xx,yy.yy.yy.yy” --gpus=0,1 train.py

3.3 如果使用高层API代码，则程序不用修改，直接运行fleetrun命令即可。

4.利用VisualDL进行并行计算下的可视化

VisualDL是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据，用户可以更直观、清晰地查看数据的特征与变化趋势，有助于分析数据、及时发现错误，进而改进神经网络模型的设计。喜欢的同学可以去star支持一下哦~

AI Studio Notebook 项目（Paddle1.8.0及以上版本）已经集成VisualDL工具以便于您的使用，可在可视化tab中启动VisualDL服务。

4.1 VisualDL可视化

在高层API程序中，只需要加上这句callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')并在model.fit里面加上callbacks=callback参数即可，也就是这样：model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)

前面的hapi_fleet.py代码中已经加入了VisualDL语句支持，前面cell已经执行!fleetrun hapi_fleet.py现在直接就可以在AIStudio里面打开可视化了：

打开左侧标签栏 可视化->设置logdir->点击添加->选择 visualdl_log/ -> 点击启动VisualDL服务 -> 点击打开VisualDL，在打开的网页中，就能看到训练的loss/acc等统计了；

4.2 利用VisualDL-Service共享可视化结果

此功能是 VisualDL 2.0.4 新添加的功能，需要安装 VisualDL 2.0.4 或者以上的版本，只需要一行代码 visualdl service upload 即可以将自己的log文件上传到远端，
非常推荐这个功能，我们上传文件之后，就不再需要在本地保存这些文件，直接访问生成的链接就可以了，十分方便！
如果没有安装 VisualDL 2.0.4 + ，需要使用命令pip install visualdl==2.0.5安装
执行下面的代码之后，访问生成的链接，所有人都可以对训练过程进行查看分析

!pip install -U visualdl -q # ==2.0.5

!visualdl service upload --logdir visualdl_log

【三】分布式训练---单机多卡与多机多卡组网（飞桨paddle2.0+）更加推荐spawn方式！的更多相关文章

云原生的弹性 AI 训练系列之一：基于 AllReduce 的弹性分布式训练实践
引言随着模型规模和数据量的不断增大,分布式训练已经成为了工业界主流的 AI 模型训练方式.基于 Kubernetes 的 Kubeflow 项目,能够很好地承载分布式训练的工作负载,业已成为了云原生 ...
[源码解析] 深度学习分布式训练框架 Horovod (1) --- 基础知识
[源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识目录 [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 0x00 摘要 0x01 分布式并 ...
[翻译] 使用 TensorFlow 进行分布式训练
本文以两篇官方文档为基础来学习TensorFlow如何进行分布式训练,借此进入Strategy世界.
云原生的弹性 AI 训练系列之二：PyTorch 1.9.0 弹性分布式训练的设计与实现
背景机器学习工作负载与传统的工作负载相比,一个比较显著的特点是对 GPU 的需求旺盛.在之前的文章中介绍过(https://mp.weixin.qq.com/s/Nasm-cXLtJObjLwLQH ...
[源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入
[源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入目录 [源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入 0x00 摘要 0 ...
Pytorch使用分布式训练，单机多卡
pytorch的并行分为模型并行.数据并行左侧模型并行:是网络太大,一张卡存不了,那么拆分,然后进行模型并行训练. 右侧数据并行:多个显卡同时采用数据训练网络的副本. 一.模型并行二.数据并行数 ...
windows下使用pytorch进行单机多卡分布式训练
现在有四张卡,但是部署在windows10系统上,想尝试下在windows上使用单机多卡进行分布式训练,网上找了一圈硬是没找到相关的文章.以下是踩坑过程. 首先,pytorch的版本必须是大于1.7, ...
『TensorFlow』分布式训练_其一_逻辑梳理
1,PS-worker架构将模型维护和训练计算解耦合,将模型训练分为两个作业(job): 模型相关作业,模型参数存储.分发.汇总.更新,有由PS执行训练相关作业,包含推理计算.梯度计算(正向/反向 ...
[源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么
[源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么目录 [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun ...
[源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Worker生命周期
[源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Worker生命周期目录 [源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Work ...

随机推荐

Java 项目工程搭建 --创建父工程
Java 项目工程搭建 --创建父工程 Java 项目工程搭建 --创建子模块(依赖父工程) Intellij 2018 更多详细内容见尚硅谷阳哥视频,实际项目中更多的是copy,修改pom Inte ...
Win 下 Redis 设置开机启动
1,在redis的目录下执行(执行后就作为windows服务了) redis-server.exe --service-install redis.windows.conf 2,安装好后需要手动启动r ...
Codeforce：455A. Boredom （DP）
https://codeforces.com/problemset/problem/455/A 题意: 给出n个元素,让我们来挑选,如果选了 \(a_k\),获得\(a_k\)点数,同时与\(a_{k ...
秒杀活动java怎么实现
秒杀与其他业务最大的区别在于:秒杀的瞬间: (1)系统的并发量会非常的大 (2)并发量大的同时,网络的流量也会瞬间变大. 一个秒杀或者抢购页面,通常分为2个部分,一个是静态的HTML等内容,另一个就是 ...
SpringCloud学习系列五、创建生产者和消费者验证微服务中心 Eureka的作用
系列导航 SpringCloud学习系列一. 前言-为什么要学习微服务 SpringCloud学习系列二. 简介 SpringCloud学习系列三. 创建一个没有使用springCloud的服务 ...
C#开源跨平台的多功能Steam工具箱&GitHub加速神器
前言作为一个程序员你是否会经常会遇到GitHub无法访问(如下无法访问图片),或者是访问和下载源码时十分缓慢就像乌龟爬行一般.今天分享一款C#开源的.跨平台的多功能Steam工具箱和GitHub加速 ...
C# 通过ServiceStack 操作Redis——Hash类型的使用及示例
接着上一篇,下面转到hash类型的代码使用 Hash:结构 key-key-value,通过索引快速定位到指定元素的,可直接修改某个字段 /// <summary> /// Hash:类似 ...
C# 几种常见数据结构（数组、链表、Hash表）
一.内存上连续存储,节约空间,可以索引访问,读取快,增删慢 Array: 在内存上连续分配的,而且元素类型是一样的,可以坐标访问;读取快--增删慢,长度不变 { //Array:在内存上连续分配的,而 ...
Linux 常见重要系统文件
Linux 常见重要系统文件目录 Linux 常见重要系统文件网卡配置文件文件内容举例: DNS配置文件文件内容举例: 系统hosts文件文件内容举例: fstab文件文件内容举例: rc ...
Java开发者的Python进修指南：JSON利器之官方json库、demjson和orjson的实用指南
JSON JSON作为目前最流行的传输格式,在Python中也有相应的实现方式.由于JSON格式的文本可以跨平台并且简单易用,因此被广泛传播.因此,我们今天的主要讨论内容是如何熟练地应用Python的 ...

【三】分布式训练---单机多卡与多机多卡组网（飞桨paddle2.0+）更加推荐spawn方式！