ARM CPU自动调度神经网络
ARM CPU自动调度神经网络
对特定设备和工作负载进行自动调度,对于获得最佳性能至关重要。通过RPC使用自动调度器为ARM CPU调度整个神经网络。
为了自动调度神经网络,将网络划分为小的子图,进行独立调度。每个子图被视为一个搜索任务。任务调度程序对时间进行分片,为这些任务动态分配时间资源。任务调度程序预测每个任务对端到端执行时间的影响,确定最大程度地减少执行时间的任务的优先级。
对于每个子图,使用compute声明tvm/python/topi,获取张量表达式形式的计算DAG。使用自动调度器来构造此DAG的搜索空间,搜索良好的调度(低级优化)。
与手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何调度模板。换句话说,自动调度程序仅在其中使用计算声明tvm/python/topi,而不使用现有的调度模板。
本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个if __name__ == "__main__":块中。
import numpy as np
import tvm
from tvm import relay, auto_scheduler
import tvm.relay.testing
from tvm.contrib import graph_runtime
from tvm.contrib.utils import tempdir
定义网络
首先,需要使用Relay中继前端API定义网络。从加载一些预定义的网络tvm.relay.testing。还从MXNet,ONNX,PyTorch和TensorFlow加载模型。
对于卷积神经网络,尽管自动调度程序在任何布局下正常工作,使用NHWC布局通常实现最佳性能。使用自动调度程序对NHWC布局实施了更多优化。建议将模型转换为NHWC布局以使用自动调度程序。使用ConvertLayout传递在TVM中进行布局转换。
def get_network(name, batch_size, layout="NHWC", dtype="float32"):
"""Get the symbol definition and random weight of a network"""
# auto-scheduler prefers NHWC layout
if layout == "NHWC":
image_shape = (224, 224, 3)
elif layout == "NCHW":
image_shape = (3, 224, 224)
else:
raise ValueError("Invalid layout: " + layout)
input_shape = (batch_size,) + image_shape
output_shape = (batch_size, 1000)
if name.startswith("resnet-"):
n_layer = int(name.split("-")[1])
mod, params = relay.testing.resnet.get_workload(
num_layers=n_layer,
batch_size=batch_size,
layout=layout,
dtype=dtype,
image_shape=image_shape,
)
elif name.startswith("resnet3d-"):
n_layer = int(name.split("-")[1])
mod, params = relay.testing.resnet.get_workload(
num_layers=n_layer,
batch_size=batch_size,
layout=layout,
dtype=dtype,
image_shape=image_shape,
)
elif name == "mobilenet":
mod, params = relay.testing.mobilenet.get_workload(
batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
)
elif name == "squeezenet_v1.1":
assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
mod, params = relay.testing.squeezenet.get_workload(
version="1.1",
batch_size=batch_size,
dtype=dtype,
image_shape=image_shape,
)
elif name == "inception_v3":
input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "mxnet":
# an example for mxnet model
from mxnet.gluon.model_zoo.vision import get_model
assert layout == "NCHW"
block = get_model("resnet50_v1", pretrained=True)
mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
net = mod["main"]
net = relay.Function(
net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
)
mod = tvm.IRModule.from_expr(net)
return mod, params, input_shape, output_shape
启动RPC跟踪器
TVM使用RPC会话与ARM板进行通信。在调度期间,调谐器会将生成的代码发送到电路板上,测量电路板上的代码速度。
为了扩大调度范围,TVM使用RPC Tracker来管理分布式设备。RPC跟踪器是一个集中式控制器节点。将所有设备注册到跟踪器。例如,如果有10部电话,全部注册到跟踪器,并行运行10次测量,从而加快了调谐过程。
要启动RPC跟踪器,在主机上运行此命令。在整个调度过程中都需要使用跟踪器,因此需要为此命令打开一个新终端:
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
预期的输出是
INFO:RPCTracker:bind to 0.0.0.0:9190
将设备注册到RPC跟踪器
现在,将设备注册到跟踪器。第一步是为ARM设备构建TVM运行时。
- 对于Linux:遵循本节在设备上构建TVM运行时,以在设备上构建TVM运行时。然后通过以下方式将设备注册到跟踪器
- python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rasp4b-64
(用[HOST_IP]主机的IP地址代替)
- 对于Android:在Android设备上安装TVM RPC APK。确保通过android rpc测试。已经注册了设备。在调度过程中,必须转到开发人员选项,启用“更改时保持屏幕唤醒”并为手机充电以使其稳定。
注册设备后,通过查询rpc_tracker进行确认
python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
例如,如果有2个Huawei mate10 pro,11个具有64位操作系统的Raspberry Pi 4B和2个rk3399,则输出是
Queue Status
----------------------------------
key total free pending
----------------------------------
mate10pro 2 2 0
rk3399 2 2 0
rasp4b-64 11 11 0
----------------------------------
将多个设备注册到跟踪器,以加快调谐过程中的测量速度。
设置调度选项
调度之前,应该应用一些配置。以带有64位操作系统(Ubuntu 20.04)的Raspberry Pi 4b 4GB主板为例。在设置中,应该相应地修改目标和device_key。如果使用的是Android手机,use_ndk设置为True。
#### DEVICE CONFIG ####
# Replace "aarch64-linux-gnu" with the correct target of your board.
# This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.
# FIXME(tmoreau89, merrymercy): We leave '-device=arm_cpu' out of the target string
# because we're sharing x86 op strategy.
target = tvm.target.Target("llvm -mtriple=aarch64-linux-gnu -mattr=+neon")
# Also replace this with the device key in your tracker
device_key = "rasp4b-64"
# Set this to True if you use ndk tools for cross compiling
# And also set the environment variable below to point to the cross compiler
use_ndk = False
# os.environ["TVM_NDK_CC"] = "/usr/bin/aarch64-linux-gnu-g++"
#### TUNING OPTION ####
network = "mobilenet"
batch_size = 1
layout = "NHWC"
dtype = "float32"
log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)
提取搜索任务
接下来,从网络中提取搜索任务及其权重。任务的权重是该任务的子图在整个网络中的出现次数。通过使用权重,将网络的端到端延迟近似为,sum(latency[t] * weight[t])其中是任务的延迟,latency[t]weight[t]是任务的权重。任务调度程序将仅优化此目标。
# Extract tasks from the network
print("Extract tasks...")
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
for idx, task in enumerate(tasks):
print("========== Task %d (workload key: %s) ==========" % (idx, task.workload_key))
print(task.compute_dag)
出去:
Extract tasks...
========== Task 0 (workload key: ["d7b65649a4dd54becea0a52aabbc5af5", 1, 1000, 1, 1000]) ==========
placeholder = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])
========== Task 1 (workload key: ["9847f8cc0b305137f49f2c5c0c8ab25d", 1, 1024, 1000, 1024, 1000, 1, 1000]) ==========
placeholder = PLACEHOLDER [1, 1024]
placeholder = PLACEHOLDER [1000, 1024]
T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])
placeholder = PLACEHOLDER [1000]
T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])
========== Task 2 (workload key: ["69115f188984ae34ede37c3b8ca40b43", 1, 7, 7, 1024, 1, 1, 1, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))
========== Task 3 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 7, 7, 1024, 1, 1, 1024, 1024, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 4 (workload key: ["06fce76bd84cb904eee50b905ca9449a", 1, 7, 7, 1024, 3, 3, 1024, 1, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 1024]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 1024, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 5 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 7, 7, 512, 1, 1, 512, 1024, 1, 1, 1, 1024, 1, 7, 7, 1024]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 6 (workload key: ["c87ba68bc180312f5716af09a77ca15b", 1, 14, 14, 512, 3, 3, 512, 1, 1, 1, 1, 512, 1, 7, 7, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i*2) + di), ((j*2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 7 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 14, 14, 512, 1, 1, 512, 512, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 8 (workload key: ["06fce76bd84cb904eee50b905ca9449a", 1, 14, 14, 512, 3, 3, 512, 1, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 9 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 14, 14, 256, 1, 1, 256, 512, 1, 1, 1, 512, 1, 14, 14, 512]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 10 (workload key: ["c87ba68bc180312f5716af09a77ca15b", 1, 28, 28, 256, 3, 3, 256, 1, 1, 1, 1, 256, 1, 14, 14, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i*2) + di), ((j*2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 11 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 28, 28, 256, 1, 1, 256, 256, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 12 (workload key: ["06fce76bd84cb904eee50b905ca9449a", 1, 28, 28, 256, 3, 3, 256, 1, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 13 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 28, 28, 128, 1, 1, 128, 256, 1, 1, 1, 256, 1, 28, 28, 256]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 14 (workload key: ["c87ba68bc180312f5716af09a77ca15b", 1, 56, 56, 128, 3, 3, 128, 1, 1, 1, 1, 128, 1, 28, 28, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i*2) + di), ((j*2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 15 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 56, 56, 128, 1, 1, 128, 128, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 16 (workload key: ["06fce76bd84cb904eee50b905ca9449a", 1, 56, 56, 128, 3, 3, 128, 1, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 17 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 56, 56, 64, 1, 1, 64, 128, 1, 1, 1, 128, 1, 56, 56, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 18 (workload key: ["c87ba68bc180312f5716af09a77ca15b", 1, 112, 112, 64, 3, 3, 64, 1, 1, 1, 1, 64, 1, 56, 56, 64]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 64]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 64, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, ((i*2) + di), ((j*2) + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 19 (workload key: ["6b7583cf23c7c37d3212cad9d06e58c1", 1, 112, 112, 32, 1, 1, 32, 64, 1, 1, 1, 64, 1, 112, 112, 64]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 32]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 32, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 20 (workload key: ["06fce76bd84cb904eee50b905ca9449a", 1, 112, 112, 32, 3, 3, 32, 1, 1, 1, 1, 32, 1, 112, 112, 32]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 32]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 113)) && (i2 >= 1)) && (i2 < 113)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 32, 1]
DepthwiseConv2d(b, i, j, c) += (PaddedInput[b, (i + di), (j + dj), c]*placeholder[di, dj, c, 0])
placeholder = PLACEHOLDER [1, 1, 1, 32]
T_add(ax0, ax1, ax2, ax3) = (DepthwiseConv2d[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
========== Task 21 (workload key: ["98418eda02701ddd175ad50e364a0638", 1, 224, 224, 3, 3, 3, 3, 32, 1, 112, 1, 1, 1, 112, 1, 1, 1, 112, 112, 32]) ==========
placeholder = PLACEHOLDER [1, 224, 224, 3]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 225)) && (i2 >= 1)) && (i2 < 225)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 3, 32]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 112, 1, 1]
T_multiply(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3]*placeholder[ax0, ax1, 0, 0])
placeholder = PLACEHOLDER [1, 112, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
调优与评估
设置一些选项来优化和启动搜索任务
- num_measure_trials是在调度期间使用的测量试验的次数。将其设置为较小的数字(例如200)以进行快速演示。实际上,建议将其设置为800 * len(tasks),通常足以使搜索收敛。例如,resnet-50中有29个任务,因此将其设置为20000。根据时间预算调度。
- 此外,还用RecordToFile将测量记录转储到日志文件中,些测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
- 有关更多参数auto_scheduler.TuningOptions, 参见auto_scheduler.LocalRunner。
自动调度后,使用发现的最佳调度来编译网络。在自动调度期间,所有测量记录都将转储到日志文件中,因此读取日志文件并加载最佳调度。
def tune_and_evaluate():
print("Begin tuning...")
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=200, # change this to 20000 to achieve the best performance
runner=auto_scheduler.RPCRunner(
device_key,
host="0.0.0.0",
port=9191,
timeout=30,
repeat=1,
min_repeat_ms=200,
enable_cpu_cache_flush=True,
),
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
tuner.tune(tune_option)
# Compile with the history best
print("Compile...")
with auto_scheduler.ApplyHistoryBest(log_file):
with tvm.transform.PassContext(
opt_level=3, config={"relay.backend.use_auto_scheduler": True}
):
lib = relay.build(mod, target=target, params=params)
# Export library
tmp = tempdir()
if use_ndk:
from tvm.contrib import ndk
filename = "net.so"
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "net.tar"
lib.export_library(tmp.relpath(filename))
# Upload module to device
print("Upload...")
remote = auto_scheduler.utils.request_remote(device_key, "0.0.0.0", 9191, timeout=10000)
remote.upload(tmp.relpath(filename))
rlib = remote.load_module(filename)
# Create graph runtime
ctx = remote.cpu()
module = graph_runtime.GraphModule(rlib["default"](ctx))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)
# Evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
prof_res = np.array(ftimer().results) * 1e3 # convert to millisecond
print(
"Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res))
)
# We do not run the tuning in our webpage server since the server doesn't have a Raspberry Pi,
# or device tracker running.
# Uncomment the following line to run it by yourself.
# tune_and_evaluate()
笔记
调度期间解释打印的信息
在调度期间,控制台上会打印很多信息。用于调试目的。最重要的信息是任务调度程序的输出。下表是示例输出。
----------------------------------------------------------------------
------------------------------ [ Task Scheduler ]
----------------------------------------------------------------------
| ID | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
| 0 | 0.013 | 0.31 | 64 |
| 1 | 0.845 | 2.43 | 448 |
| 2 | 0.046 | -0.00 | 64 |
| 3 | 4.194 | 24.53 | 2112 |
| 4 | 0.109 | 9.21 | 64 |
| 5 | 1.759 | 29.27 | 896 |
| 6 | 0.083 | 6.01 | 64 |
| 7 | 3.084 | 33.38 | 7680 |
| 8 | 0.136 | 14.78 | 384 |
| 9 | 1.349 | 38.23 | 768 |
| 10 | 0.133 | 7.55 | 128 |
| 11 | 2.747 | 37.56 | 1536 |
| 12 | 0.338 | 11.87 | 192 |
| 13 | 1.295 | 40.00 | 704 |
| 14 | 0.482 | 4.16 | 256 |
| 15 | 2.686 | 38.56 | 1344 |
| 16 | 0.884 | 9.08 | 448 |
| 17 | 1.332 | 39.18 | 704 |
| 18 | 1.045 | 3.84 | 576 |
| 19 | 1.391 | 38.09 | 704 |
| 20 | 0.777 | 10.34 | 448 |
| 21 | 0.739 | 30.97 | 448 |
-------------------------------------------------
Estimated total latency: 38.347 ms Trials: 19992 Used time : 19260 s Next ID: 3
下表列出了所有任务的延迟和(估计)速度。列出了所有任务的测量试验分配。最后一行显示这些任务的总加权延迟,粗略估计网络的端到端执行时间。最后一行还显示测量试验的总数,自动调度所花费的总时间,以及要调度的下一个任务的ID。
将出现一些“ dmlc :: Error”错误,因为自动调度程序将尝试一些无效的调度。如果继续进行调度,则放心地忽略,因为这些错误与主要过程是隔离的。
笔记
提前终止调度
通过强制终止此过程来提前终止调度。只要在日志文件中为每个任务至少获得一个有效的调度,就应该能够进行编译(下面的部分)。
其他技巧
- 在调度过程中,自动调度器需要编译许多程序并从中提取功能。该部分占用大量CPU,建议使用具有多个内核的高性能CPU,以加快搜索速度。
- 用来提取大型日志文件,而仅保存最有用的记录。python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json
- 从上一个日志文件继续搜索。load_log_file在function中创建任务调度程序时,只需添加一个新参数run_tuning。
tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
- 如果有多个目标CPU,则全部用于测量以并行化测量。检查本节 以了解如何使用RPC跟踪器和RPC服务器。要在自动调度使用RPC跟踪, 用auto_scheduler.RPCRunner更换转轮TuningOptions。
ARM CPU自动调度神经网络的更多相关文章
- 为x86 CPU自动调度神经网络
为x86 CPU自动调度神经网络 对特定设备和工作负载进行自动调试对于获得最佳性能至关重要.这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档. 为了自动调试神经网络,将网络划分为小的子 ...
- NVIDIA GPU自动调度神经网络
NVIDIA GPU自动调度神经网络 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是有关如何使用自动调度器为NVIDIA GPU调整整个神经网络. 为了自动调整神经网络,将网络划分为小 ...
- NVIDIA GPU的神经网络自动调度
NVIDIA GPU的神经网络自动调度 针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是一个关于如何使用自动调度器为NVIDIA GPU调整整个神经网络的资料. 为了自动调整一个神经网络 ...
- CPU的自动调度矩阵乘法
CPU的自动调度矩阵乘法 这是一个有关如何对CPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度 ...
- TVM自动调度器
TVM自动调度器 随着模型大小,算子多样性和硬件异构性的不断增长,优化深度神经网络的执行速度非常困难.从计算的角度来看,深度神经网络只是张量计算的一层又一层.这些张量计算(例如matmul和conv2 ...
- arm cpu的架构及分类说明
今天在编译mplayer for mx27ads的时候, 碰到了armv5te与armv6优化的问题. 默认的交叉编译器支持armv5te也支持armv6,就默认使用了mplayer中mpeg4的ar ...
- ARM CPU大小端
ARM CPU大小端: 大端模式:低位字节存在高地址上,高位字节存在低地址上 小端模式:高位字节存在高地址上,低位字节存在低地址上 STM32属于小端模式,简单的说,比如u32 temp=0X1234 ...
- ARM CPU与Intel x86 CPU性能比较
Qualcomm ARM CPU与Intel x86 CPU性能比较 随着移动互联网时代的到来,Qualcomm(高通).Texas Instruments(德州仪器)等基于ARM架构的CPU受到越来 ...
- 10 -- 深入使用Spring -- 5... 实现任务的自动调度
10.5 实现任务的自动调度 10.5.1 使用Quartz 10.5.2 在Spring中使用Quartz
随机推荐
- 06- Linux Ubuntu下sublime下载与使用与安装包
我有Linux下全套的安装包,如sublime,pycharm等.不用再苦逼的下软件了.需要的加我微信:chimu sublime text 3文本编辑器 启动命令:Linux命令行输入subl 或者 ...
- 【ElasticSearch】ElasticSearch集群扫盲
Cluster 集群 ⼀个 Elasticsearch 集群由⼀个或多个节点(Node)组成,每个集群都有⼀个共同的集群名称作为标识. Node节点 ⼀个 Elasticsearch 实例即⼀个 ...
- 【Nginx(五)】Nginx配置Https证书
大致的流程如下 1.申请Https证书,绑定域名信息; 由于自己的服务器是腾讯云服务器, 这里就在腾讯云上申请SSL证书, 申请地址: https://console.cloud.tencent.co ...
- java之I/O流
I/O流的使用情况多种多样,首先它的数据源就可能是文件.控制台.服务器等,它的单位可能是按字节.按字符.按行等.为了涵盖所有的可能,java类库中创建了大量的类,如此多的类让我们在使用时感觉有点难以选 ...
- 【Redis】启动redis提示Could not connect to Redis at 127.0.0.1:6379: Connection refused 已解决
1.配置redis.conf文件,将daemonize no 为 daemonize yes即可(让redis作为守护进程运行)
- 【转】python SQLAlchemy
数据库表是一个二维表,包含多行多列. 把一个表的内容用Python的数据结构表示出来的话,可以用一个list表示多行,list的每一个元素是tuple,表示一行记录,比如,包含id和name的user ...
- Pytest自动化测试-简易入门教程(01)
我们今天主讲的内容,就是测试框架Pytest,讲到这个测试框架对于没有做过Web自动化的伙伴来说,会觉得这个东西是陌生的,那么到底什么是框架呢?什么又是自动化呢?自动化为什么又要用框架呢? 难道我学自 ...
- 什么是 Mock 测试?
什么是 Mock? 作为动词,Mock 是模拟.模仿的意思. 作为名词,Mock 是能够模仿真实对象行为的模拟对象. 那么,在软件测试中,Mock 所模拟的对象是什么呢? 模拟的是 SUT(Syste ...
- C#是怎么跑起来的
解释流程前,需要了解一些基本的概念. 基本概念解释: CPU :中央处理器,计算机的大脑,内部由数百万至数亿个晶体管组成,是解释和运行最终转换成机器语言(二进制代码)的地方.机器语言是通过CPU内存的 ...
- 在Visual Studio 中使用git——浏览版本库(七)
在Visual Studio 中使用git--什么是Git(一) 在Visual Studio 中使用git--给Visual Studio安装 git插件(二) 在Visual Studio 中使用 ...