如何使用TensorCores优化卷积

本文将演示如何在TVM中使用TensorCores编写高性能的卷积计划。假设卷积的输入有大量数据。首先介绍如何在GPU上优化卷积

TensorCore简介

每个Tensor核心都提供一个4x4x4的矩阵处理阵列,该阵列可以运行 ,其中A,B,C和D是4x4矩阵,如图所示。矩阵乘法输入A和B是FP16矩阵,而累加矩阵C和D可以是FP16或FP32矩阵。D = A * B + C

但是,CUDA程序员只能使用扭曲级原语,在张量核上执行16x16x16半精度矩阵乘法。在调用矩阵乘法之前,程序员必须将内存中的数据显式地加载到寄存器中。NVCC编译器将该原语转换为多个内存加载指令。在运行时runtime,每个线程从矩阵A加载16个元素,从矩阵B加载16个元素。wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag)wmma::load_matrix_sync

准备和算法

将固定大小用于256通道和14 x 14尺寸的输入张量。批处理大小为256。卷积过滤器包含512个大小为3 x 3的过滤器。对于卷积,使用步幅大小1和填充大小1。在示例中,使用NHWCnc内存布局。以下代码定义了TVM中的卷积算法。

import tvm
from tvm import te
import numpy as np
from tvm.contrib import nvcc
 
# The sizes of inputs and filters
batch_size = 256
height = 14
width = 14
in_channels = 256
out_channels = 512
kernel_h = 3
kernel_w = 3
pad_h = 1
pad_w = 1
stride_h = 1
stride_w = 1
 
# TensorCore shape
block_size = 16
 
assert batch_size % block_size == 0
assert in_channels % block_size == 0
assert out_channels % block_size == 0
 
# Input feature map: (N, H, W, IC, n, ic)
data_shape = (
    batch_size // block_size,
    height,
    width,
    in_channels // block_size,
    block_size,
    block_size,
)
# Kernel: (H, W, IC, OC, ic, oc)
kernel_shape = (
    kernel_h,
    kernel_w,
    in_channels // block_size,
    out_channels // block_size,
    block_size,
    block_size,
)
# Output feature map: (N, H, W, OC, n, oc)
output_shape = (
    batch_size // block_size,
    height,
    width,
    out_channels // block_size,
    block_size,
    block_size,
)
 
# Reduction axes
kh = te.reduce_axis((0, kernel_h), name="kh")
kw = te.reduce_axis((0, kernel_w), name="kw")
ic = te.reduce_axis((0, in_channels // block_size), name="ic")
ii = te.reduce_axis((0, block_size), name="ii")
 
# Algorithm
A = te.placeholder(data_shape, name="A", dtype="float16")
W = te.placeholder(kernel_shape, name="W", dtype="float16")
Apad = te.compute(
    (
        batch_size // block_size,
        height + 2 * pad_h,
        width + 2 * pad_w,
        in_channels // block_size,
        block_size,
        block_size,
    ),
    lambda n, h, w, i, nn, ii: tvm.tir.if_then_else(
        tvm.tir.all(h >= pad_h, h - pad_h < height, w >= pad_w, w - pad_w < width),
        A[n, h - pad_h, w - pad_w, i, nn, ii],
        tvm.tir.const(0.0, "float16"),
    ),
    name="Apad",
)
Conv = te.compute(
    output_shape,
    lambda n, h, w, o, nn, oo: te.sum(
        Apad[n, h * stride_h + kh, w * stride_w + kw, ic, nn, ii].astype("float32")
        * W[kh, kw, ic, o, ii, oo].astype("float32"),
        axis=[ic, kh, kw, ii],
    ),
    name="Conv",
)
 
s = te.create_schedule(Conv.op)
s[Apad].compute_inline()

存储范围

在传统的GPU计划中,具有全局,共享和本地内存范围。为了支持TensorCores,添加了另外三个特殊的存储范围:wmma.matrix_a, wmma.matrix_b和wmma.accumulator。在硬件上,所有片段作用域存储在片上寄存器级别,与本地存储器位于同一位置。

# Designate the memory hierarchy
AS = s.cache_read(Apad, "shared", [Conv])
WS = s.cache_read(W, "shared", [Conv])
AF = s.cache_read(AS, "wmma.matrix_a", [Conv])
WF = s.cache_read(WS, "wmma.matrix_b", [Conv])
ConvF = s.cache_write(Conv, "wmma.accumulator")

定义张量特征

实际上,TensorCore是一种特殊的硬件操作。因此,可以使用Tensorize用TensorCore指令替换计算单位。首先,需要定义张量特征。

有四种基本的操作TensorCore: ,, 。由于都用于矩阵乘法,因此可以编写以下三个内部函数。fill_fragmentload_matrixmma_syncstore_matrixfill_fragmentmma_sync

def intrin_wmma_load_matrix(scope):
    n = 16
    A = te.placeholder((n, n), name="A", dtype="float16")
    BA = tvm.tir.decl_buffer(A.shape, A.dtype, scope="shared", data_alignment=32, offset_factor=256)
    C = te.compute((n, n), lambda i, j: A[i, j], name="C")
    BC = tvm.tir.decl_buffer(C.shape, C.dtype, scope=scope, data_alignment=32, offset_factor=256)
 
    def intrin_func(ins, outs):
        ib = tvm.tir.ir_builder.create()
 
        BA = ins[0]
        BC = outs[0]
        ib.emit(
            tvm.tir.call_intrin(
                "handle",
                "tir.tvm_load_matrix_sync",
                BC.data,
                n,
                n,
                n,
                BC.elem_offset // 256,
                BA.access_ptr("r"),
                n,
                "row_major",
            )
        )
        return ib.get()
 
    return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, C: BC})
 
 
def intrin_wmma_gemm():
    n = 16
    A = te.placeholder((n, n), name="A", dtype="float16")
    B = te.placeholder((n, n), name="B", dtype="float16")
    k = te.reduce_axis((0, n), name="k")
    C = te.compute(
        (n, n),
        lambda ii, jj: te.sum(A[ii, k].astype("float") * B[k, jj].astype("float"), axis=k),
        name="C",
    )
    BA = tvm.tir.decl_buffer(
        A.shape, A.dtype, name="BA", scope="wmma.matrix_a", data_alignment=32, offset_factor=256
    )
    BB = tvm.tir.decl_buffer(
        B.shape, B.dtype, name="BB", scope="wmma.matrix_b", data_alignment=32, offset_factor=256
    )
    BC = tvm.tir.decl_buffer(
        C.shape, C.dtype, name="BC", scope="wmma.accumulator", data_alignment=32, offset_factor=256
    )
 
    def intrin_func(ins, outs):
        BA, BB = ins
        (BC,) = outs
 
        def init():
            ib = tvm.tir.ir_builder.create()
            ib.emit(
                tvm.tir.call_intrin(
                    "handle", "tir.tvm_fill_fragment", BC.data, n, n, n, BC.elem_offset // 256, 0.0
                )
            )
            return ib.get()
 
        def update():
            ib = tvm.tir.ir_builder.create()
            ib.emit(
                tvm.tir.call_intrin(
                    "handle",
                    "tir.tvm_mma_sync",
                    BC.data,
                    BC.elem_offset // 256,
                    BA.data,
                    BA.elem_offset // 256,
                    BB.data,
                    BB.elem_offset // 256,
                    BC.data,
                    BC.elem_offset // 256,
                )
            )
            return ib.get()
 
        return update(), init(), update()
 
    return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, B: BB, C: BC})
 
 
def intrin_wmma_store_matrix():
    n = 16
    A = te.placeholder((n, n), name="A", dtype="float32")
    BA = tvm.tir.decl_buffer(
        A.shape, A.dtype, scope="wmma.accumulator", data_alignment=32, offset_factor=256
    )
    C = te.compute((n, n), lambda i, j: A[i, j], name="C")
    BC = tvm.tir.decl_buffer(C.shape, C.dtype, scope="global", data_alignment=32, offset_factor=256)
 
    def intrin_func(ins, outs):
        ib = tvm.tir.ir_builder.create()
        BA = ins[0]
        BC = outs[0]
        ib.emit(
            tvm.tir.call_intrin(
                "handle",
                "tir.tvm_store_matrix_sync",
                BA.data,
                n,
                n,
                n,
                BA.elem_offset // 256,
                BC.access_ptr("w"),
                n,
                "row_major",
            )
        )
        return ib.get()
 
    return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, C: BC})

调度计算

要在TVM中使用TensorCores,必须将计算调度到特定的结构中以匹配张量特征。与传统的GPU程序一样,可以使用共享内存来提高速度。如果对阻塞和共享内存有任何疑问,请参阅如何在GPU上优化卷积

在此示例中,每个块包含2x4变形,并且每个变形调用4x2 TensorCore指令。因此,每个warp的输出形状为64x32,每个块输出128x128标题。由于共享内存空间的限制,一次只能加载2个块(2x128x128个图块)。

warp操作

请注意,所有TensorCore指令均为warp级指令,这意味着warp中的所有32个线程应同时执行此指令。使theadIdx.x范围= 32是解决此问题的最简单方法之一。然后可以将threadIdx.x绑定到任何循环,除了那些直接或间接包含TensorCore内部函数的循环。还要注意,这不是唯一的解决方案。唯一要做的是确保warp中的所有线程可以同时调用TensorCore。

# Define tiling sizes
block_row_warps = 4
block_col_warps = 2
warp_row_tiles = 2
warp_col_tiles = 4
warp_size = 32
chunk = 2
 
block_x = te.thread_axis("blockIdx.x")
block_y = te.thread_axis("blockIdx.y")
block_z = te.thread_axis("blockIdx.z")
thread_x = te.thread_axis("threadIdx.x")
thread_y = te.thread_axis("threadIdx.y")
thread_z = te.thread_axis("threadIdx.z")
 
nc, hc, wc, oc, nnc, ooc = Conv.op.axis
block_k = s[Conv].fuse(hc, wc)
s[Conv].bind(block_k, block_z)
nc, nci = s[Conv].split(nc, factor=warp_row_tiles)
block_i, nc = s[Conv].split(nc, factor=block_row_warps)
oc, oci = s[Conv].split(oc, factor=warp_col_tiles)
block_j, oc = s[Conv].split(oc, factor=block_col_warps)
s[Conv].reorder(block_k, block_i, block_j, nc, oc, nci, oci, nnc, ooc)
s[Conv].bind(block_i, block_x)
s[Conv].bind(block_j, block_y)
s[Conv].bind(nc, thread_y)
s[Conv].bind(oc, thread_z)
 
# Schedule local computation
s[ConvF].compute_at(s[Conv], oc)
n, h, w, o, nnf, oof = ConvF.op.axis
ko, ki = s[ConvF].split(ic, factor=chunk)
s[ConvF].reorder(ko, kh, ki, kw, n, o, nnf, oof, ii)
 
# Move intermediate computation into each output compute tile
s[AF].compute_at(s[ConvF], kw)
s[WF].compute_at(s[ConvF], kw)
 
# Schedule for A's share memory
s[AS].compute_at(s[ConvF], kh)
n, h, w, i, nn, ii = AS.op.axis
tx, xo = s[AS].split(n, nparts=block_row_warps)
ty, yo = s[AS].split(xo, nparts=block_col_warps)
t = s[AS].fuse(nn, ii)
to, ti = s[AS].split(t, factor=warp_size)
s[AS].bind(tx, thread_y)
s[AS].bind(ty, thread_z)
s[AS].bind(ti, thread_x)
 
# Schedule for W's share memory
s[WS].compute_at(s[ConvF], kh)
kh, kw, ic, o, ii, oo = WS.op.axis
tx, xo = s[WS].split(o, nparts=block_row_warps)
ty, yo = s[WS].split(xo, nparts=block_col_warps)
t = s[WS].fuse(ii, oo)
to, ti = s[WS].split(t, nparts=warp_size)
s[WS].bind(tx, thread_y)
s[WS].bind(ty, thread_z)
s[WS].bind(to, thread_x)
s[WS].vectorize(ti)
print(tvm.lower(s, [A, W, Conv], simple_mode=True))

输出:

primfn(A_1: handle, W_1: handle, Conv_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
             W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
             A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
  buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
  attr [IterVar(blockIdx.z: int32, (nullptr), "ThreadIndex", "blockIdx.z")] "thread_extent" = 196;
  attr [Conv.wmma.accumulator: Pointer(float32)] "storage_scope" = "wmma.accumulator";
  allocate(Conv.wmma.accumulator, float32, [2048]);
  attr [Apad.shared: Pointer(float16)] "storage_scope" = "shared";
  allocate(Apad.shared, float16, [12288]);
  attr [W.shared: Pointer(float16)] "storage_scope" = "shared";
  allocate(W.shared, float16, [12288]);
  attr [Apad.shared.wmma.matrix_a: Pointer(float16)] "storage_scope" = "wmma.matrix_a";
  allocate(Apad.shared.wmma.matrix_a, float16, [512]);
  attr [W.shared.wmma.matrix_b: Pointer(float16)] "storage_scope" = "wmma.matrix_b";
  allocate(W.shared.wmma.matrix_b, float16, [1024]);
  attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 2;
  attr [IterVar(blockIdx.y: int32, (nullptr), "ThreadIndex", "blockIdx.y")] "thread_extent" = 4;
  attr [IterVar(threadIdx.y: int32, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 4;
  attr [IterVar(threadIdx.z: int32, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 2 {
    for (n.c.init: int32, 0, 2) {
      for (o.c.init: int32, 0, 4) {
        for (nn.c.init: int32, 0, 16) {
          for (oo.c.init: int32, 0, 16) {
            Conv.wmma.accumulator[((((n.c.init*1024) + (o.c.init*256)) + (nn.c.init*16)) + oo.c.init)] = 0f32
          }
        }
      }
    }
    for (ic.outer: int32, 0, 8) {
      for (kh: int32, 0, 3) {
        for (ax2: int32, 0, 3) {
          for (ax3: int32, 0, 2) {
            for (ax4.ax5.fused.outer: int32, 0, 8) {
              attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
              Apad.shared[((((((threadIdx.y*3072) + (threadIdx.z*1536)) + (ax2*512)) + (ax3*256)) + (ax4.ax5.fused.outer*32)) + threadIdx.x)] = @tir.if_then_else(((((1 <= (floordiv(blockIdx.z, 14) + kh)) && ((floordiv(blockIdx.z, 14) + kh) < 15)) && (1 <= (ax2 + floormod(blockIdx.z, 14)))) && ((ax2 + floormod(blockIdx.z, 14)) < 15)), (float16*)A_2[(((((((((((blockIdx.x*6422528) + (threadIdx.y*1605632)) + (threadIdx.z*802816)) + (kh*57344)) + (blockIdx.z*4096)) + (ax2*4096)) + (ic.outer*512)) + (ax3*256)) + (ax4.ax5.fused.outer*32)) + threadIdx.x) - 61440)], 0f16, dtype=float16)
            }
          }
        }
        for (ax1: int32, 0, 3) {
          for (ax2_1: int32, 0, 2) {
            attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
            W.shared[ramp((((((ax1*4096) + (ax2_1*2048)) + (threadIdx.y*512)) + (threadIdx.z*256)) + (threadIdx.x*8)), 1, 8)] = (float16x8*)W_2[ramp(((((((((kh*393216) + (ax1*131072)) + (ic.outer*16384)) + (ax2_1*8192)) + (blockIdx.y*2048)) + (threadIdx.y*512)) + (threadIdx.z*256)) + (threadIdx.x*8)), 1, 8)]
          }
        }
        for (ic.inner: int32, 0, 2) {
          for (kw: int32, 0, 3) {
            for (ax0: int32, 0, 2) {
              for (ax4: int32, 0, 16) {
                for (ax5: int32, 0, 16) {
                  Apad.shared.wmma.matrix_a[(((ax0*256) + (ax4*16)) + ax5)] = (float16*)Apad.shared[((((((threadIdx.y*3072) + (ax0*1536)) + (kw*512)) + (ic.inner*256)) + (ax4*16)) + ax5)]
                }
              }
            }
            for (ax3_1: int32, 0, 4) {
              for (ax4_1: int32, 0, 16) {
                for (ax5_1: int32, 0, 16) {
                  W.shared.wmma.matrix_b[(((ax3_1*256) + (ax4_1*16)) + ax5_1)] = (float16*)W.shared[((((((kw*4096) + (ic.inner*2048)) + (threadIdx.z*1024)) + (ax3_1*256)) + (ax4_1*16)) + ax5_1)]
                }
              }
            }
            for (n.c: int32, 0, 2) {
              for (o.c: int32, 0, 4) {
                for (nn.c: int32, 0, 16) {
                  for (oo.c: int32, 0, 16) {
                    for (ii: int32, 0, 16) {
                      Conv.wmma.accumulator[((((n.c*1024) + (o.c*256)) + (nn.c*16)) + oo.c)] = ((float32*)Conv.wmma.accumulator[((((n.c*1024) + (o.c*256)) + (nn.c*16)) + oo.c)] + (cast(float32, (float16*)Apad.shared.wmma.matrix_a[(((n.c*256) + (nn.c*16)) + ii)])*cast(float32, (float16*)W.shared.wmma.matrix_b[(((o.c*256) + (ii*16)) + oo.c)])))
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
    for (n.inner: int32, 0, 2) {
      for (o.inner: int32, 0, 4) {
        for (nn: int32, 0, 16) {
          for (oo: int32, 0, 16) {
            Conv_2[(((((((((blockIdx.x*12845056) + (threadIdx.y*3211264)) + (n.inner*1605632)) + (blockIdx.z*8192)) + (blockIdx.y*2048)) + (threadIdx.z*1024)) + (o.inner*256)) + (nn*16)) + oo)] = (float32*)Conv.wmma.accumulator[((((n.inner*1024) + (o.inner*256)) + (nn*16)) + oo)]
          }
        }
      }
    }
  }
}

降低算力

最后一个阶段是通过将2D卷积映射到张量特征,来将计算循环降低到TensorCore硬件特征

s[AF].tensorize(AF.op.axis[-2], intrin_wmma_load_matrix("wmma.matrix_a"))
s[WF].tensorize(WF.op.axis[-2], intrin_wmma_load_matrix("wmma.matrix_b"))
s[Conv].tensorize(nnc, intrin_wmma_store_matrix())
s[ConvF].tensorize(nnf, intrin_wmma_gemm())
print(tvm.lower(s, [A, W, Conv], simple_mode=True))

输出:

primfn(A_1: handle, W_1: handle, Conv_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
             W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
             A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
  buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
  attr [IterVar(blockIdx.z: int32, (nullptr), "ThreadIndex", "blockIdx.z")] "thread_extent" = 196;
  attr [Conv.wmma.accumulator: Pointer(float32)] "storage_scope" = "wmma.accumulator";
  allocate(Conv.wmma.accumulator, float32, [2048]);
  attr [Apad.shared: Pointer(float16)] "storage_scope" = "shared";
  allocate(Apad.shared, float16, [12288]);
  attr [W.shared: Pointer(float16)] "storage_scope" = "shared";
  allocate(W.shared, float16, [12288]);
  attr [Apad.shared.wmma.matrix_a: Pointer(float16)] "storage_scope" = "wmma.matrix_a";
  allocate(Apad.shared.wmma.matrix_a, float16, [512]);
  attr [W.shared.wmma.matrix_b: Pointer(float16)] "storage_scope" = "wmma.matrix_b";
  allocate(W.shared.wmma.matrix_b, float16, [1024]);
  attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 2;
  attr [IterVar(blockIdx.y: int32, (nullptr), "ThreadIndex", "blockIdx.y")] "thread_extent" = 4;
  attr [IterVar(threadIdx.y: int32, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 4;
  attr [IterVar(threadIdx.z: int32, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 2 {
    for (n.c.init: int32, 0, 2) {
      for (o.c.init: int32, 0, 4) {
        @tir.tvm_fill_fragment(Conv.wmma.accumulator, 16, 16, 16, ((n.c.init*4) + o.c.init), 0f32, dtype=handle)
      }
    }
    for (ic.outer: int32, 0, 8) {
      for (kh: int32, 0, 3) {
        for (ax2: int32, 0, 3) {
          for (ax3: int32, 0, 2) {
            for (ax4.ax5.fused.outer: int32, 0, 8) {
              attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
              Apad.shared[((((((threadIdx.y*3072) + (threadIdx.z*1536)) + (ax2*512)) + (ax3*256)) + (ax4.ax5.fused.outer*32)) + threadIdx.x)] = @tir.if_then_else(((((1 <= (floordiv(blockIdx.z, 14) + kh)) && ((floordiv(blockIdx.z, 14) + kh) < 15)) && (1 <= (ax2 + floormod(blockIdx.z, 14)))) && ((ax2 + floormod(blockIdx.z, 14)) < 15)), (float16*)A_2[(((((((((((blockIdx.x*6422528) + (threadIdx.y*1605632)) + (threadIdx.z*802816)) + (kh*57344)) + (blockIdx.z*4096)) + (ax2*4096)) + (ic.outer*512)) + (ax3*256)) + (ax4.ax5.fused.outer*32)) + threadIdx.x) - 61440)], 0f16, dtype=float16)
            }
          }
        }
        for (ax1: int32, 0, 3) {
          for (ax2_1: int32, 0, 2) {
            attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
            W.shared[ramp((((((ax1*4096) + (ax2_1*2048)) + (threadIdx.y*512)) + (threadIdx.z*256)) + (threadIdx.x*8)), 1, 8)] = (float16x8*)W_2[ramp(((((((((kh*393216) + (ax1*131072)) + (ic.outer*16384)) + (ax2_1*8192)) + (blockIdx.y*2048)) + (threadIdx.y*512)) + (threadIdx.z*256)) + (threadIdx.x*8)), 1, 8)]
          }
        }
        for (ic.inner: int32, 0, 2) {
          for (kw: int32, 0, 3) {
            for (ax0: int32, 0, 2) {
              @tir.tvm_load_matrix_sync(Apad.shared.wmma.matrix_a, 16, 16, 16, ax0, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float16), Apad.shared, ((((threadIdx.y*3072) + (ax0*1536)) + (kw*512)) + (ic.inner*256)), 256, 1, dtype=handle), 16, "row_major", dtype=handle)
            }
            for (ax3_1: int32, 0, 4) {
              @tir.tvm_load_matrix_sync(W.shared.wmma.matrix_b, 16, 16, 16, ax3_1, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float16), W.shared, ((((kw*4096) + (ic.inner*2048)) + (threadIdx.z*1024)) + (ax3_1*256)), 256, 1, dtype=handle), 16, "row_major", dtype=handle)
            }
            for (n.c: int32, 0, 2) {
              for (o.c: int32, 0, 4) {
                @tir.tvm_mma_sync(Conv.wmma.accumulator, ((n.c*4) + o.c), Apad.shared.wmma.matrix_a, n.c, W.shared.wmma.matrix_b, o.c, Conv.wmma.accumulator, ((n.c*4) + o.c), dtype=handle)
              }
            }
          }
        }
      }
    }
    for (n.inner: int32, 0, 2) {
      for (o.inner: int32, 0, 4) {
        @tir.tvm_store_matrix_sync(Conv.wmma.accumulator, 16, 16, 16, ((n.inner*4) + o.inner), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), Conv_2, (((((((blockIdx.x*12845056) + (threadIdx.y*3211264)) + (n.inner*1605632)) + (blockIdx.z*8192)) + (blockIdx.y*2048)) + (threadIdx.z*1024)) + (o.inner*256)), 256, 2, dtype=handle), 16, "row_major", dtype=handle)
      }
    }
  }
}

生成CUDA内核

最后,使用TVM生成和编译CUDA内核,并评估卷积的延迟。由于TensorCores仅在具有Compute Capability 7.0或更高版本的NVIDIA GPU中受支持,因此它可能无法在构建服务器上运行

ctx = tvm.gpu(0)
if nvcc.have_tensorcore(ctx.compute_version):
    with tvm.transform.PassContext(config={"tir.UnrollLoop": {"auto_max_step": 16}}):
        func = tvm.build(s, [A, W, Conv], "cuda")
    a_np = np.random.uniform(size=data_shape).astype(A.dtype)
    w_np = np.random.uniform(size=kernel_shape).astype(W.dtype)
    a = tvm.nd.array(a_np, ctx)
    w = tvm.nd.array(w_np, ctx)
    c = tvm.nd.array(np.zeros(output_shape, dtype=Conv.dtype), ctx)
    evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
    print("conv2d with tensor core: %f ms" % (evaluator(a, w, c).mean * 1e3))

输出:

conv2d with tensor core: 8.329637 ms

概要

本文演示了如何使用TVM调度原语在特定GPU上调用TensorCore。

https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html

如何使用TensorCores优化卷积的更多相关文章

  1. TensorFlow之CNN:运用Batch Norm、Dropout和早停优化卷积神经网络

    学卷积神经网络的理论的时候,我觉得自己看懂了,可是到了用代码来搭建一个卷积神经网络时,我发现自己有太多模糊的地方.这次还是基于MINIST数据集搭建一个卷积神经网络,首先给出一个基本的模型,然后再用B ...

  2. GPU上如何优化卷积

    GPU上如何优化卷积 本文将演示如何在TVM中编写高性能卷积实现.我们以平方大小的输入张量和滤波器为例,假设卷积的输入是大批量的.在本例中,使用不同的布局来存储数据,以实现更好的数据局部性.缓冲区布局 ...

  3. 如何在GPU上优化卷积

    本文将演示如何在TVM中编写高性能的卷积实现.以平方大小的输入张量和滤波器为例,并假设卷积的输入量很大.使用不同的布局来存储数据,以实现更好的数据局部性.缓冲区布局为HWCN,代表高度,宽度,通道,批 ...

  4. 使用Auto TensorCore CodeGen优化Matmul

    使用Auto TensorCore CodeGen优化Matmul 本文将演示如何使用TVM Auto TensorCore CodeGen在Volta / Turing GPU上编写高性能matmu ...

  5. 用Auto-TensorCore代码生成优化matmul

    用Auto-TensorCore代码生成优化matmul 将演示如何使用TVM Auto TensorCore CodeGen在Volta/Turing GPU上编写高性能matmul调度.这是一个透 ...

  6. 嵌入式设备上卷积神经网络推理时memory的优化

    以前的神经网络几乎都是部署在云端(服务器上),设备端采集到数据通过网络发送给服务器做inference(推理),结果再通过网络返回给设备端.如今越来越多的神经网络部署在嵌入式设备端上,即inferen ...

  7. 卷积理论 & 高维FWT学习笔记

    之前做了那么多生成函数和多项式卷积的题目,结果今天才理解了优化卷积算法的实质. 首先我们以二进制FWT or作为最简单的例子入手. 我们发现正的FWT or变换就是求$\hat{a}_j=\sum_{ ...

  8. 【翻译】借助 NeoCPU 在 CPU 上进行 CNN 模型推理优化

    本文翻译自 Yizhi Liu, Yao Wang, Ruofei Yu.. 的  "Optimizing CNN Model Inference on CPUs" 原文链接: h ...

  9. 再探快速傅里叶变换(FFT)学习笔记(其三)(循环卷积的Bluestein算法+分治FFT+FFT的优化+任意模数NTT)

    再探快速傅里叶变换(FFT)学习笔记(其三)(循环卷积的Bluestein算法+分治FFT+FFT的优化+任意模数NTT) 目录 再探快速傅里叶变换(FFT)学习笔记(其三)(循环卷积的Blueste ...

随机推荐

  1. 【Springboot项目启动异常】项目启动,数据库连接异常

    今天使用Springboot 整合 MybatisPlus 准备写一个Demo,在项目启动时,频繁出错,在此记录整个问题的解决过程 问题如下图 人工翻译一遍,主要意思就是没有检测到数据库驱动,也就是说 ...

  2. Hydra暴力破解工具的用法

    目录 Hydra 常见参数 破解SSH 破解FTP 破解HTTP 破解3389远程登录 Kali自带密码字典 dirb dirbuster fern-wifi metasploit wfuzz Hyd ...

  3. Python脚本爬取网站美女照片

    上次无意之中看到一个网站,里面全是美女的照片,我就心想,哪天有时间了得把这网站的所有美女照片都得爬下来.今天有时间,写了点代码,爬去了网站的所有照片.附上战果!图片实在是太多了,爬半个多小时 先附上所 ...

  4. DLL注入-APC注入

    APC注入 APC注入的原理是利用当线程被唤醒时APC中的注册函数会被执行的机制,并以此去执行我们的DLL加载代码,进而完成DLL注入的目的,其具体流程如下:     1)当EXE里某个线程执行到Sl ...

  5. canvas绘制折线路径动画

    最近有读者加我微信咨询这个问题: 其中的效果是一个折线路径动画效果,如下图所示: 要实现以上路径动画,一般可以使用svg的动画功能.或者使用canvas绘制,结合路径数学计算来实现. 如果用canva ...

  6. PowerBI开发 第十九篇:基于Page创建Tooltip

    在PowerBI 报表中,常规的Tooltip是一段文本,当光标悬停在Visual上,Visual上方会自动显示Tooltip的文本.PowerBI 支持用户自定义内容丰富的Tooltip,用户通过创 ...

  7. 将项目连接到远程仓库git

    方式一: git clone "git中的项目地址",此时会生成一个新的项目2.该步骤用于生成一个本地仓库 将需要提交的所有文件除了node_module.git以及输出文件dis ...

  8. Js的变量、作用域与内存

    变量.作用域与内存 1 .原始值与引用值 Undefined.Null.Boolean.Number. String和Symbol.保存原始值的变量是按值(by value)访问的 引用值是保存在内存 ...

  9. 一行代码解决JS数字大于2^53精度错误的问题

    服务端使用长整型(Int64)的数字,在浏览器端使用JS的number类型接收时,当这个实际值超过 (2^53-1)时,JS变量的值和实际值就会出现不相等的问题.常见场景比如使用雪花算法生成Id. 在 ...

  10. Python分支结构你真的搞定了吗?

    分支结构 分支结构能够让计算机像人一样进行思考,应对不同的场景做出不同的回应. Python中不支持switch语法,目前仅支持if/else形式,但是在Python3.10的测试版本中,貌似支持了s ...