自动调度GPU的卷积层

这是有关如何对GPU使用自动调度程序的文档。

与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板。用户只需要编写计算声明,而无需任何调度命令或模板。自动调度程序可以自动生成较大的搜索空间,并在该空间中找到良好的调度。

本文以卷积层为例。

注意,本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个if __name__ == "__main__":块中。

import os
 
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python

定义计算

首先,定义卷积层的计算。该函数应返回输入/输出张量的列表。通过这些张量,自动调度器可以获取整个计算图。

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, CI, H, W), name="data")
    kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
    bias = te.placeholder((1, CO, 1, 1), name="bias")
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
    out = topi.nn.relu(conv + bias)
    return [data, kernel, bias, out]

创建搜索任务

然后,为resnet中的最后一个卷积层创建搜索任务。

target = tvm.target.Target("cuda")
 
# Use the last layer in ResNet-50
N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
    func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)
 
# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)

输出:

Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)

接下来,为自动调度程序设置参数。这些参数主要指定在搜索过程中如何进行测量。

  • measure_ctx启动不同的测量过程以提供隔离。它可以保护主进程免受测量期间GPU崩溃的影响,并避免其他运行时runtime冲突。
  • min_repeat_ms定义每次测量中一次“重复”的最小持续时间。这样可以预热GPU,对于获得准确的测量结果是必不可少的。通常,建议阈值> = 300 ms。
  • num_measure_trials是在搜索过程中可以使用的测量试验的数量。为了快速演示,在本文中仅进行10次试用。实际上,1000是搜索收敛的一个很好的值。可以根据自己的时间预算进行更多试验。
  • 此外,还用于RecordToFile将测量记录转储到文件conv2d.json。测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
  • 有关更多参数, 请参见auto_scheduler.TuningOptionsauto_scheduler.LocalRPCMeasureContext
log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=10,  # change this to 1000 to achieve the best performance
    runner=measure_ctx.runner,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=2,
)

输出:

Get devices for measurement successfully!

运行搜索

现在准备好所有输入。开始搜索,让自动调度程序发挥作用。经过一些测量试验后,可以从日志文件中加载最佳调度并应用它。

# Run auto-tuning (search)
task.tune(tune_option)
# Apply the best schedule
sch, args = task.apply_best(log_file)
 
# Kill the measurement process
del measure_ctx

输出:

可以降低调度,以便在自动调度后查看IR。自动调度程序可以正确执行优化,包括多层平铺,协作提取,展开和算子融合。

print("Lowered TIR:")
print(tvm.lower(sch, args, simple_mode=True))

输出:

Lowered TIR:
primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
             bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
             kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
             data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
  buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
  attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 28;
  attr [compute_3: Pointer(float32)] "storage_scope" = "local";
  allocate(compute_3, float32, [14]);
  attr [pad_temp.shared: Pointer(float32)] "storage_scope" = "shared";
  allocate(pad_temp.shared, float32, [72]);
  attr [kernel.shared: Pointer(float32)] "storage_scope" = "shared";
  allocate(kernel.shared, float32, [3072]);
  attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
    compute_3[0] = 0f32
    compute_3[1] = 0f32
    compute_3[2] = 0f32
    compute_3[3] = 0f32
    compute_3[4] = 0f32
    compute_3[5] = 0f32
    compute_3[6] = 0f32
    compute_3[7] = 0f32
    compute_3[8] = 0f32
    compute_3[9] = 0f32
    compute_3[10] = 0f32
    compute_3[11] = 0f32
    compute_3[12] = 0f32
    compute_3[13] = 0f32
    for (rc.outer.outer: int32, 0, 64) {
      for (ry.outer.outer: int32, 0, 3) {
        attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64 {
          if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
            pad_temp.shared[(threadIdx.x_1*4)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv((threadIdx.x_1*4), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
          }
          if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
            pad_temp.shared[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 1), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
          }
          if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
            pad_temp.shared[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 2), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32)
          }
          if @tir.likely((threadIdx.x_1 < 18), dtype=bool) {
            pad_temp.shared[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((1 <= (ry.outer.outer + floormod(blockIdx.x, 7))) && ((ry.outer.outer + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(((threadIdx.x_1*4) + 3), 9)*49)) + (ry.outer.outer*7)) + (floormod(blockIdx.x, 7)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0f32, dtype=float32)
          }
        }
        attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[threadIdx.x_2] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 64)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 64), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 128)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 128), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 192)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 36864)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 256)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 256), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 320)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 320), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 384)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 73728)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 448)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 448), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 512)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 512), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 576)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 110592)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 640)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 640), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 704)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 704), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 768)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 147456)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 832)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 832), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 896)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 896), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 960)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 184320)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1024)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1024), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1088)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1088), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1152)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 221184)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1216)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1216), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1280)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1280), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1344)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 258048)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1408)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1408), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1472)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1472), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1536)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 294912)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1600)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1600), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1664)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1664), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1728)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 331776)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1792)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1792), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1856)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1856), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1920)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 368640)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 1984)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 1984), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2048)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2048), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2112)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 405504)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2176)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2176), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2240)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2240), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2304)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 442368)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2368)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2368), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2432)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2432), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2496)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 479232)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2560)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2560), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2624)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2624), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2688)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 516096)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2752)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2752), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2816)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2816), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2880)] = (float32*)kernel_2[(((((((floordiv(blockIdx.x, 7)*589824) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + (ry.outer.outer*3)) + floormod(threadIdx.x_2, 3)) + 552960)]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 2944)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 2944), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 16), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 1), 3))]
        attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
        kernel.shared[(threadIdx.x_2 + 3008)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*589824) + (floordiv((threadIdx.x_2 + 3008), 24)*4608)) + (rc.outer.outer*72)) + (floordiv(floormod((threadIdx.x_2 + 8), 24), 3)*9)) + (ry.outer.outer*3)) + floormod((threadIdx.x_2 + 2), 3))]
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[0]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[9]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[(threadIdx.x*48)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 3)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[0]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[9]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 24)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 27)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 1)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 4)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[1]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[10]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 25)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 28)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[8]*(float32*)kernel.shared[((threadIdx.x*48) + 2)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[17]*(float32*)kernel.shared[((threadIdx.x*48) + 5)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[2]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[11]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[3]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[12]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[4]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[13]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[5]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[14]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[6]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[15]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[7]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[16]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[8]*(float32*)kernel.shared[((threadIdx.x*48) + 26)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[17]*(float32*)kernel.shared[((threadIdx.x*48) + 29)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[18]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[27]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 6)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 9)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[18]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[27]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 30)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 33)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 7)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 10)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[19]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[28]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 31)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 34)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[26]*(float32*)kernel.shared[((threadIdx.x*48) + 8)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[35]*(float32*)kernel.shared[((threadIdx.x*48) + 11)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[20]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[29]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[21]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[30]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[22]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[31]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[23]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[32]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[24]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[33]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[25]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[34]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[26]*(float32*)kernel.shared[((threadIdx.x*48) + 32)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[35]*(float32*)kernel.shared[((threadIdx.x*48) + 35)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[36]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[45]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 12)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 15)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[36]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[45]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 36)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 39)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 13)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 16)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[37]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[46]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 37)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 40)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[44]*(float32*)kernel.shared[((threadIdx.x*48) + 14)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[53]*(float32*)kernel.shared[((threadIdx.x*48) + 17)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[38]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[47]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[39]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[48]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[40]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[49]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[41]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[50]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[42]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[51]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[43]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[52]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[44]*(float32*)kernel.shared[((threadIdx.x*48) + 38)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[53]*(float32*)kernel.shared[((threadIdx.x*48) + 41)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[54]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[63]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 18)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 21)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[54]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[63]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 42)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 45)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 19)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 22)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[55]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[64]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 43)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 46)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[62]*(float32*)kernel.shared[((threadIdx.x*48) + 20)]))
        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[71]*(float32*)kernel.shared[((threadIdx.x*48) + 23)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[56]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[65]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[57]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[66]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[58]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[67]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[59]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[68]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[60]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[69]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[61]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[70]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[62]*(float32*)kernel.shared[((threadIdx.x*48) + 44)]))
        compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[71]*(float32*)kernel.shared[((threadIdx.x*48) + 47)]))
      }
    }
    for (i1.inner: int32, 0, 2) {
      for (i3.inner: int32, 0, 7) {
        compute_2[(((((floordiv(blockIdx.x, 7)*6272) + (threadIdx.x*98)) + (i1.inner*49)) + (floormod(blockIdx.x, 7)*7)) + i3.inner)] = max(((float32*)compute_3[((i1.inner*7) + i3.inner)] + (float32*)bias_2[(((floordiv(blockIdx.x, 7)*128) + (threadIdx.x*2)) + i1.inner)]), 0f32)
      }
    }
  }
}

检查正确性并评估性能

构建二进制文件并检查其正确性和性能。

func = tvm.build(sch, args, target)
 
# Check correctness
data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)
 
ctx = tvm.gpu()
data_tvm = tvm.nd.array(data_np, ctx=ctx)
weight_tvm = tvm.nd.array(weight_np, ctx=ctx)
bias_tvm = tvm.nd.array(bias_np, ctx=ctx)
out_tvm = tvm.nd.empty(out_np.shape, ctx=ctx)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)
 
# Check results
np.testing.assert_allclose(out_np, out_tvm.asnumpy(), rtol=1e-3)
 
# Evaluate execution time
evaluator = func.time_evaluator(func.entry_name, ctx, min_repeat_ms=500)
print(
    "Execution time of this operator: %.3f ms"
    % (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)

出:

Execution time of this operator: 0.417 ms

使用记录文件

搜索期间,所有测量记录都将转储到记录文件“ conv2d.json”中。测量记录可用于重新应用搜索结果,继续搜索以及执行其他分析。

这是一个示例,其中从文件加载最佳调度,并打印等效的python,调度API和CUDA源代码。可用于调试和学习自动调度程序的行为。

print("Equivalent python schedule:")
print(task.print_best(log_file, print_mode="schedule"))
 
print("CUDA source code:")
print(task.print_best(log_file, print_mode="cuda"))

输出:

Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1)
compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1)
compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)
compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)
compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)
compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=2)
compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=64)
compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=1)
compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)
compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)
compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=1)
compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)
compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=1)
compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=7)
compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)
compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1)
compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=2)
compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4)
compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1)
compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=1)
compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=1)
compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=3)
s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=64)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[compute].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, "shared", [compute])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[compute], compute_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, "shared", [compute])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x"))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread"))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=64)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 512)
s[compute].pragma(compute_nn_o_o_o_o, "unroll_explicit", True)
 
CUDA source code:
extern "C" __global__ void default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
  float compute1[14];
  __shared__ float pad_temp_shared[72];
  __shared__ float kernel_shared[3072];
  compute1[(0)] = 0.000000e+00f;
  compute1[(1)] = 0.000000e+00f;
  compute1[(2)] = 0.000000e+00f;
  compute1[(3)] = 0.000000e+00f;
  compute1[(4)] = 0.000000e+00f;
  compute1[(5)] = 0.000000e+00f;
  compute1[(6)] = 0.000000e+00f;
  compute1[(7)] = 0.000000e+00f;
  compute1[(8)] = 0.000000e+00f;
  compute1[(9)] = 0.000000e+00f;
  compute1[(10)] = 0.000000e+00f;
  compute1[(11)] = 0.000000e+00f;
  compute1[(12)] = 0.000000e+00f;
  compute1[(13)] = 0.000000e+00f;
  for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) {
    for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
      __syncthreads();
      if (((int)threadIdx.x) < 18) {
        pad_temp_shared[((((int)threadIdx.x) * 4))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= ((((int)threadIdx.x) * 4) % 9))) && (((((int)threadIdx.x) * 4) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + (((((int)threadIdx.x) * 4) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + ((((int)threadIdx.x) * 4) % 9)) - 8))] : 0.000000e+00f);
      }
      if (((int)threadIdx.x) < 18) {
        pad_temp_shared[(((((int)threadIdx.x) * 4) + 1))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 1) % 9))) && ((((((int)threadIdx.x) * 4) + 1) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 1) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 1) % 9)) - 8))] : 0.000000e+00f);
      }
      if (((int)threadIdx.x) < 18) {
        pad_temp_shared[(((((int)threadIdx.x) * 4) + 2))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 2) % 9))) && ((((((int)threadIdx.x) * 4) + 2) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 2) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 2) % 9)) - 8))] : 0.000000e+00f);
      }
      if (((int)threadIdx.x) < 18) {
        pad_temp_shared[(((((int)threadIdx.x) * 4) + 3))] = (((((1 <= (ry_outer_outer + (((int)blockIdx.x) % 7))) && ((ry_outer_outer + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((((int)threadIdx.x) * 4) + 3) % 9))) && ((((((int)threadIdx.x) * 4) + 3) % 9) < 8)) ? data[(((((((rc_outer_outer * 392) + ((((((int)threadIdx.x) * 4) + 3) / 9) * 49)) + (ry_outer_outer * 7)) + ((((int)blockIdx.x) % 7) * 7)) + (((((int)threadIdx.x) * 4) + 3) % 9)) - 8))] : 0.000000e+00f);
      }
      kernel_shared[(((int)threadIdx.x))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 64))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 64) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 128))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 128) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 192))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36864))];
      kernel_shared[((((int)threadIdx.x) + 256))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 256) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 320))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 320) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 384))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 73728))];
      kernel_shared[((((int)threadIdx.x) + 448))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 512))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 512) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 576))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 110592))];
      kernel_shared[((((int)threadIdx.x) + 640))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 640) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 704))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 704) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 768))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 147456))];
      kernel_shared[((((int)threadIdx.x) + 832))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 832) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 896))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 896) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 960))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 184320))];
      kernel_shared[((((int)threadIdx.x) + 1024))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1024) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1088))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1088) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1152))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 221184))];
      kernel_shared[((((int)threadIdx.x) + 1216))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1216) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1280))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1280) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1344))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 258048))];
      kernel_shared[((((int)threadIdx.x) + 1408))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1408) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1472))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1472) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1536))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 294912))];
      kernel_shared[((((int)threadIdx.x) + 1600))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1600) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1664))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1664) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1728))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 331776))];
      kernel_shared[((((int)threadIdx.x) + 1792))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1792) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1856))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1856) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 1920))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 368640))];
      kernel_shared[((((int)threadIdx.x) + 1984))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 1984) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2048))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2048) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2112))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 405504))];
      kernel_shared[((((int)threadIdx.x) + 2176))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2176) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2240))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2240) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2304))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 442368))];
      kernel_shared[((((int)threadIdx.x) + 2368))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2368) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2432))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2432) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2496))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 479232))];
      kernel_shared[((((int)threadIdx.x) + 2560))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2624))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2624) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2688))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 516096))];
      kernel_shared[((((int)threadIdx.x) + 2752))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2752) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2816))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2816) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 2880))] = kernel[(((((((((((int)blockIdx.x) / 7) * 589824) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 552960))];
      kernel_shared[((((int)threadIdx.x) + 2944))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 2944) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3)))];
      kernel_shared[((((int)threadIdx.x) + 3008))] = kernel[((((((((((int)blockIdx.x) / 7) * 589824) + (((((int)threadIdx.x) + 3008) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3)))];
      __syncthreads();
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(0)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(1)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(2)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(3)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(4)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(5)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(6)] * kernel_shared[((((int)threadIdx.x) * 48))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 3))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(0)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(9)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 24))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 27))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 1))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 4))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(1)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(10)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 25))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 28))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 2))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 5))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(2)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(11)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(3)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(12)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(4)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(13)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(5)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(14)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(6)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(15)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(7)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(16)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(8)] * kernel_shared[(((((int)threadIdx.x) * 48) + 26))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(17)] * kernel_shared[(((((int)threadIdx.x) * 48) + 29))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 6))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 9))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(18)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(27)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 30))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 33))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 7))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 10))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(19)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(28)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 31))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 34))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 8))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 11))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(20)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(29)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(21)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(30)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(22)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(31)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(23)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(32)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(24)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(33)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(25)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(34)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(26)] * kernel_shared[(((((int)threadIdx.x) * 48) + 32))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(35)] * kernel_shared[(((((int)threadIdx.x) * 48) + 35))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 12))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 15))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(36)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(45)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 36))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 39))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 13))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 16))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(37)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(46)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 37))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 40))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 14))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 17))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(38)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(47)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(39)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(48)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(40)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(49)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(41)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(50)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(42)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(51)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(43)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(52)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(44)] * kernel_shared[(((((int)threadIdx.x) * 48) + 38))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(53)] * kernel_shared[(((((int)threadIdx.x) * 48) + 41))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 18))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 21))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(54)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(63)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 42))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 45))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 19))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 22))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(55)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(64)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 43))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 46))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(1)] = (compute1[(1)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(2)] = (compute1[(2)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(3)] = (compute1[(3)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(4)] = (compute1[(4)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(5)] = (compute1[(5)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 20))]));
      compute1[(6)] = (compute1[(6)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 23))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(56)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(65)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(57)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(8)] = (compute1[(8)] + (pad_temp_shared[(66)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(58)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(9)] = (compute1[(9)] + (pad_temp_shared[(67)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(59)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(10)] = (compute1[(10)] + (pad_temp_shared[(68)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(60)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(11)] = (compute1[(11)] + (pad_temp_shared[(69)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(61)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(12)] = (compute1[(12)] + (pad_temp_shared[(70)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(62)] * kernel_shared[(((((int)threadIdx.x) * 48) + 44))]));
      compute1[(13)] = (compute1[(13)] + (pad_temp_shared[(71)] * kernel_shared[(((((int)threadIdx.x) * 48) + 47))]));
    }
  }
  for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
    for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {
      compute[(((((((((int)blockIdx.x) / 7) * 6272) + (((int)threadIdx.x) * 98)) + (i1_inner * 49)) + ((((int)blockIdx.x) % 7) * 7)) + i3_inner))] = max((compute1[(((i1_inner * 7) + i3_inner))] + bias[(((((((int)blockIdx.x) / 7) * 128) + (((int)threadIdx.x) * 2)) + i1_inner))]), 0.000000e+00f);
    }
  }
}

一个更复杂的示例是继续搜索。在这种情况下,需要自己创建搜索策略和成本模型,并使用日志文件恢复搜索策略和成本模型的状态。在下面的示例中,恢复状态并进行5次以上的试用。

cost_model = auto_scheduler.XGBModel()
cost_model.update_from_file(log_file)
search_policy = auto_scheduler.SketchPolicy(
    task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=5,
    runner=measure_ctx.runner,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
task.tune(tune_option, search_policy=search_policy)
 
# Kill the measurement process
del measure_ctx

输出:

Get devices for measurement successfully!

脚本的总运行时间:( 1分钟34.433秒)

https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

自动调度GPU的卷积层的更多相关文章

  1. CPU的自动调度矩阵乘法

    CPU的自动调度矩阵乘法 这是一个有关如何对CPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度 ...

  2. GPU自动调度卷积层

    GPU自动调度卷积层 本文对GPU使用自动调度程序. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,无需任何调度命令或模板.自动调度程序 ...

  3. NVIDIA GPU自动调度神经网络

    NVIDIA GPU自动调度神经网络 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是有关如何使用自动调度器为NVIDIA GPU调整整个神经网络. 为了自动调整神经网络,将网络划分为小 ...

  4. NVIDIA GPU的神经网络自动调度

    NVIDIA GPU的神经网络自动调度 针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是一个关于如何使用自动调度器为NVIDIA GPU调整整个神经网络的资料. 为了自动调整一个神经网络 ...

  5. 自动调试用于移动GPU的卷积网络

    自动调试用于移动GPU的卷积网络 对特定设备进行自动调试对于获得最佳性能至关重要.这是有关如何调试整个卷积网络的说明文档. TVM中Mobile GPU的算子实现以模板形式编写.模板具有许多可调旋钮( ...

  6. TVM自动调度器

    TVM自动调度器 随着模型大小,算子多样性和硬件异构性的不断增长,优化深度神经网络的执行速度非常困难.从计算的角度来看,深度神经网络只是张量计算的一层又一层.这些张量计算(例如matmul和conv2 ...

  7. tensorflow 1.0 学习:卷积层

    在tf1.0中,对卷积层重新进行了封装,比原来版本的卷积层有了很大的简化. 一.旧版本(1.0以下)的卷积函数:tf.nn.conv2d conv2d( input, filter, strides, ...

  8. Keras深度神经网络算法模型构建【输入层、卷积层、池化层】

    一.输入层 1.用途 构建深度神经网络输入层,确定输入数据的类型和样式. 2.应用代码 input_data = Input(name='the_input', shape=(1600, 200, 1 ...

  9. ARM CPU自动调度神经网络

    ARM CPU自动调度神经网络 对特定设备和工作负载进行自动调度,对于获得最佳性能至关重要.通过RPC使用自动调度器为ARM CPU调度整个神经网络. 为了自动调度神经网络,将网络划分为小的子图,进行 ...

随机推荐

  1. POJ3189二分最大流(枚举下界,二分宽度,最大流判断可行性)

    题意:       有n头猪,m个猪圈,每个猪圈都有一定的容量(就是最多能装多少只猪),然后每只猪对每个猪圈的喜好度不同(就是所有猪圈在每个猪心中都有一个排名),然后要求所有的猪都进猪圈,但是要求所有 ...

  2. Linux-鸟菜-2-主机规划与磁盘分区

    Linux-鸟菜-2-主机规划与磁盘分区 开机流程: 1. BIOS:開機主動執行的韌體,會認識第一個可開機的裝置: 2. MBR:第一個可開機裝置的第一個磁區內的主要開機記錄區塊,內含開機管理程式: ...

  3. Andrew Ng机器学习算法入门(三):线性回归算法

    线性回归 线性回归,就是能够用一个直线较为精确地描述数据之间的关系.这样当出现新的数据的时候,就能够预测出一个简单的值. 线性回归中最常见的就是房价的问题.一直存在很多房屋面积和房价的数据,如下图所示 ...

  4. ERROR: Pool overlaps with other one on this address space

    出现问题 配置了两个不同的docker-compose.yml,使用了相同的网段,导致了在运行第二个yml文件时命令行报错目标网段已存在,报错如下: Creating network "v2 ...

  5. Cannot load driver class: com.mysql.jdbc.Driver

    mysql-connector-java提供了mysql驱动等类库,此处必须引入此依赖,否则将会提示: Cannot load driver class: com.mysql.jdbc.Driver等 ...

  6. 从零开始搞监控系统(1)——SDK

    目前市面上有许多成熟的前端监控系统,但我们没有选择成品,而是自己动手研发.这里面包括多个原因: 填补H5日志的空白 节约公司费用支出 可灵活地根据业务自定义监控 回溯时间能更长久 反哺运营和产品,从而 ...

  7. Visual Lab Online —— Alpha版本发布声明

    Visual Lab Online -- Alpha版本发布声明 项目 内容 班级:北航2020春软件工程 博客园班级博客 作业:Alpha阶段发布声明 发布声明 目录 Visual Lab Onli ...

  8. SQLFlow——一个强大的可视化SQL关系分析工具

    SQLFlow 摘要 本文主要介绍SQLFlow是什么,以及它的功能及使用场景 SQLFlow是什么 SQLFlow是一个可视化的在线处理SQL对象依赖关系的工具,只需要上传你的SQL脚本,它可以自动 ...

  9. (原创)高DPI适配经验系列:(三)字体与字号、缩放锚点

    一.前言 程序最基本的元素,就是文本,也就是字体.如果程序未进行高DPI的适配,最直观的感受便是字体的模糊.所以本篇便来说一下高DPI适配中的字体问题. 高DPI的适配,简单来说便是便是根据不同的DP ...

  10. Linux如何设置用户登录超时(闲置时间)vi /etc/profile ... export TMOUT=900

    Linux如何设置用户登录超时(闲置时间) 转载莫负寒夏ai 最后发布于2019-08-08 15:04:22 阅读数 1897  收藏 展开 1. 针对所有用户 # vi /etc/profile ...