GPU自动调度卷积层

本文对GPU使用自动调度程序。

与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板。用户只需要编写计算声明,无需任何调度命令或模板。自动调度程序可以自动生成一个较大的搜索空间,在该空间中找到良好的调度。

本文以卷积层为例。

本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个if __name__ == "__main__":块中。

import os

import numpy as np

import tvm

from tvm import te, auto_scheduler, topi

from tvm.topi.testing import conv2d_nchw_python

定义计算

首先,定义卷积层的计算。该函数应返回输入/输出张量的列表。通过这些张量,自动调度器可以获得整个计算图。

@auto_scheduler.register_workload

def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):

data = te.placeholder((N, CI, H, W), name="data")

kernel = te.placeholder((CO, CI, KH, KW), name="kernel")

bias = te.placeholder((1, CO, 1, 1), name="bias")

conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")

out = topi.nn.relu(conv + bias)

return [data, kernel, bias, out]

创建搜索任务

然后,为resnet中的最后一个卷积层创建搜索任务。

target = tvm.target.Target("cuda")

# Use the last layer in ResNet-50

N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)

task = auto_scheduler.SearchTask(

func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target

)

# Inspect the computational graph

print("Computational DAG:")

print(task.compute_dag)

输出:

Computational DAG:

data = PLACEHOLDER [1, 512, 7, 7]

pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)

kernel = PLACEHOLDER [512, 512, 3, 3]

compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])

bias = PLACEHOLDER [1, 512, 1, 1]

T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])

compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)

接下来,为自动调度程序设置参数。这些参数主要指定在搜索过程中如何进行测量。

  • measure_ctx启动不同的测量过程以提供隔离。保护主进程免受测量期间GPU崩溃的影响,避免其它运行时冲突。
  • min_repeat_ms定义每次测量中一次“重复”的最小持续时间。这样可以预热GPU,对于获得准确的测量结果是必不可少的。通常,建议值> = 300毫秒。
  • num_measure_trials是在搜索过程中可以使用的测量试验的数量。为了快速演示,在本文中仅进行了10次试用。在实践中,1000是使搜索收敛的一个好值。可以根据自己的时间预算进行更多试验。
  • 此外,还用RecordToFile将测量记录转储到文件conv2d.json中。测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
  • 有关更多参数auto_scheduler.TuningOptions, 请参见auto_scheduler.LocalRPCMeasureContext

log_file = "conv2d.json"

measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)

tune_option = auto_scheduler.TuningOptions(

num_measure_trials=10,  # change this to 1000 to achieve the best performance

runner=measure_ctx.runner,

measure_callbacks=[auto_scheduler.RecordToFile(log_file)],

verbose=2,

)

输出:

Get devices for measurement successfully!

运行搜索

现在准备好所有输入。开始搜索,让自动调度程序发挥作用。经过一些测量试验之后,可以从日志文件中加载最佳调度并应用它。

# Run auto-tuning (search)

task.tune(tune_option)

# Apply the best schedule

sch, args = task.apply_best(log_file)

# Kill the measurement process

del measure_ctx

输出:

可以降低调度以在自动调度后查看IR。自动调度程序可以正确执行优化,包括多层平铺,协作提取,展开和算子融合。

print("Lowered TIR:")

print(tvm.lower(sch, args, simple_mode=True))

输出:

Lowered TIR:

primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()

attr = {"global_symbol": "main", "tir.noalias": True}

buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),

kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),

bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),

data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}

buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {

attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 16;

attr [compute_3: Pointer(float32)] "storage_scope" = "local";

allocate(compute_3, float32, [14]);

attr [pad_temp.shared: Pointer(float32)] "storage_scope" = "shared";

allocate(pad_temp.shared, float32, [1296]);

attr [kernel.shared: Pointer(float32)] "storage_scope" = "shared";

allocate(kernel.shared, float32, [4608]);

attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

compute_3[0] = 0f32

compute_3[7] = 0f32

compute_3[1] = 0f32

compute_3[8] = 0f32

compute_3[2] = 0f32

compute_3[9] = 0f32

compute_3[3] = 0f32

compute_3[10] = 0f32

compute_3[4] = 0f32

compute_3[11] = 0f32

compute_3[5] = 0f32

compute_3[12] = 0f32

compute_3[6] = 0f32

compute_3[13] = 0f32

for (rc.outer.outer: int32, 0, 32) {

attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[threadIdx.x_1] = @tir.if_then_else(((((9 <= floormod(threadIdx.x_1, 81)) && (floormod(threadIdx.x_1, 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(threadIdx.x_1, 81)*49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 31), 81)) && (floormod((threadIdx.x_1 + 31), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 112), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 31), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 62), 81)) && (floormod((threadIdx.x_1 + 62), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 224), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 62), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 336)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 12), 81)) && (floormod((threadIdx.x_1 + 12), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 336), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 12), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 448)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 43), 81)) && (floormod((threadIdx.x_1 + 43), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 448), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 43), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 560)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 560), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 672)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 24), 81)) && (floormod((threadIdx.x_1 + 24), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 672), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 24), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 784), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 896)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 5), 81)) && (floormod((threadIdx.x_1 + 5), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 896), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 5), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 1008)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 36), 81)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 1008), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 36), 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

pad_temp.shared[(threadIdx.x_1 + 1120)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 67), 81)) && (floormod((threadIdx.x_1 + 67), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 1120), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 67), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)

attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;

if @tir.likely((threadIdx.x_1 < 64), dtype=bool) {

pad_temp.shared[(threadIdx.x_1 + 1232)] = @tir.if_then_else((((floormod((threadIdx.x_1 + 17), 81) < 72) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1 + 1232), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)

}

attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[(threadIdx.x_2*4)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 36)*4608)) + (rc.outer.outer*144)) + (floormod(threadIdx.x_2, 36)*4))]

kernel.shared[((threadIdx.x_2*4) + 1)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 1), 144))]

kernel.shared[((threadIdx.x_2*4) + 2)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 2), 144))]

kernel.shared[((threadIdx.x_2*4) + 3)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 3), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 448)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 448), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 16), 144))]

kernel.shared[((threadIdx.x_2*4) + 449)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 449), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 17), 144))]

kernel.shared[((threadIdx.x_2*4) + 450)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 450), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 18), 144))]

kernel.shared[((threadIdx.x_2*4) + 451)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 451), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 19), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 896)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 896), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 32), 144))]

kernel.shared[((threadIdx.x_2*4) + 897)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 897), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 33), 144))]

kernel.shared[((threadIdx.x_2*4) + 898)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 898), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 34), 144))]

kernel.shared[((threadIdx.x_2*4) + 899)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 899), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 35), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 1344)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1344), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 48), 144))]

kernel.shared[((threadIdx.x_2*4) + 1345)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1345), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 49), 144))]

kernel.shared[((threadIdx.x_2*4) + 1346)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1346), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 50), 144))]

kernel.shared[((threadIdx.x_2*4) + 1347)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1347), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 51), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 1792)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1792), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 64), 144))]

kernel.shared[((threadIdx.x_2*4) + 1793)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1793), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 65), 144))]

kernel.shared[((threadIdx.x_2*4) + 1794)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1794), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 66), 144))]

kernel.shared[((threadIdx.x_2*4) + 1795)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 1795), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 67), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 2240)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2240), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 80), 144))]

kernel.shared[((threadIdx.x_2*4) + 2241)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2241), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 81), 144))]

kernel.shared[((threadIdx.x_2*4) + 2242)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2242), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 82), 144))]

kernel.shared[((threadIdx.x_2*4) + 2243)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2243), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 83), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 2688)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2688), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 96), 144))]

kernel.shared[((threadIdx.x_2*4) + 2689)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2689), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 97), 144))]

kernel.shared[((threadIdx.x_2*4) + 2690)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2690), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 98), 144))]

kernel.shared[((threadIdx.x_2*4) + 2691)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 2691), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 99), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 3136)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3136), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 112), 144))]

kernel.shared[((threadIdx.x_2*4) + 3137)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3137), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 113), 144))]

kernel.shared[((threadIdx.x_2*4) + 3138)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3138), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 114), 144))]

kernel.shared[((threadIdx.x_2*4) + 3139)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3139), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 115), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 3584)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3584), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 128), 144))]

kernel.shared[((threadIdx.x_2*4) + 3585)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3585), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 129), 144))]

kernel.shared[((threadIdx.x_2*4) + 3586)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3586), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 130), 144))]

kernel.shared[((threadIdx.x_2*4) + 3587)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 3587), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 131), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

kernel.shared[((threadIdx.x_2*4) + 4032)] = (float32*)kernel_2[(((((blockIdx.x*147456) + (floordiv((threadIdx.x_2*4), 144)*4608)) + (rc.outer.outer*144)) + (floormod(threadIdx.x_2, 36)*4)) + 129024)]

kernel.shared[((threadIdx.x_2*4) + 4033)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4033), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 1), 144))]

kernel.shared[((threadIdx.x_2*4) + 4034)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4034), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 2), 144))]

kernel.shared[((threadIdx.x_2*4) + 4035)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4035), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 3), 144))]

}

attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {

if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {

kernel.shared[((threadIdx.x_2*4) + 4480)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4480), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 16), 144))]

}

if @tir.likely(((threadIdx.x_2*4) < 127), dtype=bool) {

if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {

kernel.shared[((threadIdx.x_2*4) + 4481)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4481), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 17), 144))]

}

}

if @tir.likely(((threadIdx.x_2*4) < 126), dtype=bool) {

if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {

kernel.shared[((threadIdx.x_2*4) + 4482)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4482), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 18), 144))]

}

}

if @tir.likely(((threadIdx.x_2*4) < 125), dtype=bool) {

if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {

kernel.shared[((threadIdx.x_2*4) + 4483)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(((threadIdx.x_2*4) + 4483), 144)*4608)) + (rc.outer.outer*144)) + floormod(((threadIdx.x_2*4) + 19), 144))]

}

}

}

for (rc.outer.inner: int32, 0, 4) {

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9))]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9))]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 1)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36))]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2304)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 1)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 1)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 7)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 1)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 7)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2305)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 2)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 3)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 4)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 5)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 6)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 7)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 7)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 8)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 8)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2306)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 81)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 81)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 82)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 82)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 9)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2313)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 82)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 82)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 88)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 10)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 88)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2314)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 83)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 84)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 85)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 86)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 87)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 88)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 88)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 89)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 11)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 89)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2315)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 162)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 162)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 163)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 163)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 18)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2322)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 163)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 163)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 169)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 19)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 169)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2323)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 164)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 165)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 166)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 167)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 168)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 169)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 169)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 170)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 20)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 170)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2324)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 243)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 243)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 244)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 244)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 27)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2331)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 244)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 244)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 250)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 28)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 250)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2332)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 245)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 246)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 247)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 248)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 249)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 250)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 250)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 251)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 29)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 251)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2333)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 9)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 9)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 10)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 10)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 3)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2307)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 10)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 10)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 16)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 4)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 16)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2308)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 11)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 12)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 13)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 14)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 15)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 16)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 16)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 17)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 5)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 17)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2309)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 90)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 90)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 91)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 91)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 12)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2316)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 91)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 91)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 97)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 13)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 97)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2317)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 92)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 93)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 94)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 95)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 96)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 97)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 97)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 98)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 14)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 98)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2318)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 171)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 171)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 172)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 172)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 21)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2325)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 172)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 172)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 178)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 22)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 178)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2326)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 173)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 174)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 175)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 176)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 177)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 178)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 178)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 179)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 23)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 179)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2327)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 252)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 252)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 253)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 253)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 30)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2334)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 253)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 253)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 259)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 31)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 259)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2335)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 254)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 255)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 256)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 257)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 258)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 259)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 259)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 260)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 32)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 260)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2336)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 18)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 18)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 19)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 19)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 6)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2310)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 19)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 19)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 7)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2311)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 20)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 21)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 26)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 26)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 99)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 99)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 107)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 107)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 180)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 180)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 188)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 188)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 261)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 261)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))

compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 269)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))

compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 269)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))

}

}

for (i3.inner: int32, 0, 7) {

compute_2[(((blockIdx.x*1568) + (threadIdx.x*7)) + i3.inner)] = max(((float32*)compute_3[i3.inner] + (float32*)bias_2[((blockIdx.x*32) + floordiv(threadIdx.x, 7))]), 0f32)

compute_2[((((blockIdx.x*1568) + (threadIdx.x*7)) + i3.inner) + 784)] = max(((float32*)compute_3[(i3.inner + 7)] + (float32*)bias_2[(((blockIdx.x*32) + floordiv(threadIdx.x, 7)) + 16)]), 0f32)

}

}

}

检查正确性并评估性能

构建二进制文件并检查其正确性和性能。

func = tvm.build(sch, args, target)

# Check correctness

data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)

weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)

bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)

conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)

out_np = np.maximum(conv_np + bias_np, 0.0)

ctx = tvm.gpu()

data_tvm = tvm.nd.array(data_np, ctx=ctx)

weight_tvm = tvm.nd.array(weight_np, ctx=ctx)

bias_tvm = tvm.nd.array(bias_np, ctx=ctx)

out_tvm = tvm.nd.empty(out_np.shape, ctx=ctx)

func(data_tvm, weight_tvm, bias_tvm, out_tvm)

# Check results

np.testing.assert_allclose(out_np, out_tvm.asnumpy(), rtol=1e-3)

# Evaluate execution time

evaluator = func.time_evaluator(func.entry_name, ctx, min_repeat_ms=500)

print(

"Execution time of this operator: %.3f ms"

% (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)

)

输出:

Execution time of this operator: 0.184 ms

使用记录文件

搜索期间,所有测量记录都将转储到记录文件“ conv2d.json”中。测量记录可用于重新应用搜索结果,继续搜索以及执行其它分析。

这是一个示例,其中从文件加载最佳调度,打印等效的python调度API和CUDA源代码。它们可用于调试和学习自动调度程序的行为。

print("Equivalent python schedule:")

print(task.print_best(log_file, print_mode="schedule"))

print("CUDA source code:")

print(task.print_best(log_file, print_mode="cuda"))

输出:

Equivalent python schedule:

pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)

compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)

T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)

compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)

s[T_add].compute_inline()

compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1)

compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1)

compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)

compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)

compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)

compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=1)

compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=16)

compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=2)

compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)

compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)

compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=7)

compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)

compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=7)

compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=1)

compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)

compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1)

compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=4)

compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4)

compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1)

compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=3)

compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=3)

compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=1)

s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i)

compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)

compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)

compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)

compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)

compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=16)

compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=2)

compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)

compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)

compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)

compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)

compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)

compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)

s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)

s[compute].compute_at(s[compute], compute_i3_o_i)

kernel_shared = s.cache_read(kernel, "shared", [compute])

kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)

s[kernel_shared].compute_at(s[compute], compute_rx_o_o)

pad_temp_shared = s.cache_read(pad_temp, "shared", [compute])

pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)

s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o)

s[pad_temp].compute_inline()

compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)

s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis("blockIdx.x"))

compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)

s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis("vthread"))

compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)

s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))

kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)

kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)

s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)

kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)

s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))

pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)

pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)

s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)

pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)

s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))

s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 1024)

s[compute].pragma(compute_nn_o_o_o_o, "unroll_explicit", True)

CUDA source code:

#ifdef _WIN32

using uint = unsigned int;

using uchar = unsigned char;

using ushort = unsigned short;

using int64_t = long long;

using uint64_t = unsigned long long;

#else

#define uint unsigned int

#define uchar unsigned char

#define ushort unsigned short

#define int64_t long

#define uint64_t ulong

#endif

extern "C" __global__ void default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {

float compute1[14];

__shared__ float pad_temp_shared[1296];

__shared__ float kernel_shared[4608];

compute1[(0)] = 0.000000e+00f;

compute1[(7)] = 0.000000e+00f;

compute1[(1)] = 0.000000e+00f;

compute1[(8)] = 0.000000e+00f;

compute1[(2)] = 0.000000e+00f;

compute1[(9)] = 0.000000e+00f;

compute1[(3)] = 0.000000e+00f;

compute1[(10)] = 0.000000e+00f;

compute1[(4)] = 0.000000e+00f;

compute1[(11)] = 0.000000e+00f;

compute1[(5)] = 0.000000e+00f;

compute1[(12)] = 0.000000e+00f;

compute1[(6)] = 0.000000e+00f;

compute1[(13)] = 0.000000e+00f;

for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {

__syncthreads();

pad_temp_shared[(((int)threadIdx.x))] = (((((9 <= (((int)threadIdx.x) % 81)) && ((((int)threadIdx.x) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 81) * 49)) + (((((int)threadIdx.x) % 81) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 112))] = (((((9 <= ((((int)threadIdx.x) + 31) % 81)) && (((((int)threadIdx.x) + 31) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 112) / 81) * 49)) + ((((((int)threadIdx.x) + 31) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 224))] = (((((9 <= ((((int)threadIdx.x) + 62) % 81)) && (((((int)threadIdx.x) + 62) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 224) / 81) * 49)) + ((((((int)threadIdx.x) + 62) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 336))] = (((((9 <= ((((int)threadIdx.x) + 12) % 81)) && (((((int)threadIdx.x) + 12) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 336) / 81) * 49)) + ((((((int)threadIdx.x) + 12) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 448))] = (((((9 <= ((((int)threadIdx.x) + 43) % 81)) && (((((int)threadIdx.x) + 43) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 448) / 81) * 49)) + ((((((int)threadIdx.x) + 43) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 560))] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 560) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 672))] = (((((9 <= ((((int)threadIdx.x) + 24) % 81)) && (((((int)threadIdx.x) + 24) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 672) / 81) * 49)) + ((((((int)threadIdx.x) + 24) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 784))] = (((((9 <= ((((int)threadIdx.x) + 55) % 81)) && (((((int)threadIdx.x) + 55) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 81) * 49)) + ((((((int)threadIdx.x) + 55) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 896))] = (((((9 <= ((((int)threadIdx.x) + 5) % 81)) && (((((int)threadIdx.x) + 5) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 896) / 81) * 49)) + ((((((int)threadIdx.x) + 5) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 1008))] = (((((9 <= ((((int)threadIdx.x) + 36) % 81)) && (((((int)threadIdx.x) + 36) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1008) / 81) * 49)) + ((((((int)threadIdx.x) + 36) % 81) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8))] : 0.000000e+00f);

pad_temp_shared[((((int)threadIdx.x) + 1120))] = (((((9 <= ((((int)threadIdx.x) + 67) % 81)) && (((((int)threadIdx.x) + 67) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1120) / 81) * 49)) + ((((((int)threadIdx.x) + 67) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8))] : 0.000000e+00f);

if (((int)threadIdx.x) < 64) {

pad_temp_shared[((((int)threadIdx.x) + 1232))] = ((((((int)threadIdx.x) < 55) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1232) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8))] : 0.000000e+00f);

}

kernel_shared[((((int)threadIdx.x) * 4))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 36) * 4)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 1) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 2) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 3) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 448))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 448) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 16) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 449))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 449) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 17) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 450))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 450) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 18) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 451))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 451) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 19) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 896))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 896) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 32) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 897))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 897) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 33) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 898))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 898) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 34) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 899))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 899) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 35) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1344))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1344) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 48) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1345))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1345) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 49) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1346))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1346) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 50) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1347))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1347) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 51) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1792))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1792) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 64) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1793))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1793) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 65) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1794))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1794) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 66) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 1795))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 1795) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 67) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2240))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2240) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 80) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2241))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2241) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 81) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2242))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2242) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 82) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2243))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2243) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 83) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2688))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2688) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 96) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2689))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2689) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 97) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2690))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2690) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 98) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 2691))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 2691) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 99) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3136))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3136) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 112) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3137))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3137) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 113) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3138))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3138) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 114) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3139))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3139) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 115) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3584))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3584) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 128) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3585))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3585) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 129) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3586))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3586) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 130) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 3587))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 3587) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 131) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 4032))] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 36) * 4)) + 129024))];

kernel_shared[(((((int)threadIdx.x) * 4) + 4033))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4033) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 1) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 4034))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4034) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 2) % 144)))];

kernel_shared[(((((int)threadIdx.x) * 4) + 4035))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4035) / 144) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) * 4) + 3) % 144)))];

if (((int)threadIdx.x) < 32) {

kernel_shared[(((((int)threadIdx.x) * 4) + 4480))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4480) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) * 4) + 16)))];

}

if (((int)threadIdx.x) < 32) {

kernel_shared[(((((int)threadIdx.x) * 4) + 4481))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4481) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) * 4) + 17)))];

}

if (((int)threadIdx.x) < 32) {

kernel_shared[(((((int)threadIdx.x) * 4) + 4482))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4482) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) * 4) + 18)))];

}

if (((int)threadIdx.x) < 32) {

kernel_shared[(((((int)threadIdx.x) * 4) + 4483))] = kernel[(((((((int)blockIdx.x) * 147456) + ((((((int)threadIdx.x) * 4) + 4483) / 144) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) * 4) + 19)))];

}

__syncthreads();

for (int rc_outer_inner = 0; rc_outer_inner < 4; ++rc_outer_inner) {

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[(((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[(((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 1))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 1))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2304))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 1))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 1))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 7))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 1))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 7))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2305))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 2))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 3))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 4))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 5))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 6))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 7))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 7))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 8))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 8))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2306))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 81))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 81))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 82))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 82))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 9))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2313))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 82))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 82))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 88))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 10))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 88))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2314))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 83))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 84))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 85))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 86))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 87))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 88))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 88))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 89))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 11))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 89))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2315))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 162))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 162))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 163))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 163))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 18))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2322))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 163))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 163))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 169))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 19))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 169))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2323))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 164))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 165))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 166))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 167))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 168))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 169))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 169))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 170))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 20))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 170))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2324))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 243))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 243))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 244))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 244))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 27))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2331))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 244))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 244))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 250))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 28))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 250))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2332))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 245))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 246))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 247))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 248))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 249))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 250))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 250))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 251))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 29))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 251))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2333))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 9))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 9))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 10))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 10))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 3))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2307))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 10))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 10))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 16))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 4))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 16))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2308))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 11))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 12))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 13))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 14))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 15))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 16))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 16))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 17))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 5))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 17))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2309))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 90))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 90))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 91))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 91))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 12))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2316))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 91))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 91))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 97))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 13))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 97))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2317))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 92))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 93))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 94))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 95))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 96))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 97))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 97))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 98))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 14))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 98))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2318))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 171))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 171))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 172))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 172))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 21))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2325))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 172))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 172))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 178))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 22))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 178))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2326))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 173))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 174))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 175))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 176))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 177))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 178))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 178))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 179))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 23))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 179))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2327))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 252))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 252))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 253))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 253))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 30))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2334))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 253))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 253))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 259))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 31))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 259))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2335))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 254))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 255))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 256))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 257))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 258))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 259))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 259))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 260))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 32))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 260))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2336))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 18))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 18))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 19))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 19))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 6))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2310))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 19))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 19))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 25))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 7))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 25))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2311))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 20))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 21))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 22))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 23))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 24))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 25))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 25))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 26))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 8))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 26))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2312))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 99))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 99))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 100))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 100))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 15))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2319))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 100))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 100))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 106))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 16))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 106))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2320))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 101))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 102))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 104))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 105))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 106))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 106))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 107))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 107))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 180))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 180))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 181))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 181))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 24))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2328))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 181))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 181))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 187))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 25))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 187))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2329))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 182))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 183))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 184))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 185))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 186))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 187))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 187))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 188))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 26))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 188))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2330))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 261))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 261))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 262))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 262))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 33))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2337))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 262))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 262))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 268))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 34))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 268))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2338))]));

compute1[(0)] = (compute1[(0)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(7)] = (compute1[(7)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 263))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(1)] = (compute1[(1)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(8)] = (compute1[(8)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 264))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 265))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(10)] = (compute1[(10)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 266))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(4)] = (compute1[(4)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(11)] = (compute1[(11)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 267))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(5)] = (compute1[(5)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 268))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(12)] = (compute1[(12)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 268))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

compute1[(6)] = (compute1[(6)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 269))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 35))]));

compute1[(13)] = (compute1[(13)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 269))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2339))]));

}

}

for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {

compute[((((((int)blockIdx.x) * 1568) + (((int)threadIdx.x) * 7)) + i3_inner))] = max((compute1[(i3_inner)] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 7)))]), 0.000000e+00f);

compute[(((((((int)blockIdx.x) * 1568) + (((int)threadIdx.x) * 7)) + i3_inner) + 784))] = max((compute1[((i3_inner + 7))] + bias[((((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 7)) + 16))]), 0.000000e+00f);

}

}

一个更复杂的示例是继续搜索。在这种情况下,需要自己创建搜索策略和成本模型,使用日志文件恢复搜索策略和成本模型的状态。在下面的示例中,恢复状态并进行5次以上的试用。

def resume_search(task, log_file):

print("Resume search:")

cost_model = auto_scheduler.XGBModel()

cost_model.update_from_file(log_file)

search_policy = auto_scheduler.SketchPolicy(

task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]

)

measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)

tune_option = auto_scheduler.TuningOptions(

num_measure_trials=5,

runner=measure_ctx.runner,

measure_callbacks=[auto_scheduler.RecordToFile(log_file)],

)

task.tune(tune_option, search_policy=search_policy)

# Kill the measurement process

del measure_ctx

resume_search(task, log_file)

输出:

Resume search:

Get devices for measurement successfully!

脚本的总运行时间:(1分钟37.345秒)

GPU自动调度卷积层的更多相关文章

  1. NVIDIA GPU自动调度神经网络

    NVIDIA GPU自动调度神经网络 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是有关如何使用自动调度器为NVIDIA GPU调整整个神经网络. 为了自动调整神经网络,将网络划分为小 ...

  2. 自动调度GPU的卷积层

    自动调度GPU的卷积层 这是有关如何对GPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度命令或 ...

  3. NVIDIA GPU的神经网络自动调度

    NVIDIA GPU的神经网络自动调度 针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是一个关于如何使用自动调度器为NVIDIA GPU调整整个神经网络的资料. 为了自动调整一个神经网络 ...

  4. CPU的自动调度矩阵乘法

    CPU的自动调度矩阵乘法 这是一个有关如何对CPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度 ...

  5. TVM自动调度器

    TVM自动调度器 随着模型大小,算子多样性和硬件异构性的不断增长,优化深度神经网络的执行速度非常困难.从计算的角度来看,深度神经网络只是张量计算的一层又一层.这些张量计算(例如matmul和conv2 ...

  6. tensorflow 1.0 学习:卷积层

    在tf1.0中,对卷积层重新进行了封装,比原来版本的卷积层有了很大的简化. 一.旧版本(1.0以下)的卷积函数:tf.nn.conv2d conv2d( input, filter, strides, ...

  7. Keras深度神经网络算法模型构建【输入层、卷积层、池化层】

    一.输入层 1.用途 构建深度神经网络输入层,确定输入数据的类型和样式. 2.应用代码 input_data = Input(name='the_input', shape=(1600, 200, 1 ...

  8. ARM CPU自动调度神经网络

    ARM CPU自动调度神经网络 对特定设备和工作负载进行自动调度,对于获得最佳性能至关重要.通过RPC使用自动调度器为ARM CPU调度整个神经网络. 为了自动调度神经网络,将网络划分为小的子图,进行 ...

  9. 为x86 CPU自动调度神经网络

    为x86 CPU自动调度神经网络 对特定设备和工作负载进行自动调试对于获得最佳性能至关重要.这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档. 为了自动调试神经网络,将网络划分为小的子 ...

随机推荐

  1. 脚本加载后执行JS回调函数的方法

    动态脚本简单示例 // IE下: var HEAD = document.getElementsByTagName('head')[0] || document.documentElement var ...

  2. hdu1245 两个权值的最短路

    题意:       求s到t的最短路,如果路径相同求那么要求另一个权值尽可能的小. 思路:       水题,就是spfa的比较那个地方多了一个可以更新的机会,当(s_x[xin] > s_x[ ...

  3. Linux下逻辑卷LVM的管理和RAID磁盘阵列

    目录 LVM 一:LVM的创建 二:LVM的拉伸 三:LVM的缩小 四:LVM的删除 五:RAID磁盘阵列的添加 LVM LVM(Logical Volume Manager) 逻辑卷管理器,可以动态 ...

  4. 【Android开发高手笔记】Dagger2和它在SystemUI上的应用

    和人类需要群居一样,程序界的进程.线程也需要通信往来.它们的交流则依赖模块之间.文件之间产生的关系.如何快速地搞清和构建这种关系,同时还能减轻彼此的依赖,需要开发者们认真思考. 我们将这种需求称之为依 ...

  5. 通过例子分析MVVM

    通过一个简单的计数器例子分析MVVM. 代码 demo2.html <!DOCTYPE html> <html lang="en"> <head> ...

  6. ERROR Invalid options in vue.config.js: "baseUrl" is not allowed

    vue项目 我的这个版本是 3.10.0 module.exports = { baseUrl: process.env.NODE_ENV === 'production' ? './' : '/' ...

  7. vue.js在html页面中的使用

    1.加载vue.js,然后 var app = new Vue({ //vue代码})2.截图如下:

  8. 17道APP测试面试题分享带参考答案

    一.Android四大组件 Android四大基本组件:Activity.BroadcastReceiver广播接收器.ContentProvider内容提供者.Service服务. Activity ...

  9. LINQ之方法语法

    上节讲到使用linq的查询关键字进行查询,这节讲一下linq查询的另一种方式--linq方法. 使用linq方法语法,必须要会用lambda表达式,配合lambda表达式才能体会到linq的优雅便捷. ...

  10. Visual Studio/VS中任务列表的妙用

    一.任务列表开启方法 首先说下开启的方法:视图-任务列表,即可打开任务列表. 快捷键Ctrl+'\'+T,熟练了可以快速开启.注意,'\'键是回车键上面的'',不要按成了'/' 二.任务列表标签设置 ...