模型加速[tensorflow&tensorrt]

在tensorflow1.8之后的版本中，tensorflow.contrib部分都有tensorrt的组件，该组件存在的意义在于，你可以读取pb文件，并调用tensorrt的方法进行subgraph压缩，其他不能压缩的subgraph依然被tensorflow所处理。这样的操作方式就不同于你生成一个pb文件，然后单独用tensorrt的其他工具等等进行操作的方式了。

不同版本的tensorrt，其改动还是较多的，本文是基于tensorrt-integration-speeds-tensorflow-inference.此时tensorflow1.12其中是tensorrt4.0.1版本。如果想要使用tensorrt5.0，那还是推荐单独使用tensorrt好了。

硬件环境：

TensorRT-4.0.1.6.Ubuntu-14.04.5.x86_64-gnu.cuda-9.0.cudnn7.1.tar.gz;

tensorflow-gpu 1.12.0;

centos7.3

下面是我修改的代码，在P40卡上，因为不支持FP16,所以并没加速，实测INT8比FP32块1倍。

# -*- coding: utf-8 -*-

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

r""" TF-TensorRT integration sample script 

1 - Specify the fraction of GPU memory allowed for TensorFlow. TensorRT can use the remaining memory.

2 - Let TensorRT analyze the TensorFlow graph, apply optimizations, and replace subgraphs with TensorRT nodes.

"""

import os

import sys

import time

import json

import os.path as osp

import argparse, itertools, datetime

import numpy as np

import tensorflow as tf

from tensorflow.python.ops import data_flow_ops

from tensorflow.python.platform import gfile

from tensorflow.python.client import timeline

import tensorflow.contrib.tensorrt as trt

tf.logging.set_verbosity(tf.logging.INFO)

class TF2TensorRT(object):

    '''将tf生成的pb模型进行读取，并用tensorrt进行处理  '''

    def __init__(self, percent, batch_size, output_nodes):

        '''Use the new per_process_gpu_memory_fraction parameter of the GPUOptions

           function to specify the GPU memory fraction TensorRT can consume. This

           parameter should be set the first time the TensorFlow-TensorRT process

           starts. As an example, 0.67 would allocate 67% of GPU memory for TensorFlow,

           making the remaining 33% available for TensorRT engines. '''

        self.batch_size = batch_size

        self.output_nodes = output_nodes

        self.gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=percent)

        self.config = tf.ConfigProto(gpu_options=self.gpu_options)

    def read_pb(self, pb_path, graph, sess):

        '''read the model from pb file '''

        self.pb_path = pb_path

        with graph.as_default():

            with gfile.FastGFile(pb_path, 'rb') as fr:

                graph_def = tf.GraphDef()

                graph_def.ParseFromString(fr.read())

        return graph_def

    def _write_pb(self, trt_graph,  precision_mode):

        '''write converted model into new pb file '''

        dir_path, ext = osp.splitext(self.pb_path)

        newpb_filename = '{}{}{}'.format(dir_path, precision_mode, ext)

        with gfile.FastGFile(newpb_filename, 'wb') as fw:

            fw.write(trt_graph.SerializeToString())

        return newpb_filename

    def create_workspace(self):

        graph = tf.Graph()

        with graph.as_default():

            sess = tf.Session(graph=graph,config=self.config)

        return graph,sess

    def close_workspace(self,*args,sess=None):

        sess.close()

    def get_FPxx(self,

                 graph,graph_def,

                 workspace_size=1<<30,

                 precision_mode='FP32',

                 dump=True):

        '''You apply TensorRT optimizations to the frozen graph with the new

           create_inference_graph function. TensorRT then takes a frozen TensorFlow

           graph as input and returns an optimized graph with TensorRT nodes 

           You should use the per_process_gpu_memory_fraction and max_workspace_size_bytes

           parameters together for best overall application performance. For example,

           set the per_process_gpu_memory_fraction parameter to ( 12 – 4 ) / 12 = 0.67

           and the max_workspace_size_bytes parameter to 4000000000 for a 12GB GPU

           in order to allocate ~4GB for the TensorRT engines.

           TensorRT automatically uses Tensor Cores in Volta GPUs for inference when using

           half-precision arithmetic. The peak performance of Tensor Cores on the NVIDIA

           Tesla V100 is about an order of magnitude (10x) faster than double precision (FP64)

           and about 4 times faster than single precision (FP32). Just use FP16 as value for

           the precision_mode parameter in the create_inference_graph function to enable

           half precision

        ---

        frozen_graph_def: frozen TensorFlow graphout

        put_node_name:    list of strings with names of output nodes

                          e.g. ["resnet_v1_50/predictions/Reshape_1"]

        max_batch_size:   integer, size of input batch e.g. 16

        max_workspace_size_bytes:   integer, maximum GPU memory size available for TensorRT

        precision_mode:   string, allowed values FP32, FP16 or INT8

        '''

        with graph.as_default():

            trt_graph = trt.create_inference_graph(graph_def, self.output_nodes,

                                                   max_batch_size=self.batch_size,

                                                   max_workspace_size_bytes=workspace_size,

                                                   precision_mode=precision_mode )

            if dump:

                newpb_path = self._write_pb(trt_graph, precision_mode)

            else:

                newpb_path=''

        return trt_graph,newpb_path

    def get_INT8(self,

                 graph,

                 calib_graph,

                 workspace_size=1<<30,

                 precision_mode='INT8'):

        '''TensorRT provides capabilities to take models trained in single (FP32) and

           half (FP16) precision and convert them for deployment with INT8 quantizations

           while minimizing accuracy loss.

           HOW TO CALIBRATE THE GRAPH WITH INT8?

           To convert models for deployment with INT8, you need to calibrate the trained

           FP32 model before applying TensorRT’s optimizations described in the earlier

           sections. The remaining workflow remains unchanged 

           1 - First use the "create_inference_graph" function with the precision_mode parameter

               set to INT8 to calibrate the model. The output of this function is a frozen

               TensorFlow graph ready for calibration.

           2 - Next, execute the calibration graph with calibration data. TensorRT uses the

               distribution of node data to quantize the weights for the nodes. It is important

               to use calibration data that closely reflects the distribution of the problem

               dataset in production. We suggest checking for error accumulation during inference

               when first using models calibrated with INT8.

\```trt_graph = trt.create_inference_graph(getNetwork(network_file_name), outputs,

    max_batch_size=batch_size, max_workspace_size_bytes=workspace_size, precision_mode="INT8")

\```

           3 - After executing the graph on calibration data, apply TensorRT optimizations to

               the calibration graph with the "calib_graph_to_infer_graph" function. This function

               also replaces the TensorFlow subgraph with a TensorRT node optimized for INT8.

               The output of the function is a frozen TensorFlow graph that can be used for

               inference as usual.

\```

trt_graph=trt.calib_graph_to_infer_graph(calibGraph)

\```

        4 - And that’s it! These two commands enable INT8 precision inference with your TensorFlow model.

        '''

        with graph.as_default():

            trt_graph = trt.calib_graph_to_infer_graph(calib_graph)

            newpb_path = self._write_pb(trt_graph,precision_mode)

        return trt_graph,newpb_path

    def convert_NHWC2NCHW(self, graph,sess,tensor_input):

        with graph.as_default():

            tensor_output = tf.transpose(tensor_input, perm=(0,3,1,2))

            tensor_output = sess.run(tensor_output)

        return tensor_output

    def read_tensor_from_image_file(self, graph, sess, file_name, input_height=224, input_width=224,

                                input_mean=0, input_std=255, input_name = "file_reader",

                                output_name = "normalized"):

        """ Read a jpg image file and return a tensor """

        with graph.as_default():

            file_reader = tf.read_file(file_name, input_name)

            image_reader = tf.image.decode_png(file_reader, channels = 3, name='jpg_reader')

            float_caster = tf.cast(image_reader, tf.float32)

            dims_expander = tf.expand_dims(float_caster, 0);

            resized = tf.image.resize_bilinear(dims_expander, [input_height, input_width])

            normalized = tf.divide(tf.subtract(resized, [input_mean]), [input_std])

            normalized_NHWC = sess.run(normalized)

            normalized_NCHW = self.convert_NHWC2NCHW(graph,sess,normalized_NHWC)

        return normalized_NHWC,normalized_NCHW

    def run(self, graph, graph_def, sess, num_loops, tensor_input):

        tf.logging.info('Starting execution')

        with graph.as_default():

            ''' 下述几行必须添加，否则会提示问题'''

            inc=tf.constant(tensor_input, dtype=tf.float32)

            dataset=tf.data.Dataset.from_tensors(inc)

            dataset=dataset.repeat()

            iterator=dataset.make_one_shot_iterator()

            next_element=iterator.get_next()

            output = tf.import_graph_def(graph_def=graph_def,

                                         input_map={"input":next_element},

                                         return_elements=self.output_nodes)

            output = output[0].outputs[0] # 这一行是 resnet 50 特有的，如果读取inceptionv3，则这里需要修改

            '''此处为模拟代码 '''

            for i in range(num_loops):

                st = time.time()

                ans = sess.run(output)

                print('the {} run  take {} seconds'.format(i,time.time()-st))

        return ans

def topX(arr,X):

  ind=np.argsort(arr)[:,-X:][:,::-1]

  ind = ind.squeeze()

  return arr[np.arange(np.shape(arr)[0])[:,np.newaxis],ind],ind

def getLabels(labels,ids):

  return [labels[str(x+1)] for x in ids]

if "__main__" == __name__:

  parser = argparse.ArgumentParser(prog="convert pb model file into uff!")

  parser.add_argument('--FP32',action='store_true')

  parser.add_argument('--FP16',action='store_true')

  parser.add_argument('--INT8',action='store_true')

  parser.add_argument('--native',action='store_true')

  parser.add_argument('--num_loops',type=int,default=20)

  parser.add_argument('--data_dir',type=str,default='./data')

  parser.add_argument('--pb_path',type=str,default='resnetV150_frozen.pb')

  parser.add_argument('--output_nodes',action='append',default=['InceptionV3/Predictions/Reshape_1:0'])

  parser.add_argument('--mem_percent',type=float,default=0.5)

  parser.add_argument('--topN',type=int,default=10)

  parser.add_argument('--batch_size',type=int,default=1)

  parser.add_argument('--workspace_size',type=int,default=1<<10,help="workspace size in MB")

  f,unparsed = parser.parse_known_args()

  batch_size = f.batch_size

  pb_path = f.pb_path

  mem_percent = f.mem_percent

  workspace_size = f.workspace_size

  os.environ["CUDA_VISIBLE_DEVICES"] = "0"

  print('===============start==================')

  print("Starting at",datetime.datetime.now())

  output_nodes = f.output_nodes

  output_nodes = ['resnet_v1_50/predictions/Reshape_1']

  print(output_nodes)

  tft = TF2TensorRT(mem_percent, batch_size, output_nodes)

  ''' 为了更好的独立性，下述每个分支选择都具有冗余代码，如每次都会去读取图片，还有关闭session等等，这是有意为之'''

  if f.native:

      print('===native 模式')

      graph,sess = tft.create_workspace()

      graph_def = tft.read_pb(pb_path, graph, sess)

      imageName = 'grace_hopper.jpg'

      image_input = tft.read_tensor_from_image_file(graph,sess,imageName,

                                  input_height=224,

                                  input_width=224,

                                  input_mean=0,

                                  input_std=1.0)

      image_input = image_input[0]

      ans = tft.run(graph,graph_def,sess,2,image_input)

      tft.close_workspace(graph,graph_def,sess=sess)

      ans_topX = topX(ans,1)

      print('the result id is: ',ans_topX[1])

  if f.FP32:

      print('===FP32 模式')

      graph,sess = tft.create_workspace()

      graph_def = tft.read_pb(pb_path, graph, sess)

      trt_graph_FP32,newpb_path = tft.get_FPxx(graph,graph_def,

                                    workspace_size=1<<30,

                                    precision_mode='FP32')

      tft.close_workspace(graph,graph_def,trt_graph_FP32,sess=sess)

      # read the converted pb file

      graph,sess = tft.create_workspace()

      imageName = 'grace_hopper.jpg'

      image_input = tft.read_tensor_from_image_file(graph,sess,imageName,

                                  input_height=224,

                                  input_width=224,

                                  input_mean=0,

                                  input_std=1.0)

      image_input = image_input[0]

      graph_def_FP32 = tft.read_pb(newpb_path, graph, sess)

      ans = tft.run(graph,graph_def_FP32,sess,2,image_input)

      tft.close_workspace(graph,graph_def_FP32,sess=sess)

      ans_topX = topX(ans,1)

      print('the result id is: ',ans_topX[1])

  if f.FP16:

      print('===FP16 模式')

      graph,sess = tft.create_workspace()

      graph_def = tft.read_pb(pb_path, graph, sess)

      trt_graph_FP16,newpb_path = tft.get_FPxx(graph,graph_def,

                                    workspace_size=1<<30,

                                    precision_mode='FP16')

      tft.close_workspace(graph,graph_def,trt_graph_FP16,sess=sess)

      # read the converted pb file

      graph,sess = tft.create_workspace()

      imageName = 'grace_hopper.jpg'

      image_input = tft.read_tensor_from_image_file(graph,sess,imageName,

                                  input_height=224,

                                  input_width=224,

                                  input_mean=0,

                                  input_std=1.0)

      image_input = image_input[0]

      graph_def_FP16 = tft.read_pb(newpb_path, graph, sess)

      ans = tft.run(graph,graph_def_FP16,sess,2,image_input)

      tft.close_workspace(graph,graph_def_FP16,sess=sess)

      ans_topX = topX(ans,1)

      print('the result id is: ',ans_topX[1])

  if f.INT8:

      print('===INT8 模式')

      graph,sess = tft.create_workspace()

      graph_def = tft.read_pb(pb_path, graph, sess)

      print('读取pb文件，然后创建calibGraph，此时需要喂入较多生产样本')

      calibGraph,_ = tft.get_FPxx(graph,graph_def,

                                workspace_size=1<<30,

                                precision_mode='INT8',

                                dump=False)

      print("==========Running Calibration")

      print('校对即用多个生产数据进行下述代码运行，tensorrt内部会按照每层激活值自行进行对应的校对')

      print('这里是用单张图片执行20次，模拟校对过程')

      print('正常流程是:1)将下面20次改为1次；2）循环读取多个生产数据完成整个流程的校对')

      imageName = 'grace_hopper.jpg'

      image_input = tft.read_tensor_from_image_file(graph,sess,imageName,

                                  input_height=224,

                                  input_width=224,

                                  input_mean=0,

                                  input_std=1.0)

      image_input = image_input[0]

      ans = tft.run(graph,calibGraph,sess,20,image_input)

      print('校对完成，准备生成最终inference模型')

      print("=========Creating inference graph")

      int8Graph,newpb_path = tft.get_INT8(graph,calibGraph, workspace_size)

      tft.close_workspace(graph,graph_def,calibGraph,int8Graph,sess=sess)

      # read the converted pb file

      graph,sess = tft.create_workspace()

      graph_def_INT8 = tft.read_pb(newpb_path, graph, sess)

      ans = tft.run(graph,graph_def_INT8,sess,2,image_input)

      tft.close_workspace(graph,graph_def_INT8,sess=sess)

      ans_topX = topX(ans,1)

      print('the result id is: ',ans_topX[1])

当不添加上述输入部分代码则有如下结果，引起的原因见Visualize Optimized Graph in TensorBoard:

INFO:tensorflow:Starting execution

2019-03-15 05:59:37.410106: E tensorflow/core/common_runtime/executor.cc:623] Executor failed to create kernel. Not found: No registered 'TRTEngineOp' OpKernel for CPU devices compatible with node {{node import/resnet_v1_50/my_trt_op_0}} = TRTEngineOp[InT=[DT_FLOAT], OutT=[DT_FLOAT], cached_engine_batches=[4], calibration_data="", fixed_input_size=true, input_shapes=[[?,3,230,230]], max_cached_engines_count=1, output_shapes=[[?,1000,1,1]], precision_mode="FP32", segment_funcdef_name="resnet_v1_50/my_trt_op_0_native_segment", serialized_segment="8\177\224\...00\000\000", static_engine=true, workspace_size_bytes=2147483648](import/resnet_v1_50/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer)

	.  Registered:  device='GPU'

	 [[{{node import/resnet_v1_50/my_trt_op_0}} = TRTEngineOp[InT=[DT_FLOAT], OutT=[DT_FLOAT], cached_engine_batches=[4], calibration_data="", fixed_input_size=true, input_shapes=[[?,3,230,230]], max_cached_engines_count=1, output_shapes=[[?,1000,1,1]], precision_mode="FP32", segment_funcdef_name="resnet_v1_50/my_trt_op_0_native_segment", serialized_segment="8\177\224\...00\000\000", static_engine=true, workspace_size_bytes=2147483648](import/resnet_v1_50/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer)]]

下面给出INT8时候的日志

python tf_trt.py  --INT8

===============start==================

Starting at 2019-03-15 07:00:05.756805

['resnet_v1_50/predictions/Reshape_1']

2019-03-15 07:00:05.758165: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

2019-03-15 07:00:06.554246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:

name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531

pciBusID: 0000:84:00.0

totalMemory: 22.38GiB freeMemory: 22.22GiB

2019-03-15 07:00:06.554439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-03-15 07:00:07.119839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-03-15 07:00:07.119905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0

2019-03-15 07:00:07.119921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N

2019-03-15 07:00:07.120522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1)

WARNING:tensorflow:From tf_trt.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

Instructions for updating:

Use tf.gfile.GFile.

=========reading the pb file,then creating the calibGraph

INFO:tensorflow:Running against TensorRT version 4.0.1

2019-03-15 07:00:07.936861: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1

2019-03-15 07:00:07.938337: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session

2019-03-15 07:00:07.939184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-03-15 07:00:07.939224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-03-15 07:00:07.939242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0

2019-03-15 07:00:07.939294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N

2019-03-15 07:00:07.939869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1)

2019-03-15 07:00:09.016877: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope 'resnet_v1_50/', converted to graph

2019-03-15 07:00:09.016966: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!

2019-03-15 07:00:35.699442: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine resnet_v1_50/my_trt_op_0 creation for segment 0, composed of 452 nodes succeeded.

2019-03-15 07:00:36.704760: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.

2019-03-15 07:00:36.944306: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.

2019-03-15 07:00:37.046735: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph

2019-03-15 07:00:37.046820: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 461 nodes (-267), 477 edges (-267), time = 476.292ms.

2019-03-15 07:00:37.046852: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 468 nodes (7), 479 edges (2), time = 127.892ms.

2019-03-15 07:00:37.046865: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 17 nodes (-451), 12 edges (-467), time = 26932.1719ms.

2019-03-15 07:00:37.046877: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 12 nodes (-5), 12 edges (0), time = 114.593ms.

2019-03-15 07:00:37.046889: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 12 nodes (0), 12 edges (0), time = 266.66ms.

2019-03-15 07:00:37.046909: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: resnet_v1_50/my_trt_op_0_native_segment

2019-03-15 07:00:37.046921: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 453 nodes (0), 468 edges (0), time = 282.458ms.

2019-03-15 07:00:37.046941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Invalid argument: The graph is already optimized by layout optimizer.

2019-03-15 07:00:37.046952: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 453 nodes (0), 468 edges (0), time = 35.437ms.

2019-03-15 07:00:37.046969: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 453 nodes (0), 468 edges (0), time = 204.084ms.

2019-03-15 07:00:37.046984: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 453 nodes (0), 468 edges (0), time = 36.173ms.

==========Running Calibration

INFO:tensorflow:Starting execution

2019-03-15 07:00:43.482560: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:578] Starting calibration thread on device 0, Calibration Resource @ 0x7f794c001850

====take 6.6967267990112305 seconds===

====take 0.011368751525878906 seconds===

====take 0.05899786949157715 seconds===

====take 0.06058168411254883 seconds===

====take 0.060442447662353516 seconds===

====take 0.06051158905029297 seconds===

====take 0.060460805892944336 seconds===

====take 0.060431480407714844 seconds===

====take 0.06432700157165527 seconds===

====take 0.06402254104614258 seconds===

====take 0.06392884254455566 seconds===

====take 0.06446218490600586 seconds===

====take 0.06404638290405273 seconds===

====take 0.0639350414276123 seconds===

====take 0.06392097473144531 seconds===

====take 0.06390523910522461 seconds===

====take 0.06399869918823242 seconds===

====take 0.06429791450500488 seconds===

====take 0.06387209892272949 seconds===

====take 0.06392908096313477 seconds===

=========Creating inference graph

2019-03-15 07:00:48.772447: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:155] Starting Calib Conversion

2019-03-15 07:00:48.845717: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:161] Construction of static int8 engine is not implemented yet!. Dynamic engine will be constructed

==================================================

2019-03-15 07:01:48.746487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-03-15 07:01:48.746545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-03-15 07:01:48.746555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0

2019-03-15 07:01:48.746563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N

2019-03-15 07:01:48.747006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11459 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1)

INFO:tensorflow:Starting execution

2019-03-15 07:01:55.221824: I tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:502] import/resnet_v1_50/my_trt_op_0 Constructing a new engine with batch size 1

====take 48.35376954078674 seconds===

====take 0.0026242733001708984 seconds===

====take 0.002024412155151367 seconds===

====take 0.0019381046295166016 seconds===

====take 0.0018923282623291016 seconds===

====take 0.0019183158874511719 seconds===

====take 0.001911163330078125 seconds===

====take 0.0019626617431640625 seconds===

====take 0.001909494400024414 seconds===

====take 0.001890420913696289 seconds===

====take 0.0018913745880126953 seconds===

====take 0.0019071102142333984 seconds===

====take 0.001940011978149414 seconds===

====take 0.001964569091796875 seconds===

====take 0.0019214153289794922 seconds===

====take 0.0019118785858154297 seconds===

====take 0.0018911361694335938 seconds===

====take 0.00193023681640625 seconds===

====take 0.0019140243530273438 seconds===

====take 0.0019001960754394531 seconds===

================================================== (array([[0.47768646]], dtype=float32), array([[457]]))

如果出现下述情况：

多半是装了cpu版本的tensorflow，而不是gpu版本的。

模型加速[tensorflow&tensorrt]的更多相关文章

（转）Darknet模型与Tensorflow模型相互转换
目前darknet框架下的模型训练都是在C环境下训练的,难免较为晦涩,如果能将模型转换到Tensorflow环境下完成模型的训练,在将训练好的权重转为Darknet可以识别的权重部署到实际应用中.这样 ...
StartDT AI Lab | 视觉智能引擎之算法模型加速
通过StartDT AI Lab专栏之前多篇文章叙述,相信大家已经对计算机视觉技术及人工智能算法在奇点云AIOT战略中的支撑作用有了很好的理解.同样,这种业务牵引,技术覆盖的模式也收获了市场的良好反响 ...
保存与恢复变量和模型，tensorflow官方文档阅读笔记
官方中文文档的网址先贴出来:https://tensorflow.google.cn/programmers_guide/saved_model tf.train.Saver 类别提供了保存和恢复模型 ...
DKT模型及其TensorFlow实现（Deep knowledge tracing with Tensorflow）
今年2月15日,谷歌举办了首届TensorFlow Dev Summit,并且发布了TensorFlow 1.0 正式版. 3月18号,上海的谷歌开发者社区(GDG)组织了针对峰会的专场回顾活动.本文 ...
（原）linux下caffe模型转tensorflow模型
转载请注明出处: http://www.cnblogs.com/darkknightzh/p/7419352.html 参考网址: https://github.com/ethereon/caffe- ...
MxNet 模型转Tensorflow pb模型
用mmdnn实现模型转换参考链接:https://www.twblogs.net/a/5ca4cadbbd9eee5b1a0713af 安装mmdnn pip install mmdnn 准备好mx ...
[Pytorch]基于混和精度的模型加速
这篇博客是在pytorch中基于apex使用混合精度加速的一个偏工程的描述,原理层面的解释并不是这篇博客的目的,不过在参考部分提供了非常有价值的资料,可以进一步研究. 一个关键原则:“仅仅在权重更新的 ...
将keras的h5模型转换为tensorflow的pb模型
h5_to_pb.py from keras.models import load_model import tensorflow as tf import os import os.path as ...
Pytorch的模型加速方法：Dataparallel (DP) 和 DataparallelDistributedparallel (DDP)
Dataparallel 和 DataparallelDistributed 的区别一.Dataparallel(DP) 1.1 Dartaparallel 的使用方式 Dataparallel 的 ...

随机推荐

leetcode — palindrome-partitioning-ii
import java.util.Arrays; /** * * Source : https://oj.leetcode.com/problems/palindrome-partitioning-i ...
Joda-Time开源库
Joda-Time是一个面向 Java™ 平台的易于使用的开源时间日期库. 依赖 <dependency> <groupId>joda-time</groupId> ...
[React] 从零开始的react
组件 1. 无状态组件在React中,组件的名字必须用大写字母开头,而包含该组件定义的文件名也应该是大写字母(便于区分,也可以不是). 无状态组件是纯展示组件,仅仅只是用于数据的展示,只根据传入的p ...
selenium加载配置参数，让chrome浏览器不出现‘Chrome正在受到自动软件的控制’的提示语，以及后台静默模式启动自动化测试，不占用桌面的方法
一:自动化测试的时候,启动浏览器出现‘Chrome正在受到自动软件的控制’,怎么样隐藏,今天学习分享: 在浏览器配置里加个参数,忽略掉这个警告提示语,disable_infobars option = ...
C# 定时器-System.Timers.Timer
using Newtonsoft.Json; using Rafy; using Rafy.Domain; using System; using System.Collections.Generic ...
设计模式之一工厂方法模式（Factory Method）
工厂方法模式分为三种: 一.普通工厂模式,就是建立一个工厂类,对实现了同一接口的一些类进行实例的创建.首先看下关系图: 举例如下:(我们举一个发送邮件和短信的例子) 首先,创建二者的共同接口: pub ...
Centos7.3安装和配置Mysql5.7
主要转自这篇文章:https://www.cnblogs.com/wishwzp/p/7113403.html 这篇文章已经讲的很详细,亲测可用,对于基本不懂linux的小白应该也能看得懂.只是没有修 ...
好吧，CSS3 3D transform变换，不过如此！——张鑫旭
一.写在前面的秋裤早在去年的去年,我就大肆介绍了2D transform相关内容.看过海贼王的都知道,带D的家伙都不是好惹的,2D我辈尚可以应付,3D的话,呵呵,估计我等早就在千里之外被其霸气震晕了 ...
linux学习笔记-linux主机上传下载文件至linux虚拟机的方法
我的邮箱地址:zytrenren@163.com欢迎大家交流学习纠错! 1.上传文件 scp -r file 用户名@ip地址:目标目录 2.下载文件 scp -r 用户名@ip地址:文件目标目录
WEB前端学习代码片段记录
1.JS设计模式片段 Function.prototype.addMethod = function (name,fn) { this.prototype[name] = fn; return thi ...

模型加速[tensorflow&tensorrt]

模型加速[tensorflow&tensorrt]的更多相关文章

随机推荐

热门专题