2025年较为详细的记录总结TensorRT的python接口的使用，环境配置，模型转换和静态动态模型推理

大家好，我是讯享网，很高兴认识大家。

先来一段摘抄自网上的TensorRT介绍：

TensorRT是英伟达针对自家平台做的加速包，TensorRT主要做了这么两件事情，来提升模型的运行速度。

TensorRT支持INT8和FP16的计算。深度学习网络在训练时，通常使用 32 位或 16 位数据。TensorRT则在网络的推理时选用不这么高的精度，达到加速推断的目的。
TensorRT对于网络结构进行了重构，把一些能够合并的运算合并在了一起，针对GPU的特性做了优化。现在大多数深度学习框架是没有针对GPU做过性能优化的，而英伟达，GPU的生产者和搬运工，自然就推出了针对自己GPU的加速工具TensorRT。一个深度学习模型，在没有优化的情况下，比如一个卷积层、一个偏置层和一个reload层，这三层是需要调用三次cuDNN对应的API，但实际上这三层的实现完全是可以合并到一起的，TensorRT会对一些可以合并网络进行合并。我们通过一个典型的inception block来看一看这样的合并运算。

TensorRT用来做模型的推理优化，也是有Python接口的，实际使用测试下来，python接口的模型推理速度C++基本差不多的。这里较为详细的记录TensorRT python接口从环境的配置到模型的转换，再到推理过程，还有模型的INT8量化，有时间的话也一并总结记录了，笔者使用的版本是TensorRT7.0版本，此版本支持模型动态尺寸的前向推理，下面也会分为静态推理和动态推理来介绍。

TensorRT环境的配置

tensorRT的配置是很简单的，官网注册，填调查问卷，就可以下载了,笔者用的是TensorRT-7.0.0.11.CentOS-7.6.x86_64-gnu.cuda-9.0.cudnn7.6.tar.gz版本，到存放目录直接解压，配置一下lib下各种编译好的包，还有很重要的cuda环境。

tar -zxvf TensorRT-7.0.0.11.CentOS-7.6.x86_64-gnu.cuda-9.0.cudnn7.6.tar.gz sudo vim ~/.bashrc #添加下面路径，注意改成自己的tensorRT的lib路径,cuda的路径 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/caidou/A/TensorRT-7.0.0.11/lib export C_INCLUDE_PATH=/usr/local/cuda-9.0/include/:${C_INCLUDE_PATH} export CPLUS_INCLUDE_PATH=/usr/local/cuda-9.0/include/:${CPLUS_INCLUDE_PATH} #使其生效 source ~/.bashrc

讯享网

然后pip安装解压后python 目录下的合适版本的python-tensorrt，pip安装pycuda。import成功就可以啦。

讯享网import tensorrt import pycuda

TensorRT模型的转换

模型的转换主要有两种方式，一种是把pytorch或者keras等训练的模型先转换成ONNX模型，再用TensorRT直接解析ONNX模型；但是有时候这种方法转换tensorrt模型因为某些层的操作，或者转为ONNX时版本变化太多会生成trt模型失败，这时候可以用tensorrt自己的API去重写网络，转为trt模型。这里仅仅记录前者，分为静态尺寸出入的转换和动态尺寸出入，利用API转换官方教程也是有的。

转为动态尺寸的trt模型

import tensorrt as trt import pycuda.driver as cuda import pycuda.autoinit import common import os def build_engine(onnx_file_path,engine_file_path): """Takes an ONNX file and creates a TensorRT engine to run inference with""" TRT_LOGGER = trt.Logger(trt.Logger.WARNING) with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser: builder.max_workspace_size = 1 << 28 # 256MiB builder.max_batch_size = 1 config = builder.create_builder_config() config.max_workspace_size = common.GiB(6) profile = builder.create_optimization_profile() profile.set_shape("input_1_0", (1,100,100,3),(1,1024,1024,3), (1,2048,2048,3)) idx = config.add_optimization_profile(profile) # Parse model file if not os.path.exists(onnx_file_path): print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path)) exit(0) print('Loading ONNX file from path {}...'.format(onnx_file_path)) with open(onnx_file_path, 'rb') as model: print('Beginning ONNX file parsing') if not parser.parse(model.read()): print ('ERROR: Failed to parse the ONNX file.') for error in range(parser.num_errors): print (parser.get_error(error)) return None print('Completed parsing of ONNX file') print('Building an engine from file {}; this may take a while...'.format(onnx_file_path)) engine = builder.build_engine(network,config=config) print("Completed creating Engine") with open(engine_file_path, "wb") as f: f.write(engine.serialize()) return engine if __name__ =="__main__": onnx_path1 = '/home/caidou/project/trt_python/mode1_1_-1_-1_3.onnx' engine_path = '/home/caidou/trt_python/model_1_-1_-1_3.engine' build_engine(onnx_path1,engine_path)

其中的common是官方的。

转为静态的尺寸的trt模型

讯享网import tensorrt as trt import pycuda.driver as cuda import pycuda.autoinit import common import os def build_engine(onnx_file_path,engine_file_path): """Takes an ONNX file and creates a TensorRT engine to run inference with""" TRT_LOGGER = trt.Logger(trt.Logger.WARNING) with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser: builder.max_workspace_size = 1 << 28 # 256MiB builder.max_batch_size = 1 # Parse model file if not os.path.exists(onnx_file_path): print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path)) exit(0) print('Loading ONNX file from path {}...'.format(onnx_file_path)) with open(onnx_file_path, 'rb') as model: print('Beginning ONNX file parsing') if not parser.parse(model.read()): print ('ERROR: Failed to parse the ONNX file.') for error in range(parser.num_errors): print (parser.get_error(error)) return None print('Completed parsing of ONNX file') print('Building an engine from file {}; this may take a while...'.format(onnx_file_path)) engine = builder.build_cuda_engine(network) print("Completed creating Engine") with open(engine_file_path, "wb") as f: f.write(engine.serialize()) return engine if __name__ =="__main__": onnx_path1 = '/home/caidou/project/trt_python/model4_256_256.onnx' engine_path = '/home/caidou/project/trt_python/model4_256_256.engine' build_engine(onnx_path1,engine_path)

就是不需要设置一下尺寸范围，还有一些其他设置。注意生成engine 时候的API，用错了会报错。

TensorRT模型的推理

推理依旧分为动态尺寸的和固定尺寸的，动态推理这一块C++版本的资料比较多，python接口的比较少，固定尺寸的推理官方也有demo，分为异步同步推理，但是不知道为什么笔者实测下来速度区别很小。

python推理接收numpy格式的数据输入。

动态推断

import tensorrt as trt import pycuda.driver as cuda #import pycuda.driver as cuda2 import pycuda.autoinit import numpy as np import cv2 def load_engine(engine_path): #TRT_LOGGER = trt.Logger(trt.Logger.WARNING) # INFO TRT_LOGGER = trt.Logger(trt.Logger.ERROR) with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime: return runtime.deserialize_cuda_engine(f.read()) path ='/home/caidou/trt_python/model_1_-1_-1_3.engine' #这里不以某个具体模型做为推断例子. # 1. 建立模型，构建上下文管理器 engine = load_engine(path) context = engine.create_execution_context() context.active_optimization_profile = 0 #2. 读取数据，数据处理为可以和网络结构输入对应起来的的shape，数据可增加预处理 imgpath = '/home/caidou/test/aaa.jpg' image = cv2.imread(imgpath) image = np.expand_dims(image, 0) # Add batch dimension. #3.分配内存空间，并进行数据cpu到gpu的拷贝 #动态尺寸，每次都要set一下模型输入的shape，0代表的就是输入，输出根据具体的网络结构而定，可以是0,1,2,3...其中的某个头。 context.set_binding_shape(0, image.shape) d_input = cuda.mem_alloc(image.nbytes) #分配输入的内存。 output_shape = context.get_binding_shape(1) buffer = np.empty(output_shape, dtype=np.float32) d_output = cuda.mem_alloc(buffer.nbytes) #分配输出内存。 cuda.memcpy_htod(d_input,image) bindings = [d_input ,d_output] #4.进行推理，并将结果从gpu拷贝到cpu。 context.execute_v2(bindings) #可异步和同步 cuda.memcpy_dtoh(buffer,d_output) output = buffer.reshape(output_shape) #5.对推理结果进行后处理。这里只是举了一个简单例子，可以结合官方静态的yolov3案例完善。

整体的pipline就是上面的1-5.

静态推断

静态推断和动态推断差不多，只不过不需要每次都分配输入和输出的内存空间。

讯享网import tensorrt as trt import pycuda.driver as cuda #import pycuda.driver as cuda2 import pycuda.autoinit import numpy as np import cv2 path ='/home/caidou/trt_python/model_1_4_256_256.engine' engine = load_engine(path) imgpath = 'aaa.jpg' context = engine.create_execution_context() image1 = cv2.imread(imgpath) image1 = cv2.resize(image1,(256,256)) image2 = image1.copy() image3 = image1.copy() image4 = image1.copy() image = np.concatenate((image1,image2,image3,image4)) image = image.reshape(-1,256,256) # image = np.expand_dims(image, axis=1) image = image.astype(np.float32) image = image.ravel()#数据平铺 outshape= context.get_binding_shape(1) output = np.empty((outshape), dtype=np.float32) d_input = cuda.mem_alloc(1 * image.size * image.dtype.itemsize) d_output = cuda.mem_alloc(1*output.size * output.dtype.itemsize) bindings = [int(d_input), int(d_output)] stream = cuda.Stream() for i in tqdm.tqdm(range(600)): cuda.memcpy_htod(d_input,image) context.execute_v2(bindings) cuda.memcpy_dtoh(output, d_output)

TensorRT模型的量化

待续...

这一块等有时间了再补充

讯享网

2025年较为详细的记录总结TensorRT的python接口的使用，环境配置，模型转换和静态动态模型推理

TensorRT环境的配置

TensorRT模型的转换

TensorRT模型的推理

TensorRT模型的量化

相关推荐