Tensorrt环境安装及yolov5模型转换以及量化部署INT8

Tensorrt的运行需要环境中有Opencv的编译环境,所以首先要opencv的编译

一.opencv 编译

1. 安装依赖项

    sudo apt-get install cmake
    sudo apt-get install build-essential libgtk2.0-dev libavcodec-dev libavformat-dev libjpeg-dev libswscale-dev libtiff5-dev
    sudo apt-get install libgtk2.0-dev

2. 下载自己需要的版本

https://opencv.org/releases/

解压后放在自己想放的目录下,在opencv-4.5.0目录下 建立build 文件夹, 进入 build 文件夹下,打开终端

    unzip opencv-4.5.0.zip
    cd opencv-4.5.0/
    mkdir build
    cd build/

3.编译cmake

3.1.安装cmake以及依赖库

    sudo apt-get install cmake
    $ sudo apt-get install build-essential libgtk2.0-dev libavcodec-dev libavformat-dev libjpeg.dev libtiff4.dev libswscale-dev libjasper1 libjasper-dev

错误:E: 无法定位软件包 libjasper-dev

sudo add-apt-repository "deb http://security.ubuntu.com/ubuntu xenial-security main"
sudo apt update
sudo apt install libjasper1 libjasper-dev

成功的解决了问题,其中libjasper1是libjasper-dev的依赖包

3.2.安装cmake

$ cmake ..  ||  cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local ..

使用网友的后面加一堆配置就会遇到各种报错,这种编译一直用的很顺手,没有报错

然后就等待安装完成,最后输出如下,没有报错就说成功了一半

3.3 编译

$ sudo make 或者 sudo make -j4  

这里需要耐心的等待编译

3.4 安装

$ sudo make install 

对应卸载就是 :

$ sudo make uninstall 

测试 opencv 版本的。

opencv_version 或者  pkg-config --modversion opencv

到此opencv源码编译成功。

二.TensorRT下载安装

1.对应自己的Cuda 和Cudnn以及系统的版本

$ nvcc -V 												   #查看cuda的版本
$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2  #查看cudnn的版本
$ cat  /etc/issue                                              #查看ubuntu系统版本

我的版本信息如下 :

2.下载

TensorRT链接

直接点击下载进入选项


这里选择一般选择相对用户比较多的版本。


注意:这里要注意选择与自己cuda和操作系统的版本相适应的下载,否则兄弟你会怀疑人生的(千万不要不符合,也不要比你的环境高也不要低,不然很麻烦,博主自己第一次按的时候没注意看,导致一直报错) 。

3.安装TensorRT

(1)将下载的压缩文件进行解压:

$ tar zxvf  TensorRT-7.2.2.3.Ubuntu-18.04.x86_64-gnu.cuda-11.1.cudnn8.0.tar.gz

(2)环境配置:

解压得到TensorRT的文件夹,将里边的lib绝对路径添加到环境变量中

vim ~/.bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/software/TensorRT-7.2.2.3/lib

刷新配置:source ~/.bashrc

(3) 安装TensorRT

$ cd TensorRT-7.2.2.3/python/
$ pip install tensorrt-7.2.2.3-cp37-none-linux_x86_64.whl 

博主亲测,这里你的python环境并没有限制,py35,py36,py37都可 ,一般安装与自己python版本对应的最稳。

#安装UFF,支持tensorflow模型转化
$ cd TensorRT-7.2.2.3/uff/
$ pip install uff-0.6.9-py2.py3-none-any.whl 

#安装graphsurgeon,支持自定义结构
cd TensorRT-7.2.2.3/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl 

以上,我们就成功的将tensorRT安装完了,试着执行一下python,然后看能不能导入这些模块。 如果成功的import tensorrt,那么就算安装成功咯。

ps:import uff的时候,需要提前install tensorflow模块。

pip install tensorflow-gpu==2.4.0

(4)安装PyCUDA
PyCUDA是Python使用NVIDIA CUDA的API,在Python中映射了所有CUDA的API
安装:

     pip3 install pycuda==2021.1 

三.tensorrt部署yolov5s(v5.0)

参考地址: https://blog.csdn.net/xingtianyao/article/details/111353568

最终实现的是yolovs 中默认的fp16的engine部署,测试通过yolov5_trt.py就可以看到效果,改造一下就可以视屏测试了
改造代码列:

"""
An example that uses TensorRT's Python api to make inferences.
"""
import ctypes
import os
import shutil
import random
import sys
import threading
import time
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch
import torchvision
import argparse
 
CONF_THRESH = 0.5
IOU_THRESHOLD = 0.4
 
 
def get_img_path_batches(batch_size, img_dir):
    ret = []
    batch = []
    for root, dirs, files in os.walk(img_dir):
        for name in files:
            if len(batch) == batch_size:
                ret.append(batch)
                batch = []
            batch.append(os.path.join(root, name))
    if len(batch) > 0:
        ret.append(batch)
    return ret
 
def plot_one_box(x, img, color=None, label=None, line_thickness=None):
    """
    description: Plots one bounding box on image img,
                 this function comes from YoLov5 project.
    param: 
        x:      a box likes [x1,y1,x2,y2]
        img:    a opencv image object
        color:  color to draw rectangle, such as (0,255,0)
        label:  str
        line_thickness: int
    return:
        no return
    """
    tl = (
        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1
    )  # line/font thickness
    color = color or [random.randint(0, 255) for _ in range(3)]
    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))
    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
    if label:
        tf = max(tl - 1, 1)  # font thickness
        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]
        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled
        cv2.putText(
            img,
            label,
            (c1[0], c1[1] - 2),
            0,
            tl / 3,
            [225, 255, 255],
            thickness=tf,
            lineType=cv2.LINE_AA,
        )
 
 
class YoLov5TRT(object):
    """
    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.
    """
 
    def __init__(self, engine_file_path):
        # Create a Context on this device,
        self.ctx = cuda.Device(0).make_context()
        stream = cuda.Stream()
        TRT_LOGGER = trt.Logger(trt.Logger.INFO)
        runtime = trt.Runtime(TRT_LOGGER)
 
        # Deserialize the engine from file
        with open(engine_file_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()
 
        host_inputs = []
        cuda_inputs = []
        host_outputs = []
        cuda_outputs = []
        bindings = []
 
        for binding in engine:
            print('bingding:', binding, engine.get_binding_shape(binding))
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            cuda_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(cuda_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                self.input_w = engine.get_binding_shape(binding)[-1]
                self.input_h = engine.get_binding_shape(binding)[-2]
                host_inputs.append(host_mem)
                cuda_inputs.append(cuda_mem)
            else:
                host_outputs.append(host_mem)
                cuda_outputs.append(cuda_mem)
 
        # Store
        self.stream = stream
        self.context = context
        self.engine = engine
        self.host_inputs = host_inputs
        self.cuda_inputs = cuda_inputs
        self.host_outputs = host_outputs
        self.cuda_outputs = cuda_outputs
        self.bindings = bindings
        self.batch_size = engine.max_batch_size
 
    def infer(self, input_image_path):
        threading.Thread.__init__(self)
        # Make self the active context, pushing it on top of the context stack.
        self.ctx.push()
        self.input_image_path = input_image_path
        # Restore
        stream = self.stream
        context = self.context
        engine = self.engine
        host_inputs = self.host_inputs
        cuda_inputs = self.cuda_inputs
        host_outputs = self.host_outputs
        cuda_outputs = self.cuda_outputs
        bindings = self.bindings
        # Do image preprocess
        batch_image_raw = []
        batch_origin_h = []
        batch_origin_w = []
        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])
 
        input_image, image_raw, origin_h, origin_w = self.preprocess_image(input_image_path
                                                                           )
 
 
        batch_origin_h.append(origin_h)
        batch_origin_w.append(origin_w)
        np.copyto(batch_input_image, input_image)
        batch_input_image = np.ascontiguousarray(batch_input_image)
 
        # Copy input image to host buffer
        np.copyto(host_inputs[0], batch_input_image.ravel())
        start = time.time()
        # Transfer input data  to the GPU.
        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
        # Run inference.
        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
        # Synchronize the stream
        stream.synchronize()
        end = time.time()
        # Remove any context from the top of the context stack, deactivating it.
        self.ctx.pop()
        # Here we use the first row of output in that batch_size = 1
        output = host_outputs[0]
        # Do postprocess
        result_boxes, result_scores, result_classid = self.post_process(
            output, origin_h, origin_w)
        # Draw rectangles and labels on the original image
        for j in range(len(result_boxes)):
            box = result_boxes[j]
            plot_one_box(
                box,
                image_raw,
                label="{}:{:.2f}".format(
                    categories[int(result_classid[j])], result_scores[j]
                ),
            )
        return image_raw, end - start
 
    def destroy(self):
        # Remove any context from the top of the context stack, deactivating it.
        self.ctx.pop()
        
    def get_raw_image(self, image_path_batch):
        """
        description: Read an image from image path
        """
        for img_path in image_path_batch:
            yield cv2.imread(img_path)
        
    def get_raw_image_zeros(self, image_path_batch=None):
        """
        description: Ready data for warmup
        """
        for _ in range(self.batch_size):
            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)
 
    def preprocess_image(self, input_image_path):
        """
        description: Convert BGR image to RGB,
                     resize and pad it to target size, normalize to [0,1],
                     transform to NCHW format.
        param:
            input_image_path: str, image path
        return:
            image:  the processed image
            image_raw: the original image
            h: original height
            w: original width
        """
        image_raw = input_image_path
        h, w, c = image_raw.shape
        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
        # Calculate widht and height and paddings
        r_w = self.input_w / w
        r_h = self.input_h / h
        if r_h > r_w:
            tw = self.input_w
            th = int(r_w * h)
            tx1 = tx2 = 0
            ty1 = int((self.input_h - th) / 2)
            ty2 = self.input_h - th - ty1
        else:
            tw = int(r_h * w)
            th = self.input_h
            tx1 = int((self.input_w - tw) / 2)
            tx2 = self.input_w - tw - tx1
            ty1 = ty2 = 0
        # Resize the image with long side while maintaining ratio
        image = cv2.resize(image, (tw, th))
        # Pad the short side with (128,128,128)
        image = cv2.copyMakeBorder(
            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128)
        )
        image = image.astype(np.float32)
        # Normalize to [0,1]
        image /= 255.0
        # HWC to CHW format:
        image = np.transpose(image, [2, 0, 1])
        # CHW to NCHW format
        image = np.expand_dims(image, axis=0)
        # Convert the image to row-major order, also known as "C order":
        image = np.ascontiguousarray(image)
        return image, image_raw, h, w
 
    def xywh2xyxy(self, origin_h, origin_w, x):
        """
        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
        param:
            origin_h:   height of original image
            origin_w:   width of original image
            x:          A boxes tensor, each row is a box [center_x, center_y, w, h]
        return:
            y:          A boxes tensor, each row is a box [x1, y1, x2, y2]
        """
        y = torch.zeros_like(x) if isinstance(x, torch.Tensor) else np.zeros_like(x)
        r_w = self.input_w / origin_w
        r_h = self.input_h / origin_h
        if r_h > r_w:
            y[:, 0] = x[:, 0] - x[:, 2] / 2
            y[:, 2] = x[:, 0] + x[:, 2] / 2
            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
            y /= r_w
        else:
            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
            y[:, 1] = x[:, 1] - x[:, 3] / 2
            y[:, 3] = x[:, 1] + x[:, 3] / 2
            y /= r_h
 
        return y
 
    def post_process(self, output, origin_h, origin_w):
        """
        description: postprocess the prediction
        param:
            output:     A tensor likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] 
            origin_h:   height of original image
            origin_w:   width of original image
        return:
            result_boxes: finally boxes, a boxes tensor, each row is a box [x1, y1, x2, y2]
            result_scores: finally scores, a tensor, each element is the score correspoing to box
            result_classid: finally classid, a tensor, each element is the classid correspoing to box
        """
        # Get the num of boxes detected
        num = int(output[0])
        # Reshape to a two dimentional ndarray
        pred = np.reshape(output[1:], (-1, 6))[:num, :]
        # to a torch Tensor
        pred = torch.Tensor(pred).cuda()
        # Get the boxes
        boxes = pred[:, :4]
        # Get the scores
        scores = pred[:, 4]
        # Get the classid
        classid = pred[:, 5]
        # Choose those boxes that score > CONF_THRESH
        si = scores > CONF_THRESH
        boxes = boxes[si, :]
        scores = scores[si]
        classid = classid[si]
        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]
        boxes = self.xywh2xyxy(origin_h, origin_w, boxes)
        # Do nms
        indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD).cpu()
        result_boxes = boxes[indices, :].cpu()
        result_scores = scores[indices].cpu()
        result_classid = classid[indices].cpu()
        return result_boxes, result_scores, result_classid
 
 
class inferThread(threading.Thread):
    def __init__(self, yolov5_wrapper):
        threading.Thread.__init__(self)
        self.yolov5_wrapper = yolov5_wrapper
    def infer(self , frame):
        batch_image_raw, use_time = self.yolov5_wrapper.infer(frame)
        # for i, img_path in enumerate(self.image_path_batch):
        #     parent, filename = os.path.split(img_path)
        #     save_name = os.path.join('output', filename)
        #     # Save image
        #     cv2.imwrite(save_name, batch_image_raw[i])
        # print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))
        return batch_image_raw,use_time*1000
 
class warmUpThread(threading.Thread):
    def __init__(self, yolov5_wrapper):
        threading.Thread.__init__(self)
        self.yolov5_wrapper = yolov5_wrapper
 
    def run(self):
        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())
        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))
 
 
 
if __name__ == "__main__":
    # load custom plugins
    PLUGIN_LIBRARY = "/home/module/TR/yolov5-5.0/yolov5/build/libmyplugins.so"
    engine_file_path = "/home/module/TR/yolov5-5.0/yolov5/build/yolov5s.engine"

    if len(sys.argv) > 1:
        engine_file_path = sys.argv[1]
    if len(sys.argv) > 2:
        PLUGIN_LIBRARY = sys.argv[2]
 
    ctypes.CDLL(PLUGIN_LIBRARY)
 
    # load coco labels
     
    categories = ["bus", "truck", "car", ]
    # categories = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
    #         "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
    #         "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
    #         "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
    #         "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
    #         "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
    #         "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone",
    #         "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
    #         "hair drier", "toothbrush"]
    # a YoLov5TRT instance
    yolov5_wrapper = YoLov5TRT(engine_file_path)
    url="/opt/nvidia/deepstream/deepstream-5.1/samples/streams/test6.mp4"
    cap = cv2.VideoCapture(url)
    try:
        thread1 = inferThread(yolov5_wrapper)
        thread1.start()
        thread1.join()
        while 1:
            st=time.time()
            _,frame = cap.read()
            img,t=thread1.infer(frame)
            cv2.imshow("result", img)
            cv2.waitKey(1) 
            print("time->{:.2f}ms",t)
            if cv2.waitKey(1) & 0XFF == ord('q'):  # 1 millisecond
                break
            if time.time()-st>0:
                print("====fps",1/(time.time()-st))
 
    finally:
        # destroy the instance
        cap.release()            #释放资源
        cv2.destroyAllWindows()  #destroyWindow()或destroyAllWindows()来关闭窗口并取消分配任何相关的内存使用。
        yolov5_wrapper.destroy() #释放堆栈资源

四.TensorRT量化INT8部署

因为很多小伙伴都怎样再提升帧率和识别速度等,手段就是通过INT8的模型去部署,但是网上关于量化INT8的模型方法千奇白怪,搞得不知如何下手,这里就就记录我趟坑之后的一个最简单的实践方式,虽然最后的检测效果不是很理想,可能是我自己的校准数据集比较少,但是流程是没有问题的,如有开发过程中的问题,可以邮箱找我,我看到就会回复你。

注意:在这之前请一定要保证你的上一步的Tensorrt.wts转.engine是成功的,这样就能保证你前面环境的正常,那么后面的量化INT8就会很简单。

1.打开yolov5.cpp


这里默认是USE_FP16,你要做的就是手动把USE_FP16替换成USE_INT8

2.指定INT校准图片

第一步:创建build目录然后在build目录中创建coco_calib文件夹,具体名字可参考你的yolov5.cpp的第109行

第二步:把你的数据集图片拷贝到coco_calib目录下,这些图片就是校准需要用到的

第三步:就是编译效准的过程了

cmake ..

make

./yolov5 -s  yolov5s.wts  yolov5s.engine s

最后就会生成一个int8calib.table和yolov5s.engine文件,在yolov5_trt.py中替换模型后即可测试,到此本篇完成。
如果小伙伴们发现我哪里写的不好,或者有错误的地方请帮忙指出来,我再改进

posted @ 2021-06-24 19:00  憨憨青年  阅读(3846)  评论(0编辑  收藏  举报
// 侧边栏目录 // https://blog-static.cnblogs.com/files/douzujun/marvin.nav.my1502.css