Jetson Nano 上的 CNN 推理速度很慢

数据挖掘 喀拉斯 美国有线电视新闻网 表现
2022-02-17 04:41:42

我正在使用 Jetpack 4.4 在 nVidia Jetson Nano 上运行我认为非常轻量级的 CNN。 nVidia 声称 Nano 可以以 36fps 的速度运行 ResNet-50,所以我希望我的小得多的网络能够轻松地以 30+ fps 的速度运行。

但实际上,每次前向传递需要 160-180 毫秒,所以我的得分最多为 5-6 fps。在生产中,必须在实时摄像机流上进行实时预测,因此不能通过使用大于 1 的批次来提高每个样本的性能。

我的推理代码有什么根本错误吗?与例如 ResNet-50 相比,我认为网络架构的计算速度应该非常快,我错了吗?我能做些什么来找出究竟是什么需要这么多时间?

我的有线电视新闻网:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lambda (Lambda)              (None, 210, 848, 3)       0
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 210, 282, 3)       0
_________________________________________________________________
conv2d (Conv2D)              (None, 102, 138, 16)      2368
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 51, 69, 16)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 24, 33, 32)        12832
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 16, 32)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 4, 6, 64)          51264
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 3, 64)          0
_________________________________________________________________
flatten (Flatten)            (None, 384)               0
_________________________________________________________________
dropout (Dropout)            (None, 384)               0
_________________________________________________________________
dense (Dense)                (None, 64)                24640
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0
_________________________________________________________________
elu (ELU)                    (None, 64)                0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65
=================================================================
Total params: 91,169
Trainable params: 91,169
Non-trainable params: 0
_________________________________________________________________

代码:

import numpy as np
import cv2
import time
import tensorflow as tf
from tensorflow import keras

model_name = 'v9_small_FC_epoch_3'
loaded_model = keras.models.load_model('/home/jetson/notebooks/trained_models/' + model_name + '.h5')
loaded_model.summary()
frame = cv2.imread('/home/jetson/notebooks/frame1.jpg')    
test_data = np.expand_dims(frame, axis=0)

for i in range(10):
    start = time.time()
    predictions = loaded_model.predict(test_data)
    print(predictions[0][0])
    end = time.time()
    print("Inference took {}s".format(end-start))

结果:

4.7763316333293915
Inference took 10.111131191253662s
4.7763316333293915
Inference took 0.1822071075439453s
4.7763316333293915
Inference took 0.17330455780029297s
4.7763316333293915
Inference took 0.18085694313049316s
4.7763316333293915
Inference took 0.16646790504455566s
4.7763316333293915
Inference took 0.1703803539276123s
4.7763316333293915
Inference took 0.1788337230682373s
4.7763316333293915
Inference took 0.17131853103637695s
4.7763316333293915
Inference took 0.1660606861114502s
4.7763316333293915
Inference took 0.18377089500427246s

编辑:为了确保我不只是低估了我的网络,我将其替换为仅由单个输出和单个输出神经元组成的网络。正如预期的那样,模型的初始加载明显更快,但在那之后,推理几乎一样慢。

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lambda (Lambda)              (None, 1, 1, 1)           0         
_________________________________________________________________
dense (Dense)                (None, 1, 1, 1)           2         
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
2021-01-06 20:44:22.361558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Inference took 1.9230175018310547s
Inference took 0.17112112045288086s
Inference took 0.16610288619995117s
Inference took 0.1768038272857666s
Inference took 0.16962003707885742s
Inference took 0.16416263580322266s
Inference took 0.17536258697509766s
Inference took 0.16603755950927734s
Inference took 0.16376280784606934s
Inference took 0.16828060150146484s

在我的桌面(i5-2500k,GTX 1070Ti)上,即使是第一个预测也只需要大约 26 毫秒:

Inference took 0.02569293975830078s
Inference took 0.026061534881591797s
Inference took 0.023118019104003906s
Inference took 0.023060083389282227s
Inference took 0.02504444122314453s
Inference took 0.02664470672607422s
1个回答

对我来说,转换为 TensorRT 似乎将性能提高了 10 倍以上(!),这是我完全没想到的。

不利的一面是,现在加载 TensorRT 模型需要 2 分钟以上的时间,并且由于我无法掌握脚本占用 2.2G 内存的原因。让转换过程正常工作也非常痛苦。我将就该主题打开一个新的问答,因为似乎很多人最终放弃了它。

TensorRT 模型似乎需要一些预热(约 100 次通过),然后以最终推理速度稳定下来,在我的情况下为约 15-17 毫秒(68-66 帧/秒)。我不得不说相当惊人的进步。

Inference took 100.2991828918457s
Inference took 0.2558176517486572s
Inference took 0.04433894157409668s
Inference took 0.037764787673950195s
Inference took 0.03640627861022949s
Inference took 0.04129934310913086s
Inference took 0.024821043014526367s
Inference took 0.0219266414642334s
...
Inference took 0.0170745849609375s
Inference took 0.016851186752319336s
Inference took 0.016122817993164062s
Inference took 0.01502084732055664s
Inference took 0.015442371368408203s
Inference took 0.01560211181640625s

在没有 TensorRT 的情况下,不仅推理通常需要更长的时间,而且偶尔也会出现更长时间的传递,在某些情况下甚至高达 750 毫秒。对于破坏交易的实时应用程序。

使用 TensorRT,推理时间非常稳定,我没有看到 10K 次传递中的变化超过 15%。