我正在使用 Jetpack 4.4 在 nVidia Jetson Nano 上运行我认为非常轻量级的 CNN。 nVidia 声称 Nano 可以以 36fps 的速度运行 ResNet-50,所以我希望我的小得多的网络能够轻松地以 30+ fps 的速度运行。
但实际上,每次前向传递需要 160-180 毫秒,所以我的得分最多为 5-6 fps。在生产中,必须在实时摄像机流上进行实时预测,因此不能通过使用大于 1 的批次来提高每个样本的性能。
我的推理代码有什么根本错误吗?与例如 ResNet-50 相比,我认为网络架构的计算速度应该非常快,我错了吗?我能做些什么来找出究竟是什么需要这么多时间?
我的有线电视新闻网:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lambda (Lambda) (None, 210, 848, 3) 0
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 210, 282, 3) 0
_________________________________________________________________
conv2d (Conv2D) (None, 102, 138, 16) 2368
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 51, 69, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 24, 33, 32) 12832
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 16, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 4, 6, 64) 51264
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 3, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 384) 0
_________________________________________________________________
dropout (Dropout) (None, 384) 0
_________________________________________________________________
dense (Dense) (None, 64) 24640
_________________________________________________________________
dropout_1 (Dropout) (None, 64) 0
_________________________________________________________________
elu (ELU) (None, 64) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 91,169
Trainable params: 91,169
Non-trainable params: 0
_________________________________________________________________
代码:
import numpy as np
import cv2
import time
import tensorflow as tf
from tensorflow import keras
model_name = 'v9_small_FC_epoch_3'
loaded_model = keras.models.load_model('/home/jetson/notebooks/trained_models/' + model_name + '.h5')
loaded_model.summary()
frame = cv2.imread('/home/jetson/notebooks/frame1.jpg')
test_data = np.expand_dims(frame, axis=0)
for i in range(10):
start = time.time()
predictions = loaded_model.predict(test_data)
print(predictions[0][0])
end = time.time()
print("Inference took {}s".format(end-start))
结果:
4.7763316333293915
Inference took 10.111131191253662s
4.7763316333293915
Inference took 0.1822071075439453s
4.7763316333293915
Inference took 0.17330455780029297s
4.7763316333293915
Inference took 0.18085694313049316s
4.7763316333293915
Inference took 0.16646790504455566s
4.7763316333293915
Inference took 0.1703803539276123s
4.7763316333293915
Inference took 0.1788337230682373s
4.7763316333293915
Inference took 0.17131853103637695s
4.7763316333293915
Inference took 0.1660606861114502s
4.7763316333293915
Inference took 0.18377089500427246s
编辑:为了确保我不只是低估了我的网络,我将其替换为仅由单个输出和单个输出神经元组成的网络。正如预期的那样,模型的初始加载明显更快,但在那之后,推理几乎一样慢。
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lambda (Lambda) (None, 1, 1, 1) 0
_________________________________________________________________
dense (Dense) (None, 1, 1, 1) 2
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
2021-01-06 20:44:22.361558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Inference took 1.9230175018310547s
Inference took 0.17112112045288086s
Inference took 0.16610288619995117s
Inference took 0.1768038272857666s
Inference took 0.16962003707885742s
Inference took 0.16416263580322266s
Inference took 0.17536258697509766s
Inference took 0.16603755950927734s
Inference took 0.16376280784606934s
Inference took 0.16828060150146484s
在我的桌面(i5-2500k,GTX 1070Ti)上,即使是第一个预测也只需要大约 26 毫秒:
Inference took 0.02569293975830078s
Inference took 0.026061534881591797s
Inference took 0.023118019104003906s
Inference took 0.023060083389282227s
Inference took 0.02504444122314453s
Inference took 0.02664470672607422s