如何计算运行深度学习网络所需的 GPU 内存?

人工智能 深度学习 张量流 训练 记忆
2021-11-14 06:38:14

一般来说,我如何计算运行深度学习网络所需的 GPU 内存?

我问这个问题是因为我对某些网络配置的培训内存不足。

如果 TensorFlow 只存储可调参数所需的内存,并且如果我有大约 800 万,我认为所需的 RAM 将是:

RAM = 8.000.000 * (8 (float64)) / 1.000.000(缩放到 MB)

RAM = 64 MB,对吗?

TensorFlow 需要更多内存来存储每一层的图像?

顺便说一下,这些是我的 GPU 规格:

  • 英伟达 GeForce 1050 4GB

网络拓扑

  • 网络
  • 输入形状 (256,256,4)
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 256, 256, 4) 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 256, 256, 64) 2368        input_1[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 256, 256, 64) 0           conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 256, 256, 64) 36928       dropout[0][0]                    
__________________________________________________________________________________________________
max_pooling2d (MaxPooling2D)    (None, 128, 128, 64) 0           conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 128, 128, 128 73856       max_pooling2d[0][0]              
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 128, 128, 128 0           conv2d_2[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 128, 128, 128 147584      dropout_1[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 64, 64, 128)  0           conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 64, 64, 256)  295168      max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 64, 64, 256)  0           conv2d_4[0][0]                   
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 64, 64, 256)  590080      dropout_2[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, 32, 32, 256)  0           conv2d_5[0][0]                   
__________________________________________________________________________________________________
conv2d_6 (Conv2D)               (None, 32, 32, 512)  1180160     max_pooling2d_2[0][0]            
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 32, 32, 512)  0           conv2d_6[0][0]                   
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, 32, 32, 512)  2359808     dropout_3[0][0]                  
__________________________________________________________________________________________________
conv2d_transpose (Conv2DTranspo (None, 64, 64, 256)  524544      conv2d_7[0][0]                   
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64, 64, 512)  0           conv2d_transpose[0][0]           
                                                                 conv2d_5[0][0]                   
__________________________________________________________________________________________________
conv2d_8 (Conv2D)               (None, 64, 64, 256)  1179904     concatenate[0][0]                
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 64, 64, 256)  0           conv2d_8[0][0]                   
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, 64, 64, 256)  590080      dropout_4[0][0]                  
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, 128, 128, 128 131200      conv2d_9[0][0]                   
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 128, 128, 256 0           conv2d_transpose_1[0][0]         
                                                                 conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 128, 128, 128 295040      concatenate_1[0][0]              
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 128, 128, 128 0           conv2d_10[0][0]                  
__________________________________________________________________________________________________
conv2d_11 (Conv2D)              (None, 128, 128, 128 147584      dropout_5[0][0]                  
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, 256, 256, 64) 32832       conv2d_11[0][0]                  
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 256, 256, 128 0           conv2d_transpose_2[0][0]         
                                                                 conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_12 (Conv2D)              (None, 256, 256, 64) 73792       concatenate_2[0][0]              
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 256, 256, 64) 0           conv2d_12[0][0]                  
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 256, 256, 64) 36928       dropout_6[0][0]                  
__________________________________________________________________________________________________
conv2d_14 (Conv2D)              (None, 256, 256, 1)  65          conv2d_13[0][0]                  
==================================================================================================
Total params: 7,697,921
Trainable params: 7,697,921
Non-trainable params: 0

这是给出的错误。

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-17-d4852b86b8c1> in <module>
     23 # Train the model, doing validation at the end of each epoch.
     24 epochs = 30
---> 25 result_model = model.fit(train_gen, epochs=epochs, validation_data=val_gen, callbacks=callbacks)

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\keras\engine\training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1096                 batch_size=batch_size):
   1097               callbacks.on_train_batch_begin(step)
-> 1098               tmp_logs = train_function(iterator)
   1099               if data_handler.should_sync:
   1100                 context.async_wait()

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
    778       else:
    779         compiler = "nonXla"
--> 780         result = self._call(*args, **kwds)
    781 
    782       new_tracing_count = self._get_tracing_count()

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
    838         # Lifting succeeded, so variables are initialized and we can run the
    839         # stateless function.
--> 840         return self._stateless_fn(*args, **kwds)
    841     else:
    842       canon_args, canon_kwds = \

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
   2827     with self._lock:
   2828       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2829     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2830 
   2831   @property

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\function.py in _filtered_call(self, args, kwargs, cancellation_manager)
   1846                            resource_variable_ops.BaseResourceVariable))],
   1847         captured_inputs=self.captured_inputs,
-> 1848         cancellation_manager=cancellation_manager)
   1849 
   1850   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1922       # No tape is watching; skip to running the function.
   1923       return self._build_call_outputs(self._inference_function.call(
-> 1924           ctx, args, cancellation_manager=cancellation_manager))
   1925     forward_backward = self._select_forward_and_backward_functions(
   1926         args,

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    548               inputs=args,
    549               attrs=attrs,
--> 550               ctx=ctx)
    551         else:
    552           outputs = execute.execute_with_cancellation(

~\Anaconda3\envs\tf23\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

ResourceExhaustedError:  OOM when allocating tensor with shape[8,64,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node gradient_tape/functional_1/conv2d_14/Conv2D/Conv2DBackpropInput (defined at <ipython-input-17-d4852b86b8c1>:25) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_17207]

Function call stack:
train_function

网络定义中是否存在任何类型的错误?我怎样才能改善网络来解决这个问题?

3个回答

事实上,我不知道如何计算 GPU 内存来运行神经网络,但我有一个解决方案,可以在使用 tensorflow 框架时解决 GPU 中的分配问题。

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only allocate 2GB * 2 of memory on the first GPU
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048 * 2)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)

您可以在 GPU 上设置内存限制,这有时可以解决内存分配问题。如上所示,您可以根据配置需要设置“memory_limit”参数。

还要小心使用正确的框架。如果你想使用上面的代码来设置内存,你必须从带有 keras 后端的 tensorflow 构建你的神经网络。

from tensorflow.python.keras.models import Sequential

您可以通过分析计算内存需求,但在实践中仍然无法击败物理测试,因为系统中有太多未知变量可以占用 GPU 内存。也许 tensorflow 会决定存储梯度,那么你还必须考虑它的内存使用情况。

我这样做的方法是将 GPU 内存限制设置为较高的值,例如 1GB,然后测试模型推理速度。然后我用一半的内存重复这个过程。我一直这样做,直到模型拒绝运行或模型速度下降。例如,我从 1GB 开始,然后是 512MB,然后是 256MB,最终达到 32MB,模型速度下降。在 16MB 时,模型拒绝运行。所以我知道 64 MB 是我应该为我的模型使用的最低要求。如果我想获得更精确的数字,我会在 64 MB 和 32 MB 之间再重复几次二进制搜索过程。

您可以在此处查看如何限制 GPU 内存: https ://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

TensorFlow 很有趣,它不仅可以存储权重,还可以在视频 RAM 中存储训练数据。

with tf.device('/gpu:0'):
    tensorflow_dataset = tf.constant(numpy_dataset)

为矩阵 mul 向 GPU 提供训练数据和权重比从常规 RAM 中更快。

Video RAM required = Number of params * sizeof(weight type) +
                     Training data amount in bytes

但是,我认为所需的视频 RAM 应该至少是上述值的 1.5 倍,以确保一切正常。