计算科学 - 意外的 CUDA 处理时间依赖于线程数 - 吾爱随笔录

调用内核函数时，每个块的线程数理想情况下应该是 warp 大小的倍数。这样可以更有效地利用资源并缩短处理时间。但是，似乎还有另一个因素会周期性地减少处理时间。如下所示，处理时间每 32 个线程偏移一次，而每个块每 11 个线程的倍数会发生一次额外的加速。这种行为背后的原因是什么？

有问题的 GPU 是 GeForce GT 730，运行本文底部附加的内核函数。出于计时目的，它在循环中使用：

kernel_generate_image[(16,16),(1,i+1)](px, 32)

在哪里px = np.zeros([1024,1024])

@cuda.jit
def kernel_generate_image(image, T):

    # Calculate the thread's absolute position within the grid
    x = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    y = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

    # Set stride equal to the number of threads we have available in either direction
    stride_x = cuda.gridDim.x * cuda.blockDim.x
    stride_y = cuda.gridDim.y * cuda.blockDim.y

    for i in range(x, image.shape[0], stride_x):
        for j in range(y, image.shape[1], stride_y):
            image[i, j] = (sin(i*2*pi/T+1)*sin(j*2*pi/T+1)*0.25)