我认为您正在将使用基于 GPGPU 技术(如 C++ AMP)构建的高级库与在可能的最低汇编语言或内在函数级别上编程 SIMD 进行比较。这不是一个公平的比较。
由于您特别提到了 C++ AMP,所以让我用一个基本示例来证明您的前提不正确。介绍性 AMP 示例展示了如何使用 AMP 并行化以下简单向量加法:
#include <iostream>
void StandardMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[5];
for (int idx = 0; idx < 5; idx++)
{
sumCPP[idx] = aCPP[idx] + bCPP[idx];
}
for (int idx = 0; idx < 5; idx++)
{
std::cout << sumCPP[idx] << "\n";
}
}
为此,您必须使用 AMP 库函数并以特定方式编写循环。如果我们想在 CPU 上使用 SIMD 指令并行化这个循环怎么办?我们只需要使用相当现代的编译器和合适的标志来编译它!
在 Linux 上用 gcc 4.8.3 编译上述代码并反汇编,我得到
$ gcc -O3 -c vectest.cpp && objdump -M intel -d vectest.o
vectest.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z14StandardMethodv>:
0: 55 push rbp
1: 53 push rbx
2: 48 83 ec 68 sub rsp,0x68
6: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
d: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
14: 00
15: 48 8d 5c 24 40 lea rbx,[rsp+0x40]
1a: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
21: 00
22: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
29: 00
2a: 48 8d 6c 24 54 lea rbp,[rsp+0x54]
2f: 66 0f 6f 04 24 movdqa xmm0,XMMWORD PTR [rsp]
34: c7 44 24 20 06 00 00 mov DWORD PTR [rsp+0x20],0x6
3b: 00
3c: c7 44 24 24 07 00 00 mov DWORD PTR [rsp+0x24],0x7
43: 00
44: c7 44 24 28 08 00 00 mov DWORD PTR [rsp+0x28],0x8
4b: 00
4c: c7 44 24 2c 09 00 00 mov DWORD PTR [rsp+0x2c],0x9
53: 00
54: 66 0f fe 44 24 20 paddd xmm0,XMMWORD PTR [rsp+0x20]
5a: 66 0f 7f 44 24 40 movdqa XMMWORD PTR [rsp+0x40],xmm0
60: c7 44 24 10 05 00 00 mov DWORD PTR [rsp+0x10],0x5
67: 00
68: c7 44 24 30 0a 00 00 mov DWORD PTR [rsp+0x30],0xa
6f: 00
70: c7 44 24 50 0f 00 00 mov DWORD PTR [rsp+0x50],0xf
77: 00
78: 8b 33 mov esi,DWORD PTR [rbx]
7a: bf 00 00 00 00 mov edi,0x0
7f: 48 83 c3 04 add rbx,0x4
83: e8 00 00 00 00 call 88 <_Z14StandardMethodv+0x88>
88: ba 01 00 00 00 mov edx,0x1
8d: be 00 00 00 00 mov esi,0x0
92: 48 89 c7 mov rdi,rax
95: e8 00 00 00 00 call 9a <_Z14StandardMethodv+0x9a>
9a: 48 39 eb cmp rbx,rbp
9d: 75 d9 jne 78 <_Z14StandardMethodv+0x78>
9f: 48 83 c4 68 add rsp,0x68
a3: 5b pop rbx
a4: 5d pop rbp
a5: c3 ret
Disassembly of section .text.startup:
0000000000000000 <_GLOBAL__sub_I__Z14StandardMethodv>:
0: 48 83 ec 08 sub rsp,0x8
4: bf 00 00 00 00 mov edi,0x0
9: e8 00 00 00 00 call e <_GLOBAL__sub_I__Z14StandardMethodv+0xe>
e: ba 00 00 00 00 mov edx,0x0
13: be 00 00 00 00 mov esi,0x0
18: bf 00 00 00 00 mov edi,0x0
1d: 48 83 c4 08 add rsp,0x8
21: e9 00 00 00 00 jmp 26 <_GLOBAL__sub_I__Z14StandardMethodv+0x26>
如您所见,编译器已自动使用 MMX 指令来优化您的循环,而无需添加任何 CPU 或库特定的注释。所以我想说你声称在高级代码中使用 SIMD 比使用 GPGPU 技术更困难的说法是不正确的——恰恰相反。