计算科学 - MPI_Test 未处理 MPI_Isend/MPI_Irecv 请求 - 吾爱随笔录

我正在运行具有 200,000 个顶点网格的 CFD 模拟。我已经将网格分解为 2 个负载平衡的子域来测试我的并行实现。在我进行时间分析的特定功能中，每个子域必须为MPI_DOUBLE位于并行通信边界上的约 2000 个顶点中的每一个发送一个 s 的 3D 梯度向量。对顶点列表进行排序，使得并行通信中涉及的所有顶点都位于列表的开头（ivertParaCommStart 是列表中具有最大索引的并行顶点）。以下是我的代码的简化版本：

MPI_Request sendRequ; // Local variable in each thread
MPI_Request recvRequ; // Local variable in each thread
for(ivert, mvert)
{
  // Perform costly calculations for each vertex.....

  if(ivert == ivertCommStart)
  {
    // Load data from ALL parallel vertices and communicate with other thread in one go.
    MPI_Isend(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &sendRequ);
    MPI_Irecv(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &recvRequ);
  }
  else if(ivert > ivertCommStart)
  {
    MPI_Test(&sendRequ, ..., ...);
    MPI_Test(&recvRequ, ..., ...);
  }
}

MPI_Wait(&sendRequ, ...); // Send is still not completed at this point
MPI_Wait(&recvRequ, ...); // Recv is still not completed at this point

考虑到每个子域中并行通信顶点的总数约为 4000（约为顶点总数的 4%）并考虑到为每个顶点执行的计算成本，我预计数据传输将是完全被对剩余非平行顶点执行的计算所掩盖。然而，这种情况并非如此。请注意，在MPI_Wait调用最终返回之后，我已经确认交换的数据与我预期的一样（并行模拟产生与串行模拟相同的结果）。但是，MPI_Wait()调用成本对我的代码的扩展性非常差。谁能告诉我为什么MPI_Test()我的发送和接收请求没有进展？

编辑：对造成的任何混淆表示歉意-我应该澄清一下，我一次性传达了所有并行顶点的数据。

编辑 2：我发现这些MPI_Test调用不允许我重叠通信和计算。他们有很大的开销。for实际上，循环 +调用的总时间MPI_Wait与我等到for循环结束然后使用阻塞发送/接收调用通信并行数据的时间相同（下面的代码与代码的运行时间相同更多）。所以我没有看到非阻塞通信的好处。我对MPI. 对于这里可能发生的事情，我将不胜感激。

for(ivert, mvert)
{
  // Perform costly calculations for each vertex.....
}

// Load data from ALL parallel vertices and communicate with other thread.
MPI_Request sendRequ; 
MPI_Request recvRequ;
MPI_Isend(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &sendRequ);
MPI_Irecv(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &recvRequ);
MPI_Wait(&sendRequ, ...);
MPI_Wait(&recvRequ, ...);
```