计算科学 - 为什么 PETSc 使用自定义预处理器设置其 KSP 求解器的时间出人意料？ - 吾爱随笔录

我正在尝试解决一个大型系统， $\bf{Ax} = \bf{b}$ 在 PETSc 的帮助下。由于问题的大小，我使用的是无矩阵方法，其中 $\bf{A}$ 只是一个外壳。我还提供了我自己的预处理器（不是外壳），并且我在预处理器上使用了 ilu(2) 分解。

问题：求解器的设置阶段（参见下面的相关代码块）需要很长时间。我怀疑这主要是需要一段时间的预处理器的 ilu。我知道预计 ilu 将需要时间，但我担心以下原因：当我尝试使用直接求解器解决相同的问题时（使用 MKL Lapacke 找到 LU 分解然后反转 $\bf{A}$ ，在 PETSc 之外），LU 快 10 倍。我预计 PETSc 的 ilu 应该花费与完整 LU 分解相当的时间，但它慢了 10 倍，这似乎很奇怪。（顺便说一句，你可能会问，如果我可以用 LU 做，为什么还要使用迭代求解器，但是这个例子没有我实际想要运行的那么大，此时我将无法使用直接求解器）。

以下是与该问题相关的代码片段：

MatCreateShell(comm, Nu, Nu, Nu, Nu, ctx, &A_shell);
MatShellSetOperation(A_shell, MATOP_MULT, (void(*)(void))usermult);

KSPCreate(comm, &solver);
KSPSetOperators(solver, A_shell, PreconditionerMatrix);
KSPSetInitialGuessNonzero(solver, PETSC_TRUE);
KSPSetNormType(solver, KSP_NORM_UNPRECONDITIONED);

KSPSetFromOptions(solver);
KSPSetUp(solver);

我知道/尝试过的事情：

矩阵的条件数可以大到 $10^7$ ，但我不认为我的问题与此有任何关系，因为同样，时间接收器在设置中。如果这是问题所在，那么当我进行完整的 LU 分解时，它也会显现出来，但事实并非如此。

我知道我必须为后 ilu 矩阵提供足够大的填充因子猜测，并且我已将选项设置-pc_factor_fill为 3。使用运行代码后-info，我确认这足以防止任何内存重新分配。有趣的旁注：当我使用运行它时-info，它会很快报告所需的填充因子。这是否意味着它实际上执行 ilu 的速度预期很快，但随后卡在了其他地方？我在吠叫错误的树吗？这是它的报告：

  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] PCSetUp(): Setting up PC for first time
  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] PetscCommDuplicate(): Using internal PETSc communicator 7412512 20851120
  [0] MatILUFactorSymbolic_SeqAIJ(): Reallocs 0 Fill ratio:given 3. needed 1.7385
  [0] MatILUFactorSymbolic_SeqAIJ(): Run with -[sub_]pc_factor_fill 1.7385 or use
  [0] MatILUFactorSymbolic_SeqAIJ(): PCFactorSetFill([sub]pc,1.7385);
  [0] MatILUFactorSymbolic_SeqAIJ(): for best performance.
  [0] MatSeqAIJCheckInode_FactorLU(): Found 2030 nodes of 6096. Limit used: 5. Using Inode routines

然后它卡住了很长时间......所以也许我将填充因子设置得太大了？我用填充因子 2 再次尝试了同样的事情，但没有任何区别。

我已经确定我没有使用 PETSc 的调试安装；在计时代码时，我肯定会--with-debugging=0在 PETSc 配置步骤中使用。
我没有使用任何并行化。

这是使用生成的输出-log_view：

Using Petsc Release Version 3.7.6, Apr, 24, 2017

                         Max       Max/Min        Avg      Total
Time (sec):           4.057e+02      1.00000   4.057e+02
Objects:              7.050e+02      1.00000   7.050e+02
Flops:                2.161e+11      1.00000   2.161e+11  2.161e+11
Flops/sec:            5.327e+08      1.00000   5.327e+08  5.327e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
 0:      Main Stage: 4.0567e+02 100.0%  2.1612e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult              624 1.0 3.2336e+01 1.0 4.23e+10 1.0 0.0e+00 0.0e+00 0.0e+00  8 20  0  0  0   8 20  0  0  0  1307
MatMultAdd          2480 1.0 5.4122e-02 1.0 5.00e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   923
MatMultTranspose    3100 1.0 2.8153e+00 1.0 9.05e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  4  0  0  0   1  4  0  0  0  3215
MatSolve             608 1.0 4.6020e+01 1.0 7.41e+10 1.0 0.0e+00 0.0e+00 0.0e+00 11 34  0  0  0  11 34  0  0  0  1611
MatLUFactorNum         1 1.0 4.4004e+01 1.0 9.31e+10 1.0 0.0e+00 0.0e+00 0.0e+00 11 43  0  0  0  11 43  0  0  0  2115
MatILUFactorSym        1 1.0 3.4659e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  9  0  0  0  0   9  0  0  0  0     0
MatAssemblyBegin      27 1.0 1.5497e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd        27 1.0 6.8570e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRow        10650098 1.0 6.6575e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatGetRowIJ            1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.3018e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries        39 1.0 1.5883e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMatMult             2 1.0 3.5977e+00 1.0 5.97e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  3  0  0  0   1  3  0  0  0  1660
MatMatMultSym          2 1.0 7.2353e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMatMultNum          2 1.0 2.8741e+00 1.0 5.97e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  3  0  0  0   1  3  0  0  0  2078
VecMDot              302 1.0 4.2068e-02 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  4907
VecNorm              621 1.0 1.4746e-02 1.0 3.03e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2054
VecScale             314 1.0 2.0843e-03 1.0 7.66e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  3674
VecCopy              636 1.0 1.0492e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              4096 1.0 4.1316e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY             5278 1.0 9.9347e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1977
VecAYPX              306 1.0 4.3933e-03 1.0 7.46e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1698
VecMAXPY             608 1.0 5.9476e-02 1.0 4.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  6993
VecAssemblyBegin    2493 1.0 1.5733e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd      2493 1.0 1.5340e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         314 1.0 1.0375e-02 1.0 2.30e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2214
KSPGMRESOrthog       302 1.0 7.3104e-02 1.0 4.13e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  5647
KSPSetUp               1 1.0 1.2398e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               4 1.0 7.8504e+01 1.0 1.17e+11 1.0 0.0e+00 0.0e+00 0.0e+00 19 54  0  0  0  19 54  0  0  0  1491
PCSetUp                1 1.0 7.8663e+01 1.0 9.31e+10 1.0 0.0e+00 0.0e+00 0.0e+00 19 43  0  0  0  19 43  0  0  0  1183
PCApply              608 1.0 4.6022e+01 1.0 7.41e+10 1.0 0.0e+00 0.0e+00 0.0e+00 11 34  0  0  0  11 34  0  0  0  1611
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix    33             30   1306648980     0.
              Vector   663            663     65247472     0.
       Krylov Solver     1              1        35264     0.
      Preconditioner     1              1         1008     0.
              Viewer     2              0            0     0.
           Index Set     5              5        87624     0.
========================================================================================================================
Average time to get PetscTime(): 5.96046e-07
#PETSc Option Table entries:
-ksp_atol 1e-8
-ksp_converged_reason
-ksp_monitor
-ksp_monitor_true_residual
-ksp_rtol 1e-8
-log_view
-pc_factor_fill 3
-pc_factor_levels 2
-pc_type ilu
#End of PETSc Option Table entries
Compiled with FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16 sizeof(PetscInt) 4
Configure options: PETSC_ARCH=arch-linux2-cxx-nodebug --with-scalar-type=complex --with-fortran-kernels=1 --with-clanguage=c++ --with-debugging=0 --with-cxx=g++ CXXOPTFLAGS=-O3 COPTFLAGS=O3 FOPTFLAGS=-O3 --download-openmpi --with-blaslapack-dir=/opt/intel/mkl

问题

我知道对此可能没有简单/明显的解决方案，但至少我想了解为什么会发生这种情况，以及 PETSc 在内部做什么需要这么长时间
如果没有明确的解决方案，我可以采取哪些步骤来尝试减轻这种情况或进一步调查？
这是预期的/正常的，我应该停止担心它并把它吸起来吗？

对不起这个问题的长度。谢谢你的时间！