计算科学 - 如何调试大问题中的分段错误？ - 吾爱随笔录

如果这个问题似乎偏离主题或基于意见，我很抱歉，但我不知道如何去做。

我目前正在研究一个 100k x 100k 正定线性系统，并尝试使用 Scalapack 2.1.0 ( pdpotrf) 中的 Cholesky 分解来解决它。我的问题是，当我为 50kx50k 矩阵（即 A(1:50000,1:50000)）运行相同的系统时，我的程序运行良好。但是它在 100k 运行中给出了分段错误：

Backtrace for this error:
#0  0x7f8126942b9a in ???
#1  0x7f8126941dc3 in ???
#2  0x7f8125e4824f in ???
#3  0x45409d in ???
#4  0x44382c in pdpotf2_
        at /home/user/sclpk_solver/scalapack-2.1.0/SRC/pdpotf2.f:250
#5  0x414502 in pdpotrf_
        at /home/user/sclpk_solver/scalapack-2.1.0/SRC/pdpotrf.f:262

通常我想使用记录重放或类似的调试工具来解决它，但考虑到问题的大小以及大约需要 10 个小时才能出现故障的事实，我什至不知道如何开始调试它。完全相同的代码和完全相同的数据在英特尔 MKL（版本 2018 服务包 3）上运行良好。但是 GNU 编译器给出了问题（版本 4.8 和 7.1）。

欢迎任何意见。

更新：我刚刚发现了一个重要细节：我的程序适用于大于 4 的内核。当它给出分段错误时，它的 <=4。我认为这是因为对于 100k，4 个或更少的内核意味着每个内核将具有 > 50k*50k 的元素。所以某处存在整数溢出错误。使用 intel MKL 编译时，需要 -i8 标志才能使 numroc 函数正常工作，这一点可以得到加强。

我也尝试用“-g -Og”标志编译scalapack，这次程序失败，输出如下：

{*****,    0}:  On entry to PDPOTRF parameter number    1 had an illegal value
<< Warning >> pdpotrf FAILED      -1
{*****,    0}:  On entry to PDPOTRF parameter number    1 had an illegal value
<< Warning >> pdpotrf FAILED      -1
{    0,    0}:  On entry to PDPOTRS parameter number    1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
 Solved! predicting test values
{    1,    0}:  On entry to PDPOTRS parameter number    1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
 TIme for solving (sec)   9.0224840000000008E-003
{    0,    0}:  On entry to PDPOTRS parameter number    1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
 Solved! delta test values
 TIme for solving (sec)   9.1293109999999993E-003
{    1,    0}:  On entry to PDPOTRS parameter number    1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
~

PBLAS ERROR 'Illegal DESCA[INB_] = 0, DESCA[INB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.

PBLAS ERROR 'Illegal DESCA[NB_] = 0, DESCA[NB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.

PBLAS ERROR 'Illegal DESCA[RSRC_] = 100000, DESCA[RSRC_] must be either -1, or >= 0 and < 2'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.

PBLAS ERROR 'Illegal DESCA[M_] = 0, it must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.

PBLAS ERROR 'Illegal DESCX[INB_] = 0, DESCX[INB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
...

现在有什么建议吗？