如果这个问题似乎偏离主题或基于意见,我很抱歉,但我不知道如何去做。
我目前正在研究一个 100k x 100k 正定线性系统,并尝试使用 Scalapack 2.1.0 ( pdpotrf) 中的 Cholesky 分解来解决它。我的问题是,当我为 50kx50k 矩阵(即 A(1:50000,1:50000))运行相同的系统时,我的程序运行良好。但是它在 100k 运行中给出了分段错误:
Backtrace for this error:
#0 0x7f8126942b9a in ???
#1 0x7f8126941dc3 in ???
#2 0x7f8125e4824f in ???
#3 0x45409d in ???
#4 0x44382c in pdpotf2_
at /home/user/sclpk_solver/scalapack-2.1.0/SRC/pdpotf2.f:250
#5 0x414502 in pdpotrf_
at /home/user/sclpk_solver/scalapack-2.1.0/SRC/pdpotrf.f:262
通常我想使用记录重放或类似的调试工具来解决它,但考虑到问题的大小以及大约需要 10 个小时才能出现故障的事实,我什至不知道如何开始调试它。完全相同的代码和完全相同的数据在英特尔 MKL(版本 2018 服务包 3)上运行良好。但是 GNU 编译器给出了问题(版本 4.8 和 7.1)。
欢迎任何意见。
更新:我刚刚发现了一个重要细节:我的程序适用于大于 4 的内核。当它给出分段错误时,它的 <=4。我认为这是因为对于 100k,4 个或更少的内核意味着每个内核将具有 > 50k*50k 的元素。所以某处存在整数溢出错误。使用 intel MKL 编译时,需要 -i8 标志才能使 numroc 函数正常工作,这一点可以得到加强。
我也尝试用“-g -Og”标志编译scalapack,这次程序失败,输出如下:
{*****, 0}: On entry to PDPOTRF parameter number 1 had an illegal value
<< Warning >> pdpotrf FAILED -1
{*****, 0}: On entry to PDPOTRF parameter number 1 had an illegal value
<< Warning >> pdpotrf FAILED -1
{ 0, 0}: On entry to PDPOTRS parameter number 1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
Solved! predicting test values
{ 1, 0}: On entry to PDPOTRS parameter number 1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
TIme for solving (sec) 9.0224840000000008E-003
{ 0, 0}: On entry to PDPOTRS parameter number 1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
Solved! delta test values
TIme for solving (sec) 9.1293109999999993E-003
{ 1, 0}: On entry to PDPOTRS parameter number 1 had an illegal value
<< Warning >> pdpotrs FAILED
��������
~
PBLAS ERROR 'Illegal DESCA[INB_] = 0, DESCA[INB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
PBLAS ERROR 'Illegal DESCA[NB_] = 0, DESCA[NB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
PBLAS ERROR 'Illegal DESCA[RSRC_] = 100000, DESCA[RSRC_] must be either -1, or >= 0 and < 2'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
PBLAS ERROR 'Illegal DESCA[M_] = 0, it must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
PBLAS ERROR 'Illegal DESCX[INB_] = 0, DESCX[INB_] must be at least 1'
from {0,0}, pnum=0, Contxt=0, in routine 'PDGEMV'.
...
现在有什么建议吗?