求解 Ax=B,其中 B 是并行矩阵

计算科学 稀疏矩阵 矩阵 矩阵分解
2021-11-30 00:43:32

我尝试解决问题Ax=B在哪里A是一个大稀疏n×n矩阵,和B是稠密的n×m矩阵(这里n=754850m=182)。反斜杠运算符产生正确的解x = A\B这显然会减慢进程。这里有什么问题?

我还尝试了 LU 分解A然后求解中的每一列B,但lu(A)似乎没有被并行化(尽管我使用的是 version R2014b)。我看到人们认为lu(A)是并行化的,所以我错过了一些更新还是什么?

通过执行

spparms('spumoni',2)
x = A\b;
spparms

我得到以下输出

sp\: bandwidth = 60795+1+60795.
sp\: is A diagonal? no.
sp\: is band density (0.00) > bandden (0.50) to try banded solver? no.
sp\: is A triangular? no.
sp\: is A morally triangular? no.
sp\: is A a candidate for Cholesky (symmetric, real positive diagonal)? no.
sp\: use Unsymmetric MultiFrontal PACKage with automatic reordering.
UMFPACK V5.4.0 (May 20, 2009), Control:
    Matrix entry defined as: double complex
    Int (generic integer) defined as: UF_long

    0: print level: 2
    1: dense row parameter:    0.2
        "dense" rows have    > max (16, (0.2)*16*sqrt(n_col) entries)
    2: dense column parameter: 0.2
        "dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
    3: pivot tolerance: 0.1
    4: block size for dense matrix kernels: 32
    5: strategy: 0 (auto)
    6: initial allocation ratio: 0.7
    7: max iterative refinement steps: 2
    12: 2-by-2 pivot tolerance: 0.01
    13: Q fixed during numerical factorization: 0 (auto)
    14: AMD dense row/col parameter:    10
       "dense" rows/columns have > max (16, (10)*sqrt(n)) entries
        Only used if the AMD ordering is used.
    15: diagonal pivot tolerance: 0.001
        Only used if diagonal pivoting is attempted.
    16: scaling: 1 (divide each row by sum of abs. values in each row)
    17: frontal matrix allocation ratio: 0.5
    18: drop tolerance: 0
    19: AMD and COLAMD aggressive absorption: 1 (yes)

    The following options can only be changed at compile-time:
    8: BLAS library used:  Fortran BLAS.  size of BLAS integer: 8
    9: compiled for MATLAB
    10: CPU timer is POSIX times ( ) routine.
    11: compiled for normal operation (debugging disabled)
    computer/operating system: Linux
    size of int: 4 UF_long: 8 Int: 8 pointer: 8 double: 8 Entry: 16 (in bytes)

    sp\: UMFPACK's factorization was successful.
    sp\: UMFPACK's solve was successful.
    UMFPACK V5.4.0 (May 20, 2009), Info:
        matrix entry defined as:          double complex
        Int (generic integer) defined as: UF_long
        BLAS library used: Fortran BLAS.  size of BLAS integer: 8
        MATLAB:                           yes.
        CPU timer:                        POSIX times ( ) routine.
        number of rows in matrix A:       754850
        number of columns in matrix A:    754850
        entries in matrix A:              86456682
        memory usage reported in:         16-byte Units
        size of int:                      4 bytes
        size of UF_long:                  8 bytes
        size of pointer:                  8 bytes
        size of numerical entry:          16 bytes

        strategy used:                    symmetric
        ordering used:                    amd on A+A'
        modify Q during factorization:    no
        prefer diagonal pivoting:         yes
        pivots with zero Markowitz cost:               0
        submatrix S after removing zero-cost pivots:
            number of "dense" rows:                    0
            number of "dense" columns:                 0
            number of empty rows:                      0
            number of empty columns                    0
            submatrix S square and diagonal preserved
        pattern of square submatrix S:
            number rows and columns                    754850
            symmetry of nonzero pattern:               1.000000
            nz in S+S' (excl. diagonal):               85701832
            nz on diagonal of matrix S:                754850
            fraction of nz on diagonal:                1.000000
        AMD statistics, for strict diagonal pivoting:
            est. flops for LU factorization:           4.16517e+14
            est. nz in L+U (incl. diagonal):           7435413184
            est. largest front (# entries):            896703025
            est. max nz in any column of L:            29945
            number of "dense" rows/columns in S+S':    0
        symbolic factorization defragmentations:       0
        symbolic memory usage (Units):                 199337590
        symbolic memory usage (MBytes):                3041.7
        Symbolic size (Units):                         1961813
        Symbolic size (MBytes):                        30
        symbolic factorization CPU time (sec):         15.57
        symbolic factorization wallclock time(sec):    15.45

        matrix scaled: yes (divided each row by sum of abs values in each row)
        minimum sum (abs (rows of A)):              3.25924e-02
        maximum sum (abs (rows of A)):              6.95838e+01

        symbolic/numeric factorization:      upper bound               actual      %
        variable-sized part of Numeric object:
            initial size (Units)               226394213            225639362   100%
            peak size (Units)               101780904057           8946021598     9%
            final size (Units)               98119112401           7543694062     8%
        Numeric final size (Units)           98124018962           7548223198     8%
        Numeric final size (MBytes)            1497253.7             115176.7     8%
        peak memory usage (Units)           101792135841           8957253382     9%
        peak memory usage (MBytes)             1553224.7             136676.8     9%
        numeric factorization flops          2.95433e+16          4.20989e+14     1%
        nz in L (incl diagonal)              36831301536           3820083550    10%
        nz in U (incl diagonal)              58996420369           3718115473     6%
        nz in L+U (incl diagonal)            95826967055           7537444173     8%
        largest front (# entries)            13417023050            896703025     7%
        largest # rows in front                    93703                29945    32%
        largest # columns in front                144673                29945    21%

        initial allocation ratio used:                 0.0942
        # of forced updates due to frontal growth:     0
        number of off-diagonal pivots:                 38
        nz in L (incl diagonal), if none dropped       3820083550
        nz in U (incl diagonal), if none dropped       3718115473
        number of small entries dropped                0
        nonzeros on diagonal of U:                     754850
        min abs. value on diagonal of U:               8.14e-09
        max abs. value on diagonal of U:               1.42e+02
        estimate of reciprocal of condition number:    5.75e-11
        indices in compressed pattern:                 13860039
        numerical values stored in Numeric object:     7537500101
        numeric factorization defragmentations:        3
        numeric factorization reallocations:           0
        costly numeric factorization reallocations:    0
        numeric factorization CPU time (sec):          106790.54
        numeric factorization wallclock time (sec):    7368.85
        numeric factorization mflops (CPU time):       3942.19
        numeric factorization mflops (wallclock):      57130.86
        symbolic + numeric CPU time (sec):             106806.11
        symbolic + numeric mflops (CPU time):          3941.62
        symbolic + numeric wall clock time (sec):      7384.30
        symbolic + numeric mflops (wall clock):        57011.33

        solve flops:                                   1.86839e+11
        iterative refinement steps taken:              1
        iterative refinement steps attempted:          2
        sparse backward error omega1:                  4.93e-16
        sparse backward error omega2:                  0.00e+00
        solve CPU time (sec):                          248.79
        solve wall clock time (sec):                   246.02
        solve mflops (CPU time):                       750.99
        solve mflops (wall clock time):                759.45

        total symbolic + numeric + solve flops:        4.21176e+14
        total symbolic + numeric + solve CPU time:     107054.90
        total symbolic + numeric + solve mflops (CPU): 3934.20
        total symbolic+numeric+solve wall clock time:  7633.09
        total symbolic+numeric+solve mflops(wallclock) 55177.60

    SParse MONItor output level 2.
    mmd: threshold = 1.1 * mindegree + 1,
         using approximate degrees in A'*A,
         supernode amalgamation every 3 stages,
         row reduction every 3 stages,
         withhold rows at least 50% dense in colmmd.
    Minimum degree orderings used with v4 chol, lu, and qr in \ and /.
    Approximate minimum degree orderings used with CHOLMOD and UMFPACK in \ and /.
    Pivot tolerance of 0.1 used by UMFPACK in \ and /.
    Backslash uses band solver if band density is > 0.5
    UMFPACK used for lu in \ and /.
    Symmetric pivot tolerance of 0.001 used by UMFPACK in \ and /.
    Pivot tolerance of 0.01 used by MA57 in \ and /.
1个回答

当输入矩阵是稀疏、正方形和不对称时,MATLAB \ 运算符使用 UMFPACK。UMFPACK 不是并行稀疏求解器。但是,UMFPACK 分解步骤在其计算内核中调用 BLAS 3 级例程。BLAS 的大多数现代实现(例如 MATLAB 中的英特尔 MKL)都支持多核架构,因此 UMFPACK 可以在分解步骤中获得相当显着的加速。由于 UMFPACK 中没有任何高级并行性,并且并行化的 BLAS 在求解步骤中没有多大帮助,因此在多核机器上没有任何显着的加速。

当使用以下形式时,MATLAB lu 函数调用较旧的 LU 实现 ( https://www.mathworks.com/help/pdf_doc/otherdocs/simax.pdf ):

[L,U,P] = lu(A);

和这种形式的 UMFPACK:

[L,U,P,Q] = lu(A);

较旧的实现不支持 BLAS 3 级调用,因此在多核机器上没有加速。UMFPACK 版本的 lu 以与反斜杠相同的方式支持并行 BLAS。

Tim Davis 的这篇论文中包含了关于 MATLAB 反斜杠运算符和 lu 函数的有趣讨论: http ://www.cise.ufl.edu/research/sparse/techreports/factorize.pdf