计算科学 - 并行稀疏矩阵和密集向量乘法的最新技术是什么？ - 吾爱随笔录

当然，在这方面已经有很多高度优化的库。但是我正在研究无矩阵的上下文，因为问题大小不允许显式存储稀疏矩阵元素。我很想了解人们如何处理这类问题，尤其是如何缓解内存瓶颈。

==========================更新========================

正如其他评论员所建议的，我将进一步阐述我的情况。

我目前的实现如下，这只是各种教科书中出现的基本实现。

void SpMVMtply(vector<det>& basis, ColVec& x, ColVec& b)
{

    /*
         "basis" is the orthonormal basis that spans the physical space 
         "x" ,"b" are just those in Ax = b
    */
    int len = x.size(); 

    #pragma omp parallel
    {
        #pragma omp for nowait 
        for (int i=0; i<len; i++)
        {           
            vector<det> pool_det; 
            vector<cdouble> pool_amp; 

            /*
               this function generates the non-zero off diagonal elements 
               with respect to the given basis vector. pool_det is the 
               collection of generated basis vectors, and pool_map collection 
               of the amplitude. 
            */

            OffDiagGen(basis[i], pool_det, pool_amp);

            // this function generates the diagonal element
            cdouble sum = x(i)* DiagCal(basis[i]);

            /*
                Below, basis_pos is an unordered_map with basis vector as key 
                and that basis vector's position in "basis" array as value. 

            */
            for (int j=0; j< pool_det.size(); j++) 
                sum += x(basis_pos[pool_det[j]]) * conj(pool_amp[j]); 

            b(i) = sum ;  
        }
    }
}

向量的大小可以达到 10 亿，因此稀疏矩阵的显式存储是不可能的。粗略的分析表明，x(basis_pos[pool_det[j]]) 中内存的非顺序访问是最耗时的部分。在使用 12 个内核的 Xeon Gold 6242 上，对于具有 300 万个元素的向量，该部分需要 6.2 秒，而将该部分替换为常数只需要 1.5 秒。这就是为什么我称之为内存瓶颈。我主要担心的是，如果这样的瓶颈是不可避免的，因为即使显式存储了稀疏矩阵，这种非顺序的内存访问也会出现。根据我的知识，我还没有找到避免它的方法。