数据挖掘 - 为什么我的主节点在计算方阵逆时会为 Apache Spark 中的内置 SVD API 堆满堆内存？ - 吾爱随笔录

我正在使用三节点系统：

主节点： $64$ GB 内存
2个从节点： $32$ GB 内存

我正在使用内置的 Apache Spark 函数 SVD 计算方阵的逆。当我调用 Apache spark 的 SVD 函数时 $64$ 主节点的 GB 内存被完全消耗并开始使用交换内存，这使得执行速度太慢，最终导致堆内存满。

如果矩阵大小是 $3000 \times 3000$ 那么没有问题，但是当我们认为尺寸大于 $3000$ （像 $3500$ 或者 $5000$ ) 那么就出现了上述问题。

注意：即使我的 MATLAB 也能够计算 $10K \times 10K$ 32 GB RAM 上的矩阵，但如果我在上述系统上使用内置 SVD 功能，Apache Spark 内存已满。

示例代码如下：-

import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,SingularValueDecomposition,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix

def computeInverse(X: RowMatrix): DenseMatrix = {
  val nCoef = X.numCols.toInt
  val svd = X.computeSVD(nCoef, computeU = true)
  if (svd.s.size < nCoef) {
    sys.error(s"RowMatrix.computeInverse called on singular matrix.")
  }

  // Create the inv diagonal matrix from S 
  val invS = DenseMatrix.diag(new DenseVector(svd.s.toArray.map(x => math.pow(x,-1))))

  // U cannot be a RowMatrix
  val U = new DenseMatrix(svd.U.numRows().toInt,svd.U.numCols().toInt,svd.U.rows.collect.flatMap(x => x.toArray))

  // If you could make V distributed, then this may be better. However its alreadly local...so maybe this is fine.
  val V = svd.V
  // inv(X) = V*inv(S)*transpose(U)  --- the U is already transposed.
  (V.multiply(invS)).multiply(U)
}