机器算法验证 - Lawson 和 Hanson 如何解决无约束最小二乘问题？ - 吾爱随笔录

Lawson 和 Hanson 如何解决无约束最小二乘问题？

机器算法验证 r 回归最小二乘线性的

2022-03-30 05:54:44

对于非负最小二乘问题min(b - a %*% x)，x >= 0Lawson-Hanson fortran77 实现（fortran 代码，R 包）通常针对相同的无约束变量集提供与现代方法不同的解决方案。

现代基准方法可能包括multiway::fnnls、来自 NNLM 的顺序坐标下降 nnls和TNT-NN。请注意，有界变量最小二乘法 ( bvls::bvls ) 使用 Lawson/Hanson 源代码并且通常返回相同的结果。

技术背景： 非负最小二乘问题由两个独立的子问题组成：

确定哪些变量将被约束为零（活动集）
求解所有无约束变量的最小二乘解

这个问题涉及第二个子问题。但是，请注意，这些子问题在坐标下降中是模糊的，其操作b是为了最小化错误而不是主动集选择。但是，NNLM::nnlm无法复制 LH77 结果并使用梯度下降。

功能

library(nnls)
library(multiway)

# my own take on tnt-nn
fastnnls <- function(a, b){
  x <- solve(a, b)
  while(any(x < 0)){
    ind <- which(x > 0)
    x <- rep(0, length(b))
    x[ind] <- solve(a[ind, ind], b[ind])
  }
  as.vector(x)
}

可重现的设置

set.seed(123)
X <- matrix(rnorm(2000),100,20)
y <- X %*% runif(20) + rnorm(100)*5
a <- crossprod(X)
b <- as.vector(crossprod(X, y))

计算解和残差

x_fnnls <- as.vector(multiway::fnnls(a, b))
x_lh <- nnls::nnls(a, b)$x
x_fastnnls <- fastnnls(a, b)

all.equal(x_fnnls, x_fastnnls)
# [1] TRUE
all.equal(x_fnnls, x_lh)
# [1] "Mean relative difference: 0.08109434"
all.equal(x_fastnnls, x_lh)
# [1] "Mean relative difference: 0.08109434"

请注意，multiway::fnnls理论上应该给出与 Lawson-Hanson 实现相同的结果，但很少这样做。然而，它确实找到了相同的活动集，但解决方案不同——而且通常更糟。

残差

# using mean l2-norm

mean(abs(b - as.vector(a %*% x_fnnls)))
# [1] 6.515709
mean(abs(b - as.vector(a %*% x_fastnnls)))
# [1] 6.515709
mean(abs(b - as.vector(a %*% x_lh)))
# [1] 9.086841

在这个模拟随机示例的情况下，Lawson-Hanson 算法的误差更大，但我经常看到它明显低于相同活动集的其他方法。

那么，这是为什么呢？base::solveLH77 求解器与 R 、 Armadilloarma::solve等中使用的现代 Cholesky 对称正定求解器给出不同的结果如何？我试图阅读 Fortran 代码，不得不认输。

想法：

这是否归结为他们用来求解方程的方法？例如，他们是否使用 QR 而不是 Cholesky 分解 -> 前向 -> 后向替换，这有关系吗？
LH77 在找到最终的活动集后是否使用梯度优化，如果是，为什么坐标下降求解器NNLM::nnlm无法匹配 LH77 在良好确定系统中的性能？
似乎有时 LH77 活动集实际上也有所不同，这会导致结果略有不同（并且几乎总是更好）。因此，这种差异似乎完全是他们求解器的一个特征。
是否有一些特定于 NNLS 的方法来求解 Lawson/Hanson 发现所有后续研究都未能复制的最小二乘方程，因为我们喜欢开箱即用的求解器？
NNLS 的真正解决方案是否实际上无法通过坐标下降来解决，而我们能得到的最好的结果是近似值，而 LH77 以某种方式找到了真正的最小值？

我知道，这是一个很复杂的问题，但我赌我的名声在这附近某个地方的某个人的帽子里有一只兔子。

1个回答

TNT神经网络

对于 TNT-NN，请参见：

迈尔，乔 M.，等人。“TNT-NN：一种解决大型非负最小二乘问题的快速主动集方法。” Procedia 计算机科学 108 (2017): 755-764。

https://doi.org/10.1016/j.procs.2017.05.194

为了形成活动集，TNT-NN 首先解决了一个无约束的最小二乘问题。违反非负约束的变量被添加到活动集中。一旦找到可行的解决方案，其中没有违反任何非负约束，残差的 2 范数被用作适应度的度量，并且该解决方案被保存为当前的“最佳”解决方案。

TNT-NN 试图通过迭代地将一些变量从活动集中移回无约束集来修改活动集。活动集变量根据它们的梯度分量进行排序。通过将其中一些从活动集中移动到无约束集中来测试显示最大正梯度分量的变量。重要的是要注意，最初可以在单个测试中移动大量变量。如果新解决方案的适应度没有提高，则拒绝该解决方案并测试较小的变量集。如果可以从活动集中删除一组变量，并找到一个“更好”的新可行解，则保存该解，算法开始新的迭代。当活动集不能再被修改时，算法达到收敛。

您的实施只是此解决方案的一半。它是搜索新的可行集以测试从活动集中删除变量的部分。在下面的代码中进行了演示。我已经复制了你的fastnnls函数并将其变成feasible_set了算法的一部分。

在 Myre 等人的文章中谈到“通过将其中一些从活动集中移动到无约束集中。重要的是要注意，最初可以在单个测试中移动大量变量”。但是我不确定他们是如何在代码中这样做的，我一直在将它们一一添加。可能有一些额外的技巧可以更快地选择一次添加的大型组，而不是尝试所有变量的 for 循环。

`multiway::fnnls`和之间的区别`nnls::nnls`

由于比较中的一个小错误，您会有所不同。一个函数需要矩阵和向量，另一个函数需要矩阵和向量。您已将后者用于这两个功能。在下面的代码中，我给出了一个示例输出。 $X$ $Y$ $X^TX$ $X^Ty$

示例代码

### finding the feasible set
feasible_set <- function(a, b, ind){
  x <- rep(0, length(b))
  x[ind] <- solve(a[ind, ind], b[ind])
  while(any(x < 0)){
    ind <- which(x > 0)
    x <- rep(0, length(b))
    x[ind] <- solve(a[ind, ind], b[ind])
  }
  as.vector(x)
}

### finding the gradients
gradients <- function(b,y,X) {
  current_y <- X %*% b
  d_y <- y-current_y
  gradients <- t(X) %*% d_y
  return(gradients)
}

### The algorithm that repeatedly updates the active set
### The updates are done by removing the variable with the highest positive gradient
fastnnls <- function(y,X) {
  
  ### Initiation
  a <- crossprod(X)
  b <- as.vector(crossprod(X, y))
  current_active <-   rep(TRUE,length(X[1,])) ### start with all variables in active set
  current_s <- rep(0,length(X[1,])) ### initial conditions
  current_y <- X %*% current_s
  current_loss <- sum((y-current_y)^2)

  ### algorithm that stops untill no improvement can be made
  cont <- TRUE
  while (cont) {
    ### add variables based on gradients 
    ### in these four lines the gradients are found and ordered
    gradients <- gradients(current_s,y,X)
    testing <- which(gradients*current_active>0) ### find out which variables are active and have positive gradients
    ord <- order(gradients, decreasing = TRUE)
    ord <- ord[ord %in% testing] ### strip the negative or non-active variables
    
    ### keep adding variables in a loop while this improves the solution
    addition <- 0 ### itterative variable keeping track of the additions
    new_active <- current_active
    for (i in 1:length(ord)) {
      
      ### Try out a new active set with one variable removed
      new_active[ord[i]] <- FALSE  ### remove 'ord[i]' from active set
      new_s <- feasible_set(a,b, ind = which(new_active == FALSE))
      new_y <- X %*% new_s
      new_loss <- sum((y-new_y)^2) 
      
      ### Update the solution if the new trial is better
      if (new_loss < current_loss) {
        addition <- i
        current_active <- new_active
        current_loss <- new_loss
        current_s <- new_s
        current_y <- new_y
      } else {
        break ### skip loop to end
      }
    }
    
    if (addition == 0) { ### quit while when no addition is made
      cont = FALSE
    } else {
      new_active <- new_s == 0 ### in the for loop we had only been decreasing the active set
                               ### but the feasible_set function also increases the active it and we need to adapt accrdingly
    }
    if (sum(current_active) == 0) { ### quit if active set is empty (all variables positive) 
      cont = FALSE
    }
    
    ### The while loop continues by recomputing the gradients 
  }
  
  return(current_s)
}

set.seed(123)
X <- matrix(rnorm(2000),100,20)
y <- X %*% runif(20) + rnorm(100)*5

library(nnls)
library(multiway)


data.frame(multiway = multiway::fnnls(a, b),
           nnls = nnls::nnls(X, y)$x,
           manual = fastnnls(y,X))

输出

> data.frame(multiway = multiway::fnnls(a, b),
+            nnls = nnls::nnls(X, y)$x,
+            manual = fastnnls(y,X))
      multiway        nnls      manual
1  0.610802720 0.610802720 0.610802720
2  0.146121047 0.146121047 0.146121047
3  0.841809005 0.841809005 0.841809005
4  1.131040740 1.131040740 1.131040740
5  0.000000000 0.000000000 0.000000000
6  1.093652478 1.093652478 1.093652478
7  0.725590111 0.725590111 0.725590111
8  0.211525228 0.211525228 0.211525228
9  0.000000000 0.000000000 0.000000000
10 1.472333600 1.472333600 1.472333600
11 0.005740395 0.005740395 0.005740395
12 2.131277775 2.131277775 2.131277775
13 0.000000000 0.000000000 0.000000000
14 0.590923989 0.590923989 0.590923989
15 0.652530944 0.652530944 0.652530944
16 0.717713755 0.717713755 0.717713755
17 1.115162378 1.115162378 1.115162378
18 0.603304661 0.603304661 0.603304661
19 0.000000000 0.000000000 0.000000000
20 0.218073317 0.218073317 0.218073317

其它你可能感兴趣的问题

上一篇在对多个预测变量和结果变量之间的关系进行建模时，如何判断哪个变量更有意义？下一篇如何在回归中处理不同的重要变量，包括重叠信息？

Lawson 和 Hanson 如何解决无约束最小二乘问题？

TNT神经网络

multiway::fnnls和之间的区别nnls::nnls

示例代码

`multiway::fnnls`和之间的区别`nnls::nnls`