数据挖掘 - 在 cv 中使用聚类和 Lasso - 吾爱随笔录

在 cv 中使用聚类和 Lasso

数据挖掘机器学习聚类预测建模

2022-03-03 20:07:59

我在我的数据集上使用了聚类。现在，当我尝试使用带有 cv 的 LASSO 来预测响应时，它考虑的变量之一是新点被分类到哪个集群中。（我将集群变量作为预测变量包括在内，以查看是否在特定组影响响应）由于所有变量的信息已经被集群变量捕获，在 Lasso 模型中再次使用它和其他一些变量，它是否变得冗余/有偏差？

2个回答

由于所有变量的信息已经被集群变量捕获，在 Lasso 模型中再次使用它和其他一些变量，它会变得冗余/有偏差吗？

它不会变得有偏见，至少在统计意义上不会。根据聚类机制，信息可能是冗余的。

如果您使用回归，将 RMSE 作为损失函数，将平方误差作为聚类算法的损失指标，那么信息将是多余的。

无论如何，投入模型并测试它是无害的。LASSO 应该知道这些信息是多余的。

这是R中的模拟。

library(glmnet)
library(Matrix)

n <- 1e5
nclusters <- 5
set.seed(420)
rmse <- function(y, yhat){
  return (sqrt( sum( (y-yhat)**2 )))
}
ls <- data.frame(sample(letters, n, replace=TRUE))
xs <- sparse.model.matrix(~.-1,data=ls)

print(head(xs))  

# Now let's run k-means
out <- kmeans(xs, centers=nclusters)
bs <- runif(dim(xs)[2])

# Let's run k-means on the different categories
clusterpred <- data.frame(out[[1]])
ys <- xs %*% bs + rnorm(n)
print(table(clusterpred))

# Now let's use a clustered data set to predict some outcome
cxs <- sparse.model.matrix(~.-1, data = data.frame(cluster = factor(clusterpred[,1])))
# Concatenating the original features and the assigned clusters
totalxs <- cbind2(xs, cxs)

head(cxs)

# Setting alpha = 1 implies LASSO for the GLMNET Package
model <- cv.glmnet(y=ys, x=xs, alpha=1)
cmodel <- cv.glmnet(y=ys, x=cxs, alpha=1)

# Running on the clusters and the original features
totalmodel <- cv.glmnet(y=ys, x=totalxs, alpha=1)

# Predictions
yhat <- predict(model, xs)
yhatc <- predict(cmodel, cxs)
yhatt <- predict(totalmodel, totalxs)

# Looking at the difference RMSEs 
print(rmse(ys, yhat))
print(rmse(ys, yhatc))
print(rmse(ys, yhatt))

# It seems to select most of the features and *one* cluster
print(coef(totalmodel, s='lambda.min'))

在这个模拟中，我们看到模型的性能稍差一些，但相当相似，包括集群。

我认为这样做，你扩展了特征空间，我不认为这个额外的变量是多余的。

聚类分类变量是其他变量的线性组合。当您在数据集上应用 Lasso 时，分类变量会经历一次热编码，并且 Lasso 会在给定一定程度的正则化的情况下选择值的子集。假设在表现最好的正则化参数中，Lasso 从集群 1 - 10 中挑选出集群 1,3,7；在没有集群变量的情况下，仅在其他变量上使用 Lasso 可能无法拾取这些集群口袋。

其它你可能感兴趣的问题

上一篇在非正式数据集中寻找用户相似性下一篇图模块化度量