我正在对具有多个分类变量(即因子)的数据执行简单(一阶项)线性回归,并且通常希望对于每个因子,其中一个级别不应该向回归和添加任何内容,而其他级别应该添加回归的正值。然而,当我执行回归分析时,我经常得到很多负系数。
是否有一种非手动方式来选择应将因子的哪些水平用作回归变量,以最大化方程中正系数的数量?换句话说,我怎样才能让 R 为我做这个(有点乏味的)任务?
我正在对具有多个分类变量(即因子)的数据执行简单(一阶项)线性回归,并且通常希望对于每个因子,其中一个级别不应该向回归和添加任何内容,而其他级别应该添加回归的正值。然而,当我执行回归分析时,我经常得到很多负系数。
是否有一种非手动方式来选择应将因子的哪些水平用作回归变量,以最大化方程中正系数的数量?换句话说,我怎样才能让 R 为我做这个(有点乏味的)任务?
感谢 whuber 的评论和 Seb 的回答,我整理了以下功能,我相信它可以满足我的需求。希望它对某人有用。欢迎评论。
# take a dataframe, and re-level it such that the levels of the factors are
# assigned positive coefficients by lm()
# NOTE: this currently only works for model-forms that don't include
# interaction terms.
auto_relevel <- function(df, model_form)
{
# get list of categorical variables in df
catvar_indices <- get_catvar_indices(df)
# loop over categorical variables
df_colnames <- attr(df, 'names')
model_form_zeroicept <- paste(model_form, "- 1")
for (i in catvar_indices) {
catvar_name = df_colnames[i]
all_levels = attr(df[[i]], "levels")
temp_model <- lm(model_form_zeroicept, data=df)
# If at least one of the levels' coefficients is less than zero, then
# choose the one w/min coeff to be the new base-level
# put a space after catvar_name so that it doesn't match longer level-names
catvar_name <- paste(catvar_name, " ", sep="")
factors <- grep(catvar_name, names(coef(temp_model)))
coeffs <- coef(temp_model)[factors]
# remove NA's from coeffs
coeffs2 <- coeffs[! is.na(coeffs)]
if (any(coeffs2 < 0)) {
# find out where this factor is in *all_levels*
chosen_level_name <- names(coeffs2)[which(coeffs2==min(coeffs2))]
stripped_level_name <- unlist(strsplit(chosen_level_name," "))[2] # strip factor name
# add an initial space (to match all_levels)
stripped_level_name <- paste(" ", stripped_level_name, sep="")
min_level_index <- which(all_levels == stripped_level_name)
df[[i]] <- relevel(df[[i]], ref=min_level_index)
}
}
return(df)
}
这是做你想做的事的尝试。
# Setting up some sample data
require(dummies)
df <- data.frame(categorial=rep(c(1,2,3), each=20), x=rnorm(60))
flevels <- dummy(df$categorial)
df$categorial <- factor(df$categorial)
df$y=20 + df$x*3 + flevels%*%c(3,1,2) + rnorm(60)*2
我使用回归来获得最小因子水平,然后重新排序:
# Now we start with trying to find the minimum cateogry. However, note that this does not work in every context!
summary(helpreg <- lm(y~x+factor(categorial) - 1, data=df))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 2.9944 0.2334 12.83 <2e-16 ***
factor(categorial)1 22.9640 0.4472 51.35 <2e-16 ***
factor(categorial)2 21.0720 0.4390 48.00 <2e-16 ***
factor(categorial)3 22.1300 0.4364 50.71 <2e-16 ***
然后我开始整理最小值:
factors <- grep('categorial', names(coef(helpreg))) # --- replace categorial with your variable name
minimumf <- which(coef(helpreg)[factors]==min(coef(helpreg)[factors]))
然后重新调平
df$categorial <- relevel(df$categorial, ref=minimumf)
就我而言,它有效-可能对您也有效....
summary(lm(y~x+factor(categorial), data=df))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.0720 0.4390 48.003 < 2e-16 ***
x 2.9944 0.2334 12.828 < 2e-16 ***
factor(categorial)1 1.8920 0.6341 2.984 0.00421 **
factor(categorial)3 1.0580 0.6193 1.708 0.09310 .
评论当然非常感谢!