如何在 R 中自动选择因子水平以最大化回归模型中的正系数数量?

机器算法验证 r 回归 分类数据
2022-04-18 13:04:18

我正在对具有多个分类变量(即因子)的数据执行简单(一阶项)线性回归,并且通常希望对于每个因子,其中一个级别不应该向回归和添加任何内容,而其他级别应该添加回归的正值。然而,当我执行回归分析时,我经常得到很多负系数。

是否有一种非手动方式来选择应将因子的哪些水平用作回归变量,以最大化方程中正系数的数量?换句话说,我怎样才能让 R 为我做这个(有点乏味的)任务?

2个回答

感谢 whuber 的评论和 Seb 的回答,我整理了以下功能,我相信它可以满足我的需求。希望它对某人有用。欢迎评论。

# take a dataframe, and re-level it such that the levels of the factors are
# assigned positive coefficients by lm()
# NOTE: this currently only works for model-forms that don't include
#       interaction terms.
auto_relevel <- function(df, model_form)
{
    # get list of categorical variables in df
    catvar_indices <- get_catvar_indices(df)

    # loop over categorical variables
    df_colnames <- attr(df, 'names')
    model_form_zeroicept <- paste(model_form, "- 1")
    for (i in catvar_indices) {
        catvar_name = df_colnames[i]
        all_levels = attr(df[[i]], "levels")
        temp_model <- lm(model_form_zeroicept, data=df)

        # If at least one of the levels' coefficients is less than zero, then
        # choose the one w/min coeff to be the new base-level

        # put a space after catvar_name so that it doesn't match longer level-names
        catvar_name <- paste(catvar_name, " ", sep="")
        factors <- grep(catvar_name, names(coef(temp_model)))
        coeffs <- coef(temp_model)[factors]
        # remove NA's from coeffs
        coeffs2 <- coeffs[! is.na(coeffs)]

        if (any(coeffs2 < 0)) {            
            # find out where this factor is in *all_levels*
            chosen_level_name <- names(coeffs2)[which(coeffs2==min(coeffs2))]
            stripped_level_name <- unlist(strsplit(chosen_level_name," "))[2] # strip factor name
            # add an initial space (to match all_levels)
            stripped_level_name <- paste(" ", stripped_level_name, sep="")
            min_level_index <- which(all_levels == stripped_level_name)
            df[[i]] <- relevel(df[[i]], ref=min_level_index)
        }
    }

    return(df)
}

这是做你想做的事的尝试。

 # Setting up some sample data
 require(dummies)
 df <- data.frame(categorial=rep(c(1,2,3), each=20), x=rnorm(60))
 flevels <- dummy(df$categorial)
 df$categorial <- factor(df$categorial)

 df$y=20 + df$x*3 + flevels%*%c(3,1,2) + rnorm(60)*2

我使用回归来获得最小因子水平,然后重新排序:

 # Now we start with trying to find the minimum cateogry. However, note that this does not work in every context!
 summary(helpreg <- lm(y~x+factor(categorial) - 1, data=df))

     Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
     x                     2.9944     0.2334   12.83   <2e-16 ***
     factor(categorial)1  22.9640     0.4472   51.35   <2e-16 ***
     factor(categorial)2  21.0720     0.4390   48.00   <2e-16 ***
     factor(categorial)3  22.1300     0.4364   50.71   <2e-16 ***

然后我开始整理最小值:

 factors <- grep('categorial', names(coef(helpreg))) # --- replace categorial with your variable name

 minimumf <- which(coef(helpreg)[factors]==min(coef(helpreg)[factors]))

然后重新调平

 df$categorial <- relevel(df$categorial, ref=minimumf)

就我而言,它有效-可能对您也有效....

 summary(lm(y~x+factor(categorial), data=df))

                           Estimate Std. Error t value Pr(>|t|)    
       (Intercept)          21.0720     0.4390  48.003  < 2e-16 ***
       x                     2.9944     0.2334  12.828  < 2e-16 ***
       factor(categorial)1   1.8920     0.6341   2.984  0.00421 ** 
       factor(categorial)3   1.0580     0.6193   1.708  0.09310 . 

评论当然非常感谢!