原始 df 看起来像:
title salary
1 engineer 51000
2 manager 33700
3 sales 26800
4 engineer 53700
5 sales 36800
使用dummy.data.frame(df),字符列“title”被重新编码为“one-hot”(也称为虚拟变量)。如果每列 =1 title=TRUE(否则为零)。
titleengineer titlemanager titlesales salary
1 1 0 0 51000
2 0 1 0 33700
3 0 0 1 26800
4 1 0 0 53700
5 0 0 1 36800
另一种编码是使用“标题”作为具有三个级别的因素
df$title = as.factor(df$title)。
但是,在许多情况下,ML 算法可以更好地消化“one hot”。
简单的例子:
title <- c('engineer','manager','sales','engineer','sales')
salary <- c(51000, 33700, 26800, 53700, 36800)
df = data.frame(title, salary)
df
library(dummies)
df2 = dummy.data.frame(df)
df2
带有“因子”的 OLS 模型:
df$title = as.factor(df$title)
ols = lm(salary~., data=df)
summary(ols)
输出:
Call:
lm(formula = salary ~ ., data = df)
Residuals:
1 2 3 4 5
-1.350e+03 5.684e-14 -5.000e+03 1.350e+03 5.000e+03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52350 3662 14.295 0.00486 **
titlemanager -18650 6343 -2.940 0.09882 .
titlesales -20550 5179 -3.968 0.05804 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5179 on 2 degrees of freedom
Multiple R-squared: 0.8992, Adjusted R-squared: 0.7983
F-statistic: 8.918 on 2 and 2 DF, p-value: 0.1008
带有“假人”的 OLS 模型:
ols2 = lm(salary~titlemanager+titlesales, data=df2)
summary(ols2)
输出:
Call:
lm(formula = salary ~ titlemanager + titlesales, data = df2)
Residuals:
1 2 3 4 5
-1.350e+03 5.684e-14 -5.000e+03 1.350e+03 5.000e+03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52350 3662 14.295 0.00486 **
titlemanager -18650 6343 -2.940 0.09882 .
titlesales -20550 5179 -3.968 0.05804 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5179 on 2 degrees of freedom
Multiple R-squared: 0.8992, Adjusted R-squared: 0.7983
F-statistic: 8.918 on 2 and 2 DF, p-value: 0.1008
概括:
结果是一样的。哑元是 R 因子的不同表示。有时您需要将内容明确地作为假人传递。在这种情况下model.matrix()通常很有用。(文档)