机器算法验证 - 为什么固定效应 OLS 需要唯一的时间元素？ - 吾爱随笔录

为什么固定效应 OLS 需要唯一的时间元素？

机器算法验证 r 计量经济学固定效应模型 plm

2022-03-19 17:03:10

R 中库的plm功能plm让我对重复的时间 ID 对感到悲痛，即使我正在运行一个我认为根本不需要时间变量的模型（参见下面的可重现示例）。

我能想到三种可能：

我对固定效应回归的理解是错误的，它们确实需要唯一的时间索引（或根本没有时间索引！）。
plm() 在这里过于挑剔，应该放宽这个要求。
plm() 使用的特定估计技术（内部转换）需要时间索引，即使顺序似乎并不重要，而且计算效率较低的版本（包括直接 OLS 模型中的虚拟变量）并不重要不需要它们。

有什么想法吗？

set.seed(1)
n <- 1000
test <- data.frame( grp = as.factor(rep( letters, (n/length(letters))+1 ))[seq(n)], x = runif(n), z = runif(n) )
test$y <- with( test, 2*x + 3*z + rnorm(n) )
lm( y ~ x + z, data = test )
lm( y ~ x + z + grp, data = test )

require(plm)
# Model fails if I don't specify a time index, despite effect = "individual"
plm( y ~ x + z, data = test, model = "within", effect="individual", index = "grp" ) 
# Create time variable and add it to the index but still specify individual FE not time FE also
library(plyr)
test <- ddply( test, .(grp), function(dat) transform( dat, t = seq(nrow(dat)) ) )
# Now plm() works; note coefficients clearly include the fixed effects, as they match the lm() version above
plm( y ~ x + z, data = test, model = "within", effect="individual", index = c("grp","t") ) 
# Scramble time variables and show they don't matter as long as they're unique within a cluster
test <- ddply( test, .(grp), function(dat) transform( dat, t = sample(t) ) )
plm( y ~ x + z, data = test, model = "within", effect="individual", index = c("grp","t") ) 
# Add a duplicate time entry and show that it causes plm() to fail
test[ 2, "t" ] <- test[ 1, "t" ] 
plm( y ~ x + z, data = test, model = "within", effect="individual", index = c("grp","t") )

为什么这很重要

我正在尝试引导我的模型，当我要求索引时间对是唯一的时，这会导致头痛，如果 (2) 为真，这似乎是不必要的。

2个回答

您对固定效应回归的理解似乎非常好。当您进行内部转换以获得固定效应估计时间排序顺序无关紧要，因为，和将时间分量求和，无论每个人（或公司/国家/无论您的下标是什么）内的排序顺序如何。

y_{i t} - {\bar{y}}_{i} = (X_{i t} - {\bar{X}}_{i}) β + ϵ_{i t} - {\bar{ϵ}}_{i}

$y_{it} - \overline{y}_{i} = (X_{it} - \overline{X}_i)\beta + \epsilon_{it} - \overline{\epsilon}_i$

{\bar{y}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} y_{i t}

$\overline{y}_{i} = \frac{1}{T}\sum^{T}_{t=1}y_{it}$

{\bar{x}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} x_{i t}

$\overline{x}_{i} = \frac{1}{T}\sum^{T}_{t=1}x_{it}$

{\bar{ϵ}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} ϵ_{i t}

$\overline{\epsilon}_{i} = \frac{1}{T}\sum^{T}_{t=1}\epsilon_{it}$

i

$i$

我不是 R 人，但在 Stata 中，对于时间变量中的重复时间值，您会遇到同样的问题。同样，这对于固定效应估计无关紧要，实际上您甚至不需要指定时间变量。

例如，

webuse nlswork
xtset idcode
xtreg ln_wage age hours, fe

会给你同样的估计

xtset idcode year
xtreg ln_wage age hours, fe

但是，时间值的排序顺序有时对推理很重要。如果您xtserial在上述固定效果回归后使用该命令，Stata 会告诉您

xtserial age
time variable not set, use -tsset varname ...

如果您以前没有使用xtset idcode year过。为此，如果您在给定年份对个人有 2 次观察，但您不知道一个观察的日期是否在另一个之前/之后（例如，如果缺少一个月或季度变量），则可能会出现问题。

我敢肯定这不是你的情况，但有时人们将时间变量指定为年度，而实际上他们有每月数据。如果他们想进行这样的回归，他们需要首先将数据汇总到年度水平。否则，要解决重复时间值问题，需要为年月组合生成一个新的时间变量。内部估计器本身不需要指定的时间组件。

实际上，当存在较低级别的单位时（即您想要家庭而不是个人，国家而不是州等），plm将不允许您运行 FE 模型。事实上，做你想做的事并没有错。

在这种情况下，诀窍就是使时间变量唯一，将其与子级别单位交叉：如果您在家庭级别进行，则创建一个时间个体，如果您在地区级别进行，则创建一个州年。查看类似的帖子：https ://stackoverflow.com/questions/43510067/fixed-effects-plm-package-r-multiple-observations-per-year-id/43573731

library(plm)
#> Loading required package: Formula
data("Produc", package = "plm")


Produc$year_state <- paste(Produc$year, Produc$state, sep="_")

## will throw warning
Produc_plm <- pdata.frame(Produc, index = c("region", "year"))
#> Warning in pdata.frame(Produc, index = c("region", "year")): duplicate couples (id-time) in resulting pdata.frame
#>  to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")

## will throw error:
reg_plm_1 <- plm(gsp ~ pcap, data = Produc_plm)
#> Warning: non-unique values when setting 'row.names': '1-1970', '1-1971',
#> '1-1972', '1-1973', '1-1974', '1-1975', '1-1976', '1-1977', '1-1978',
#> '1-1979', '1-1980', '1-1981', '1-1982', '1-1983', '1-1984', '1-1985',

#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed

改用技巧：

Produc_plm2 <- pdata.frame(Produc, index = c("region", "year_state"))
reg_plm_2 <- plm(gsp ~ pcap, data = Produc_plm2)

让我们检查一下包lfe是否正确：

library(lfe)
#> Loading required package: Matrix
#> 
#> Attaching package: 'lfe'
#> The following object is masked from 'package:plm':
#> 
#>     sargan
library(broom)
reg_lfe_1 <- felm(gsp ~ pcap|region, data = Produc)
all.equal(as.data.frame(tidy(reg_plm_2)), 
          as.data.frame(tidy(reg_lfe_1)))
#> [1] TRUE

其它你可能感兴趣的问题

上一篇删除 lme 中随机效应与解释摘要输出之间相关性的术语下一篇在 DTW 中找到通过矩阵的最佳路径