我面临一个问题,我需要弄清楚两件事:
- 在几个相关的预测变量中,哪个预测变量在其对预测变量的影响/预测能力方面是最有意义的。
- 这些不同预测变量的意义顺序(从最有意义到最不重要)。
由于我没有关于这项调查的先验假设,我认为我应该进行某种多元回归分析。然后,也许我应该从模型中提取术语,看看哪一个是最有意义的。我已经知道过去p-value不是正确的方式。那是什么?
例子
假设我想调查哪些因素会影响城市居民的幸福感。我对来自纽约市和旧金山的人(居民)进行抽样,并要求他们评价:
- 他们对城市的普遍满意度。
- 他们的城市有多干净(在他们看来)
- 学校的教育水平有多好
- 公共交通有多好。
在我看来,这里有 3 个相关变量(清洁度、教育和交通)可能与总体满意度有关。我想模拟这种关系,并得出结论纽约市与旧金山有多么不同。例如,就整体满意度而言,纽约市的影响是教育 > 交通 > 清洁度,而旧金山交通 > 教育 > 清洁度的影响是什么?
这里有一些玩具数据来演示。
my_df <- structure(list(location = c("sf", "nyc", "nyc", "sf", "nyc",
"nyc", "nyc", "nyc", "nyc", "sf", "nyc", "sf", "sf", "sf", "nyc",
"sf", "sf", "nyc", "nyc", "nyc", "sf", "sf", "sf", "sf", "sf",
"nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "nyc", "sf", "nyc",
"nyc", "nyc", "sf", "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc",
"nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc", "sf", "sf", "nyc",
"nyc", "nyc", "nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "sf",
"nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc",
"nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "sf", "nyc",
"nyc", "nyc", "sf", "sf", "sf", "nyc", "sf", "sf", "sf", "nyc",
"nyc", "nyc", "nyc", "nyc", "nyc"),
satisfied = c(5, 1, 7, 5,
7, 1, 5, 5, 5, 7, 7, 4, 1, 3, 5, 6, 7, 7, 6, 4, 4, 5, 6, 5, 5,
7, 5, 6, 5, 4, 7, 7, 5, 5, 4, 7, 7, 5, 6, 6, 3, 6, 5, 7, 5, 7,
6, 5, 4, 3, 6, 5, 7, 3, 5, 5, 7, 5, 6, 7, 7, 7, 7, 5, 4, 7, 6,
7, 7, 6, 6, 6, 5, 7, 5, 4, 6, 4, 7, 5, 6, 6, 5, 5, 6, 7, 6, 5,
1, 5, 2, 7, 7, 7, 7, 7, 1, 3, 7, 7),
clean = c(4, 1,
7, 3, 4, 1, 6, 6, 7, 4, 5, 1, 1, 1, 4, 6, 6, 1, 6, 1, 4, 2, 2,
7, 3, 5, 2, 4, 1, 1, 4, 6, 3, 5, 1, 4, 5, 2, 5, 5, 4, 5, 4, 7,
3, 6, 5, 4, 5, 4, 5, 4, 5, 1, 5, 2, 6, 5, 7, 6, 3, 7, 5, 6, 4,
6, 6, 5, 5, 5, 4, 1, 4, 4, 5, 1, 3, 1, 2, 2, 6, 4, 3, 6, 7, 7,
5, 2, 4, 1, 3, 1, 5, 3, 5, 5, 1, 1, 5, 6),
edu = c(5,
1, 7, 4, 4, 1, 6, 6, 6, 4, 5, 4, 4, 3, 3, 5, 4, 1, 1, 3, 5, 6,
5, 5, 3, 6, 2, 4, 4, 4, 6, 3, 4, 7, 1, 4, 7, 5, 6, 5, 5, 5, 4,
7, 3, 7, 6, 5, 5, 4, 5, 3, 4, 4, 4, 7, 5, 4, 6, 6, 4, 7, 4, 2,
5, 6, 6, 7, 6, 7, 3, 3, 2, 6, 6, 2, 5, 3, 6, 5, 6, 4, 4, 5, 6,
7, 3, 3, 4, 5, 4, 1, 3, 4, 4, 6, 5, 1, 4, 6),
transportation = c(1,
1, 7, 5, 7, 1, 4, 6, 6, 6, 6, 1, 1, 1, 5, 5, 4, 7, 6, 6, 7, 5,
2, 7, 3, 6, 1, 4, 7, 5, 6, 7, 4, 3, 2, 6, 4, 2, 6, 5, 4, 7, 6,
7, 3, 7, 4, 4, 5, 4, 6, 3, 5, 2, 7, 3, 7, 7, 7, 6, 7, 7, 7, 5,
3, 5, 4, 7, 6, 6, 4, 2, 4, 4, 5, 6, 5, 2, 6, 2, 6, 6, 3, 4, 7,
7, 7, 4, 5, 4, 5, 3, 7, 5, 7, 7, 7, 1, 6, 6)),
row.names = c(NA, -100L),
class = c("tbl_df", "tbl", "data.frame"))
library(magrittr)
library(effectsize)
my_df %>%
lm(satisfied ~ clean*location + edu*location + transportation*location, data = .) %>%
effectsize() %>%
plot()
#> Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
#> use `guide = "none"` instead.

由reprex 包于 2021-08-15 创建 (v2.0.0 )
为了在城市之间进行比较,我在每个//变量和变量之间添加edu了交互clean项。然后我过去常常从模型中获取估计值。但我能从这些估计中得出什么结论?transportationlocationeffectsize::effectsize()
如果我完全弄错了,请告知我应该采取什么其他途径来解决这个问题。
谢谢!


