在分类问题中基于其 Pearson 相关值与目标变量来消除特征是否有效?
例如,我有一个具有以下格式的数据集,其中目标变量取 1 或 0:
>>> dt.head()
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 \
0 1 2 23 0 0
1 3 2 34 0 0
2 4 2 23 0 0
3 8 2 37 0 195
4 10 2 39 0 0
imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 TARGET
0 0 0 0
1 0 0 0
2 0 0 0
3 195 0 0
4 0 0 0
计算相关矩阵给出以下值
|ID|var3|var15|imp_ent_var16_ult1|imp_op_var39_comer_ult1|imp_op_var39_comer_ult3|imp_op_var40_comer_ult1|TARGET
ID|1.0|-0.00102533166614|-0.00213549813966|-0.00311137548461|-0.00143645708778|-0.00413114484307|-0.00727672024906|0.0031484687227
var3|-0.00102533166614|1.0|-0.00445177129541|0.0018681447614|0.00598903116859|0.00681691701467|0.00151753041397|0.00447479817554
var15|-0.00213549813966|-0.00445177129541|1.0|0.0437222608106|0.0947624170998|0.101177078747|0.0427540973727|0.101322098561
imp_ent_var16_ult1|-0.00311137548461|0.0018681447614|0.0437222608106|1.0|0.0412213212518|0.0348787079026|0.00989582043194|-1.74602537678e-05
imp_op_var39_comer_ult1|-0.00143645708778|0.00598903116859|0.0947624170998|0.0412213212518|1.0|0.886476049204|0.342709191344|0.0103531295754
imp_op_var39_comer_ult3|-0.00413114484307|0.00681691701467|0.101177078747|0.0348787079026|0.886476049204|1.0|0.316671244555|0.0035169224417
imp_op_var40_comer_ult1|-0.00727672024906|0.00151753041397|0.0427540973727|0.00989582043194|0.342709191344|0.316671244555|1.0|0.00311938694896
TARGET|0.0031484687227|0.00447479817554|0.101322098561|-1.74602537678e-05|0.0103531295754|0.0035169224417|0.00311938694896|1.0
消除与目标的相关性低于阈值(例如,0.1)的所有特征是否有效?
如果在相关属性是连续变量的情况下存在高达 1 的强属性间相关性怎么办,这是否意味着这些特征为学习者保存了冗余信息?我可以安全地删除其中一个而不冒丢失信息的风险吗?