将字符串转换为虚拟编码变量

数据挖掘 分类数据
2022-02-23 19:53:15

这是数据

PlayerID, Characters, Win or Lose

我可以让它看起来像这样

8PYPY0LLQ,valkyrie5 ,  chr_witch4 ,  hog_rider5 ,  zapMachine1 ,  mega_minion3 ,  baby_dragon2 ,  bomber7 ,  skeleton_horde1, 0

或者像这样

2GRG822L9,"barbarians8, valkyrie5, chr_balloon3, fire_spirits8, minion8, firespirit_hut6, rage4, skeleton_horde3,",1

第二列是 70+ n 个字符的 8 个字符组合。

我需要将变量编码为虚拟变量,因此每个字符都有自己的列。有没有办法在 python/R 中做到这一点?我假设您必须将第二列保留为字符串,而不是输出看起来像这样的 csv 文件。

2GRG822L9,barbarians8, valkyrie5, chr_balloon3, fire_spirits8, minion8, firespirit_hut6, rage4, skeleton_horde3,1
8PYPY0LLQ,valkyrie5 ,  chr_witch4 ,  hog_rider5 ,  zapMachine1 ,  mega_minion3 ,  baby_dragon2 ,  bomber7 ,  skeleton_horde1,0

在虚拟编码之前它应该看起来像这样(我可以去掉字符串中的逗号)

2GRG822L9,"barbarians8, valkyrie5, chr_balloon3, fire_spirits8, minion8, firespirit_hut6, rage4, skeleton_horde3,",1
8PYPY0LLQ,"valkyrie5 ,  chr_witch4 ,  hog_rider5 ,  zapMachine1 ,  mega_minion3 ,  baby_dragon2 ,  bomber7 ,  skeleton_horde1,",0
1个回答

用python很简单:

from pandas import DataFrame
data = [('2GRG822L9',"barbarians8,valkyrie5,chr_balloon3,fire_spirits8,minion8,firespirit_hut6,rage4,skeleton_horde3",1), ('8PYPY0LLQ',"valkyrie5,chr_witch4,hog_rider5,zapMachine1,mega_minion3,baby_dragon2,bomber7,skeleton_horde1",0)]
df = DataFrame.from_records(data,columns=('PlayerID', 'Characters', 'Result'))
df = df.drop('Characters', 1).join(df.Characters.str.get_dummies(','))

结果:

    PlayerID  Result  baby_dragon2  barbarians8  bomber7  chr_balloon3  \
0  2GRG822L9       1             0            1        0             1   
1  8PYPY0LLQ       0             1            0        1             0   

   chr_witch4  fire_spirits8  firespirit_hut6  hog_rider5  mega_minion3  \
0           0              1                1           0             0   
1           1              0                0           1             1   

   minion8  rage4  skeleton_horde1  skeleton_horde3  valkyrie5  zapMachine1  
0        1      1                0                1          1            0  
1        0      0                1                0          1            1