从df单元内提取子特征?

数据挖掘 机器学习 Python 熊猫 特征提取
2022-02-16 16:13:27

我有一个包含多个表单特征的数据框:

Id, Acol,                   Bcol,   Ccol,           Dcol,
1,  X:0232,Y:10332,Z:23891, E:1222, F:12912,G:1292, V:1281
2,  X:432,W:2932            R:2392, T:292,U:29203   Q:2392
3,  Y:29320,W:2392          R:2932, G:239,T:2392    Q:2391

...about 10,000 Id's
  • 其中 1,2,3 是 ID。
  • Acol、Bcol、Ccol 和 Dcol 是特征列,
  • X、Y、Z、W 是特征“Acol”的子特征,依此类推……

如何从这种数据框中提取子特征/特征?

1个回答

您并没有具体说明您希望在此框架中的数据在哪里结束,因此我将简单地展示如何将特征和子特征分解为可以根据需要转换为表单的格式:

代码:

最重要的元素是采用特征列并分解子特征。可以这样做:

def get_sub_features(feature_col):
    # split on commas and then colons
    feature_df = feature_col.str.split(',').apply(
        lambda feature: pd.Series(
            dict([sub_feature.split(':') for sub_feature in feature]),
            name=feature_col.name), 1)

    # add a feature name column to use as an index
    feature_df['feature'] = feature_col.name

    # name the columns as sub-feature for later stacking
    feature_df.columns.names = ['sub-feature']

    # return dataframe with id/feature_name index
    new_index = [feature_df.index.name, 'feature']
    return feature_df.reset_index().set_index(new_index)

测试代码:

df = pd.read_fwf(StringIO(u"""
    Id  Acol                    Bcol    Ccol            Dcol
    1   X:0232,Y:10332,Z:23891  E:1222  F:12912,G:1292  V:1281
    2   X:432,W:2932            R:2392  T:292,U:29203   Q:2392
    3   Y:29320,W:2392          R:2932  G:239,T:2392    Q:2391"""
                          ), header=1).set_index(['Id'])
print(df)

feature_cols = ['Acol', 'Bcol', 'Ccol', 'Dcol']
stacked = pd.concat(get_sub_features(df[f]).stack() for f in feature_cols)
print(stacked)

结果:

                      Acol    Bcol            Ccol    Dcol
Id                                                        
1   X:0232,Y:10332,Z:23891  E:1222  F:12912,G:1292  V:1281
2             X:432,W:2932  R:2392   T:292,U:29203  Q:2392
3           Y:29320,W:2392  R:2932    G:239,T:2392  Q:2391

Id  feature  sub-feature
1   Acol     X               0232
             Y              10332
             Z              23891
2   Acol     W               2932
             X                432
3   Acol     W               2392
             Y              29320
1   Bcol     E               1222
2   Bcol     R               2392
3   Bcol     R               2932
1   Ccol     F              12912
             G               1292
2   Ccol     T                292
             U              29203
3   Ccol     G                239
             T               2392
1   Dcol     V               1281
2   Dcol     Q               2392
3   Dcol     Q               2391
dtype: object

访问数据:

一些例子:

print(stacked.xs('T', level=2))
print(stacked.iloc[stacked.index.get_level_values('sub-feature') == 'T'])

结果:

               0
Id feature      
2  Ccol      292
3  Ccol     2392

                           0
Id feature sub-feature      
2  Ccol    T             292
3  Ccol    T            2392