您并没有具体说明您希望在此框架中的数据在哪里结束,因此我将简单地展示如何将特征和子特征分解为可以根据需要转换为表单的格式:
代码:
最重要的元素是采用特征列并分解子特征。可以这样做:
def get_sub_features(feature_col):
# split on commas and then colons
feature_df = feature_col.str.split(',').apply(
lambda feature: pd.Series(
dict([sub_feature.split(':') for sub_feature in feature]),
name=feature_col.name), 1)
# add a feature name column to use as an index
feature_df['feature'] = feature_col.name
# name the columns as sub-feature for later stacking
feature_df.columns.names = ['sub-feature']
# return dataframe with id/feature_name index
new_index = [feature_df.index.name, 'feature']
return feature_df.reset_index().set_index(new_index)
测试代码:
df = pd.read_fwf(StringIO(u"""
Id Acol Bcol Ccol Dcol
1 X:0232,Y:10332,Z:23891 E:1222 F:12912,G:1292 V:1281
2 X:432,W:2932 R:2392 T:292,U:29203 Q:2392
3 Y:29320,W:2392 R:2932 G:239,T:2392 Q:2391"""
), header=1).set_index(['Id'])
print(df)
feature_cols = ['Acol', 'Bcol', 'Ccol', 'Dcol']
stacked = pd.concat(get_sub_features(df[f]).stack() for f in feature_cols)
print(stacked)
结果:
Acol Bcol Ccol Dcol
Id
1 X:0232,Y:10332,Z:23891 E:1222 F:12912,G:1292 V:1281
2 X:432,W:2932 R:2392 T:292,U:29203 Q:2392
3 Y:29320,W:2392 R:2932 G:239,T:2392 Q:2391
Id feature sub-feature
1 Acol X 0232
Y 10332
Z 23891
2 Acol W 2932
X 432
3 Acol W 2392
Y 29320
1 Bcol E 1222
2 Bcol R 2392
3 Bcol R 2932
1 Ccol F 12912
G 1292
2 Ccol T 292
U 29203
3 Ccol G 239
T 2392
1 Dcol V 1281
2 Dcol Q 2392
3 Dcol Q 2391
dtype: object
访问数据:
一些例子:
print(stacked.xs('T', level=2))
print(stacked.iloc[stacked.index.get_level_values('sub-feature') == 'T'])
结果:
0
Id feature
2 Ccol 292
3 Ccol 2392
0
Id feature sub-feature
2 Ccol T 292
3 Ccol T 2392