数据挖掘 - 将单索引熊猫数据框转换为多索引 - 吾爱随笔录

将单索引熊猫数据框转换为多索引

数据挖掘 Python 熊猫索引

2022-02-14 16:14:42

我有一个具有以下结构的数据框：

df.columns
Index(['first_post_date', 'followers_count', 'friends_count',
       'last_post_date','min_retweet', 'retweet_count', 'screen_name',
       'tweet_count',  'tweet_with_max_retweet', 'tweets', 'uid'],
        dtype='object')

在推文系列中，每个单元格都是另一个数据框，其中包含用户的所有推文。

df.tweets[0].columns
Index(['created_at', 'id', 'retweet_count', 'text'], dtype='object')

我想将此数据帧转换为多索引帧，主要是通过破坏包含推文的单元格。一个索引是uid，另一个是tweet中的id。

我怎样才能做到这一点？

链接到示例数据

1个回答

将嵌入式数据框拉入主数据框并构建多索引的一种方法如下：

代码：

def expand_tweets(tweets_df):
    tweets = []
    for uid, sub_df in tweets_df.set_index('uid').tweets.iteritems():
        sub_df['uid'] = uid
        tweets.append(sub_df)
    return pd.concat(tweets).merge(
        tweets_df.drop('tweets', axis=1).reset_index(),
        how='outer', on='uid').set_index(['uid', 'id'])

如何：

将所有推文数据帧uid作为索引从主数据帧中拉出，并将concat()它们与uid.
然后将主数据帧合并到串联的推文数据帧中。
设置所需的索引。

测试代码：

import json
import pandas as pd
with open('test.json') as f:
    df = pd.DataFrame(json.load(f))
df['tweets'] = df.tweets.apply(lambda x: pd.DataFrame(x))

print(expand_tweets(df).text.head())

结果：

uid         id                
1153859336  655060275025047552    Article on my new Haunted Stevenage book Paran...
            653912439940120576    Big thank you to @bobfmuk for interviewing me ...
            643709869908996096    Another interesting non-toadstool tweet today,...
            547107275681579008    @sisax67 Thanks, Simon. All the best to you &a...
            546693940024733696    Paul Adams @SkySportsDarts The Wanderer from W...
Name: text, dtype: object

其它你可能感兴趣的问题

上一篇从df单元内提取子特征？下一篇缺失值的插补和分类值的处理