来自消息序列的数据集

数据挖掘 Python 数据集 熊猫
2022-02-15 10:50:34

我有一个按日期戳排序的数据集,如下所示:

user    message
A       Hi.             
B       Hello.
B       How are you?
A       I am stuck.
B       How can I help you?

我想要的是创建一个看起来像这样的熊猫 df:

user  message       reply
A     Hi.           Hello.
A     Hi.           How are you?
B     Hello.        I am stuck.
B     How are you?  I am stuck.
A     I am stuck.   How can I help you?

对于每条消息,我想找到所有的回复。这意味着我想要在当前消息之后但来自其他用户的消息。我怎么能用熊猫做到这一点?让我们只考虑 2 个用户 A 和 B 的二元情况。

1个回答

首先,找出用户切换的时间,并给每个消息组一个单独的id:

df['group_id'] = ((df['user'] != df['user'].shift()).cumsum())

user              message  group_id
   A                  Hi.         1
   B               Hello.         2
   B         How are you?         2
   A          I am stuck.         3
   B  How can I help you?         4

然后groupby每个 group_id 并聚合每个 id 的消息列表。通过将这些消息移动 -1,我们会收到每个 group_id 的回复:

df_reply = df.groupby('group_id')['message'].agg(list)
df_reply = df_reply.shift(-1).reset_index().rename(columns={'message': 'reply'})

 group_id                   reply
        1  [Hello., How are you?]
        2           [I am stuck.]
        3   [How can I help you?]
        4                     NaN

然后可以将回复合并回原始数据帧。回复列表被分解以确保每行有一个回复:

df.merge(df_reply, on='group').explode('reply').drop('group', axis=1).dropna()

最终结果:

user       message                reply
   A           Hi.               Hello.
   A           Hi.         How are you?
   B        Hello.          I am stuck.
   B  How are you?          I am stuck.
   A   I am stuck.  How can I help you?