如何在 python 中对这个数据框进行分组?

数据挖掘 Python 熊猫 数据框
2022-02-18 00:34:10

我有这个问题:

import pandas as pd

stripline = "----------------------------"

rawData = {
    'order number': ['11xa', '11xa', '11xa', '21xb', '31xc'],
    'working area': ['LLA', 'LLE', 'LLS', 'MLA', 'MLE'],
    'time': ['1', '6', '13', '35', '24']
}

df = pd.DataFrame(rawData)
print("original data:")
print(df.head())

print(stripline)

rawData2 = {
    'order number': ['11xa', '21xb', '31xc'],
    'working area': ['LLS', 'MLA', 'MLE'],
    'time': ['20', '35', '24']
}
df2 = pd.DataFrame(rawData2)

print("expected result:")
print("group after order number, sum all times to that order and choose working field with the biggest time")
print(df2.head())

如何操作我的数据框 df 以获取 df2?

我想总结时间列中与订单号相对应的所有值。我想使用时间最长的工作领域,特别是我想保留其余的数据。新的数据框有三阶,旧的一阶五阶。

2个回答

这行代码应该为您完成:

df.groupby(["order number", "working area"])['time'].agg(sum)

1)将时间列转换为整数:

df['time'] = df['time'].astype(int)

2)找到working area最大值time

for index, row in df.iterrows():
    df.at[index, 'max working area'] = df[df['time'] == df[df['order number'] == row['order number']]['time'].max()]['working area'].values[0]

3)聚合时间列:

df2 = df.groupby(['order number', 'max working area']).sum()

这是你想要的吗?