数据挖掘 - pandas行迭代求和优化 - 吾爱随笔录

pandas行迭代求和优化

数据挖掘熊猫优化数据框加工

2022-01-19 20:19:27

我想知道是否有人可以提供一些关于提高熊猫结果的速度和计算的意见。

我想要获得的是基于第二个表（UUID）的每一行的一个表（玩家表）中的 ID 总和。从功能上讲，每一行都需要对其 Active 行中包含的 player 表行的总数求和，并将 UUID 分配为该行的索引。

我最初的想法是逐行循环并计算出我的结果，但这产生了相当缓慢的结果，我怀疑这不是可以完成的最佳方式。在下面的版本中，我估计完整数据集的总时间约为 66 分钟。在 10,000 个子样本上运行大约需要 20 秒。

有人有更好的解决方案来计算这些结果吗？

提前致谢！

UUID 表

这是整个表的一个子集

形状 = (2060590, 2)

玩家 ID 表

这是整个表的一个子集

形状 = (39,8)

决赛桌

代码

# executes in ~20 seconds
df = None
for ix, i in enumerate(uuid_df[["UUID", "Active"]].sample(10000).itertuples(index=False)):
    # Get UUID for row
    _uuid = i[0]
    # Get list of "Active items" (these are the ones that will be summed)
    _active = i[1]

    # Create new frame by selecting everything from points table where the ID is in the Active List.
    # Sum selected values, convert to a dataframe with UUID as index and tranpose
    _dff = points_table_df.loc[points_table_df.index.isin(_active)].sum().to_frame(_uuid).T

    # Check if first dataframe, if not concat to existing one
    if df is None:
        df = _dff
    else:
        df = pd.concat([df, _dff])

2个回答

这实际上可以使用线性代数快速直观地完成。

因此，将您的播放器视为标签二值化数组（可以使用 MultiLabelBinarizer 完成），因此您会期望包含 0 和 1 的大小为 (2060590, 39) 的数组，重新排列与您订购播放器表的方式类似的列（或其他-哪种方式更容易），基本上这样你的新矩阵的第一列对应于玩家桌上的同一个玩家。最后只需应用矩阵乘法，就完成了。

这是一个使用生成的示例的示例，但希望您了解这样做的想法。

import numpy as np
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

sample_active = pd.Series([[100,50,150,200],
                           [100,50,150],
                           [100,50],
                           [100]])
sample_df = pd.DataFrame()

sample_df['id'] = ['fadfsadsa', 'dsafsadf', 'dfsafsda', 'dasfasdfsaf']
sample_df['active'] = sample_active
## sample_df should look close to your original df

classes = [50,100,150,200]

player_df = pd.DataFrame({cl : np.random.uniform(0,1,size=5) for cl in classes}).T
player_df.columns = ['A','B','C','D','E'] 

sample_transformed = mlb.fit_transform(sample_active.values) ##apply multilabel binarizer

output = sample_transformed.dot(player_df.loc[mlb.classes_]) ##matrix multiply and get your required answer, use loc so the order will be similar as your binarized matrix.


new_df = pd.concat([sample_df['id'], pd.DataFrame(output)], axis = 1)
new_df.columns = ['id'] + list(player_df.columns)

对于您的情况，我认为这应该可行：

mlb = MultiLabelBinarizer()
active_transformed = mlb.fit_transform(uuid_df['Active'])
output = active_transformed.dot(points_table_df.loc[mlb.classes_])
df = pd.concat([uuid_df[['UUID']], output], axis = 1)
df.columns = ['UUID'] + list(points_table_df.columns)

尝试一下！

如果我正确理解了您的 DataFrame 架构，那么应该这样做。所有这些操作都是矢量化的，因此它们应该比遍历数据帧快得多。

# "explode" the values in the `Active` column to get a dataframe of (UUID, player_id) pairs
uuid_df = uuid_df.explode('Active').rename(columns={
    'Active': 'player_id'
})

# inner join with the player stats dataframe 
# (this assumes that `player_id` is the index of `points_table_df`)
joined_df = uuid_df.merge(points_table_df, left_on='player_id', right_index=True)

# group by the UUID and sum to get aggregate stats
stats_df = joined_df.groupby('UUID').sum()

# promote the UUID to the index:
stats_df = stats_df.set_index('UUID')

其它你可能感兴趣的问题

上一篇组合嵌入的方法下一篇为什么在这种情况下梯度步骤不垂直于等高线？