数据挖掘 - 熊猫合并问题 - 吾爱随笔录

熊猫合并问题

数据挖掘熊猫

2022-03-02 08:49:53

我有一个数据集，其中对于每个日期我有两个值说 A 和 B，有些日期缺少其中一个值

Date       C1   C2
1/1/2019    A   10
1/1/2019    B   20
1/2/2019    A   25
1/2/2019    B   30
1/3/2019    A   23
1/4/2019    A   32

我想确保对于每个日期，我都有 A 和 B 的日期。我能想到的一种方法是让驱动程序表包含 A 和 B 的所有不同日期和数据，然后进行左连接。有没有更好的方法来使用 pandas 合并或遍历行来做到这一点？我想不出任何，但觉得可能有更好的方法。谢谢！

1个回答

我认为您已经列出了实现它的主要方法 - 您可以通过迭代或合并来实现它。什么是“最佳”取决于您的用例。

这是一种通过迭代数据框来实现的方法。这种方式可以让您更好地控制填充的内容，即您可以添加更多条件来填充要填充的值。我首先创建一个新的数据框，其中包含所有完整日期和A B A B列中的交替值C1：

import pandas as pd
import numpy as np

dates = pd.date_range(start="1/1/2019", end="1/10/2019")
repeated_dates = np.repeat(dates, 2)
df = pd.DataFrame(index=repeated_dates, columns=["C1", "C2"])
df["C1"] = (len(df) // 2) * ["A", "B"]

# See first 5 rows
print(df.head())

           C1   C2
2019-01-01  A  NaN
2019-01-01  B  NaN
2019-01-02  A  NaN
2019-01-02  B  NaN
2019-01-03  A  NaN

我们稍后将填充该C2列的值。

接下来制作一个数据框（它实际上是您的起始数据），从上面的“结果”数据框中删除几行：

df_missing = df.drop(df.index[[3, 9]])

C2_col = []

# grouping by the index (i.e. the date) gives us two rows at a time
for date, group in df.groupby(df.index):
    try:
        # see which values your data has for this data and extract them
        day = df_missing.loc[date, ["C1", "C2"]]
        C2_A, C2_B = day.C2.values

    # If the date wasn't there, we can catch the error and give any values we want
    except KeyError as e:
        # Could now use more condition e.g. on the date or previous values, etc.
        C2_A = C2_B = "was_missing"

    # Keep the values in a list
    C2_col.extend([C2_A, C2_B])

# Overwrite the column that was full of NaN values
df["C2"] = C2_col

我们可以在最终结果中看到所有日期以及A B A模式都存在，我们可以将我们想要的任何内容插入到那些缺少值的日期中：

print(df)

           C1           C2
2019-01-01  A          NaN
2019-01-01  B          NaN
2019-01-02  A  was_missing
2019-01-02  B  was_missing
2019-01-03  A          NaN
2019-01-03  B          NaN
2019-01-04  A          NaN
2019-01-04  B          NaN
2019-01-05  A  was_missing
2019-01-05  B  was_missing
2019-01-06  A          NaN
2019-01-06  B          NaN
2019-01-07  A          NaN
2019-01-07  B          NaN
2019-01-08  A          NaN
2019-01-08  B          NaN
2019-01-09  A          NaN
2019-01-09  B          NaN
2019-01-10  A          NaN
2019-01-10  B          NaN

其它你可能感兴趣的问题

上一篇理解一个非常简单的树的python XGBoost模型转储输出下一篇模型训练期间的准确噪声模式