如果另一列中的值相等,熊猫如何填充一列中的缺失值

数据挖掘 Python 数据清理 熊猫
2021-10-11 05:14:33

我有一个数据框,我需要使用不同列(id)中具有相同值的行中的值来填充一列(paid_date)中的缺失值。保证每个 id 值的paid_date 列中不超过 1 个非空值,并且非空值将始终位于空值之前。例如:

index     id       paid_date           
6         25220    2017-01-05 00:00:00
9         30847    None               
11        30847    None               
14        29369    2017-06-21 00:00:00
17        31232    2017-08-31 00:00:00
20        26196    2017-02-20 00:00:00
21        26196    None               
24        28303    2017-05-09 00:00:00
25        28303    None               

如果存在具有匹配 id的行,如何替换列None中的值?paid_datepaid_date

index     id       paid_date          
6         25220    2017-01-05 00:00:00
9         30847    None               
11        30847    None               
14        29369    2017-06-21 00:00:00
17        31232    2017-08-31 00:00:00
20        26196    2017-02-20 00:00:00
21        26196    2017-02-20 00:00:00
24        28303    2017-05-09 00:00:00
25        28303    2017-05-09 00:00:00

我尝试使用fillna将 id 映射到paid_dates 的字典,我尝试使用pd.Series.map但都没有成功。

paid_dates = df[pd.notnull(df['paid_date'])]
pds = pd.Series(data=paid_dates['paid_date'].values, index=paid_dates['id'])
pds_dict = pds.to_dict()

# doesn't work
df['paid_date'].fillna(value=pds_dict)

# also doesn't work
df['paid_date'].map(pds_dict)
1个回答

保证每个 id 值的paid_date 列中不超过 1 个非空值,并且非空值将始终位于空值之前。

In [117]: df['paid_date'] = pd.to_datetime(df['paid_date'], errors='coerce')

In [118]: df
Out[118]:
   index     id  paid_date
0      6  25220 2017-01-05
1      9  30847        NaT
2     11  30847        NaT
3     14  29369 2017-06-21
4     17  31232 2017-08-31
5     20  26196 2017-02-20
6     21  26196        NaT
7     24  28303 2017-05-09
8     25  28303        NaT

In [119]: df.groupby('id').ffill()
Out[119]:
   index     id  paid_date
0      6  25220 2017-01-05
1      9  30847        NaT
2     11  30847        NaT
3     14  29369 2017-06-21
4     17  31232 2017-08-31
5     20  26196 2017-02-20
6     21  26196 2017-02-20
7     24  28303 2017-05-09
8     25  28303 2017-05-09

如果不能保证,那么我们可以这样做:

In [111]: df['paid_date'] = pd.to_datetime(df['paid_date'], errors='coerce')

In [112]: df
Out[112]:
   index     id  paid_date
0      6  25220 2017-01-05
1      9  30847        NaT
2     11  30847        NaT
3     14  29369 2017-06-21
4     17  31232 2017-08-31
5     20  26196 2017-02-20
6     21  26196        NaT
7     24  28303 2017-05-09
8     25  28303        NaT

In [113]: df.loc[df.paid_date.isnull(), 'paid_date'] = \
              df.loc[df.paid_date.isnull(), 'id'].map(df.loc[df.paid_date.notnull()] \
                .set_index('id')['paid_date'])

In [114]: df
Out[114]:
   index     id  paid_date
0      6  25220 2017-01-05
1      9  30847        NaT
2     11  30847        NaT
3     14  29369 2017-06-21
4     17  31232 2017-08-31
5     20  26196 2017-02-20
6     21  26196 2017-02-20
7     24  28303 2017-05-09
8     25  28303 2017-05-09