如何计算熊猫(时间)系列中序列小于特定持续时间的次数

数据挖掘 时间序列 熊猫
2021-10-08 03:20:54

给定一个熊猫系列,其中一个值为 {0, +1, -1} 并且序列始终以 +1 开头并以 -1 结尾。

例如,

2019-01-02 15:19:00    0.0
2019-01-02 15:20:00    0.0
2019-01-02 15:21:00    1.0       # sequence start
2019-01-02 15:22:00    0.0       # sequence continue, it's only a 0
2019-01-02 15:23:00   -1.0       # sequence finish, let's look for the next +1 value 
2019-01-02 15:24:00    0.0
2019-01-02 15:25:00    1.0       # sequence start
2019-01-02 15:26:00   -1.0       # sequence finish 
2019-01-02 15:27:00    0.0
2019-01-02 15:28:00    0.0
2019-01-02 15:29:00    1.0       # sequence start
2019-01-02 15:30:00   -1.0       # sequence finish 

如何计算“序列开始”和“序列结束”的间隔小于 5 分钟的序列总数。

例如,

2019-01-02 15:19:00    0.0
2019-01-02 15:20:00    0.0
2019-01-02 15:21:00    1.0       # sequence start
2019-01-02 15:22:00    0.0
2019-01-02 15:23:00    0.0
2019-01-02 15:24:00    0.0
2019-01-02 15:25:00   -1.0       # sequence finish (sequence start to sequence finish is less than 5 minutes, so increment counter) 
2019-01-02 15:26:00    1.0       # sequence start
2019-01-02 15:27:00    0.0
2019-01-02 15:28:00    0.0
2019-01-02 15:29:00    0.0
2019-01-02 15:30:00    0.0
2019-01-02 15:31:00    0.0
2019-01-02 15:32:00    0.0
2019-01-02 15:33:00    0.0
2019-01-02 15:34:00    0.0
2019-01-02 15:35:00    0.0
2019-01-02 15:36:00    0.0
2019-01-02 15:37:00    0.0
2019-01-02 15:38:00    0.0
2019-01-02 15:39:00    0.0
2019-01-02 15:40:00   -1.0       # sequence finish (longer than 5 minutes, don't increment) 

从上面的例子中,我们得到一个总数:

counter: 1
2个回答

这是我的想法:

# dataframe example
b = [1,0,0,0,0,0,0-1,1,0,0,-1,0,0,0,0,0,0,1,0,0,-1]
minutes = ['2019-01-02 15:' + str(x) + ':00' for x in range(len(b))]
df = pd.DataFrame({'a': minutes, 'b': b})

# keep only rows which are equal to 1 or -1
df_tmp = df[df['b'] != 0]

# convert date column to datetime if needed
df_tmp['a'] = pd.to_datetime(df['a'])

# create a new column which is the time difference between each row
df_tmp['diff'] = df_tmp['a'].diff().astype('timedelta64[m]')

# keep only the rows equal to -1
df_tmp = df_tmp[df_tmp['b'] == -1]

# count the values which are smaller or equal to 5
(df_tmp['diff'] <= 5).sum()

使用 pandas merge_asof 可以在这里工作:

my_df[['date_hour', 'ind']].head(5)
    date_hour   ind
0   2019-01-02 15:19:00 0
1   2019-01-02 15:20:00 0
2   2019-01-02 15:21:00 1
3   2019-01-02 15:22:00 0
4   2019-01-02 15:23:00 0

starts = my_df[my_df['ind'] == 1][['date_hour', 'ind']]
ends = my_df[my_df['ind'] == -1][['date_hour', 'ind']]
ends['end_date_hour'] = ends['date_hour']
##merging ends to starts by closest date hour
merged_df = pd.merge_asof(starts, ends,on='date_hour', direction='forward',
         suffixes=('_start', '_end'), tolerance=pd.Timedelta(10000, 's'))
##calculating time diff for matching starts ends
merged_df['time_diff'] = merged_df['end_date_hour'] - merged_df['date_hour']
##get total of segments that fit the 5 minutes condition
print(sum((merged_df['time_diff'].dt.seconds / 60) < 5))

请注意,公差参数的 10000 集只需要足够大即可。