数据挖掘 - 如何按行计算特定范围内值的出现次数 - 吾爱随笔录

如何按行计算特定范围内值的出现次数

数据挖掘 Python 熊猫

2022-01-29 15:14:21

我有一个 3000 行 x 101 列的数据框，如下所示：

Time   id0  id1  id2     ………… id99

1      1.71 6.99 4.01    ………… 4.98

2      1.72 6.78 3.15    ………… 4.97

.

.

3000   0.36 0.23 0.14    ………… 0.28

使用 Python，我们如何添加一个列来为每一行计算特定范围内的值（在 column id0、 to 中）的数量？id99

2个回答

您可以使用方法将函数应用于 DataFrame 的每一行apply。between在应用函数中，您可以先使用方法或标准关系运算符将行转换为布尔数组，然后使用方法True对布尔数组的值进行计数sum。

import pandas as pd

df = pd.DataFrame({
    'id0': [1.71, 1.72, 1.72, 1.23, 1.71],
    'id1': [6.99, 6.78, 6.01, 8.78, 6.43],
    'id2': [3.11, 3.11, 4.99, 0.11, 2.88]})


def count_values_in_range(series, range_min, range_max):

    # "between" returns a boolean Series equivalent to left <= series <= right.
    # NA values will be treated as False.
    return series.between(left=range_min, right=range_max).sum()

    # Alternative approach:
    # return ((range_min <= series) & (series <= range_max)).sum()


range_min, range_max = 1.72, 6.43

df["n_values_in_range"] = df.apply(
    func=lambda row: count_values_in_range(row, range_min, range_max), axis=1)

print(df)

结果数据框：

    id0   id1   id2  n_values_in_range
0  1.71  6.99  3.11                  1
1  1.72  6.78  3.11                  2
2  1.72  6.01  4.99                  3
3  1.23  8.78  0.11                  0
4  1.71  6.43  2.88                  2

IIUC 你可以使用DataFrame.isin()方法：

数据：

In [41]: given_set = {3,8,11,18,22,24,35,36,42,47}

In [42]: df
Out[42]:
    a   b   c   d   e
0  36  38  27  12  35
1  45  33   8  41  18
4  32  14   4  14   9
5  43   1  31  11   3
6  16   8   3  17  39

解决方案：

In [44]: df['new'] = df.isin(given_set).sum(1)

In [45]: df
Out[45]:
    a   b   c   d   e  new
0  36  38  27  12  35    2
1  45  33   8  41  18    2
4  32  14   4  14   9    0
5  43   1  31  11   3    2
6  16   8   3  17  39    2

解释：

In [49]: df.isin(given_set)
Out[49]:
       a      b      c      d      e
0   True  False  False  False   True
1  False  False   True  False   True
4  False  False  False  False  False
5  False  False  False   True   True
6  False   True   True  False  False

In [50]: df.isin(given_set).sum(1)
Out[50]:
0    2
1    2
4    0
5    2
6    2
dtype: int64

如果你想检查存在而不是计数，你可以这样做：

In [6]: df.isin(given_set).any(1)
Out[6]:
0     True
1     True
4    False
5     True
6     True
dtype: bool

In [7]: df.isin(given_set).any(1).astype(np.uint8)
Out[7]:
0    1
1    1
4    0
5    1
6    1
dtype: uint8

其它你可能感兴趣的问题

上一篇四个 GTX 1080 Ti 与一个 Tesla V100 在深度神经网络训练中的性能下一篇Python 聚类和标签