数据挖掘 - 基于两个条件的最佳异常检测算法 - 吾爱随笔录

基于两个条件的最佳异常检测算法

数据挖掘机器学习 Python 异常检测

2021-09-23 03:05:27

目前，选择正确的异常检测算法似乎相当困难。可能是因为我被很多替代方案轰炸了，比如聚类、K-Means DBSCAN 等等。

在我这边，我有一个包含数千行的 csv 文件，列的标题显示文件的名称和功能。

我是我的情况，这就是文件的样子，请记住文件要大得多。

值列是我要检查异常的列，在这种情况下，我会得到很多异常，因为这些数字属于不同的单位-

所以首先我需要过滤Unit让我们说米，然后检查值列数据是否有任何异常

我将不胜感激有关解决此问题的最佳方法的一些建议

1个回答

由于您的数据是一维和数字的，我认为您不需要任何花哨的聚类技术。当您的数据点具有多个属性时，聚类很有用。当只有一个属性时，您所需要的只是对“异常”的良好定义。

例如，假设您认为异常是与平均值相差超过两个标准差的任何点。使用 pandas 很容易找到这样的点：

import numpy as np
import pandas as pd
from scipy import stats

threshold = 2.0  # in standard deviations
input_file = "path/to/my/file.csv"
target_unit = "meters"

# read the file into a pandas DataFrame
df = pandas.read_csv(input_file)
# filter to only include the target unit
df = df[df['Units']==target_unit]
# compute z-scores for the `values` columns
df['values_z'] = np.absolute(stats.zscore(df['values'].values))
# threshold z-score to identify anomalies
is_anomaly = df['values_z'] > threshold
anomalies = df[is_anomaly]

# `anomalies` is now a DataFrame that contains the anomalous points
# you can consume this however seems appropriate
# for example, you can write the anomalies to a separate file:
anomalies.to_csv('path/to/output.csv')

```

其它你可能感兴趣的问题

上一篇使随机森林估计器与决策树完全相同下一篇寻找异常值的机器学习方法