基于两个条件的最佳异常检测算法
数据挖掘
机器学习
Python
异常检测
2021-09-23 03:05:27
1个回答
由于您的数据是一维和数字的,我认为您不需要任何花哨的聚类技术。当您的数据点具有多个属性时,聚类很有用。当只有一个属性时,您所需要的只是对“异常”的良好定义。
例如,假设您认为异常是与平均值相差超过两个标准差的任何点。使用 pandas 很容易找到这样的点:
import numpy as np
import pandas as pd
from scipy import stats
threshold = 2.0 # in standard deviations
input_file = "path/to/my/file.csv"
target_unit = "meters"
# read the file into a pandas DataFrame
df = pandas.read_csv(input_file)
# filter to only include the target unit
df = df[df['Units']==target_unit]
# compute z-scores for the `values` columns
df['values_z'] = np.absolute(stats.zscore(df['values'].values))
# threshold z-score to identify anomalies
is_anomaly = df['values_z'] > threshold
anomalies = df[is_anomaly]
# `anomalies` is now a DataFrame that contains the anomalous points
# you can consume this however seems appropriate
# for example, you can write the anomalies to a separate file:
anomalies.to_csv('path/to/output.csv')
```
其它你可能感兴趣的问题
