数据挖掘 - 如何处理时间序列数据集中的异常值？ - 吾爱随笔录

我已阅读以下有关如何处理数据集中异常值的文章：http: //napitupulu-jon.appspot.com/posts/outliers-ud120.html

基本上，他删除了所有与大多数有巨大差异的 y：

def outlierCleaner(predictions, ages, net_worths):
    """
        clean away the 10% of points that have the largest
        residual errors (different between the prediction
        and the actual net worth)

        return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error)
    """

    #calculate the error,make it descend sort, and fetch 90% of the data

    errors = (net_worths-predictions)**2
    cleaned_data =zip(ages,net_worths,errors)
    cleaned_data = sorted(cleaned_data,key=lambda x:x[2][0], reverse=True)
    limit = int(len(net_worths)*0.1)


    return cleaned_data[limit:]

但是，如果它的行是相关的，我如何将这种技术应用于时间序列数据集？