我已阅读以下有关如何处理数据集中异常值的文章:http: //napitupulu-jon.appspot.com/posts/outliers-ud120.html
基本上,他删除了所有与大多数有巨大差异的 y:
def outlierCleaner(predictions, ages, net_worths):
"""
clean away the 10% of points that have the largest
residual errors (different between the prediction
and the actual net worth)
return a list of tuples named cleaned_data where
each tuple is of the form (age, net_worth, error)
"""
#calculate the error,make it descend sort, and fetch 90% of the data
errors = (net_worths-predictions)**2
cleaned_data =zip(ages,net_worths,errors)
cleaned_data = sorted(cleaned_data,key=lambda x:x[2][0], reverse=True)
limit = int(len(net_worths)*0.1)
return cleaned_data[limit:]
但是,如果它的行是相关的,我如何将这种技术应用于时间序列数据集?