我们如何用 NaN 标准化数组?

数据挖掘 机器学习 Python scikit-学习 正常化
2021-09-16 06:23:14

到目前为止,我使用 StandardScaler() 来标准化数据,但这不适用于 NaN。我所知道的其他方法(MinMaxScaler、RobustScaler、MaxAbsScaler)都不适用于 NaN。还有其他方法吗?

我的搜索结果想出了一个解决方案

df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())

但这仅适用于熊猫数据框(它们具有列名)。有没有办法在矩阵中实现列标题?

import pandas as pd
import numpy as np
import random
import sklearn.preprocessing import StandardScaler

data = pd.DataFrame({'sepal_length': [3.4, 4.5, 3.5], 
                     'sepal_width': [1.2, 1, 2],
                'petal_length': [5.5, 4.5, 4.7],
                'petal_width': [1.2, 1, 3],
                    'species': ['setosa', 'verginica', 'setosa']})

#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)  

#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values 
y = data.iloc[:, 4].values

#train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1, random_state = 0)


#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)

a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]
c = [random.choice(range(X_test.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
d = [random.choice(range(X_test.shape[1])) for _ in range(prop)]
X_train[a, b] = np.NaN
X_test[c, d] = np.NaN

这是我得到错误的地方:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。

from sklearn.preprocessing import StandardScaler #导入进行特征缩放的库

sc_X = StandardScaler() # created an object with the scaling class

X_train = sc_X.fit_transform(X_train)  # Here we fit and transform the X_train matrix
X_test = sc_X.transform(X_test)
4个回答

这已不再是这种情况; 从 sklearn 0.20.0 开始,缺失值在此类预处理器中被忽略,fit并在它们中默默传递transform
https ://scikit-learn.org/stable/whats_new/v0.20.html#id37 (第四个项目符号)
https:/ /github.com/scikit-learn/scikit-learn/issues/10404

标准化(每列减去平均值并除以标准差),可以使用 numpy 完成:

Xz = (X - np.nanmean(X, axis=0))/np.nanstd(X, axis=0) 

其中 X 是一个矩阵(包含 NaN),Xz 是 X 的标准化版本。希望这会有所帮助。

编辑:

对于测试/训练场景,平均值和标准差可以存储在各自的变量中:

m         = np.nanmean(X_train, axis=0)
s         = np.nanstd(X_train, axis=0)
X_train_z = (X_train - m)/s 
X_test_z  = (X_test - m)/s

您可以使用sklearn.preprocessing.Imputer

演示:

from sklearn import datasets as ds
from sklearn.model_selection import train_test_split

# load Iris data set    
data = ds.load_iris()

X = data.data
y = data.target

# artificially set 33% of [X] data set to NaN's    
X.ravel()[np.random.choice(X.size, int(X.shape[0]*.33), replace=False)] = np.nan

产量:

In [137]: X
Out[137]:
array([[5.1, 3.5, nan, nan],
       [nan, 3. , 1.4, 0.2],
       [4.7, nan, 1.3, 0.2],
       ...,
       [6.5, 3. , nan, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , nan, 1.8]])

现在我们可以对其进行估算和标准化:

imp = Imputer(strategy="mean", axis=0)
scale = StandardScaler()

In [139]: X_new = scale.fit_transform(imp.fit_transform(X))

结果:

In [160]: X_new
Out[160]:
array([[-1.03733263e+00,  1.22587069e+00, -1.37398311e-15, -3.17837019e-16],
       [ 1.18191646e-15, -5.32987255e-02, -1.43399195e+00, -1.35522269e+00],
       [-1.56962048e+00,  2.27226133e-15, -1.49587065e+00, -1.35522269e+00],
       ...,
       [ 8.25674859e-01, -5.32987255e-02, -1.37398311e-15,  1.22131653e+00],
       [ 4.26458969e-01,  9.70036804e-01,  1.04115598e+00,  1.65073974e+00],
       [ 2.72430791e-02, -5.32987255e-02, -1.37398311e-15,  9.35034396e-01]])

Demo2,使用管道:

from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline

#...    

estimator = Pipeline([("impute", Imputer(strategy="mean", axis=0)),
                      ("scale", StandardScaler()),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))])

estimator.fit(X_train, y_train)
#...    

使用 NaN 总是有点困难。如果您尝试丰富 NaN 值,可能会很有用。例如,通过对年龄类等组的考虑特征进行平均。如果只有少数记录具有 NaN 值,您可以简单地删除这些 (pandas dropna)。