到目前为止,我使用 StandardScaler() 来标准化数据,但这不适用于 NaN。我所知道的其他方法(MinMaxScaler、RobustScaler、MaxAbsScaler)都不适用于 NaN。还有其他方法吗?
我的搜索结果想出了一个解决方案
df['col']=(df['col']-df['col'].min())/(df['col'].max()-df['col'].min())
但这仅适用于熊猫数据框(它们具有列名)。有没有办法在矩阵中实现列标题?
import pandas as pd
import numpy as np
import random
import sklearn.preprocessing import StandardScaler
data = pd.DataFrame({'sepal_length': [3.4, 4.5, 3.5],
'sepal_width': [1.2, 1, 2],
'petal_length': [5.5, 4.5, 4.7],
'petal_width': [1.2, 1, 3],
'species': ['setosa', 'verginica', 'setosa']})
#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)
#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values
y = data.iloc[:, 4].values
#train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1, random_state = 0)
#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)
a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]
c = [random.choice(range(X_test.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
d = [random.choice(range(X_test.shape[1])) for _ in range(prop)]
X_train[a, b] = np.NaN
X_test[c, d] = np.NaN
这是我得到错误的地方:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。
from sklearn.preprocessing import StandardScaler #导入进行特征缩放的库
sc_X = StandardScaler() # created an object with the scaling class
X_train = sc_X.fit_transform(X_train) # Here we fit and transform the X_train matrix
X_test = sc_X.transform(X_test)