如何使用 SimpleImputer 类使用 Python 用平均值替换缺失值?

数据挖掘 Python scikit-学习 熊猫 缺失数据
2021-10-03 01:42:41

这是我的代码

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing Dataset
dataset = pd.read_csv('C:/Users/Rupali Singh/Desktop/ML A-Z/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/Data.csv')
print(dataset)
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values
#Missing Data

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values= np.nan, strategy='mean')
X.fit[:, 1:3] = imputer.fit_transform(X[:, 1:3])
print(X)

我的数据集:

Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes

错误信息:

File "C:/Users/Rupali Singh/PycharmProjects/Machine_Learning/data_preprocessing_Template.py", line 15, in <module>
    X.fit[:, 1:3] = imputer.fit_transform(X[:, 1:3])
AttributeError: 'numpy.ndarray' object has no attribute 'fit'
4个回答

您的错误是由于在numpy数组上使用Simple Imputer的 fit 和 fit_transform 造成的。这是我在Dataframe上使用它的方式

imr = Imputer(missing_values='NaN', strategy='median', axis=0)
imr = imr.fit(data[['age']])
data['age'] = imr.transform(data[['age']]).ravel()

X.fit = impute.fit_transform().. 这是错误的。你不能仅仅因为 .fit() 是一个 imputer 函数而为 X.fit() 赋值,你不能numpy array上使用方法fit(),因此你的错误!

使用 x[:, 1:3] = imputer.fit_transform(x[:, 1:3]) 代替

希望这可以帮助!

SimpleImputer 也可以正常工作。

from sklearn.impute import SimpleImputer 
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])

这给出了结果

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

#照顾丢失的数据

from sklearn.impute import SimpleImputer 
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])

结果

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

您还可以将代码简化为

from sklearn.impute import SimpleImputer 
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(X[:,1:])
X[:,1:]=imputer.transform(X[:,1:])

这样您就可以从索引为 1 的第二列开始,并以最后一列结束。