数据挖掘 - 数据科学 python 数据清洗 - 吾爱随笔录

数据科学 python 数据清洗

数据挖掘 Python 数据清理数据

2022-03-14 05:11:28

我正在为模型准备数据集，但不知何故，代码运行不佳。

主要错误是：

File "/Users/liangjulia/Desktop/UW DS Certificate Learning Material/untitled6.py", line 61
  'income2' = pd.to_numeric(Adult.income, errors='coerce')
                                                        ^
SyntaxError: can't assign to literal

代码：

# import statement
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# loading dataset, it is a combination of categorical and numerical data
hp = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=0,sep=',')
hp.columns = ['age','workclass','income','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','salary-range'] 
# dataset basics
hp.head()
hp.shape
hp.dtypes

# account for all value '?'
hp.replace('?','na')
hp.isnull().sum()

#Remove onsolete data point in income
hp('income').dropna()

# replace all aberrant values
hp.replace('nan', 0)
hp.replace('NULL', 0)

# change data type of certain data point to numerical
number = LabelEncoder()
hp['income'] = number.fit_transform(hp['income'.astype('str')])
hp['capital-gain'] = number.fit_transform(hp['capital-gain'.astype('str')])
hp['capital-loss'] = number.fit_transform(hp['capital-loss'.astype('str')])

# Choose the datapoint 'income' to perform the data cleaning and remove outliers
LimitHi=np.mean('income') + 2*np.std('income')
LimitLo=np.mean('income') + 2*np.std('income')
BadIncome = ('income' > LimitHi) & ('income' < LimitLo)

# Replace outliars
RightIncome = ~BadIncome
x[BadIncome] = np.mean(x[RightIncome])

# normalize the Income Column using numpy
#'income2' = pd.to_numeric(Adult.income, errors='coerce')
minmaxscaled =('income' - min('income'))/(max('income') - min('income'))

# bin age data into several ranges
hp['bin'] = pd.cut(hp['age'], [15,30,45,60,75,90])

# construct new categorical data point with existing data point 
hp['EvalonInvestment'] = 'zzz'
hp.loc[(hp['capital-gain'] >= 50000), 'loc2'] = 'investmentking'
hp.loc[(hp['capital-gain'] > 10000) & (hp['capital-gain'] < 50000), 'loc2'] = 'good-investment'
hp.loc[(hp['capital-gain'] > 0) & (hp['capital-gain'] <= 10000), 'loc2'] = 'ok-investment'

print(hp)

2个回答

它应该是 hp['income2'] 因为您不能将可变对象分配给不可变对象，例如字符串

第 61 行应该读的是

income2 = pd.to_numeric(hp['income'], errors = 'coerce')

让我们分解一下。

您想为pd 类 to_numeric 中方法的输出分配一个新变量income2。

to_number有两个参数：arg：列表、元组、一维数组或系列，一个关于如何处理错误的选择，以及一个可选的向下转换运算符。你发现这个

help(pd.to_numeric)

第 61 行最初提供了 arg Adult.Income，它可以是列表、元组、一维数组或 Series，但尚未定义，因此它不能是有效的 arg。

尝试

类型（hp ['收入']）

你得到

pandas.core.series.Series

这是一个有效的论点。

此行不是您的代码中的唯一问题。我建议您使用 iPython，以便您可以查看所有其他错误。或者只使用命令行并一次复制几行，您会发现新的错误消息需要解决。

代码中的错误就像蟑螂——它们很少独自旅行。

祝你好运！

其它你可能感兴趣的问题

上一篇我们应该在取样之前从总体中识别异常值吗？下一篇封闭箱内的熵