ValueError:在 pca.transform 期间,操作数无法与形状 (60002,39) (38,) 一起广播

数据挖掘 机器学习 Python scikit-学习 随机森林 主成分分析
2022-02-04 23:45:30

我正在尝试解决 Kaggle 上的旧金山犯罪问题。首先,这是我的代码:

import numpy as np
import pandas as pd
from sklearn import preprocessing
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss

data = pd.read_csv('/home/limafoxtrottango/Downloads/sf_crime_train.csv')
data_test = pd.read_csv('/home/limafoxtrottango/Downloads/Mozilla Downloads/Test.csv')

#pre-processing training data
le = LabelEncoder()
data['Category'] = le.fit_transform(data['Category'])

categorical_variables = pd.get_dummies(data[['DayOfWeek','PdDistrict','Resolution']])  

data = data.join(categorical_variables)

scaler1 = StandardScaler()
scaler2 = StandardScaler()
scaler3 = StandardScaler()
scaler4 = StandardScaler()

data['X'] = scaler1.fit_transform(data[['X']])
data['Y'] = scaler2.fit_transform(data[['Y']])

data['Date'], data['Time'] = data['Dates'].str.split(' ', 1).str
data['Year'], data['Month'], data['Day'] = data['Date'].str.split('-', 2).str
data['Hours'], data['Minutes'], data['Seconds'] = data['Time'].str.split(':', 2).str
data["Hours"] = data["Hours"].map(str) + data["Minutes"]

del data['Minutes']
del data['Seconds']
del data['Time']
del data['Day']
del data['Year']
del data['Id']
del data['Address']
del data['Descript']
del data['PdDistrict']
del data['Resolution']
del data['DayOfWeek']
del data['Date']
del data['Dates']

data['Month'] = scaler1.fit_transform(data[['Month']])
data['Hours'] = scaler2.fit_transform(data[['Hours']])

labels = data.columns[1:]
train = data.loc[:, labels].values

#doing pca
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(train)
principalDf = pd.DataFrame(data = principalComponents, columns = ['pc 1', 'pc 2', 'pc 3'])

final_data_train = pd.concat([data[['Category']], principalDf], axis = 1)

target = np.array(final_data_train['Category'])
features = final_data_train.drop('Category', axis = 1)

# Saving feature names for later use
feature_list = list(final_data_train.columns)

# Convert to numpy array
features = np.array(features)

# Instantiate model with 10 decision trees
rf = RandomForestClassifier(n_estimators = 10, random_state = 0)

# Train the model on training data
rf.fit(features, target);

#pre-processing testing data
categorical_variables = pd.get_dummies(data_test[['DayOfWeek','PdDistrict','Resolution']])  

data_test = data_test.join(categorical_variables)

scaler5 = StandardScaler()
scaler6 = StandardScaler()
scaler7 = StandardScaler()
scaler8 = StandardScaler()

data_test['X'] = scaler5.fit_transform(data_test[['X']])
data_test['Y'] = scaler6.fit_transform(data_test[['Y']])


data_test['Date'], data_test['Time'] = data_test['Dates'].str.split(' ', 1).str
data_test['Year'], data_test['Month'], data_test['Day'] = data_test['Date'].str.split('-', 2).str
data_test['Hours'], data_test['Minutes'], data_test['Seconds'] = data_test['Time'].str.split(':', 2).str
data_test["Hours"] = data_test["Hours"].map(str) + data_test["Minutes"]

row_id_test = data_test['Id']

del data_test['Minutes']
del data_test['Seconds']
del data_test['Time']
del data_test['Day']
del data_test['Year']
del data_test['Address']
del data_test['PdDistrict']
del data_test['Resolution']
del data_test['DayOfWeek']
del data_test['Date']
del data_test['Dates']

data_test['Month'] = scaler7.fit_transform(data_test[['Month']])
data_test['Hours'] = scaler8.fit_transform(data_test[['Hours']])

data_test_trans = pca.transform(data_test)

predictions = rf.predict_proba(data_test_trans)

final_predictions_file = pd.concat([row_id_test, pd.DataFrame(predictions)], axis = 1)

np.savetxt("predictions.csv", final_predictions_file, delimiter=",")

我已经对我的训练数据进行了 PCA,并使用该对象使用以下方法转换测试数据:pca.transform()。但我收到以下错误:

ValueError:在 pca.transform 期间,操作数无法与形状 (60002,39) (38,) 一起广播

有人可以指出我做错了什么吗?当我使用训练数据本身作为测试数据,并且不从数据框中删除目标变量时,程序运行得很好。当我使用没有任何目标列的实际测试数据时,会引发此错误。

我刚开始学习机器学习,所以如果这个问题看起来有点太天真,请原谅。谢谢!

1个回答

当用于准备模型/系统的训练数据与用于预测的数据具有不同的维度时,通常会观察到这种类型的错误。

在这种情况下,训练数据维度是 (x,38),而测试数据维度是 (y,39)。