使用 sklearn 预测游戏分数

数据挖掘 scikit-学习 随机森林 预言 一热编码
2022-02-23 07:09:20

我正在使用onehotencodingRandomForestRegressor预测一组足球比赛的分数。我该如何使用它predict我确定我现在做错了,因为我将所有预测值都设为 1(可能是因为我将所有 NaN 值填充为 1 以进行拆分和拟合)

在对几列进行编码然后对其进行转换时,我应该传递什么数据集?

我的代码如下

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
    
# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\Historical Data\Concat_Cleaned.csv'
    , parse_dates=date_col, skiprows=0, low_memory=False)

# Clean dataset by dropping null rows
data = df.dropna(axis=0)

# Column that you want to predict = y
y = data.Full_Time_Home_Goals

# Columns that are inputted into the model to make predictions (dependants), Cannot be column y
features = ['HomeTeam', 'AwayTeam', 'Full_Time_Away_Goals', 'Full_Time_Result']
# Create X
X = data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
    
# Define and train OneHotEncoder to transform numerical data to a numeric array
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_X, train_y)

transformed_train_X = enc.transform(train_X)
transformed_val_X = enc.transform(val_X)

# Build a Random Forest model and train it on all of X and y.
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()

# Define columns we want to use for prediction
columns = ['Home_Team', 'Away_Team']
test_data = test_data[columns]
# Renaming Column Names to match with training dataset
test_data = test_data.rename({'Home_Team': 'HomeTeam', 'Away_Team': 'AwayTeam'}, axis=1)
# Adding NaN columns to dataset to match the training dataset
test_data['Full_Time_Result'] = np.nan
test_data['Full_Time_Away_Goals'] = np.nan
test_data['Full_Time_Home_Goals'] = np.nan
# Aligning dataframe to model defined
test_data_features = test_data[features]
# Filling all NA values as Encoder cannot handle nan values
df = test_data.fillna(1)

# Define Y for Fitting
Y = df

# We need nY as that would be the column used for splitting
ny = df.Full_Time_Home_Goals

# We need to encode and transform dataset so we have converted all nan to 1 and we are defining a new model as the
# val_x values are confusing, we will use n_
train_n_X, val_n_X, train_n_y, val_n_y = train_test_split(Y, ny, random_state=1)

# Since we have text again, we will need fitting and transforming the data
enc.fit(train_n_X, train_n_y)
transformed_train_n_X = enc.transform(train_n_X)
transformed_val_n_X = enc.transform(val_n_X)

# Fitting and then we will be using predict
rf_model_on_full_data.fit(transformed_train_n_X, train_n_y)

# Predicting. This step needs correction as predict should be on the new dataset and not just on on column.
test_preds = rf_model_on_full_data.predict(transformed_val_n_X)

print(test_preds)

应该进入 predict() 以获得我想要的结果?

此处使用的文件

1个回答

如果我理解正确,您正在尝试'Full_Time_Home_Goals'根据四个特征['HomeTeam', 'AwayTeam', 'Full_Time_Away_Goals', 'Full_Time_Result'](前两个是分类和单热编码)来预测目标变量。

在您的测试集中,您只接收(或只使用)前两个,并用值 1 填充后两个(自然是更重要的)。

现在,让我们假设您的模型决定团队不重要,并仅根据后两个变量进行预测......但在您的测试集中,它们都具有相同的值 - 所以它们会得到相同的值预言。

这里的教训是,当你训练模型时,不要使用测试集中(大部分或完全)缺失的特征。