为什么我得到预测分数 1 即 100%

数据挖掘 机器学习 Python scikit-学习 回归 熊猫
2022-01-20 21:58:07

我正在阅读一些参数并尝试使用线性回归和 GB 来预测目标值。令人惊讶的是,我的测试数据得分 = 1。怎么会?谁能告诉我这段代码有什么问题?

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)

#Data Pre-processing
data = dataset.drop('organization_id',1)
data = data.drop('status',1)
data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = data['zip'].median(),data['role_id'].median(),data['specialty_id'].median(),data['latitude'].median(),data['longitude'].median() 
data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)

#Start training

labels = dataset.location_id
train1 = data
reg = LinearRegression()
x_train , x_test , y_train , y_test = train_test_split(train1 , labels , test_size = 0.20,random_state =1)

# x_train.to_csv("x_train.csv", sep=',', encoding='utf-8')
# x_test.to_csv("x_test.csv", sep=',', encoding='utf-8')

reg.fit(x_train,y_train)
reg.score(x_test,y_test)
```
1个回答

您正在使用目标变量location_id作为特征。你需要将它从data// train1X 变量中删除。

换句话说,您正试图location_id通过来预测location_id

如果你使用reg.feature_importances_你会看到,这location_id会 100% 影响你的预测,而其他对预测结果没有影响。