我正在尝试使用 Python 将 5 种算法与 KDD Cup 99 数据集和 NSL-KDD 数据集进行比较,并且在尝试针对 KDDCup99 数据集和 NSL-KDD 数据集构建和评估模型时遇到问题。
每当我尝试在数据集上运行算法时,我都会收到以下错误“无法将字符串转换为浮点数:S0”
这个错误是在5个模型的评估过程中产生的;逻辑回归、线性判别分析、K-最近邻、分类和回归树、高斯朴素贝叶斯和支持向量机。
这是我用来评估数据集的代码:
#Load KDD dataset
dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])
# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]
# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
random_state=seed)
#Here is where the error is spit out
{
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
}
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
这是来自 KDDcup99 数据集的 3 行示例:
0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.
我尝试使用标签编码,但它仍然吐出相同的错误,当我浏览 sklearn 网站时,我注意到评分值是针对字符串类型的,这是问题的原因吗?如果没有,我加载数据集的方式是否有问题?
编辑我尝试从代码中删除评分值,但仍然得到同样的错误。