无法将字符串转换为 KDDCup99 数据集上的浮点错误

数据挖掘 机器学习 Python scikit-学习 熊猫
2021-09-17 01:08:04

我正在尝试使用 Python 将 5 种算法与 KDD Cup 99 数据集和 NSL-KDD 数据集进行比较,并且在尝试针对 KDDCup99 数据集和 NSL-KDD 数据集构建和评估模型时遇到问题。

每当我尝试在数据集上运行算法时,我都会收到以下错误“无法将字符串转换为浮点数:S0”

这个错误是在5个模型的评估过程中产生的;逻辑回归、线性判别分析、K-最近邻、分类和回归树、高斯朴素贝叶斯和支持向量机。

这是我用来评估数据集的代码:

#Load KDD dataset

dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])


# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]

# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)

# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'

#  Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, 

random_state=seed)

    #Here is where the error is spit out
{
            cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string. 
            results.append(cv_results)
            names.append(name)
            msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
            print(msg)
}

# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()

这是来自 KDDcup99 数据集的 3 行示例:

0   tcp http    SF  215 45076   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   normal.
0   tcp http    SF  162 4528    0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   2   2   0   0   0   0   1   0   0   1   1   1   0   1   0   0   0   0   0   normal.
0   tcp http    SF  236 1228    0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   0   1   0   0   2   2   1   0   0.5 0   0   0   0   0   normal.

我尝试使用标签编码,但它仍然吐出相同的错误,当我浏览 sklearn 网站时,我注意到评分值是针对字符串类型的,这是问题的原因吗?如果没有,我加载数据集的方式是否有问题?

编辑我尝试从代码中删除评分值,但仍然得到同样的错误。

2个回答

我注意到你提到你使用了标签编码,但我自己做了,代码运行得很好。我使用了 10% 版本的数据集加载数据集后只需输入这段代码:

for column in dataset.columns:
    if dataset[column].dtype == type(object):
        le = LabelEncoder()
        dataset[column] = le.fit_transform(dataset[column])

在标签编码之后,您应该使用One Hot Encoder来提高某些算法的性能。您还应该避免使用 cross_validation 模块,因为它已被弃用,它将在 0.20 版中删除。

让我们使用标签编码

from sklearn import preprocessing    
def convert(data):
    number = preprocessing.LabelEncoder()
    data['column_name'] = number.fit_transform(data['column_name'])
    data = data.fillna(-9999)
    return data

test = convert(test) #where test is your dataframe