我做了一个多类文档分类。我将原始数据集(18,8334 个文档作为字符串列表,其中列表的每个元素都是文档字符串。)分为 70% 的训练和 30% 的测试。
然后在 70% 的训练数据集上,我使用 sklearn 5 折交叉验证来训练模型。我用了三个模型。第一个是高斯朴素贝叶斯,第二个是随机森林,第三个是随机梯度下降 SVM。
随机梯度下降给出了最高的交叉验证准确度,为 0.85。但是在 30% 的测试数据集上测试相同的模型时,准确率是 9%。这是为什么?交叉验证错误不是测试错误/泛化错误的度量或估计吗?
谢谢
编辑:
这就是我创建 train_test(70/30) 的方式
def split(docs_list,target_recoded):
"""This function samples the dataset into training and testing"""
# Splitting into training and test.
from sklearn.cross_validation import train_test_split
train_X, test_X,train_Y,test_Y = train_test_split(docs_list, target_recoded, test_size=0.30, random_state=42)
return train_X, test_X,train_Y,test_Y
在初始 nlp 预处理(如停用词删除、词干提取等)之后,我有一个干净的文档字符串列表。在此,我使用以下内容创建词袋。首先传递 70% 的训练数据,然后将 30% 的测试数据作为参数传递给该函数。
def bagofWords(X,Y,max_feature=5000,type="count"):
"""This function creates a Bag of Features vectors from the original documents"""
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
if(type=="count"): # To choose between count or tf-idf bag or words model
vectorizer = CountVectorizer(analyzer = "word",max_features = max_feature)
else:
vectorizer = TfidfVectorizer(analyzer = "word",max_features = max_feature)
X=vectorizer.fit_transform(X)
return X ,np.array(Y)
这就是我训练 SGD 的方式
def SGD(self):
"""Method to implement Multi-class SVM using Stochastic Gradient Descent"""
from sklearn.linear_model import SGDClassifier
scores_sgd = []
for train_indices, test_indices in self.k_fold:
train_X_cv = self.train_X[train_indices].todense()
train_Y_cv= self.train_Y[train_indices]
test_X_cv = self.train_X[test_indices].todense()
test_Y_cv= self.train_Y[test_indices]
self.sgd=SGDClassifier(loss='hinge',penalty='l2')
scores_sgd.append(self.sgd.fit(train_X_cv,train_Y_cv).score(test_X_cv,test_Y_cv))
print("The mean accuracy of Stochastic Gradient Descent Classifier on CV data is:", np.mean(scores_sgd))
这是为了检查测试性能
def test_performance(self,test_X,test_Y):
"""This method checks the performance of each algorithm on test data."""
from sklearn import metrics
# For SGD
print ("The accuracy of SGD on test data is:", self.sgd.score(test_X,test_Y))
print 'Classification Metrics for SGD'
print metrics.classification_report(test_Y, self.sgd.predict(test_X))
print "Confusion matrix"
print metrics.confusion_matrix(test_Y, self.sgd.predict(test_X))