数据挖掘 - 将 TruncatedSVD 与 hashingvector 一起使用时，精度会大大降低 - 吾爱随笔录

我有大约 80 万个带有类别的产品描述。大约有280个类别。我想用给定的数据集训练一个模型，以便将来我可以预测给定产品描述的类别。由于数据集很大，我无法在该数据上使用 TF-IDF，它会抛出 MemoryError。

我发现 Hashingvector 在处理大数据时是可取的。但是当应用 Hashingvector 时，我发现它生成的数据具有 1048576 个特征。训练和 SGD 模型大约需要 1 小时，并产生 78% 的准确率。

代码：

import pandas as pd
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV

data_file = "Product_description.txt"
#Reading the input/ dataset
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()

train_data, test_data, train_label,  test_label = train_test_split(data.Product_description, data.Category, 
                                                                   test_size=0.3, random_state=100, stratify=data.Category)

sgd_model = SGDClassifier(loss='hinge', n_iter=20, class_weight="balanced", n_jobs=-1,
                          random_state=42, alpha=1e-06, verbose=1)
vectorizer = HashingVectorizer(ngram_range=(1,3))
data_features = vectorizer.fit_transform(train_data.Product_description)
sgd_model.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data.Product_Combined_Cleansed)
Output_predict = sgd_model.predict(test_data_feature)
print(accuracy_score(test_label, Output_predict))

输出：

Accuracy 77.01%

由于维度很高，我认为减小维度会提高准确性并减少训练时间。我使用 TrancatedSVD 来减少维度，但这大大降低了预测精度，但将训练时间减少到了 10 分钟。

代码2：

from sklearn.decomposition import TruncatedSVD
clf = TruncatedSVD(100)
clf.fit(data_features)

输出：

Accuracy 14%

编辑：

当我尝试使用 1000 作为限制的 TruncatedSVD 时，它会引发内存错误，所以只有我选择使用 100 作为限制。

据说减少 HashingVector 上的 n_features 会导致there can be collisions: distinct tokens can be mapped to the same feature index. However, in practice, this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems)scikit 站点发生冲突。

当我在 1 到 3 之间使用 ngram 时，我得到了最佳精度，所以只使用了它。