问题:
我需要对文档是否已检查进行分类。我只有用于检查文档的文本数据集,如果文档不包含检查数据,它可以自动归类为未检查。
我发现使用 OneClass SVM 可以做到这一点。下面是我实现的数据集和程序。
代码:
from sklearn import svm
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
ps = PorterStemmer()
stop = stopwords.words('english')
data_test = pd.DataFrame({"Keywords":["Barge ID", "Check"]})
data = pd.DataFrame({"Keywords":["Vessel Name", "Barge ID", "Barge/Vessel", "VESSEL"]})
data['Keywords'] = data['Keywords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Keywords'] = data['Keywords'].str.replace('[^\w\s]',' ').replace('\s+',' ')
data_test['Keywords'] = data_test['Keywords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data_test['Keywords'] = data_test['Keywords'].str.replace('[^\w\s]',' ').replace('\s+',' ')
vectorizer = TfidfVectorizer( max_features = 200, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( data['Keywords'] )
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
clf.fit(data_features)
test_data_features = vectorizer.transform(data_test['Keywords'])
print(clf.predict(test_data_features))
我尝试了一个小的示例代码,但后来它不起作用。它错误地指示为 -1 表示驳船 ID 和 1 表示检查