问题/主要目标/TLDR:训练一个分类器,然后对其进行随机评论并获得相应的预测评论评级(星数从 1 到 5) - 只有 60% 的准确度!:(
我有一个包含大约 48000 条技术产品评论的大数据集(来自许多不同的作家和不同的产品——这不是那么重要(?))和相应的评级(1 到 5 星)我在每个类中随机选择了一些评论:
1 star: 173 reviews (could not pick 1000 because there were 173)
2 stars: 1000 reviews
3 stars: 1000 reviews
4 stars: 1000 reviews
5 stars: 1000 reviews
总计:4173 条评论 - 此数据以元组格式组织在一个文件 (all_reviews_labeled.txt) 中,一条评论和评级为:
(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
…
我的第一个“傻瓜”方法是:
Tokenize review text
POS tagging
Get most frequent bigrams that folowing some POS tags rules for most frequent trigrams (I have seen this rules - using this POS patterns in “Automatic Star-rating Generation from Text Reviews” - pag.7 - paper from Chong-U Lim, Pablo Ortiz and Sang-Woo Jun):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(text):
if (t1 == 'JJ' or t1 == 'JJS' or t1 == 'JJR') and (t2 == 'NN' or t2 == 'NNS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'RB' or t1 == 'RBR' or t1 == 'RBS') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJR') and (t3 != 'NN' or t3 != 'NNS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'JJ' or t1 == 'JJS' or t1 == 'JJR') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS') and (t3 != 'NN' or t3 != 'NNS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'NN' or t1 == 'NNS') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS') and (t3 != 'NN' or t3 != 'NNS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'RB' or t1 == 'RBR' or t1 == 'RBS') and (t2 == 'VB' or t2 == 'VBD' or t2 == 'VBN' or t2 == 'VBG'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'DT') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
elif (t1 == 'VBZ') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS'):
bi = unicode(w1 + ' ' + w2).encode('utf-8')
bigrams.append(bi)
else:
continue
Extract features (here is where I have more doubts - should I only look for this two features?):
features={}
for bigram,freq in word_features:
features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
return features
featuresets = [(review_features(review), rating) for (review, rating) in tuples_labeled_reviews]
Splits the training data into training size and testing size (90% training - 10% testing):
numtrain = int(len(tuples_labeled_reviews) * 90 / 100)
train_set, test_set = featuresets[:numtrain], featuresets[numtrain:]
Train SVMc:
classifier = nltk.classify.SklearnClassifier(LinearSVC())
classifier.train(train_set)
Evaluate the classifier:
errors = 0
correct = 0
for review, rating in test_set:
tagged_rating = classifier.classify(review)
if tagged_rating == rating:
correct += 1
print("Correct")
print "Guess: ", tagged_rating
print "Correct: ", rating
else:
errors += 1
到目前为止,我的准确率只有 60%……我可以做些什么来改进我的预测结果?之前是否有一些文本/评论预处理(例如删除停用词/标点符号?)丢失了?你能建议我一些其他的方法吗?如果真的是分类问题还是回归问题,我仍然有点困惑......:/
请简单的解释,或者给我一个“机器学习傻瓜”的链接,或者做我的导师,我保证学得很快!我在机器学习/语言处理/数据挖掘方面的背景很浅,我玩过几次weka(Java),但现在我需要坚持使用Python(nltk + scikit-learn)!
编辑:
- 现在我还提取一元词作为特征,一元词 POS 标记为“JJ”、“NN”、“VB”和“RB”。它将准确率提高了一点,达到了 65%。
- 在 POS 标记之前,我还应用了对文本进行词干提取和词形还原。它将准确度提高到 +70%。
编辑2:
我已经为分类器提供了我所有的评论,即 48000,分为 90% 的训练和 10% 的测试,准确度为 91%。
现在我有 32000 条新评论(也有标签)并将它们全部提供给测试,平均准确率为 62%……我的混淆矩阵如下图所示(我除以 +1/-1 星点的相等误差, +2/-2, +3/-3 - 因为它只是一个说明):
