数据挖掘 - NLTK：调整 LinearSVC 分类器的准确性？- 寻找更好的方法/建议 - 吾爱随笔录

问题/主要目标/TLDR：训练一个分类器，然后对其进行随机评论并获得相应的预测评论评级（星数从 1 到 5） - 只有 60% 的准确度！:(

我有一个包含大约 48000 条技术产品评论的大数据集（来自许多不同的作家和不同的产品——这不是那么重要（？））和相应的评级（1 到 5 星）我在每个类中随机选择了一些评论：

1 star: 173 reviews (could not pick 1000 because there were 173)
2 stars: 1000 reviews
3 stars: 1000 reviews
4 stars: 1000 reviews
5 stars: 1000 reviews

总计：4173 条评论 - 此数据以元组格式组织在一个文件 (all_reviews_labeled.txt) 中，一条评论和评级为：

(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
(‘review text’, ‘x star’)
…

我的第一个“傻瓜”方法是：

Tokenize review text
POS tagging

Get most frequent bigrams that folowing some POS tags rules for most frequent trigrams (I have seen this rules - using this POS patterns in “Automatic Star-rating Generation from Text Reviews” - pag.7 - paper from Chong-U Lim, Pablo Ortiz and Sang-Woo Jun):

for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(text):
    if (t1 == 'JJ' or t1 == 'JJS' or t1 == 'JJR') and (t2 == 'NN' or t2 == 'NNS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'RB' or t1 == 'RBR' or t1 == 'RBS') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJR') and (t3 != 'NN' or t3 != 'NNS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'JJ' or t1 == 'JJS' or t1 == 'JJR') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS') and (t3 != 'NN' or t3 != 'NNS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'NN' or t1 == 'NNS') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS') and (t3 != 'NN' or t3 != 'NNS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'RB' or t1 == 'RBR' or t1 == 'RBS') and (t2 == 'VB' or t2 == 'VBD' or t2 == 'VBN' or t2 == 'VBG'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'DT') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    elif (t1 == 'VBZ') and (t2 == 'JJ' or t2 == 'JJS' or t2 == 'JJRS'):
        bi = unicode(w1 + ' ' + w2).encode('utf-8')
        bigrams.append(bi)
    else:
        continue

Extract features (here is where I have more doubts - should I only look for this two features?):

features={}
for bigram,freq in word_features:
    features['contains(%s)' % unicode(bigram).encode('utf-8')] = True
    features["count({})".format(unicode(bigram).encode('utf-8'))] = freq
return features

featuresets = [(review_features(review), rating) for (review, rating) in tuples_labeled_reviews]

Splits the training data into training size and testing size (90% training - 10% testing):

numtrain = int(len(tuples_labeled_reviews) * 90 / 100)
train_set, test_set = featuresets[:numtrain], featuresets[numtrain:]

Train SVMc:

classifier = nltk.classify.SklearnClassifier(LinearSVC())
classifier.train(train_set)

Evaluate the classifier:

errors = 0
correct = 0
for review, rating in test_set:
    tagged_rating = classifier.classify(review)
    if tagged_rating == rating:
        correct += 1
        print("Correct")
        print "Guess: ", tagged_rating
        print "Correct: ", rating
    else:
        errors += 1

到目前为止，我的准确率只有 60%……我可以做些什么来改进我的预测结果？之前是否有一些文本/评论预处理（例如删除停用词/标点符号？）丢失了？你能建议我一些其他的方法吗？如果真的是分类问题还是回归问题，我仍然有点困惑......：/

请简单的解释，或者给我一个“机器学习傻瓜”的链接，或者做我的导师，我保证学得很快！我在机器学习/语言处理/数据挖掘方面的背景很浅，我玩过几次weka（Java），但现在我需要坚持使用Python（nltk + scikit-learn）！

编辑：

现在我还提取一元词作为特征，一元词 POS 标记为“JJ”、“NN”、“VB”和“RB”。它将准确率提高了一点，达到了 65%。
在 POS 标记之前，我还应用了对文本进行词干提取和词形还原。它将准确度提高到 +70%。

编辑2：

我已经为分类器提供了我所有的评论，即 48000，分为 90% 的训练和 10% 的测试，准确度为 91%。
现在我有 32000 条新评论（也有标签）并将它们全部提供给测试，平均准确率为 62%……我的混淆矩阵如下图所示（我除以 +1/-1 星点的相等误差， +2/-2, +3/-3 - 因为它只是一个说明）：

怎么了？为什么在 3 星和 5 星时准确率会下降这么多？