数据挖掘 - 我可以从文本中选择哪些功能？ - 吾爱随笔录

我可以从文本中选择哪些功能？

数据挖掘机器学习分类神经网络文本挖掘特征选择

2021-10-03 12:20:45

您好，我对数据科学、机器学习和堆栈溢出非常陌生。请原谅我不清楚或提出幼稚的问题。

我的问题如下：

从任何给定的文档中，我尝试使用神经网络根据它在读者中唤起的情绪对其进行分类。但是，我在选择功能时遇到了困难。我正在考虑使用 NLTK 和 RAKE 来提取关键字，但我不知道如何将它们转换为特征。我应该为一项功能散列关键字吗？或者，我是否应该找到一本英语单词词典（即 Wordnet），并使用词典中的每个单词作为特征。

2个回答

在 python 中使用NLTK你应该首先将句子标记为单词，即使你可以将Ngram用于 2-Gram 或 3-Gram 词袋，我建议N-Gram的原因是让我们假设你有这样的句子：I am not happy with this product，然后2 -Gram将其标记为['not happy', 'happy with', 'with this', 'this product']此处I并am假定为STOPWORDS. 使用HashingTF，您可以将句子散列为特征向量，['word position': frequency of word, ...]即高度稀疏的向量，对于 PySpark 中的散列，请查看此文档。

下面的python代码将帮助您在词袋中进行标记

import string

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

PUNCTUATION = set(string.punctuation)
STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()

example = ['Hello Krishna Prasad, this is test file for spark testing',
            'Another episode of star',
            'There are far and away many stars'
            'A galloping horse using two coconuts'
            'My kingdom for a horse'
            'A long time ago in a galaxy far']

def tokenize(text):
    tokens = word_tokenize(text)
    lowercased = [t.lower() for t in tokens]
    no_punctuation = []
    for word in lowercased:
        punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
        no_punctuation.append(punct_removed)
    no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
    stemmed = [STEMMER.stem(w) for w in no_stopwords]
    return [w for w in stemmed if w]

tokenized_word = [tokenize(text) for text in example]
for word in tokenized_word:
    print word

以上代码为：

$python WordFrequencyHash.py
[u'hello', u'krishna', u'prasad', u'test', u'file', u'spark', u'test']
[u'anoth', u'episod', u'star']
[u'far', u'away', u'mani', u'starsa', u'gallop', u'hors', u'use', u'two', u'coconutsmi', u'kingdom', u'horsea', u'long', u'time', u'ago', u'galaxi', u'far']

您还可以使用word2vec或countvectorizer进行标记化。

特征提取器只是一个函数，它返回给定目标实例的特征值。例如，在输入句子“ I friggin hat regex ”中给出，您可以在其上运行标记器以将其分解为单词列表。然后，您可以拥有一个特征提取器函数“hasCurseWord(tokens)”，它返回 true 或 false，指示存在诅咒词（您可以有一个预定义的诅咒词的字典来比较。类似地，您可以编写一个特征提取器，返回 numCurseWords文本中诅咒词的数量。通过类推，你可以用正面和负面的词做同样的事情。

因此，在回答您的查询时，除了单个克和可能的短语 ngram（如另一个答案所述）之外，您还需要添加自定义特征提取器 - 例如，诅咒词是很好的情绪指标 - 以提高情绪分类器的准确性.

以下是计算开发人员定义的否定词的特征提取器在 Python 中的样子：

def featx_negative(tokens, args=None):
       #negative terms
    num_neg = 0
    num_neg += len([w for w in tokens if re.search('hate', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('stuck', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('smh', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('angry', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('mad', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('blow', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('trash', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('garbage', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('bad', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('worst', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('<<<', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('dead', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('die', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('boo', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('horrib', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('terrib', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('annoy', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('wrong', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('dump', w, re.IGNORECASE)])
    num_neg += len([w for w in tokens if re.search('mess', w, re.IGNORECASE)])
    features = {}
    features['Has(NEGATIVE)'] = True if num_neg > 0 else False
    return features

上述方法可以称为字典方法——本质上，每个特征提取器都将输入句子与外部定义的术语字典进行比较。这种方法的局限性在于您必须自己手动创建字典（或找到一些正面/负面单词的外部数据源）。此外，字典是静态的（当一个新的俚语出现时会发生什么？）并且它们基本上对术语的权重相同。另一种方法是使用机器学习算法，该算法在手动预先标记的句子（如 pos/neg）上进行训练。这种方法允许您根据单词的实际分布（例如tf-idf）或 chi-sq 对单词进行加权。有关更多详细信息，请参阅创建训练数据的答案。

其它你可能感兴趣的问题

上一篇对客户电子邮件进行分类下一篇R中具有非常多的类级别的响应变量