数据挖掘 - 情感分析：从数据集中创建字典 - 吾爱随笔录

情感分析：从数据集中创建字典

数据挖掘 r Python 数据集 nlp

2021-09-20 23:46:48

我从新闻页面收集了大约 50 万条未标记的评论。新闻页面具有反外国背景。正因为如此，相对大量的评论包含仇恨。

您知道如何根据评论开始创建字典吗？字典应包含与仇恨相关的单词。稍后，我想使用字典对仇恨进行文本分析。

因为没有这样的德语词典，所以我尝试了一些东西。例如，我检查了与情绪相关的单词在评论中出现的频率。但事实证明，它们似乎和其他任何词一样经常被包含在内。我也想过一个词袋和其他一些东西，但我不知道。

我正在使用 Python 3 和 R 进行分析。

提前致以诚挚的问候和感谢！

3个回答

在查看单词出现的概率时，您会得到停用词和其他流行词。您对评论中出现的词（假设与仇恨相关）比正常使用中更多的词感兴趣。

获取中立的资源（例如，德国报纸、德国维基百科，也许是Google ngrams for German）。计算中性源的概率 $P_{neutral}(word)$ , 评论的概率 $P_{comments}(word)$ 并寻找电梯的话 $\frac{P_{comments}(word)}{P_{neutral}(word)} > 1$ . These are the words that are more popular at the comments.

As @chi wrote, many repositories can both give you a head start and help you tune the needed lift threshold (you might want words that appear much more often in the comments).

After this phase you might need do do a finer analysis. For example, I guess that there will be politicians names that will appear more often in the comments. See here for a possible approach.

It is not completely clear whether your dataset has any kind of mark up (like 'comment', 'neutral', 'positive'), and, from my point of view and experience, to get a quite precise dictionary of any kind, you should take human insights as source and stick with supervised learning algorithms.

If your dataset does contain such information, you may use Dan Levin's approach which seems quite promising and probability-wise comprehensive.

Alternatively to it, you may use advanced vector-space representations of words (word2vec) in a following manner:

train a word2vec model on large German text bank. That text shouldn't be of any specific character though you would benefit from presence of all your comment-words in that dataset.
then, for particular words in your hate-comments, find similar words using word2vec model. In that way, given your word2vec text bank will be representative enough, you may get a dictionary even richer that the one people tend to use in comments.

Anyway, keep us posted on achieved results :)

You could restate your problem as a text classification "hate vs neutral or compassion". The standard text classification methods then apply. Get yourself a neutral or "compassion" corpus and label their elements as such. Then run a classification learner pipeline. It's features dictionary for the "hate" category will be what you are looking for.

If that does not work out of the box or you don't have contrasting corpus, you could try to emulate the classifier and do the selection manually. Run the texts through a vectorizer with German stopwords, try both TfidfVectorizer and CountVectorizer. Then sort their resulting dictionary by the weight descending and just collect the words manually.

其它你可能感兴趣的问题

上一篇回归中平方项的系数为负是什么意思？下一篇什么样的数据不适合用CF做推荐？