我从新闻页面收集了大约 50 万条未标记的评论。新闻页面具有反外国背景。正因为如此,相对大量的评论包含仇恨。
您知道如何根据评论开始创建字典吗?字典应包含与仇恨相关的单词。稍后,我想使用字典对仇恨进行文本分析。
因为没有这样的德语词典,所以我尝试了一些东西。例如,我检查了与情绪相关的单词在评论中出现的频率。但事实证明,它们似乎和其他任何词一样经常被包含在内。我也想过一个词袋和其他一些东西,但我不知道。
我正在使用 Python 3 和 R 进行分析。
提前致以诚挚的问候和感谢!
我从新闻页面收集了大约 50 万条未标记的评论。新闻页面具有反外国背景。正因为如此,相对大量的评论包含仇恨。
您知道如何根据评论开始创建字典吗?字典应包含与仇恨相关的单词。稍后,我想使用字典对仇恨进行文本分析。
因为没有这样的德语词典,所以我尝试了一些东西。例如,我检查了与情绪相关的单词在评论中出现的频率。但事实证明,它们似乎和其他任何词一样经常被包含在内。我也想过一个词袋和其他一些东西,但我不知道。
我正在使用 Python 3 和 R 进行分析。
提前致以诚挚的问候和感谢!
在查看单词出现的概率时,您会得到停用词和其他流行词。您对评论中出现的词(假设与仇恨相关)比正常使用中更多的词感兴趣。
获取中立的资源(例如,德国报纸、德国维基百科,也许是Google ngrams for German)。计算中性源的概率, 评论的概率并寻找电梯的话 . These are the words that are more popular at the comments.
As @chi wrote, many repositories can both give you a head start and help you tune the needed lift threshold (you might want words that appear much more often in the comments).
After this phase you might need do do a finer analysis. For example, I guess that there will be politicians names that will appear more often in the comments. See here for a possible approach.
It is not completely clear whether your dataset has any kind of mark up (like 'comment', 'neutral', 'positive'), and, from my point of view and experience, to get a quite precise dictionary of any kind, you should take human insights as source and stick with supervised learning algorithms.
If your dataset does contain such information, you may use Dan Levin's approach which seems quite promising and probability-wise comprehensive.
Alternatively to it, you may use advanced vector-space representations of words (word2vec) in a following manner:
Anyway, keep us posted on achieved results :)
You could restate your problem as a text classification "hate vs neutral or compassion". The standard text classification methods then apply. Get yourself a neutral or "compassion" corpus and label their elements as such. Then run a classification learner pipeline. It's features dictionary for the "hate" category will be what you are looking for.
If that does not work out of the box or you don't have contrasting corpus, you could try to emulate the classifier and do the selection manually. Run the texts through a vectorizer with German stopwords, try both TfidfVectorizer and CountVectorizer. Then sort their resulting dictionary by the weight descending and just collect the words manually.