我建议测试句子或推文的极性。这可以使用textblob库来完成。它可以安装为pip install -U textblob. 一旦找到文本数据极性,就可以将其分配为数据框中的单独列。随后,可以使用句子极性进行进一步分析。
极性和主观性定义为;
极性是 [-1.0 到 1.0] 范围内的浮点值,其中 0 表示中性,+1 表示非常积极的情绪,-1 表示非常消极的情绪。
主观性是 [0.0 到 1.0] 范围内的浮点值,其中 0.0 非常客观,1.0 非常主观。主观句表达一些个人的感受、观点、信念、意见、指控、愿望、信念、怀疑和推测,而客观句是事实。
数据
import pandas as pd
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["I Hate Migrants",
"#trump said he is ok", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
注意,情绪列是一个元组。所以我们可以把它分成两列,比如,df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)。现在,我们可以创建一个新的数据框,我将在其中添加拆分列,如图所示;
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
接下来,根据之前找到的句子极性,我们现在可以在数据帧中添加一个标签,这将表明推文/句子是假的,不是假的或中性的。
import numpy as np
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)
结果将如下所示;
结果
Date ID Tweet sentiment polarity subjectivity label
0 1/10/2020 1 I Hate Migrants (-0.8, 0.9) -0.8 -0.8 fake
1 2/10/2020 2 #trump said he is ok (0.5, 0.5) 0.5 0.5 not_fake
2 3/10/2020 3 the sky is blue (0.0, 0.1) 0.0 0.0 neutral
3 4/10/2020 4 the weather is bad (-0.68, 0.66) -0.7 -0.7 fake
4 5/10/2020 5 i love apples (0.5, 0.6) 0.5 0.5 not_fake
完整代码
import pandas as pd
import numpy as np
from textblob import TextBlob
data = {"Date":["1/10/2020","2/10/2020","3/10/2020","4/10/2020","5/10/2020"],
"ID":[1,2,3,4,5],
"Tweet":["I Hate Migrants",
"#trump said he is ok", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
# print(df)
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
# print(df)
# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
# print(df_new)
# add label to dataframe based on condition
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'not_fake', 'fake']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)