我正在做一个小练习,以确定电子邮件是否为垃圾邮件。我的数据集如下:
Email Spam
0 Hi, I am Andrew and I want too buy VIAGRA 1
1 Dear subscriber, your account will be closed 1
2 Please click below to verify and access email restore 1
3 Hi Anne, I miss you so much! Can’t wait to see you 0
4 Dear Professor Johnson, I was unable to attend class today 0
5 I am pleased to inform you that you have won our grand prize. 1
6 I can’t help you with that cuz it’s too hard. 0
7 I’m sorry to tell you but im sick and will not be able to come to class. 0
8 Can I see an example before all are shipped or will that cost extra? 0
9 I appreciate your assistance and look forward to hearing back from you. 0
其中 1 表示垃圾邮件,0 不是垃圾邮件。我尝试过的是以下内容:
#Tokenization
def fun(t):
# Removing Punctuations
remove_punc = [c for c in text if c not in string.punctuation]
remove_punc = ''.join(remove_punc)
# Removing StopWords
cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]
return cleaned
所以我应用了函数:df['Email'].apply(fun)。然后我将文本转换为矩阵,如下所示:
from sklearn.feature_extraction.text import CountVectorizer
mex = CountVectorizer(analyzer= fun).fit_transform(df['Email'])
并将数据集拆分为训练和测试:
X_train, X_test, y_train, y_test = train_test_split(mex, df['Email'], test_size = 0.25, random_state = 0)
我应用了一个分类器(我会应用逻辑回归来确定电子邮件是否为垃圾邮件,但我目前只使用朴素贝叶斯:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
最后,我首先将分类器应用于训练集,然后应用于测试集:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))
该代码有效,但我想知道如何通过新电子邮件示例直观地查看它是否具有标签 1 或 0。例如:如果我有一封新电子邮件'Hi, my name is Christopher and I like VIAGRA',我如何确定标签/类别?
我觉得我遗漏了一些东西,或者我可能采用了错误的方式来证明这一点。
我的问题如下:
鉴于这封新电子邮件:Hi, my name is Christopher and I like VIAGRA,我如何查看这是否是垃圾邮件?我曾考虑过分类,但可能我的方法是错误的。我想要类似的东西:
Email Spam
...
Hi, my name is Christopher and I like VIAGRA 1
因为这与电子邮件非常相似 'Hi, I am Andrew and I want too buy VIAGRA'(如果包含在训练集中或在测试集中正确预测)。
也许我想做的只需要tf-idf算法或不同的方法。任何建议将被认真考虑。