数据挖掘 - 如何将新电子邮件分类为垃圾邮件/非垃圾邮件？ - 吾爱随笔录

如何将新电子邮件分类为垃圾邮件/非垃圾邮件？

数据挖掘机器学习 Python 分类 scikit-学习朴素贝叶斯分类器

2021-10-07 02:53:48

我正在做一个小练习，以确定电子邮件是否为垃圾邮件。我的数据集如下：

                       Email                                                   Spam
0   Hi, I am Andrew and I want too buy VIAGRA                                   1
1   Dear subscriber, your account will be closed                                1
2   Please click below to verify and access email restore                       1
3   Hi Anne, I miss you so much! Can’t wait to see you                          0
4   Dear Professor Johnson, I was unable to attend class today                  0
5   I am pleased to inform you that you have won our grand prize.               1
6   I can’t help you with that cuz it’s too hard.                               0
7   I’m sorry to tell you but im sick and will not be able to come to class.    0
8   Can I see an example before all are shipped or will that cost extra?        0
9   I appreciate your assistance and look forward to hearing back from you.     0

其中 1 表示垃圾邮件，0 不是垃圾邮件。我尝试过的是以下内容：

#Tokenization 

def fun(t):

# Removing Punctuations
remove_punc = [c for c in text if c not in string.punctuation]
remove_punc = ''.join(remove_punc)

# Removing StopWords
cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]

return cleaned

所以我应用了函数：df['Email'].apply(fun)。然后我将文本转换为矩阵，如下所示：

from sklearn.feature_extraction.text import CountVectorizer
mex = CountVectorizer(analyzer= fun).fit_transform(df['Email'])

并将数据集拆分为训练和测试：

X_train, X_test, y_train, y_test = train_test_split(mex, df['Email'], test_size = 0.25, random_state = 0)

我应用了一个分类器（我会应用逻辑回归来确定电子邮件是否为垃圾邮件，但我目前只使用朴素贝叶斯：

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

最后，我首先将分类器应用于训练集，然后应用于测试集：

from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

pred = classifier.predict(X_test)

print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))

该代码有效，但我想知道如何通过新电子邮件示例直观地查看它是否具有标签 1 或 0。例如：如果我有一封新电子邮件'Hi, my name is Christopher and I like VIAGRA'，我如何确定标签/类别？

我觉得我遗漏了一些东西，或者我可能采用了错误的方式来证明这一点。

我的问题如下：

鉴于这封新电子邮件：Hi, my name is Christopher and I like VIAGRA，我如何查看这是否是垃圾邮件？我曾考虑过分类，但可能我的方法是错误的。我想要类似的东西：

Email                                        Spam 
... 
Hi, my name is Christopher and I like VIAGRA 1

因为这与电子邮件非常相似 'Hi, I am Andrew and I want too buy VIAGRA'（如果包含在训练集中或在测试集中正确预测）。

也许我想做的只需要tf-idf算法或不同的方法。任何建议将被认真考虑。

1个回答

我修改了您的代码，使代码作为一个块运行并设置为预测新数据：

import string

from nltk.corpus import stopwords
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB

#  Define training data
df = pd.DataFrame(data={'Email': [
"Hi, I am Andrew and I want too buy VIAGRA",
"Dear subscriber, your account will be closed",
"Please click below to verify and access email restore",
"Hi Anne, I miss you so much! Can’t wait to see you",
"Dear Professor Johnson, I was unable to attend class today",
"I am pleased to inform you that you have won our grand prize.",
"I can’t help you with that cuz it’s too hard.",
"I’m sorry to tell you but im sick and will not be able to come to class.",
"Can I see an example before all are shipped or will that cost extra?",
"I appreciate your assistance and look forward to hearing back from you.",], 
'Spam': [1, 1, 1, 0, 0, 1, 0, 0, 0, 0]})

def fun(text):    
    # Removing Punctuations
    remove_punc = [c for c in text if c not in string.punctuation]
    remove_punc = ''.join(remove_punc)

    # Removing StopWords
    cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]

    return cleaned

# Create a vectorizer object to enable both fit_transform and just transform
vectorizer = CountVectorizer(analyzer=fun)
X = vectorizer.fit_transform(df['Email'])

X_train, X_test, y_train, y_test = train_test_split(X, df['Spam'], test_size = 0.25, random_state = 0)

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)

print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))

以下是如何预测新数据：

# Given a new email
new_email = "Hi, my name is Christopher and I like VIAGRA"

# Apply the same preprocessing steps and transformation
X_new = vectorizer.transform([fun(new_email)])

# Predict new email with already trained classifier
classifier.predict(X_new)

其它你可能感兴趣的问题

上一篇“Gradient Boosting Machines (GBM)”和GBDT是一回事吗？下一篇何时使用贝叶斯线性回归而不是线性回归？