数据挖掘 - 使用自然语言处理方法对文本进行分类的方法 - 吾爱随笔录

使用自然语言处理方法对文本进行分类的方法

数据挖掘机器学习 Python nlp 文本分类

2022-03-15 12:11:48

我对文本分类/分类有疑问。这项任务已经困扰我好几天了，由于我对 AI 和自然语言处理 (NLP) 领域还很陌生，我只是被在线内容和可用工具/库（例如 NLTK、Keras、spaCy 等）淹没了。 . 如果你能给我一些关于如何解决问题的指导或线索，那就太棒了。

问题：基本上我尝试设置一个对文本进行分类的工具。我已经有一个广泛的标记数据集可以使用。输入将始终是某种列表（想想一个有 500 行的 Excel 文件）。每行包含一个单词或单词组合，即没有句子。

我的标记数据集的简化示例 - 左侧输入，右侧分类：

"dog" -> "animal"
"dog owner" -> "person"
"dog owner house" -> "building"
"owner" -> "person"
"dog food" -> "food"
"food court" -> "building"

我现有的标记数据集有大约 2,000 个此类分类，总共有 50 个独特的类别。我如何设置一个算法来扫描输入例如“狗”这个词 - 如果它只是“狗”那么它是“动物”类别，如果它是“狗”和“所有者”它是类别“人”，如果是“狗”，“主人”和“房子”就是类别“建筑物”等等。

如果我将大量 if-else-statements 设置为决策树，那将是繁琐且不透明的。NLP有没有办法解决这样的问题？

非常感谢您！非常期待您的想法，如果我必须以任何方式更具体，请告诉我。

最好的问候， pythoneer

3个回答

这个问题似乎是一个多类多标签问题。提问者似乎很乐意构建详细的本体。这些导致作者提出以下方法。请注意，可以在此处的文章中找到对此的详细说明。

解决问题的步骤：

将分类文件构建为 csv 文件，如下所示。请注意，列标题应与下面给出的相同。
将所有内容放在另一个 csv 文件中，如下所示。请注意，列标题应与下面给出的相同。
在以下 python 代码中，请在df的路径中输入内容的路径，在df_tx的路径中输入分类的路径。这些步骤出现在用于映射的评论导入数据附近。在代码末尾为输出添加另一个路径值。

运行下面的python代码。请注意，此代码在 Windows 10 机器上的 Python 2.7 上运行良好。请自行解决任何技术问题，因为作者可能对此类问题没有太大帮助。

#Invoke Libraries
import pandas as pd
import numpy as np
import re

#import data for mapping
df = pd.read_csv("path to content csv");
df_tx = pd.read_csv("path to taxonomy csv");

#Build functions
#function that identifies taxonomy words ending with (*) and treats it as a wild character
def asterix_handler(asterixw, lookupw):
    mtch = "F"
    for word in asterixw:
        for lword in lookupw:
            if(word[-1:]=="*"):
                if(bool(re.search("^"+ word[:-1],lword))==True):
                    mtch = "T"
                    break
    return(mtch)

#function that removes all punctuations. helps in creation of set of words
def remov_punct(withpunct):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    without_punct = ""
    char = 'nan'
    for char in withpunct:
        if char not in punctuations:
            without_punct = without_punct + char
    return(without_punct)

#function to remove just the quotes(""). This is for the taxonomy
def remov_quote(withquote):
    quote = '"'
    without_quote = ""
    char = 'nan'
    for char in withquote:
        if char not in quote:
            without_quote = without_quote + char
    return(without_quote) 

#split each document by sentences and append one below the other for sentence level categorization and sentiment mapping
sentence_data = pd.DataFrame(columns=['slno','text'])
for d in range(len(df)):    
    doc = (df.iloc[d,1].split('.'))
    for s in ((doc)):        
        temp = {'slno': [df['slno'][d]], 'text': [s]}
        sentence_data =  pd.concat([sentence_data,pd.DataFrame(temp)])
        temp = ""

#drop empty text rows and export data
sentence_data['text'].replace('',np.nan,inplace=True);      
sentence_data.dropna(subset=['text'], inplace=True);  

data = sentence_data
cat2list = list(set(df_tx['Category2']))
data['Category'] = 0
mapped_data = pd.DataFrame(columns = ['slno','text','Category']);
temp=pd.DataFrame()

for k in range(len(data)):        
    comment = remov_punct(data.iloc[k,1])
    data_words = [str(x.strip()).lower() for x in str(comment).split()]
    data_words = filter(None, data_words)
    output = []

    for l in range(len(df_tx)):
        key_flag = False
        and_flag = False
        not_flag = False
        if (str(df_tx['Keywords'][l])!='nan'):
            kw_clean = (remov_quote(df_tx['Keywords'][l]))
        if (str(df_tx['AndWords'][l])!='nan'):
            aw_clean = (remov_quote(df_tx['AndWords'][l]))
        else:
            aw_clean = df_tx['AndWords'][l]
        if (str(df_tx['NotWords'][l])!='nan'):
            nw_clean = remov_quote(df_tx['NotWords'][l])
        else:
            nw_clean = df_tx['NotWords'][l]
        Key_words = 'nan'
        and_words = 'nan'
        and_words2 = 'nan'
        not_words = 'nan'
        not_words2 = 'nan'

        if(str(kw_clean)!='nan'):
            key_words = [str(x.strip()).lower() for x in kw_clean.split(',')]
            key_words2 = set(w.lower() for w in key_words)

        if(str(aw_clean)!='nan'):
            and_words = [str(x.strip()).lower() for x in aw_clean.split(',')]
            and_words2 = set(w.lower() for w in and_words)

        if(str(nw_clean)!= 'nan'):
            not_words = [str(x.strip()).lower() for x in nw_clean.split(',')]
            not_words2 = set(w.lower() for w in not_words)

        if(str(kw_clean) == 'nan'):
            key_flag = False        
        else:
            if set(data_words) & key_words2:
                key_flag = True
            elif(bool(re.search('"',df_tx['Keywords'][l]))==True and quote_handler(key_words, comment) == 'T'):
                key_flag = True            
            elif(asterix_handler(key_words2, data_words)=='T'):                
                    key_flag = True   

        if(str(aw_clean)=='nan'):
            and_flag = True
        else:
            if set(data_words) & and_words2:
                and_flag = True
            elif(bool(re.search('"',df_tx['AndWords'][l]))==True and quote_handler(and_words, comment) == 'T'):
                and_flag = True            
            elif(asterix_handler(and_words2, data_words)=='T'):
                and_flag = True

        if(str(nw_clean) == 'nan'):
            not_flag = False
        else:
            if set(data_words) & not_words2:
                not_flag = True
            elif(bool(re.search('"',df_tx['NotWords'][l]))==True and quote_handler(not_words, comment) == 'T'):
                not_flag = True            
            elif(asterix_handler(not_words2, data_words)=='T'):
                not_flag = True

        if(key_flag == True and and_flag == True and not_flag == False):
            output.append(str(df_tx['Category2'][l]))            
            temp = {'slno': [data.iloc[k,0]], 'text': [data.iloc[k,1].strip()], 'Category': [df_tx['Category2'][l]]}
            mapped_data = pd.concat([mapped_data,pd.DataFrame(temp)], sort = False)

#output mapped data
mapped_data = mapped_data[['slno', 'text', 'Category']]   

mapped_data.to_csv("Path here/mapped_data.csv",index = False)

最终输出如下所示：

您正在尝试将下位词（事物的特定示例的词，例如狗）映射到上位词（一般事物类别的词，例如动物）。对于您在Wordnet中的大部分条款，这可能已经为您完成了，因此如果您想快速创建解决方案（例如用于商业目的），那么这就是您开始的地方。

如果这是您想要/需要创建自己的解决方案的东西，这里有三个建议。您可以选择一个用作基准并尝试对其进行改进：

由于英语语言的工作方式，您可能会通过丢弃除到达训练示例中的最后一个单词之外的所有内容来获得很长的路要走。然后，您可以获取该词的GloVe 嵌入并将其馈送到小型前馈神经网络。
您可以采用生成相关语言模型并为其提供一个虚拟句子片段，其中包括您要分类的短语（例如“狗屋”->“狗屋是一种类型”），然后对预测的一些嵌入进行训练下一句话。
您可以使用 [散列技巧]( 嵌入所有示例短语，然后从头开始训练线性模型。永远不要低估线性模型！

祝你好运！

我会尝试做的第一件事是简单地使用 one-hot 编码来表示你的特征。One-hot 编码包括将句子（在您的情况下为单词序列）表示为稀疏向量，如果您熟悉 python sklearn 已经实现了此功能，它称为DictVectorizer

# the length is the total amount of different words in all your words sequences
"dog owner" --> [0, 0, 0, 0, 1, 0, 0, 1, ....]

然后我会训练几个模型（随机森林、朴素贝叶斯、多层感知器、支持向量机等）来检查哪种模型效果最好。通常使用稀疏特征，svm 效果很好，但这确实意味着在你的情况下它们也会提供最高的结果，这就是为什么唯一知道的方法是训练几个模型。

一种更先进的技术（尽管不是在编码方面）是使用嵌入向量和深度学习。您可以使用预训练的向量，GloVe向量是标准选择，作为卷积神经网络的输入（这种架构通常适用于短文本和分类任务，而且训练速度很快）。

作为最后的考虑，从我在你的例子中看到的，标签似乎只与序列的最后一个词有关。如果这在所有数据集中都是一致的，那么另一个可能避免完全深度学习的技巧可能是仍然使用嵌入向量，但只计算一些语义相似度分数，比如余弦相似度。如果标签在语义上足够不同，那么可能只通过计算每个序列的最终单词与每个标签之间的相似度来预测标签，然后选择得分最高的标签。

其它你可能感兴趣的问题

上一篇管道异构数据下一篇错误：float() 参数必须是字符串或数字，而不是“StandardScaler”