如何将 NLTK 文本对象与 Re 库一起使用?

数据挖掘 nlp nltk
2022-02-19 18:29:56

我正在努力从我的文本文件中构建一个词袋模型。我想使用re.subre 库中的函数。我收到以下错误;

TypeError:预期的字符串或类似字节的对象

我编写了以下代码;

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

""" importing data set """
from nltk.corpus import PlaintextCorpusReader
corpus_root = './'
wordlists = PlaintextCorpusReader(corpus_root, 'evidence1.txt')
wordlists.fileids()
wordlists.words('evidence1.txt')
stringx = wordlists.words(wordlists.fileids()[0])
print (stringx) 

""" cleaning the texts """
from nltk.text import Text
text = Text(stringx)
print(text)

import re
lowerx = re.sub('[^a-zA-Z0-9]','',text)

我想我需要正确的对象来传递给re.sub

1个回答

您将变量text(存储nltk.Text对象的位置)传递给re.sub(需要字符串str对象,而不是nltk.Text对象)。

与该类的名称可能暗示的相反Text,它本身不是类似字符串的对象,它只是存储字符串。这可能是误导你的原因。

这些字符串可能是您在调用re下面的 -related 功能时需要的。您可以通过标准枚举方法for在 pythonic 中正常访问它们,例如,在该实例上循环,for word in theText.... 有关完整示例,请参见下面的代码。

import re
import random
from nltk.text import Text

# Initialize a dummy text with integers as its words.
# With random proability, we add a non-alphanumeric 
# character to the word so that we can test the regular
# expression in the original example.
text = Text([
    str(random.randrange(1, 100)) + (
        '' if random.random() >= 0.5 else 
        random.choice(['!', ',', '?', ':', ';'])
    )
    for _ in range(20)
])

# The tokens in the `Text` object can be accessed
# via enumeration (the class implements a Python iterator
# underlyingly)
print(text)
print(text.tokens[:10])
print(list(text)[:10])
print([x for x in text][:10])

# To avoid the exception, we comment the original line
# We can't pass the variable `text` to `re.sub` because
# it expects a string and `text` stores an instance of 
# NLTK's `Text` class.
#lowerx = re.sub('[^a-zA-Z0-9]','',text)

# Test for the expected behavior:
for token in text:
    lowertoken = re.sub('[^a-zA-Z0-9]', '', token)
    print('<input="%s"  output="%s">' % (token, lowertoken))