您将变量text(存储nltk.Text对象的位置)传递给re.sub(需要字符串str对象,而不是nltk.Text对象)。
与该类的名称可能暗示的相反Text,它本身不是类似字符串的对象,它只是存储字符串。这可能是误导你的原因。
这些字符串可能是您在调用re下面的 -related 功能时需要的。您可以通过标准枚举方法for在 pythonic 中正常访问它们,例如,在该实例上循环,for word in theText.... 有关完整示例,请参见下面的代码。
import re
import random
from nltk.text import Text
# Initialize a dummy text with integers as its words.
# With random proability, we add a non-alphanumeric
# character to the word so that we can test the regular
# expression in the original example.
text = Text([
str(random.randrange(1, 100)) + (
'' if random.random() >= 0.5 else
random.choice(['!', ',', '?', ':', ';'])
)
for _ in range(20)
])
# The tokens in the `Text` object can be accessed
# via enumeration (the class implements a Python iterator
# underlyingly)
print(text)
print(text.tokens[:10])
print(list(text)[:10])
print([x for x in text][:10])
# To avoid the exception, we comment the original line
# We can't pass the variable `text` to `re.sub` because
# it expects a string and `text` stores an instance of
# NLTK's `Text` class.
#lowerx = re.sub('[^a-zA-Z0-9]','',text)
# Test for the expected behavior:
for token in text:
lowertoken = re.sub('[^a-zA-Z0-9]', '', token)
print('<input="%s" output="%s">' % (token, lowertoken))