我想用另一个例子来扩展@Emre 的答案——我们将替换“1984”(c)George Orwell(120K 单词)中的所有标记化单词:
In [163]: %paste
import requests
import nltk
import pandas as pd
# source: https://github.com/dwyl/english-words
fn = r'D:\temp\.data\words.txt'
url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
r = requests.get(url)
# read words into Pandas DataFrame
df = pd.read_csv(fn, header=None, names=['word'])
# shuffle DF, so we will have random indexes
df = df.sample(frac=1)
# convert Pandas DF into dictionary: {'word1': unique_number1, 'word2': unique_number2, ...}
lkp = df.reset_index().set_index('word')['index'].to_dict()
# tokenize "1984" (c) George Orwell
words = nltk.tokenize.word_tokenize(r.text)
print('Word Dictionary size: {}'.format(len(lkp)))
print('We have tokenized {} words...'.format(len(words)))
## -- End pasted text --
Word Dictionary size: 354983
We have tokenized 120251 words...
In [164]: %timeit [lkp.get(w, 0) for w in words]
10 loops, best of 3: 66.3 ms per loop
结论:从包含 354.983 个条目的字典中构建一个包含 120K 单词的列表需要 66 毫秒。