数据挖掘 - 如果 My Word 不在 Bert 模型词汇表中怎么办？ - 吾爱随笔录

我正在使用 Bert 模型进行 NER。我在我的数据集中遇到了一些不属于 bert 词汇表的单词，并且在将单词转换为 id 时遇到了同样的错误。有人可以帮助我吗？

下面是我用于 bert 的代码。

df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

tokens_list=['hrct',
 'heall',
 'government',
 'of',
 'hem',
 'snehal',
 'sarjerao',
 'nawale',
 '12',
 '12',
 '9999',
 'female',
 'mobile',
 'no',
 '1155812345',
 '3333',
 '3333',
 '3333',
 '41st',
 '3iteir',
 'fillow']

max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding  flasges -[CLS] and [SEP]: ")
print(input_sequence)


tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)
```