我正在使用 Bert 模型进行 NER。我在我的数据集中遇到了一些不属于 bert 词汇表的单词,并且在将单词转换为 id 时遇到了同样的错误。有人可以帮助我吗?
下面是我用于 bert 的代码。
df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
tokens_list=['hrct',
'heall',
'government',
'of',
'hem',
'snehal',
'sarjerao',
'nawale',
'12',
'12',
'9999',
'female',
'mobile',
'no',
'1155812345',
'3333',
'3333',
'3333',
'41st',
'3iteir',
'fillow']
max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding flasges -[CLS] and [SEP]: ")
print(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)
```