在试验变压器的TFBertForSequenceClassification和BertTokenizer时,我注意到 BertTokenizer:
transformer_bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
以不同于我用来为我的 BERT 模型构建的分词器的方式对文本进行分词:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tokenization
FullTokenizer = tokenization.FullTokenizer
进而
BERT_MODEL_HUB = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(BERT_MODEL_HUB, trainable=True)
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
tokenizer = FullTokenizer(vocabulary_file, to_lower_case)
举个例子:
sequence = "Systolic arrays are cool. This 🐳 is cool too."
transformer_bert_tokenizer .tokenize(sequence)
# output: ['s', '##ys', '##to', '##lic', 'array', '##s', 'are', 'cool', '.', 'this', '[UNK]', 'is', 'cool', 'too', '.']
tokenizer2.tokenize(sequence)
# output: ['sy','##sto','##lic','arrays','are','cool','.','this','[UNK]', 'is','cool','too','.']
有谁知道为什么有区别?两个分词器不是使用相同的词汇吗?首选哪种方式?