如何使用 BERT 获得句子嵌入?
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
sentence='I really enjoyed this movie a lot.'
#1.Tokenize the sequence:
tokens=tokenizer.tokenize(sentence)
print(tokens)
print(type(tokens))
2.添加[CLS]和[SEP]令牌:
tokens = ['[CLS]'] + tokens + ['[SEP]']
print(" Tokens are \n {} ".format(tokens))
3.填充输入:
T=15
padded_tokens=tokens +['[PAD]' for _ in range(T-len(tokens))]
print("Padded tokens are \n {} ".format(padded_tokens))
attn_mask=[ 1 if token != '[PAD]' else 0 for token in padded_tokens ]
print("Attention Mask are \n {} ".format(attn_mask))
4. 维护一个段令牌列表:
seg_ids=[0 for _ in range(len(padded_tokens))]
print("Segment Tokens are \n {}".format(seg_ids))
5. 获取 BERT 词汇表中标记的索引:
sent_ids=tokenizer.convert_tokens_to_ids(padded_tokens)
print("senetence idexes \n {} ".format(sent_ids))
token_ids = torch.tensor(sent_ids).unsqueeze(0)
attn_mask = torch.tensor(attn_mask).unsqueeze(0)
seg_ids = torch.tensor(seg_ids).unsqueeze(0)
将它们喂给 BERT
hidden_reps, cls_head = bert_model(token_ids, attention_mask = attn_mask,token_type_ids = seg_ids)
print(type(hidden_reps))
print(hidden_reps.shape ) #hidden states of each token in inout sequence
print(cls_head.shape ) #hidden states of each [cls]
output:
hidden_reps size
torch.Size([1, 15, 768])
cls_head size
torch.Size([1, 768])
哪个向量代表这里的句子嵌入?是hidden_reps
还是cls_head
?
有没有其他方法可以从 BERT 获取句子嵌入,以便与其他句子进行相似性检查?