Word2vec:Word2vec 为每个标记/单词提供一个向量,这些向量对单词的含义进行编码。尽管这些向量不是人类可解释的,但通过与其他向量(例如,向量dog
将与 的向量最相似cat
)和各种有趣的方程(例如king-men+women=queen
,证明如何好吧,这些向量包含单词的语义)。
word2vec的问题在于每个单词只有一个向量,但在现实世界中,每个单词都有不同的含义,具体取决于上下文,有时含义可能完全不同(例如bank as a financial institute
vs bank of the river
)。
Bert: Bert/ELMO(动态词嵌入)和 Word2vec 之间的一个重要区别是这些模型考虑了上下文,并且对于每个标记,都有一个向量。
现在的问题是,来自 Bert 的向量是否具有 word2Vec 的行为并解决了含义消歧问题(因为这是上下文词嵌入)?
实验
为了从 google 的预训练模型中获取向量,我使用了 bert-embedding-1.0.1库。我首先尝试查看它是否具有相似性。为了测试,我从Dog, Cat, and Bank(金融机构)的维基百科页面中获取了第一段。类似的词dog
是: ('dog',1.0) ('wolf', 0.7254540324211121) ('domestic', 0.6261438727378845) ('cat', 0.6036421656608582) ('canis', 0.57225221395492565321) ('mammal', 26956521) ('mammal', 26956521)第一个元素是token,第二个是相似度。
现在进行消歧测试:与Dog,Cat and Bank(金融学院)一起,我添加了一段来自维基百科的River bank 。这是为了检查 bert 是否可以区分两种不同类型的Bank
. 这里的希望是,token bank (of river) 的向量将接近river
或water
远离bank(financial institute)
,credit
等的向量。financial
结果如下: 第二个元素是显示上下文的句子。
('bank', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 1.0)
('bank', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.7796692848205566)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7275459170341492)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7121304273605347)
('bank', 'the bank consists of the sides of the channel , between which the flow is confined .', 0.6965076327323914)
('banks', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.6590269804000854)
('banking', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6490173935890198)
('banks', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6224181652069092)
('financial', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.614281952381134)
('banks', 'stream banks are of particular interest in fluvial geography , which studies the processes associated with rivers and streams and the deposits', 0.6096583604812622)
('structures', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5771245360374451)
('financial', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.5701562166213989)
('reserve', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.5462549328804016)
('institution', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.537483811378479)
('land', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5331911444664001)
('of', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.527492105960846)
('water', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5234918594360352)
('banks', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5213838815689087)
('lending', 'lending activities can be performed either directly or indirectly through due capital .', 0.5207482576370239)
('deposits', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.5131596922874451)
('stream', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5108630061149597)
('bankfull', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5102289915084839)
('river', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5099104046821594)
这里,bank 最相似向量的结果(作为河岸,token 取自第一行的上下文,这就是为什么相似度得分为 1.0。所以,第二个是最接近的向量)。从结果可以看出,第一个最接近的token的含义和上下文有很大的不同。即使是token river
,water and
stream`的相似度也较低。
因此,向量似乎并没有真正消除含义的歧义。这是为什么?上下文标记向量不是应该消除单词含义的歧义吗?