是否有任何预训练模型来查找相似的单词 n-gram,其中 n>1?
例如,FastText 似乎只适用于 unigram:
from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)
[('dogs', 0.8463464975357056),
('puppy', 0.7873005270957947),
('pup', 0.7692237496376038),
('canine', 0.7435278296470642),
...
但它在较长的 n-gram 上失败:
model.nearest_neighbors('Gone with the Wind', k=2000)
[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
0.71047443151474),
或者
model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
0.5197194218635559),