句子的 Spacy 词嵌入

数据挖掘 机器学习 Python nlp 词嵌入 斯派西
2022-02-03 20:34:26

Spacy为 word 提供了预训练的向量但是我注意到您也可以获得句子的向量:

spacy_nlp('hello I').has_vector == True

但是我不知道它是如何从句子中计算 word2vecs 的。我试过了:

spacy_nlp('hello I').vector == spacy_nlp('hello').vector + spacy_nlp('I').vector

错误的

spacy_nlp('hello I').vector/spacy_nlp('hello I').vector_norm == spacy_nlp('hello').vector/spacy_nlp('hello').vector_norm + spacy_nlp('I').vector/spacy_nlp('I').vector_norm

错误的

我似乎无法找到或弄清楚 spacy 如何计算句子的 w2v。


a =spacy_nlp('hello').vector
a

array([ 2.1919045 , -1.3554063 , -2.0530818 , -1.4123821 ,  0.73116064,
       -0.24243775, -1.238019  , -1.038872  , -3.8119905 ,  0.3023836 ,
        2.0082908 , -0.4146578 ,  0.52871764, -4.171281  , -4.014127  ,
        3.5551465 ,  3.5740273 ,  0.5369273 , -0.92361224,  1.4550962 ,
        2.1736908 , -0.05514041,  0.02151388, -2.1722403 ,  0.81322104,
        3.5877275 , -1.0136521 ,  4.6003613 , -0.19145766,  5.403145  ,
       -1.9958102 ,  0.80248785, -2.3566568 ,  2.15387   ,  0.26684093,
        1.8178961 ,  3.594517  , -2.9950802 ,  2.5587099 , -5.6746616 ,
       -3.7259517 ,  4.0144114 , -1.4814405 ,  1.5888698 , -0.2371515 ,
        0.5498152 ,  0.9527153 , -4.1197095 , -4.252441  , -0.36907774,
       -4.510469  ,  1.2669985 , -0.91693896, -3.0032263 , -4.037157  ,
       -1.986922  ,  1.8322158 , -0.9520336 , -2.6739838 ,  0.368276  ,
        0.5881702 ,  1.4819605 ,  2.1771026 ,  0.20011072, -0.20952749,
       -1.7966032 ,  4.412916  , -0.8781664 ,  3.0670204 ,  3.92986   ,
       -0.7381511 , -0.07432494, -3.6973615 , -3.546731  ,  1.6010978 ,
       -4.0834403 ,  1.7816883 ,  0.8037724 ,  0.40344352, -1.2090104 ,
       -3.3253288 ,  4.6769385 ,  1.3193885 , -1.1775286 , -1.2436512 ,
       -0.29471165,  1.9998071 ,  1.1338542 ,  5.747326  , -0.10331005,
        1.6050186 ,  2.6961374 , -1.9422164 , -3.0807574 , -1.1481779 ,
        7.1367517 ], dtype=float32)

b =spacy_nlp('I').vector
b

array([ 1.9940598e+00, -2.7776110e+00,  8.4717870e-01, -2.1956882e+00,
       -1.6103275e+00,  1.2993972e-01,  8.3826280e-01,  8.7950850e-01,
       -3.5490465e+00,  4.4254961e+00, -1.4894485e+00,  4.4692218e-01,
       -6.0040636e+00,  3.4809113e-01,  7.5852954e-01, -5.0149399e-01,
       -1.9669157e+00,  8.8114321e-01,  5.3964740e-01,  1.6436796e+00,
       -4.3819084e+00,  7.1328688e-01, -8.9688343e-01, -1.2563754e+00,
       -2.6987386e-01,  3.3273227e+00,  7.1929336e-01,  1.2008041e-01,
        2.8758078e+00, -8.6590099e-01,  5.6435466e-01, -5.4331255e-01,
       -3.3853512e+00, -2.0917976e+00, -1.1649452e+00,  8.6632729e+00,
        9.1355121e-01, -3.9117950e-01, -6.3341379e-01, -3.4170332e+00,
        3.2871642e+00,  4.5229197e-03, -4.0161700e+00,  2.6399128e+00,
       -2.4242992e+00, -1.2012237e-01, -1.1977488e-01, -1.6422987e-01,
        7.7170479e-01, -1.5015860e+00, -3.0203837e-01,  1.9385589e+00,
       -2.9229348e+00, -2.8134599e+00, -6.1340892e-01, -2.5029099e+00,
       -6.6817325e-01, -8.4735197e-01,  4.2243872e+00,  2.8358276e+00,
       -2.7096636e+00,  6.3791027e+00,  1.3461562e+00, -3.9387980e+00,
        1.0648534e+00,  5.3636909e-01,  4.1285772e+00, -2.8879738e+00,
        1.3546917e+00, -1.9005369e+00, -3.7411542e+00, -4.8598945e-02,
       -1.4411114e+00,  1.3436056e+00,  1.1946709e+00,  2.3972931e+00,
        2.1032238e+00,  1.8248746e+00, -2.1880054e+00, -1.4601905e+00,
       -1.9771397e+00,  9.3115008e-01, -3.7088573e+00, -4.9041757e-01,
        1.0846795e+00,  2.2863836e+00,  3.5038524e+00,  1.0964345e+00,
        3.6875091e+00, -1.6266774e+00,  1.4012933e-02,  2.7396250e+00,
        3.9477596e+00, -3.5737205e+00,  3.1862993e+00,  2.2955155e+00],
      dtype=float32)

c =spacy_nlp('hello I').vector
c

array([ 2.4846857 , -1.9697192 , -0.09456831, -1.5198507 , -1.6889997 ,
       -0.7867774 , -1.1812011 ,  0.01011622, -2.9120972 ,  3.59254   ,
        1.3454058 , -0.305678  , -2.1474035 , -3.110804  , -0.6446719 ,
        1.9236953 ,  0.88007987,  0.4077559 ,  0.27990723,  0.36027157,
        1.214731  , -0.27636862,  0.33037317, -1.4009418 , -1.7570219 ,
        2.0057924 ,  0.1711272 ,  0.65295005, -0.6732832 ,  1.5165039 ,
       -1.8387947 , -0.49002886, -2.529176  ,  1.0543746 ,  0.13975173,
        6.3513803 ,  3.1074045 , -1.8838222 ,  1.707653  , -3.5569887 ,
        0.02888358,  1.4662569 , -1.4711913 ,  1.6238092 , -0.996526  ,
        0.29157495,  0.7459268 , -2.6089895 , -1.4595604 , -1.6607146 ,
       -1.9626031 ,  0.0429309 , -2.2927856 , -2.7657444 , -2.2093186 ,
       -1.8635755 ,  1.1076405 , -0.87808686, -0.8882728 , -0.20140225,
       -0.14074779,  1.5494955 ,  2.2195954 , -0.8879056 ,  0.16175044,
       -0.47926584,  6.069929  , -2.2804523 ,  1.389133  ,  2.3614829 ,
       -1.6746982 , -0.65907   , -0.88322634, -0.35415757,  1.2424103 ,
       -1.3832704 ,  1.74179   ,  2.0219522 , -0.3940425 , -1.076731  ,
       -3.0649443 ,  2.6106696 , -0.03948617,  0.03465301,  0.6218431 ,
        0.8250919 ,  1.7428303 ,  0.8449378 ,  3.0572054 ,  0.29650444,
        0.4229828 ,  0.38575757,  0.20896101, -0.91772854,  0.3865456 ,
        4.248111  ], dtype=float32)
1个回答

为了构建句子嵌入,Spacy 只是对单词嵌入进行平均。我现在无法访问 Spacy,否则会进行演示,但您可以尝试:

spacy_nlp('hello I').vector == (spacy_nlp('hello').vector + spacy_nlp('I').vector) / 2

如果这也给出False,那将是因为计算后浮点值可能不完全相等。因此,只需将它们单独打印出来,您就会发现它们非常接近。