数据挖掘 - 了解 sklearn FeatureHasher - 吾爱随笔录

想了解“哈希技巧”，我编写了以下测试代码：

import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())

在上面的示例中，我将 8 个类别映射到 4 列，输出为：

[[ 0.  0.  1.  0.]<-a
 [ 0. -1.  0.  0.]<-b
 [ 0. -1.  0.  0.]<-c
 [ 0.  0.  0.  1.]<-d
 [ 0.  0.  0.  1.]<-e
 [ 0.  0.  0.  1.]<-f
 [ 0.  0. -1.  0.]<-g
 [ 0. -1.  0.  0.]]<-g

在生成的矩阵中，我可以看到重复和某些类别以相同的方式表示。这是为什么？如果我使用二进制表示，8 个类别可以映射到 4 列。

有人可以解释一下这种技术的输出，也许可以详细说明一下吗？