数据挖掘 - 高维混合数据类型的最近邻 - 吾爱随笔录

高维混合数据类型的最近邻

数据挖掘机器学习 Python scikit-学习相似距离

2021-09-22 06:42:36

我希望能够使用最近邻来尝试在具有连续、分类和文本特征的数据集中找到与样本子类（认为已处理与未处理）最相似的样本。

玩具数据集：

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, QuantileTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD    

np.set_printoptions(suppress=True)

a = [20, 100, 10000, 500]
b = [1, 2, 3, 2]
c = ['dog', 'cat', 'foo', 'cat']
d = ['apple apple fruit',
     'mercedes bmw chevrolet',
     'monster dragon snake',
     'mercedes chevrolet bmw buick']

z = [a,b,c,d]

names = {0: 'col1', 1:'col2', 2:'col3', 3:'col4'}

X = pd.DataFrame(z).T
X = X.rename(names, axis='columns')
X

将创建：

   col1    col2 col3    col4
0   20      1   dog     apple apple fruit
1   100     2   cat     mercedes bmw chevrolet
2   10000   3   foo     monster dragon snake
3   500     2   cat     mercedes chevrolet bmw buick

正如我们所见，样本 1 和 3 是最相关的。它们有许多相同的词汇表，共享两个标签（col2 + col3），并且考虑到 col1 的分布，它们非常接近。我们将它们转换为特征数组并要求最近的邻居，如下所示：

numeric = ['col1']
numeric_transformer = Pipeline(steps=[('scaler', QuantileTransformer())])

cat = ['col2', 'col3']
cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder())])

text = ['col4']
text_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                   ('svd', TruncatedSVD(n_components=2))])

prep = ColumnTransformer(transformers=[('num', numeric_transformer, numeric), 
                                       ('cat', cat_transformer, cat),
                                       ('text', text_transformer, 'col4')], sparse_threshold=0)

X_transformed = prep.fit_transform(X)

nn = NearestNeighbors(n_neighbors=2)
nn.fit(X_transformed)

d, i = nn.kneighbors(X_transformed[1].reshape(1,-1))
i

正确返回array([[1, 3]], dtype=int64)1 表示自匹配，3 表示最近的邻居。

但是在跨越 100 多个维度的真实数据集上，我应该使用自定义距离函数吗？对于使用 sklearn 的最近邻居，如果该列是文本列之一，我们可以使用余弦距离，另一个跨高维热编码变量的距离，以及另一个使用实际连续变换变量 (col1) 的距离计算。

但这是.... hacky吗？没有这个，有没有办法处理最近邻居搜索的异构数据？我担心我关于如何对这三种变量中的每一种进行加权的决定会使最终结果非常主观并容易受到批评。

2个回答

从概念上讲，这没有理由“不能”工作，但实际上，可能有比使用 KNN 更好的方法。

KNN的适用性

KNN 的一个直接问题是，在确定一个类之前，必须将每个新实例与每个现有参考实例进行比较。随着参考实例数量的增加，对新实例进行分类的时间也会增加。对于高维数据，您将需要大量参考实例。

数据的维度

如果您打算使用 KNN，您将想尽一切可能减少维度的数量，尤其是在分类列上。在您的示例中，您最好将动物减少到一些更大的群体，例如：[宠物，驯养，野生]或生物家庭。如果没有，一旦您对分类值进行编码，您可能仅拥有数千（到数百万）个动物列。如果你没有很多例子，你会匹配除了动物之外的所有东西，因为

D i s t a n c e b e t w e e n : d o g a n d c a t = 1

$Distance\ between:\ dog\ and\ cat = 1$

D i s t a n c e b e t w e e n : d o g a n d g i r a f f e = 1

$Distance\ between:\ dog\ and\ giraffe = 1$ 因此，这意味着数据中的其他属性可能会更相似，但它们可能对分类过程没有信息。例如，腿的数量。

不过，从积极的方面来说，KNN 可能会比其他人更好地处理新属性的引入，因为分类是基于当前可用的数据，所以添加新列并不像在神经网络，需要重新训练才能接受新的数据集。

结论

虽然 KNN 有其他优点和缺点以及优化它的性能的方法；根据您对数据的描述，并可能做出一些假设，KNN 听起来不太适合您的分类问题。

如果可以减少维度的数量，大多数模型都会表现得更好，但如果属性的数量和性质是相对静态的，我会从Decision Tree. 从决策树开始的好处是您可以了解列对分配的类的重要性。

从那以后已经有一段时间了，但我最终还是找到了一个解决方案：

from scipy.spatial.distance import euclidean, cosine, dice

cat_vars = ['col2', 'col3']
uniques = []
for c in cat_vars:
    uniques.append(X[c].nunique()) 
cat_vars_width = sum(uniques)
con_vars = ['col1']

# use a custom distance func to apply appropriate
# distance calcs to specific columns

con_vars_indices = list(range(0, con_vars_width))
cat_vars_indices = list(range(con_vars_width, con_vars_width + cat_vars_width))    
tfidf_indices = list(range(con_vars_width + cat_vars_width, X_normal.shape[1]))

def custom_distance_maker(con_vars_indices, cat_vars_indices, tfidf_indices):
    def custom_distance(x,y):
         x_con = x[con_vars_indices]
         y_con = y[con_vars_indices]
         x_cat = x[cat_vars_indices]
         y_cat = y[cat_vars_indices]
         x_tfidf = x[tfidf_indices]
         y_tfidf = y[tfidf_indices]

         d_con = euclidean(x_con,y_con)
         d_cat = dice(x_cat,y_cat)
         d_tfidf = cosine(x_tfidf,y_tfidf)
         return d_con + d_cat + d_tfidf
    return custom_distance

因此，我们使用一个热编码将创建多少列以及多少列只是真正的连续变量的知识，并将适用于这些数据类型的距离计算应用于大数组的特定部分。它很慢，但老实说，它给出了一些让我的客户满意的非常好的结果。

自定义距离函数也很容易修改，允许您“加权”要匹配的不同变量。例如，如果您更关心文本匹配，只需将其相乘以放大余弦相似度搜索中的任何差异。

我希望其他人发现这个 hacky 解决方案对异构数据中的相似性搜索很有用。

其它你可能感兴趣的问题

上一篇如何转换 csv 文件中的图像文件夹下一篇为什么决策树比逻辑回归表现更好