机器算法验证 - 具有 R 函数的高尔距离；“gower.dist”和“雏菊” - 吾爱随笔录

具有 R 函数的高尔距离；“gower.dist”和“雏菊”

机器算法验证 r 聚类距离高尔相似度

2022-03-25 01:38:40

我有 9 个数字变量和 5 个二进制 (0-1) 变量，我的数据集中有 73 个样本。我知道 Gower 距离对于具有混合变量的数据集来说是一个很好的指标。

我尝试了 daisy(cluster) 和 gower.dist(StatMatch) 函数。我们可以在两个函数中分配权重；我分配了这样的权重；数字属性的权重为 5，二进制属性的权重为 1。

但是它们给出了不同的距离矩阵。他们不应该给出相同的结果吗？
这些是我的功能和第一个示例。

A    B      C   D   E   F   G   H    I       J       K   L       M       N  
800 1200    0   0   0   0   1   2   0.31    0.33    0.1 0.62    0.35    0.44

一个; 数字（平方英尺）B；数字（美元）CDEFG；二进制（是-否）H；数字（儿童人数） JKLMN 数字（百分比）

2个回答

事实上，它们确实给出了相同的结果。我不确定你是如何比较它们的，但这里有一个例子：

# Create example data
set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create numeric variables
vars <- as.matrix(replicate(17, rnorm(30)))
df <- data.frame(nom, vars)

library(cluster)
daisy.mat <- as.matrix(daisy(df, metric="gower"))

library(StatMatch)
gower.mat <- gower.dist(df)

# you can look directly to see the numbers are the same
head(daisy.mat, 3)
head(gower.mat, 3)

# now identical will return FALSE, why?
identical(daisy.mat, gower.mat)
> identical(daisy.mat, gower.mat)
[1] FALSE

# This is because there is of extremely small differences 
# in the numbers returned by the different functions
max(abs(daisy.mat - gower.mat))
> max(abs(daisy.mat - gower.mat))
[1] 5.551115e-17

# Using all.equal has a higher tolerance threshold
all.equal(daisy.mat, gower.mat, check.attributes = F)
> all.equal(daisy.mat, gower.mat, check.attributes = F)
[1] TRUE

既然我知道您正在向daisy函数添加一个额外的组件，那么仍然有一个解决方案。它位于gower.dist. 关键部分在文档的第一部分，即模式逻辑列将被视为二进制非对称变量。所以你要确保你的数据结构是合适的。

set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create binary variables
bin <- as.matrix(replicate(5, rep(sample(c(0,1), 30, replace=T))))
# create numeric variables
vars <- as.matrix(replicate(9, rnorm(30)))
df <- data.frame(nom, bin, vars)

# You can see that the columns are not 'logical' types
# We need to change this
str(df)
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : num  0 1 0 1 1 0 1 1 1 0 ...
     $ X2  : num  1 1 1 1 0 0 1 0 0 0 ...
     $ X3  : num  1 0 0 0 1 0 1 1 1 0 ...
     $ X4  : num  0 1 0 1 0 0 1 0 0 1 ...
     $ X5  : num  1 0 0 0 0 1 0 0 0 1 ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# make columns logical
df[,2:6] <- sapply(df[,2:6], FUN=function(x) ifelse(x==1, TRUE, FALSE))

# now the columns are the correct types
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : logi  FALSE TRUE FALSE TRUE TRUE FALSE ...
     $ X2  : logi  TRUE TRUE TRUE TRUE FALSE FALSE ...
     $ X3  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
     $ X4  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
     $ X5  : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# now you can do your calls
daisy.mat <- as.matrix(daisy(df, metric="gower", type=list(asymm=c(2,3,4,5,6))))
gower.mat <- gower.dist(df)

# and you can see that the results are the same
all.equal(as.matrix(daisy.mat), gower.mat, check.attributes = F)
[1] TRUE

是的，正如 cdeterman 所证明的那样，它们给出了相同的结果。

我想在这里提到的一个不同之处是“gower.dist”实际上使用某种等权重方法（他们在函数文档中所谓的权重只能是 0 或 1），但是“daisy”允许您通过论据“权重”。

结论：如果您想要一种更灵活的方法来计算 Gower Dissimilarity，我更喜欢使用包“cluster”中的“daisy”。如果您的主要兴趣是构建合成数据集，请使用“gower.dist”，直接使用“NND.hotdeck”将为您节省大量时间。

其它你可能感兴趣的问题

上一篇如何在 R 中实现模型？下一篇我们应该删除轴并在科学图表上使用直接标签吗？