两个散点图之间的相似性

数据挖掘 聚类
2022-02-28 18:51:12

我想知道是否有用于计算两个散点图之间相似度的指标?

2个回答

最简单的方法是计算两个分布的质心之间的欧式距离;然而,这并没有考虑到分布之间的差异。

如果你想要更准确的东西,你可以在这个距离上增加分布的分布;在其他之间,您可以将价差计算为质心(已计算)与每个点之间的距离的标准偏差。

您还可以考虑使用前两个(质心距离和散布距离)的连接度量。

Python中的一个例子。取图中的三个分布: 创建的虚拟分布

如您所见,分布AB以同一点为中心,但它们具有不同的点差。分布AC有不同的中心,但它们的分布更相似。

这里是计算我描述的距离的代码。距离越小,分布越相似。

#create three dummy distributions
dist_a=[]
dist_b=[]
dist_c=[]
for i in range (100):
    dist_a.append(np.random.randn(2)+10)
    dist_c.append(np.random.randn(2)+25)
    dist_b.append(np.random.randn(2)*5.5+10)

plt.scatter([a for a, _ in dist_a], [b for _, b in dist_a], label='distribution a')
plt.scatter([a for a, _ in dist_b], [b for _, b in dist_b], label='distribution b')
plt.scatter([a for a, _ in dist_c], [b for _, b in dist_c], label='distribution c')

plt.legend()

#calculate baricenters
bc_a=np.mean(dist_a, axis=0)
bc_b=np.mean(dist_b, axis=0)
bc_c=np.mean(dist_c, axis=0)

#calculate the distance between baricenters
dist_a_b=np.linalg.norm(bc_a-bc_b)
dist_a_c=np.linalg.norm(bc_a-bc_c)
dist_b_c=np.linalg.norm(bc_b-bc_c)

print("baricenter distante between distribution A and distribution B=", dist_a_b)
print("baricenter distante between distribution A and distribution C=", dist_a_c )
print("baricenter distante between distribution B and distribution C=", dist_b_c )
print ("\n")

#calculate the spread of the distributions, e.g. their standard deviation
spread_a=np.std(dist_a)
spread_b=np.std(dist_b)
spread_c=np.std(dist_c)


dist_spread_a_b=np.abs(spread_a-spread_b)
dist_spread_a_c=np.abs(spread_a-spread_c)
dist_spread_b_c=np.abs(spread_b-spread_c)

print("spread distance between distribution A and distribution B=", dist_spread_a_b)
print("spread distance between distribution A and distribution C=", dist_spread_a_c)
print("spread distance between of distribution B and distribution C=", dist_spread_b_c)
print ("\n")

#put in a single metric. NB, the paramenter of this join is subjective, and depend on the usecase
#alpha=0 : don't care about the euclidean distance between the baricenters
#alpha=1 : don't care about the spread distance between the baricenters

alpha=0.3
joint_metric_a_b=alpha*dist_a_b + (1-alpha)*dist_spread_a_b
joint_metric_a_c=alpha*dist_a_c + (1-alpha)*dist_spread_a_c
joint_metric_b_c=alpha*dist_b_c + (1-alpha)*dist_spread_b_c


print("joined metric distance between distribution A and distribution B=", joint_metric_a_b)
print("joined metric distance between distribution A and distribution C=", joint_metric_a_c)
print("joined metric distance between distribution B and distribution C=", joint_metric_b_c)

输出:

baricenter distante between distribution A and distribution B= 0.22454217332627005
baricenter distante between distribution A and distribution C= 21.028862497007008
baricenter distante between distribution B and distribution C= 20.98580645790957


spread distance between distribution A and distribution B= 4.153324630270008
spread distance between distribution A and distribution C= 0.004700454831506384
spread distance between of distribution B and distribution C= 4.158025085101515


joined metric distance between distribution A and distribution B= 2.974689893186887
joined metric distance between distribution A and distribution C= 6.311949067484157
joined metric distance between distribution B and distribution C= 9.206359496943932

遵循第一个度量(质心之间的欧几里德距离),分布AB更相似。

继第二个度量(传播距离)之后,分布AC更相似。

第三个指标是可调的,通过参数alpha接受 和 之间0的值1根据您的用例,您可能对分布位于同一点周围更感兴趣,因此您更关心它们的质心之间的距离,或者即使它们的质心稍微移位,分布也具有相同的分布。因此,您必须根据alpha您的情况调整参数。

一种具有最少预处理的快速方法是将散点图转换为2D 直方图,然后根据它们的距离比较直方图。直方图距离指标(例如 Hellinger 距离)在这篇文章中进行了描述:https ://datascience.stackexchange.com/a/33007/52089 。

直方图之间的距离越小,散点图的相似度就越高:)