我想知道是否有用于计算两个散点图之间相似度的指标?
两个散点图之间的相似性
数据挖掘
聚类
2022-02-28 18:51:12
2个回答
最简单的方法是计算两个分布的质心之间的欧式距离;然而,这并没有考虑到分布之间的差异。
如果你想要更准确的东西,你可以在这个距离上增加分布的分布;在其他之间,您可以将价差计算为质心(已计算)与每个点之间的距离的标准偏差。
您还可以考虑使用前两个(质心距离和散布距离)的连接度量。
如您所见,分布A和B以同一点为中心,但它们具有不同的点差。分布A和C有不同的中心,但它们的分布更相似。
这里是计算我描述的距离的代码。距离越小,分布越相似。
#create three dummy distributions
dist_a=[]
dist_b=[]
dist_c=[]
for i in range (100):
dist_a.append(np.random.randn(2)+10)
dist_c.append(np.random.randn(2)+25)
dist_b.append(np.random.randn(2)*5.5+10)
plt.scatter([a for a, _ in dist_a], [b for _, b in dist_a], label='distribution a')
plt.scatter([a for a, _ in dist_b], [b for _, b in dist_b], label='distribution b')
plt.scatter([a for a, _ in dist_c], [b for _, b in dist_c], label='distribution c')
plt.legend()
#calculate baricenters
bc_a=np.mean(dist_a, axis=0)
bc_b=np.mean(dist_b, axis=0)
bc_c=np.mean(dist_c, axis=0)
#calculate the distance between baricenters
dist_a_b=np.linalg.norm(bc_a-bc_b)
dist_a_c=np.linalg.norm(bc_a-bc_c)
dist_b_c=np.linalg.norm(bc_b-bc_c)
print("baricenter distante between distribution A and distribution B=", dist_a_b)
print("baricenter distante between distribution A and distribution C=", dist_a_c )
print("baricenter distante between distribution B and distribution C=", dist_b_c )
print ("\n")
#calculate the spread of the distributions, e.g. their standard deviation
spread_a=np.std(dist_a)
spread_b=np.std(dist_b)
spread_c=np.std(dist_c)
dist_spread_a_b=np.abs(spread_a-spread_b)
dist_spread_a_c=np.abs(spread_a-spread_c)
dist_spread_b_c=np.abs(spread_b-spread_c)
print("spread distance between distribution A and distribution B=", dist_spread_a_b)
print("spread distance between distribution A and distribution C=", dist_spread_a_c)
print("spread distance between of distribution B and distribution C=", dist_spread_b_c)
print ("\n")
#put in a single metric. NB, the paramenter of this join is subjective, and depend on the usecase
#alpha=0 : don't care about the euclidean distance between the baricenters
#alpha=1 : don't care about the spread distance between the baricenters
alpha=0.3
joint_metric_a_b=alpha*dist_a_b + (1-alpha)*dist_spread_a_b
joint_metric_a_c=alpha*dist_a_c + (1-alpha)*dist_spread_a_c
joint_metric_b_c=alpha*dist_b_c + (1-alpha)*dist_spread_b_c
print("joined metric distance between distribution A and distribution B=", joint_metric_a_b)
print("joined metric distance between distribution A and distribution C=", joint_metric_a_c)
print("joined metric distance between distribution B and distribution C=", joint_metric_b_c)
输出:
baricenter distante between distribution A and distribution B= 0.22454217332627005
baricenter distante between distribution A and distribution C= 21.028862497007008
baricenter distante between distribution B and distribution C= 20.98580645790957
spread distance between distribution A and distribution B= 4.153324630270008
spread distance between distribution A and distribution C= 0.004700454831506384
spread distance between of distribution B and distribution C= 4.158025085101515
joined metric distance between distribution A and distribution B= 2.974689893186887
joined metric distance between distribution A and distribution C= 6.311949067484157
joined metric distance between distribution B and distribution C= 9.206359496943932
遵循第一个度量(质心之间的欧几里德距离),分布A和B更相似。
继第二个度量(传播距离)之后,分布A和C更相似。
第三个指标是可调的,通过参数alpha接受 和 之间0的值1。根据您的用例,您可能对分布位于同一点周围更感兴趣,因此您更关心它们的质心之间的距离,或者即使它们的质心稍微移位,分布也具有相同的分布。因此,您必须根据alpha您的情况调整参数。
一种具有最少预处理的快速方法是将散点图转换为2D 直方图,然后根据它们的距离比较直方图。直方图距离指标(例如 Hellinger 距离)在这篇文章中进行了描述:https ://datascience.stackexchange.com/a/33007/52089 。
直方图之间的距离越小,散点图的相似度就越高:)
