考虑汉明距离——两个相等长度的字符串之间的汉明距离是对应符号不同的位置数。从这个定义可以看出,我们可以根据汉明距离生成具有聚类但变量之间没有相关性的数据。
下面是一个使用 Mathematica 的示例。
创建一些分类数据(4 个字符的均匀随机抽样的 3 个符号长序列):
chs = CharacterRange["a", "d"];
words = StringJoin @@@ Union[Table[RandomChoice[chs, 3], 40]];
Length[words]
words
(* 29 *)
(* {"aac", "aad", "abb", "aca", "acb", "acd", "adb", "adc", "baa", "bab", "bac", "bad", "bcc", "bcd", "caa", "cab", "cac", "cad", "cbb", "ccb", "cda", "cdb", "dab", "dba", "dbb", "dbd", "dca", "dcc", "dcd"} *)
对变量之间的关系使用镶嵌图(来自不同列的值对的条件概率):
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]
wordSeqs = Characters /@ words;
opts = {ColorRules -> {2 -> ColorData[7, "ColorList"]}, ImageSize -> 400};
Grid[{{MosaicPlot[wordSeqs[[All, {1, 2}]],
"ColumnNames" -> {"column 1", "column 2"}, opts],
MosaicPlot[wordSeqs[[All, {2, 3}]],
"ColumnNames" -> {"column 2", "column 3"}, opts],
MosaicPlot[wordSeqs[[All, {1, 3}]],
"ColumnNames" -> {"column 1", "column 3"}, opts]}}, Dividers -> All]
我们可以看到没有相关性。
查找集群:
cls = FindClusters[words, 3, DistanceFunction -> HammingDistance]
(* {{"aac", "aad", "adc", "bac"}, {"abb", "acb", "adb", "baa", "bab", "bad",
"caa", "cab", "cac", "cad", "cbb", "ccb", "cda", "cdb", "dab",
"dbb"}, {"aca", "acd", "bcc", "bcd", "dba", "dbd", "dca", "dcc", "dcd"}} *)
如果我们用一个整数替换每个字符,我们可以从这个图中看到集群是如何用汉明距离形成的:
esrules = Thread[chs -> Range[Length[chs]]]; gr1 =
ListPointPlot3D[Characters[cls] /. esrules,
PlotStyle -> {PointSize[0.02]}, PlotLegends -> Automatic,
FaceGrids -> {Bottom, Left, Back}];
gr2 = Graphics3D[
Map[Text[#, Characters[#] /. esrules, {1, 1}] &, Flatten[cls]]];
Show[gr1, gr2]
进一步聚类
让我们通过连接汉明距离为 1 的单词来制作一个图:
mat = Clip[Outer[HammingDistance, words, words], {0, 1}, {0, 0}];
nngr = AdjacencyGraph[mat,
VertexLabels -> Thread[Range[Length[words]] -> words]]
现在让我们找到社区集群:
CommunityGraphPlot[nngr]
将图簇与找到的图簇FindClusters
(被迫找到 3)进行比较。我们可以看到“bac”是高度集中的,“aad”可以属于绿色簇,对应于 3D 图中的簇 1。
图表数据
这是 的边缘列表nngr
:
{1 <-> 2, 1 <-> 8, 1 <-> 11, 1 <-> 17, 2 <-> 6, 2 <-> 12, 2 <-> 18,
3 <-> 5, 3 <-> 7, 3 <-> 19, 3 <-> 25, 4 <-> 5, 4 <-> 6, 4 <-> 27,
5 <-> 6, 5 <-> 7, 5 <-> 20, 6 <-> 14, 6 <-> 29, 7 <-> 8, 7 <-> 22,
9 <-> 10, 9 <-> 11, 9 <-> 12, 9 <-> 15, 10 <-> 11, 10 <-> 12,
10 <-> 16, 10 <-> 23, 11 <-> 12, 11 <-> 13, 11 <-> 17, 12 <-> 14,
12 <-> 18, 13 <-> 14, 13 <-> 28, 14 <-> 29, 15 <-> 16, 15 <-> 17,
15 <-> 18, 15 <-> 21, 16 <-> 17, 16 <-> 18, 16 <-> 19, 16 <-> 20,
16 <-> 22, 16 <-> 23, 17 <-> 18, 19 <-> 20, 19 <-> 22, 19 <-> 25,
20 <-> 22, 21 <-> 22, 23 <-> 25, 24 <-> 25, 24 <-> 26, 24 <-> 27,
25 <-> 26, 26 <-> 29, 27 <-> 28, 27 <-> 29, 28 <-> 29}