在建立了一个2 节点的 Hadoop 集群,了解了Hadoop 和 Python并基于这个幼稚的实现之后,我最终得到了以下代码:
def kmeans(data, k, c=None):
if c is not None:
centroids = c
else:
centroids = []
centroids = randomize_centroids(data, centroids, k)
old_centroids = [[] for i in range(k)]
iterations = 0
while not (has_converged(centroids, old_centroids, iterations)):
iterations += 1
clusters = [[] for i in range(k)]
# assign data points to clusters
clusters = euclidean_dist(data, centroids, clusters)
# recalculate centroids
index = 0
for cluster in clusters:
old_centroids[index] = centroids[index]
centroids[index] = np.mean(cluster, axis=0).tolist()
index += 1
print("The total number of data instances is: " + str(len(data)))
我已经测试了它的串行执行,它是好的。如何使其分布在 Hadoop 中?换句话说,什么应该去reducer,什么应该去mapper?
请注意,如果可能的话,我想遵循教程的风格,因为这是我理解的。