数据挖掘 - 在 Python 中从 R 包中复制 randomForest 邻近矩阵 - 吾爱随笔录

我正在尝试将这段 R 代码移植到 python：

rf <- randomForest(features, proximity = T, oob.prox = T, ntree = 2000)
dists <- as.dist(1 - rf$proximity)

带参数
oob.prox：是否应该仅根据“袋外”数据计算接近度？
proximity：如果proximity=TRUE何时randomForest调用，输入之间的接近度矩阵（基于数据点对在相同终端节点中的频率）。

我目前正在尝试使用sklearn.ensemble.RandomTreesEmbedding此任务，但是没有接近矩阵的功能。我发现以下开发人员评论：

我们还没有在 Scikit-Learn 中实现邻近矩阵。然而，这可以通过依赖于我们的决策树实现中提供的应用函数来完成。也就是说，对于数据集中的所有样本对，迭代森林中的决策树（通过forest.estimators_）并计算它们落在同一叶中的次数，即应用次数给出相同的节点对中两个样本的 id。

所以我尝试使用 numpy 的pdist()函数以及我的自定义距离（或者在这种情况下，接近度）测量。我还有几个问题：

接近功能非常慢
如何处理袋外行为
如何重新创建的确切行为as.dist(1- rf$proximity)：我想我需要标准化我的计数矩阵，然后从 1 中减去它，然后计算其行之间的欧几里德距离！？

到目前为止，我的代码如下所示：

# grow a random forest from points
rf = ensemble.RandomTreesEmbedding(n_estimators=200, 
    random_state=0,
    max_depth=5
)
rfdata = rf.fit_transform(xdata);


# define an affinity measure function to use with numpy's pdist
def treeprox(u, v):
    leafcount = 0
    # needs reshaping for single samples
    u = u.reshape(1,-1)
    v = v.reshape(1,-1)
    a = rf.apply(u)
    b = rf.apply(v)
    # count number of times they fall in the same leaf 
    # (use of np forces element-wise)
    c = np.sum(np.array(a)==np.array(b))
    return c
 
distm = pdist(xdata, proxfun)
distm = squareform(distm)

我猜肯定有更好的方法，因为 R 包很容易提供此功能randomForest。
有什么建议？
蒂亚