我已经实施了我的解决方案。我写了两个函数:
prox_matrix(df,目标,特征,cluster_dimension,trees = 10)
参数
- df : 输入数据框
- 目标:您尝试使用随机福雷斯特预测的因变量
- 特征:自变量列表
- cluster_dimension:您想要集群/池以添加到您的功能列表中的维度
- 树:在你的随机森林中使用的树的数量
退货
- D : cluster_dimension 的邻近矩阵的 DataFrame
下面的代码
def prox_matrix(df, target, features, cluster_dimension,trees = 10):
#https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
#initialize datframe for independant variables
independant = pd.DataFrame()
#Handle Categoricals: This should really be added to RandomForestRegressor
for column,data_type in df[features].dtypes.iteritems():
try:
independant[column] = pd.to_numeric(df[column],downcast = 'integer')
except ValueError:
contains_nulls = df[column].isnull().values.any()
dummies = pd.get_dummies(df[column],prefix=column,dummy_na=contains_nulls,drop_first=True)
independant[dummies.columns] = dummies
if len(independant.index) != len(df.index):
raise Exception('independant variables not stored properly')
#train Model
clf = RandomForestRegressor(n_estimators=trees, n_jobs=-1)
clf.fit(independant, df[target])
#Final leaf for each tree
leaves = clf.apply(independant)
#value in cluster dimension
labels = df[cluster_dimension].values
numerator_matrix = {}
for i,value_i in enumerate(labels):
for j,value_j in enumerate(labels):
if i >= j:
numerator_matrix[(value_i,value_j)] = numerator_matrix.get((value_i,value_j), 0) + np.count_nonzero(leaves[i]==leaves[j])
numerator_matrix[(value_j,value_i)] = numerator_matrix[(value_i,value_j)]
#normalize by the total number of possible matchnig leaves
prox_matrix = {key: 1.0 - float(x)/(trees*np.count_nonzero(labels==key[0])*np.count_nonzero(labels==key[1])) for key, x in numerator_matrix.iteritems()}
#make sorted dataframe
levels = np.unique(labels)
D = pd.DataFrame(data=[[ prox_matrix[(i,j)] for i in levels] for j in levels],index=levels,columns=levels)
return D
kMedoids(D, k, tmax=100)
参数
- D:接近/距离矩阵
- k : 簇数
- tmax:检查聚类收敛的最大迭代次数
退货
- M : 媒体列表
- C:字典将聚类级别映射到每个媒体
- S:用于评估性能的每个集群的轮廓
下面的代码
def kMedoids(D, k, tmax=100):
#https://www.researchgate.net/publication/272351873_NumPy_SciPy_Recipes_for_Data_Science_k-Medoids_Clustering
import numpy as np
import pandas as pd
# determine dimensions of distance matrix D
m, n = D.shape
if m != n:
raise Exception('matrix not symmetric')
if sum(D.columns.values != D.index.values):
raise Exception('rows and columns do not match')
if k > n:
raise Exception('too many medoids')
#Some distance matricies will not have a 0 diagonal
Dtemp =D.copy()
np.fill_diagonal(Dtemp.values,0)
# randomly initialize an array of k medoid indices
M = list(Dtemp.sample(k).index.values)
# initialize a dictionary to represent clusters
Cnew = {}
for t in xrange(tmax):
# determine mapping to clusters
J = Dtemp.loc[M].idxmin(axis='index')
#Fill dictionary with cluster members
C = {kappa: J[J==kappa].index.values for kappa in J.unique()}
# update cluster medoids
Cnew = {Dtemp.loc[C[kappa],C[kappa]].mean().idxmin() : C[kappa] for kappa in C.keys()}
#Update mediod list
M = Cnew.keys()
# check for convergence (ie same clusters)
if set(C.keys()) == set(Cnew.keys()):
if not sum(set(C[kappa]) != set(Cnew[kappa]) for kappa in C.keys()): break
else:
print('did not converge')
#Calculate silhouette
S = {}
for kappa_same in Cnew.keys():
a = Dtemp.loc[Cnew[kappa_same],Cnew[kappa_same]].mean().mean()
b = np.min([Dtemp.loc[Cnew[kappa_other],Cnew[kappa_same]].mean().mean() for kappa_other in Cnew.keys() if kappa_other!=kappa_same])
S[kappa_same] = (b - a) / max(a, b)
# return results
return M, Cnew, S
笔记:
- 代码中有指向理论文档的链接
- 我使用了所有记录,而不是严格意义上的 OOB 记录。在这里跟进
- prox_matrix() 方法非常慢。我做了一些事情来加速它,但大部分成本来自双循环。欢迎更新。
- 邻近矩阵的对角线不必为零。我在 KMedoids 方法中强制执行此操作,以便获得收敛。