为了使您的估计器更加稳健,您可以将您的评级建模为高斯混合模型 (GMM),它是两个高斯 rv 的混合:1) 真实评级,2) 等于 1 的垃圾评级。Scikit-learn 已经有一个罐装 GMM 分类器:http ://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_classifier.html#example-mixture-plot-gmm-classifier-py
再深入一点,一个简单的方法是让 scikit-learn 将您的评级划分为两个高斯。如果其中一个分区的平均值接近 1,那么我们可以丢弃这些评级。或者,更优雅地,我们可以将另一个非近一高斯的均值作为真实评级均值。
这是执行此操作的 ipython 笔记本的一些代码:
from sklearn.mixture import GMM
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import collections
def make_ratings(mean,std,rating_cnt):
rating_sample = np.random.randn(rating_cnt)*std + mean
return np.clip(rating_sample,1,5).astype(int)
def make_collection(true_mean,true_std,true_cnt,junk_count):
true_ratings = make_ratings(true_mean,true_std,true_cnt)
junk_ratings = make_ratings(1,0,junk_count)
return np.hstack([true_ratings,junk_ratings])[:,np.newaxis]
def robust_mean(X, th = 2.5, agg_th=2.5, default_agg=np.mean):
classifier = GMM(n_components=2)
classifier.fit(X)
if np.min(classifier.means_) > th or default_agg(X)<agg_th:
return default_agg(X)
else:
return np.max(classifier.means_)
r_mean = 4.2
X = make_collection(r_mean,2,40,10)
plt.hist(X,5)
classifier = GMM(n_components=2)
classifier.fit(X)
plt.show()
print "vars =",classifier.covars_.flatten()
print "means = ",classifier.means_.flatten()
print "mean = ",np.mean(X)
print "median = ",np.median(X)
print "robust mean = ", robust_mean(X)
print "true mean = ", r_mean
print "prob(rating=1|class) = ",classifier.predict_proba(1).flatten()
print "prob(rating=true_mean|class) = ",classifier.predict_proba(r_mean).flatten()
print "prediction: ", classifier.predict(X)
一次运行的输出如下所示:
vars = [ 0.22386589 0.56931527]
means = [ 1.32310978 4.00603523]
mean = 2.9
median = 3.0
robust mean = 4.00603523034
true mean = 4.2
prob(rating=1|class) = [ 9.99596493e-01 4.03507425e-04]
prob(rating=true_mean|class) = [ 1.08366762e-08 9.99999989e-01]
prediction: [1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0
1 1 1 0 0 0 0 0 0 0 0 0 0]
我们可以通过一些蒙特卡罗试验来模拟这将如何工作:
true_means = np.arange(1.5,4.5,.2)
true_ratings = 40
junk_ratings = 10
true_std = 1
m_out = []
m_in = []
m_reg = []
runs = 40
for m in true_means:
Xs = [make_collection(m,true_std,true_ratings,junk_ratings) for x in range(runs)]
m_in.append([[m]*runs])
m_out.append([[robust_mean(X, th = 2.5, agg_th=2,default_agg=np.mean) for X in Xs]])
m_reg.append([[np.mean(X) for X in Xs]])
m_in = np.array(m_in).T[:,0,:]
m_out = np.array(m_out).T[:,0,:]
m_reg = np.array(m_reg).T[:,0,:]
plt.plot(m_in,m_out,'b.',alpha=.25)
plt.plot(m_in,m_reg,'r.',alpha=.25)
plt.plot(np.arange(0,5,.1),np.arange(0,5,.1),'k.')
plt.xlim([0,5])
plt.ylim([0,5])
plt.xlabel('true mean')
plt.ylabel('predicted mean')
plt.title("true_ratings=" + str(true_ratings)
+ "; junk_ratings=" + str(junk_ratings)
+ "; std="+str(true_std))
输出粘贴在下面。红色是平均评级,蓝色是建议评级。您可以调整参数以获得略有不同的行为。