我参加了 kaggle Iceberg Classifier Challenge,其想法是对雷达图像中出现的物体是冰山还是船进行分类。我目前正在尝试实现随机梯度下降,以更好地了解小批量训练的工作原理,因为在我执行数据增强后数据集将不适合内存。
SKLearn 有以下两个用于实现逻辑回归的库,sklearn.LogisticRegression
以及sklearn.SGDClassifier
. 经过一段时间的实验,我注意到 SGD 分类器的性能从来没有比逻辑回归分类器好,而且 SGD 分类器的结果中还有一些其他行为让我怀疑我是否正确实现了它或者是否它天生就有缺陷(我敢肯定是前者)。
下图总结了SGDClassifier
模型与LogisticRegression
冰山数据集的性能。
颜色曲线是SGDClassifier
具有不同数量的小批量的训练运行,范围从 50 到 1604(数据集中训练样本的总数)。每个人都使用SGDClassifier.partial_fit()
. 水平直线表示仅预测 1 时的基本准确度、逻辑回归分数和所有 1604 个训练样本在单次通过时拟合的 SGD 分数。
我很难理解以下内容:
即使两者都应该在相同的数据上以相同的方式进行训练,为什么
LogisticRegression
表现如此显着?SGDClassifier
为什么在传递数据 50 次而不是一次时性能会大幅下降?如果只是在对数据进行多次运行时过度拟合数据集,那么后者对我来说是有意义的,尽管了解如何通过小批量训练来防止这种情况会很好。
为什么即使在 50 次数据通过后,结果也会以非单调的方式随小批量大小变化如此之大?由于数据在两次运行之间被打乱,我希望在如此大量的迭代之后,拟合中的任何噪音都会被平滑。
这是用于生成结果的代码:
# Load training date
df_train = pd.read_csv('train_data.csv', sep = ',', header = 0, index_col = 0)
# Set input names; 'band_i_j' is pixel j from input image i
img_size = 75*75
inputs = ['inc_angle'] + ['band_1_' + str(i) for i in range(img_size)] + ['band_2_' + str(i) for i in range(img_size)]
# Set output name
output = 'is_iceberg'
# Set na values to mean of column
inc_angle_mean = df_train['inc_angle'].mean()
df_train['inc_angle'].fillna(inc_angle_mean, inplace = True)
# Get num training samples
N_train = len(df_train)
# Calculate mean, max, min of each column; standardize inputs
train_means = df_train[inputs].mean()
train_maxes = df_train[inputs].max()
train_mins = df_train[inputs].min()
df_train[inputs] = (df_train[inputs] - train_means)/(train_maxes - train_mins)
# Set 80% of data to belong to training set; reserve the rest for validation set
train_indices = list(range(int(0.8*N_train)))
valid_indices = list(range(int(0.8*N_train), N_train))
# Base score
print('base score', len(df_train[df_train[output]==1])/N_train)
# SGD full-batch
model = sklearn.linear_model.SGDClassifier(loss= 'log', alpha = 1, tol = 0.00001, max_iter = 1000, shuffle = False, random_state = 0)
model.fit(df_train[inputs].iloc[train_indices], df_train[output].iloc[train_indices])
print('SGD full-batch', model.score(df_train[inputs].iloc[valid_indices], df_train[output].iloc[valid_indices]))
# Log reg
model = sklearn.linear_model.LogisticRegression(C = 1, tol = 0.00001, random_state = 0)
model.fit(df_train[inputs].iloc[train_indices], df_train[output].iloc[train_indices])
print('Log reg', model.score(df_train[inputs].iloc[valid_indices], df_train[output].iloc[valid_indices]))
# SGD mini-batch
num_passes = 50
batch_size = N_train
model = sklearn.linear_model.SGDClassifier(loss= 'log', alpha = 1, tol = 0.00001, shuffle = False, random_state = 0)
sgd_minibatch_scores = []
for i in range(num_passes):
np.random.shuffle(train_indices)
for j in range(int(len(train_indices)/batch_size + 1)):
if j == int(len(train_indices)/batch_size+1):
batch_train_indices = train_indices[j*batch_size:]
else:
batch_train_indices = train_indices[j*batch_size:(j+1)*batch_size]
model.partial_fit(df_train[inputs].iloc[batch_train_indices], df_train[output].iloc[batch_train_indices], classes = [0,1])
sgd_minibatch_scores.append(model.score(df_train[inputs].iloc[valid_indices], df_train[output].iloc[valid_indices]))