数据挖掘 - SGDClassifier：在线学习/partial_fit 带有以前未知的标签 - 吾爱随笔录

SGDClassifier：在线学习/partial_fit 带有以前未知的标签

数据挖掘多类分类在线学习

2021-09-29 06:30:03

我的训练集包含大约 50k 个条目，我用它们进行初步学习。每周增加约 5k 个条目；但相同数量的“消失”（因为它是用户数据，必须在一段时间后删除）。

因此我使用在线学习，因为我以后无法访问完整的数据集。目前我正在使用SGDClassifierwhich 工作，但我的大问题：新类别正在出现，现在我不能再使用我的模型了，因为它们不在最初的fit.

有没有SGDClassifier其他模型的方法？深度学习？

我现在是否必须从头开始并不重要（即使用除之外的其他东西SGDClassifier），但我需要能够使用新标签进行在线学习的东西。

4个回答

听起来您不想在每次出现新标签类别时都开始重新训练模型。保留过去数据的最大信息的最简单方法是为每个类别训练一个分类器。

通过这种方式，您可以继续以增量方式（“在线”）训练每个分类器，SGDClassifier而无需重新训练它们。 每当出现新类别时，您就为该类别添加一个新的二元分类器。然后，您在分类器集中选择具有最高概率/分数的类。

这与您今天所做的也没有太大区别，因为scikit's SDGClassifier已经通过在引擎盖下安装多个“One vs All”分类器来处理多类场景。

当然，如果大量新类别不断出现，这种方法可能会变得有点难以管理。

如果新类别很少出现，我本人更喜欢@oW_提供的“one vs all”解决方案。对于每个新类别，您在来自新类别（第 1 类）的 X 个样本和来自其余类别（第 0 类）的 X 个样本上训练一个新模型。

但是，如果新类别频繁出现并且您想使用单个共享模型，则可以使用神经网络来完成此任务。

总之，当一个新类别到来时，我们将相应的新节点添加到具有零（或随机）权重的 softmax 层，并保持旧权重不变，然后我们用新数据训练扩展模型。这是这个想法的视觉草图（由我自己绘制）：

这是完整场景的实现：

模型在两个类别上进行训练，
新品类来了，
模型和目标格式会相应更新，
模型在新数据上进行训练。

代码：

from keras import Model
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from sklearn.metrics import f1_score
import numpy as np


# Add a new node to the last place in Softmax layer
def add_category(model, pre_soft_layer, soft_layer, new_layer_name, random_seed=None):
    weights = model.get_layer(soft_layer).get_weights()
    category_count = len(weights)
    # set 0 weight and negative bias for new category
    # to let softmax output a low value for new category before any training
    # kernel (old + new)
    weights[0] = np.concatenate((weights[0], np.zeros((weights[0].shape[0], 1))), axis=1)
    # bias (old + new)
    weights[1] = np.concatenate((weights[1], [-1]), axis=0)
    # New softmax layer
    softmax_input = model.get_layer(pre_soft_layer).output
    sotfmax = Dense(category_count + 1, activation='softmax', name=new_layer_name)(softmax_input)
    model = Model(inputs=model.input, outputs=sotfmax)
    # Set the weights for the new softmax layer
    model.get_layer(new_layer_name).set_weights(weights)
    return model


# Generate data for the given category sizes and centers
def generate_data(sizes, centers, label_noise=0.01):
    Xs = []
    Ys = []
    category_count = len(sizes)
    indices = range(0, category_count)
    for category_index, size, center in zip(indices, sizes, centers):
        X = np.random.multivariate_normal(center, np.identity(len(center)), size)
        # Smooth [1.0, 0.0, 0.0] to [0.99, 0.005, 0.005]
        y = np.full((size, category_count), fill_value=label_noise/(category_count - 1))
        y[:, category_index] = 1 - label_noise
        Xs.append(X)
        Ys.append(y)
    Xs = np.vstack(Xs)
    Ys = np.vstack(Ys)
    # shuffle data points
    p = np.random.permutation(len(Xs))
    Xs = Xs[p]
    Ys = Ys[p]
    return Xs, Ys


def f1(model, X, y):
    y_true = y.argmax(1)
    y_pred = model.predict(X).argmax(1)
    return f1_score(y_true, y_pred, average='micro')


seed = 12345
verbose = 0
np.random.seed(seed)

model = Sequential()
model.add(Dense(5, input_shape=(2,), activation='tanh', name='pre_soft_layer'))
model.add(Dense(2, input_shape=(2,), activation='softmax', name='soft_layer'))
model.compile(loss='categorical_crossentropy', optimizer=Adam())

# In 2D feature space,
# first category is clustered around (-2, 0),
# second category around (0, 2), and third category around (2, 0)
X, y = generate_data([1000, 1000], [[-2, 0], [0, 2]])
print('y shape:', y.shape)

# Train the model
model.fit(X, y, epochs=10, verbose=verbose)

# Test the model
X_test, y_test = generate_data([200, 200], [[-2, 0], [0, 2]])
print('model f1 on 2 categories:', f1(model, X_test, y_test))

# New (third) category arrives
X, y = generate_data([1000, 1000, 1000], [[-2, 0], [0, 2], [2, 0]])
print('y shape:', y.shape)

# Extend the softmax layer to accommodate the new category
model = add_category(model, 'pre_soft_layer', 'soft_layer', new_layer_name='soft_layer2')
model.compile(loss='categorical_crossentropy', optimizer=Adam())

# Test the extended model before training
X_test, y_test = generate_data([200, 200, 0], [[-2, 0], [0, 2], [2, 0]])
print('extended model f1 on 2 categories before training:', f1(model, X_test, y_test))

# Train the extended model
model.fit(X, y, epochs=10, verbose=verbose)

# Test the extended model on old and new categories separately
X_old, y_old = generate_data([200, 200, 0], [[-2, 0], [0, 2], [2, 0]])
X_new, y_new = generate_data([0, 0, 200], [[-2, 0], [0, 2], [2, 0]])
print('extended model f1 on two (old) categories:', f1(model, X_old, y_old))
print('extended model f1 on new category:', f1(model, X_new, y_new))

输出：

y shape: (2000, 2)
model f1 on 2 categories: 0.9275
y shape: (3000, 3)
extended model f1 on 2 categories before training: 0.8925
extended model f1 on two (old) categories: 0.88
extended model f1 on new category: 0.91

关于这个输出，我应该解释两点：

仅通过添加一个新节点，模型性能就会从下降0.9275到下降。0.8925这是因为新节点的输出也包含在类别选择中。在实践中，新节点的输出只有在模型在相当大的样本上训练后才应该包含在内。例如，我们应该[0.15, 0.30, 0.55]在这个阶段达到前两个条目中最大的一个，即第二类。
扩展模型在两个（旧）类别0.88上的性能低于旧模型0.9275。这是正常的，因为现在扩展模型想要将输入分配给三个类别之一而不是两个。与“one vs all”方法中的两个二元分类器相比，当我们从三个二元分类器中进行选择时，这种减少也是预期的。

我得说我没有找到任何关于这个话题的文献。据我所知，你问的是不可能的。你应该意识到这一点，产品负责人也应该意识到这一点。原因是任何损失函数都依赖于已知标签，因此您无法预测不在训练数据中的标签。此外，机器学习算法可以预测一些它没有经过训练的东西是科幻小说吗？

话虽如此，我认为可以有一种解决方法（让我指出，这是一种不基于正式文献的观点）。如果分类器是概率的，则输出是每个类为真的概率，决策是更高的概率。也许您可以为该概率设置一个阈值，这样如果所有概率都低于该阈值，模型就会预测为“未知”。让我给你举个例子。

让 $M(x)$ 是这样的模型：给定一个 $x$ , 决定是否 $x$ 属于三类之一 $c_1, c_2, c_3$ . 的输出 $M$ 是概率向量 $p$ . 该决定是通过采取最高概率做出的 $p$ . 所以输出 $M(x) = p(x) = (0.2,0.76,0.5)$ 将对应于决定 $x$ 属于 $c_2$ . 您可以通过设置一个 $\tau$ 如果没有 $p_i \geq \tau$ 那么决定是 $x$ 属于未知类

您如何处理这些未知数取决于业务逻辑。如果它们很重要，您可以创建一个池并使用可用数据重新训练模型。我认为您可以通过更改输出的维度来从经过训练的模型中进行某种“迁移学习”。但这是我没有遇到过的，所以我只是说

SGDClassifier计算下面使用的计数SVM，这不是概率算法。按照SGDClassifier文档，您可以修改loss参数以modified_huber获取log概率输出。

有两种选择：

预测数据点属于未知或unk类别的机会。流中出现的任何新类别都应预测为unk。这在自然语言处理 (NLP) 中很常见，因为词流中总是出现新词标记。
每次出现新类别时重新训练模型。

既然你提到SGDClassifier了，我假设你使用 scikit-learn。Scikit-learn 不太支持在线学习。最好换一个支持流式和在线学习的框架，比如Spark。

其它你可能感兴趣的问题

上一篇对这个“学习曲线”图有什么好的解释？下一篇何时使用 Dense、Conv1/2D、Dropout、Flatten 和所有其他层？