数据挖掘 - 如何训练连续/软分类模型？ - 吾爱随笔录

经典的分类问题就像找函数 $F:\mathbb{R}^n\mapsto \{0,1\}$ . 标签集将是 [Apple,Banana,Banana,...,Apple]。

如果我想训练一个函数怎么办 $F:\mathbb{R}\mapsto[0,1]$ ? 我的样本可能类似于“这个样本有 80% 的概率是苹果，20% 的概率是香蕉”。

看起来多输出神经网络有效，因为我们可以将 softmax 损失与交叉熵损失一起应用。随机森林或其他算法呢？我在 scikit-learn 中尝试了一些常用算法，但没有任何运气。

例如，这段代码：

import numpy as np
from sklearn.ensemble import RandomForestClassifier

N_FEATURES = 10
N_SAMPLES = 1000
N_CLASSES = 2

train_x = np.random.rand(N_SAMPLES, N_FEATURES)
train_y = np.random.rand(N_SAMPLES, N_CLASSES)
train_y = np.apply_along_axis(lambda x: x/x.sum(), 1, train_y)

model = RandomForestClassifier(n_estimators=10).fit(train_x, train_y)

产生一个ValueError: Unknown label type: 'continuous-multioutput'。