数据挖掘 - SPARK Mllib：多类逻辑回归，如何获得所有类的概率而不是顶部的概率？ - 吾爱随笔录

SPARK Mllib：多类逻辑回归，如何获得所有类的概率而不是顶部的概率？

数据挖掘分类阿帕奇火花多类分类

2021-10-01 15:25:33

我LogisticRegressionWithLBFGS用来训练一个多类分类器。

当我在新的看不见的样本上测试模型时，有没有办法获得所有类别的概率（不仅是顶级候选类别）？

PS 我不一定必须使用 LBFGS 分类器，但我想在我的问题中使用逻辑回归。因此，如果有使用另一种 LR 分类器类型的解决方案，我会选择它。

2个回答

我正在研究随机森林分类器，该分类器在预测中具有概率属性，即如果您在PySparkpredictions = model.transform(testData)中获得摘要，您将获得每个标签的概率，您可以检查以下代码和代码输出：print(predictions)

from pyspark.sql import DataFrame
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a Random Forest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=12,  maxDepth=10)

# Chain RF in a Pipeline
pipeline = Pipeline(stages=[rf])

# Train model.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

现在你的工作从这里开始。尝试打印预测和预测值

print(predictions)

输出：

DataFrame[label: double, features: vector, indexed: double, rawPrediction: vector, probability: vector, prediction: double]

因此，在 DataFrame 中，您的概率是每个 indexedLabel 的概率，此外，我已将其检查为：

print predictions.show(3)

输出：

+-----+--------------------+-------+--------------------+--------------------+----------+
|label|            features|indexed|       rawPrediction|         probability|prediction|
+-----+--------------------+-------+--------------------+--------------------+----------+
|  5.0|(2000,[141,260,50...|    0.0|[34.8672584923246...|[0.69734516984649...|       0.0|
|  5.0|(2000,[109,126,18...|    0.0|[34.6231572522266...|[0.69246314504453...|       0.0|
|  5.0|(2000,[185,306,34...|    0.0|[34.5016453103805...|[0.69003290620761...|       0.0|
+-----+--------------------+-------+--------------------+--------------------+----------+
only showing top 3 rows

仅适用于概率列：

print predictions.select('probability').take(2)

输出：

[Row(probability=DenseVector([0.6973, 0.1889, 0.0532, 0.0448, 0.0157])), Row(probability=DenseVector([0.6925, 0.1825, 0.0579, 0.0497, 0.0174]))]

在我的情况下，我有5 个indexedLabels ，因此概率向量长度为5，希望这将帮助您获得问题中每个标签的概率。

PS：你可能会在决策树、逻辑回归中得到概率。只是尝试获取的摘要model.transform(testData)。

要获取所有概率而不是所有类而不是仅标记类，到目前为止，Spark MLlib 或 ML 中还没有显式方法（Spark 2.0）。但是您可以从 MLlib 源代码扩展 Logistic Regression 类来获得这些概率。

可以在此答案中找到示例代码片段。

其它你可能感兴趣的问题

上一篇如何在马尔可夫决策问题中选择折扣因子？下一篇ADA Boost 将如何用于解决回归问题？