数据挖掘 - PySpark 的 featuresCol、labelCol、predictionCol、probabilityCol 有什么区别？ - 吾爱随笔录

我正在尝试pyspark.ml.classification.RandomForestClassifier在大型数据集（~70gb）上训练随机森林分类器（）。但是，我不确定向 featuresCol、labelCol、predictionCol 和probabilityCol 中的每一个发送什么。

从文档中我收集到：

featuresCol是您的数据框中的功能列表
labelCol是目标特征
predictionCol也是目标特征，但由模型生成（不确定）。我需要在训练之前设置吗？
probabilityCol是每个类作为向量的概率。这类似于sklearn的class_weight吗？即模型是否考虑了低多样性？如果有怎么办？

另外，我可以为 OOB_score 设置一个选项吗？

clf = RandomForestClassifier(featuresCol=feature_cols, labelCol=target_col, numTrees=300, MaxDepth=15, Impurity='gini', maxMemoryInMB=2**10)
clf_t = clf.fit(train)

y_train_pred = clf_t.transform(test)
y_test_pred = clf_t.transform(test)

这是文档的链接：https ://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#module-pyspark.ml.classification