我无法将大型(相对于 spark.rpc.message.maxSize)Spark ML 管道保存到 HDFS。具体来说,当我尝试将模型保存到 HDFS 时,它给了我一个与 spark 的最大消息大小相关的错误:
scala> val mod = pipeline.fit(df)
mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716
scala> mod.write.overwrite().save(modelPath.concat("model"))
18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size
(755610 KB). The maximum recommended task size is 100 KB.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task
2606:0 was 777523713 bytes, which exceeds max allowed: spark.rpc.message.maxSize
(134217728 bytes). Consider increasing spark.rpc.message.maxSize
or using broadcast variables for large values.
对问题区域做出以下假设:
- 不可能减小模型的大小 并且
- 不可能将最大消息大小增加到管道可以容纳单个消息的程度。
是否有任何方法可以让我将管道成功保存到 HDFS?