数据挖掘 - Pyspark 矩阵变换 - 吾爱随笔录

假设我在 PySpark 中有以下数据框：

Customer    |  product  |   rating
customer1   |  product1 |   0.2343
customer1   |  product2 |   0.4440
customer2   |  product3 |   0.3123
customer3   |  product1 |   0.7430

可以有多个客户产品组合，但每个组合都是独一无二的。我想以最有效的方式归档以下结果：

Customer (Index) | product 1 | product 2 | product 3
customer 1       |   0.2343  |  0.4440   |  0.0000
customer 2       |   0.0000  |  0.0000   |  0.3123
customer 3       |   0.7430  |  0.0000   |  0.0000

第一个表中未表示的每个组合都将设置为零。它必须高效，因为输出矩阵的大小为 59578 行 × 21521 列，我想尽可能地避免计算成本。

有什么解决方案吗？到目前为止，我还没有在网上找到一个好的解决方案。

感谢您的帮助。