数据挖掘 - 带放回抽样，指定概率 - 吾爱随笔录

我正在尝试在 Scala/Spark 中进行替换采样，定义每个类的概率。

这就是我在 R 中的做法。

# Vector to sample from
x <- c("User1","User2","User3","User4","User5")

# Occurenciens from which to obtain sampling probabilities
y <- c(2,4,4,3,2)

# Calculate sampling probabilities
p <- y / sum(y)

# Draw sample with replacement of size 10
s <- sample(x, 10, replace = TRUE, prom = p)

# Which yields (for example):
[1] "User5" "User1" "User1" "User5" "User2" "User4" "User4" "User2" "User1" "User3"

我怎样才能在 Scala / Spark 中做同样的事情？

def weightedSampleWithReplacement[T](data: Array[T], weights: Array[Double], n: Int, random: Random): Array[T] = { val cumWeights = weights.scanLeft(0.0)(_ + _) val cumProbs = cumWeights.map(_ / cumWeights.last) Array.fill(n) { val r = random.nextDouble() data(cumProbs.indexWhere(r < _) - 1) } }