机器算法验证 - 如何在不重复排列的情况下在 R 中重新采样？ - 吾爱随笔录

如何在不重复排列的情况下在 R 中重新采样？

机器算法验证 r 采样组合学重采样

2022-03-06 06:10:23

在 R 中，如果我 set.seed()，然后使用示例函数随机化一个列表，我可以保证我不会生成相同的排列吗？

IE...

set.seed(25)
limit <- 3
myindex <- seq(0,limit)
for (x in seq(1,factorial(limit))) {
    permutations <- sample(myindex)
    print(permutations)
}

这产生

[1] 1 2 0 3
[1] 0 2 1 3
[1] 0 3 2 1
[1] 3 1 2 0
[1] 2 3 0 1
[1] 0 1 3 2

打印的所有排列都是唯一排列吗？还是有一些机会，根据它的实现方式，我可以得到一些重复？

我希望能够做到这一点而不会重复，保证。我该怎么做？

（我还想避免使用像 permn() 这样的函数，它有一种非常机械的方法来生成所有排列——它看起来并不随机。）

另外，旁注——如果我没记错的话，这个问题看起来是 O((n!)!)。

3个回答

这个问题有很多有效的解释。注释——尤其是表示需要 15 个或更多元素的排列（15！= 1307674368000 越来越大）——表明需要的是一个相对较小的随机样本，无需替换，所有 n！= n*(n-1) (n-2) ...*2*1 1:n 的排列。如果这是真的，那么存在（有些）有效的解决方案。

下面的函数rperm接受两个参数n（要采样的排列的大小）和m（要绘制的大小为 n 的排列的数量）。如果 m 接近或超过 n!，该函数将花费很长时间并返回许多 NA 值：它适用于当 n 相对较大（例如，8 或更大）且 m 远小于 n! 时。它的工作原理是缓存到目前为止找到的排列的字符串表示，然后（随机）生成新的排列，直到找到新的排列。它利用 R 的关联列表索引功能来快速搜索先前找到的排列列表。

rperm <- function(m, size=2) { # Obtain m unique permutations of 1:size

    # Function to obtain a new permutation.
    newperm <- function() {
        count <- 0                # Protects against infinite loops
        repeat {
            # Generate a permutation and check against previous ones.
            p <- sample(1:size)
            hash.p <- paste(p, collapse="")
            if (is.null(cache[[hash.p]])) break

            # Prepare to try again.
            count <- count+1
            if (count > 1000) {   # 1000 is arbitrary; adjust to taste
                p <- NA           # NA indicates a new permutation wasn't found
                hash.p <- ""
                break
            }
        }
        cache[[hash.p]] <<- TRUE  # Update the list of permutations found
        p                         # Return this (new) permutation
    }

    # Obtain m unique permutations.
    cache <- list()
    replicate(m, newperm())  
} # Returns a `size` by `m` matrix; each column is a permutation of 1:size.

的本质replicate是将排列作为列向量返回；例如，以下复制了原始问题中的一个示例，转置：

> set.seed(17)
> rperm(6, size=4)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    4    4    3    4
[2,]    3    4    1    3    1    2
[3,]    4    1    3    2    2    3
[4,]    2    3    2    1    4    1

对于小到中等的 m 值（最高约 10,000），时序非常好，但对于较大的问题会降低。例如，在 10 秒内获得了 m = 10,000 个 n = 1000 个元素的排列的样本（1000 万个值的矩阵）；m = 20,000 个 n = 20 个元素的排列样本需要 11 秒，即使输出（400,000 个条目的矩阵）要小得多；并且计算样本 m = 100,000 个 n = 20 个元素的排列在 260 秒后被中止（我没有耐心等待完成）。这个缩放问题似乎与 R 的关联寻址中的缩放效率低下有关。可以通过生成 1000 个左右的样本组来解决这个问题，然后将这些样本组合成一个大样本并删除重复项。

编辑

我们可以通过将缓存分成两个缓存的层次结构来实现接近线性的渐近性能，这样 R 就不必搜索一个大列表。从概念上讲（尽管没有实现），创建一个由第一个索引的数组 $k$ 排列的元素。此数组中的条目是共享那些第一个排列的所有排列的列表 $k$ 元素。要检查是否已看到排列，请使用它的第一个 $k$ 元素在缓存中找到其条目，然后在该条目中搜索该排列。我们可以选择 $k$ 平衡所有列表的预期大小。实际实现不使用 $k$ -fold 数组，很难以足够的通用性进行编程，而是使用另一个列表。

以下是一系列排列大小和请求的不同排列数量的一些经过时间（以秒为单位）：

 Number Size=10 Size=15 Size=1000 size=10000 size=100000
     10    0.00    0.00      0.02       0.08        1.03
    100    0.01    0.01      0.07       0.64        8.36
   1000    0.08    0.09      0.68       6.38
  10000    0.83    0.87      7.04      65.74
 100000   11.77   10.51     69.33
1000000  195.5   125.5

（从 size=10 到 size=15 的明显异常加速是因为第一级缓存对于 size=15 更大，减少了第二级列表中的平均条目数，从而加快了 R 的关联搜索。在某些情况下在 RAM 中的成本，可以通过增加上层缓存大小来加快执行速度。例如，只需增加k.head1（将上层大小乘以 10）就可以rperm(100000, size=10)从 11.77 秒加快到 8.72 秒。缓存大 10 倍，但没有获得明显的增益，时钟为 8.51 秒。）

除了 10 个元素的 1,000,000 个独特排列的情况（所有 10 个中的很大一部分！= 大约 363 万个这样的排列），几乎没有检测到任何碰撞。在这种特殊情况下，有 169,301 次碰撞，但没有完全失败（实际上获得了 100 万个唯一排列）。

请注意，对于较大的排列大小（大于 20 左右），即使在 1,000,000,000 大的样本中获得两个相同排列的机会也非常小。因此，该解决方案主要适用于以下情况：(a) (b) 之间存在大量唯一排列 $n=5$ 和 $n=15$ 大约要生成元素，但即便如此，（c）比所有元素都少得多 $n!$ 需要置换。

工作代码如下。

rperm <- function(m, size=2) { # Obtain m unique permutations of 1:size
    max.failures <- 10

    # Function to index into the upper-level cache.
    prefix <- function(p, k) {    # p is a permutation, k is the prefix size
        sum((p[1:k] - 1) * (size ^ ((1:k)-1))) + 1
    } # Returns a value from 1 through size^k

    # Function to obtain a new permutation.
    newperm <- function() {
        # References cache, k.head, and failures in parent context.
        # Modifies cache and failures.        

        count <- 0                # Protects against infinite loops
        repeat {
            # Generate a permutation and check against previous ones.
            p <- sample(1:size)
            k <- prefix(p, k.head)
            ip <- cache[[k]]
            hash.p <- paste(tail(p,-k.head), collapse="")
            if (is.null(ip[[hash.p]])) break

            # Prepare to try again.
            n.failures <<- n.failures + 1
            count <- count+1
            if (count > max.failures) {  
                p <- NA           # NA indicates a new permutation wasn't found
                hash.p <- ""
                break
            }
        }
        if (count <= max.failures) {
            ip[[hash.p]] <- TRUE      # Update the list of permutations found
            cache[[k]] <<- ip
        }
        p                         # Return this (new) permutation
    }

    # Initialize the cache.
    k.head <- min(size-1, max(1, floor(log(m / log(m)) / log(size))))
    cache <- as.list(1:(size^k.head))
    for (i in 1:(size^k.head)) cache[[i]] <- list()

    # Count failures (for benchmarking and error checking).
    n.failures <- 0

    # Obtain (up to) m unique permutations.
    s <- replicate(m, newperm())
    s[is.na(s)] <- NULL
    list(failures=n.failures, sample=matrix(unlist(s), ncol=size))
} # Returns an m by size matrix; each row is a permutation of 1:size.

以unique正确的方式使用应该可以解决问题：

set.seed(2)
limit <- 3
myindex <- seq(0,limit)

endDim<-factorial(limit)
permutations<-sample(myindex)

while(is.null(dim(unique(permutations))) || dim(unique(permutations))[1]!=endDim) {
    permutations <- rbind(permutations,sample(myindex))
}
# Resulting permutations:
unique(permutations)

# Compare to
set.seed(2)
permutations<-sample(myindex)
for(i in 1:endDim)
{
permutations<-rbind(permutations,sample(myindex))
}
permutations
# which contains the same permutation twice

我将稍微绕过您的第一个问题，并建议如果您正在处理相对较短的向量，您可以简单地使用生成所有排列，permn然后它们随机排序使用sample：

x <- combinat:::permn(1:3)
> x[sample(factorial(3),factorial(3),replace = FALSE)]
[[1]]
[1] 1 2 3

[[2]]
[1] 3 2 1

[[3]]
[1] 3 1 2

[[4]]
[1] 2 1 3

[[5]]
[1] 2 3 1

[[6]]
[1] 1 3 2

其它你可能感兴趣的问题

上一篇重要性抽样产生的蒙特卡罗估计结果下一篇n-gram 在什么时候会适得其反？