机器算法验证 - 生成相关数据集的工具 - 吾爱随笔录

生成相关数据集的工具

机器算法验证相关性数理统计数据集随机生成软件

2022-03-27 09:25:08

有谁知道我可以使用一种工具来生成一组具有已知相关性的数据（并将锦上添花 - 以 json、csv、txt 或某种常见格式输出）？

我正在研究一些数据可视化，并希望评估哪些可以更容易地让用户发现相关性 - 视觉。

3个回答

你可以在任何地方做这件事。Excel，R，......几乎任何能够进行基本统计计算的东西。

人口相关性。在双变量情况下，这是一个简单的问题，即采用具有相同标准偏差的独立随机变量并从这两个变量中创建第三个变量，该变量与两个随机变量之一具有所需的相关性。如果 $X_1$ 和 $X_2$ 是独立的标准正态变量，那么 $Y=rX_2+\sqrt{1-r^2}X_1$ 会有相关性 $r$ 之间 $Y$ 和 $X_2$ .

这是R中的一个例子：
```
 n = 10
 r = 0.8
 x1 = rnorm(n)
 x2 = rnorm(n)
 y1 = r*x2+sqrt(1-r*r)*x1   
```
这里基础变量具有所需大小的总体相关性，但样本相关性会有所不同。（我只运行了 3 次代码，得到了 0.938、0.895 和 0.933 的样本相关性）。

这可以在 Excel 或任何数量的其他软件包中轻松完成。

如果您需要两个以上的变量和一些预先指定的相关矩阵，则可以使用 Cholesky 分解来完成（以上是一种特殊情况）。如果 $Z$ 是一个长度向量 $k$ 具有单位（或至少恒定）标准偏差的独立随机变量；和 $\S$ 是具有 Cholesky 分解的相关矩阵 $S=LL'$ ，然后 $LZ$ 具有人口相关性 $S$ .
样本相关性。对于精确的样本相关性，在应用上述技巧之前，您需要具有完全零样本相关性和相同样本方差的样本。有多种方法可以实现这一点，但一种简单的方法是从回归中获取残差（这将与回归中的 x 变量不相关），然后缩放这两个变量以获得单位方差。

这是R中的一个例子：
```
 n = 10
 r = 0.8
 x1 = rnorm(n)
 x2 = rnorm(n)
 y1 = scale(x2) * r  +  scale(residuals(lm(x1~x2))) * sqrt(1-r*r)
```
产生相关性：
```
 cor(y1,x2)
     [,1]
[1,]  0.8
```
完全符合要求。

因此，现在只需以您喜欢的格式写出结果（您提到的所有格式都可以轻松完成；例如，作为 csv 文件，您可以调用write.csv：

write.csv(data.frame(y=y1,x=x2),file="myfile.csv")

这会在当前工作目录中创建一个名为“myfile.csv”的文件，其内容为：

"","y","x"
"1",0.743433299251026,0.617686871809365
"2",0.527604385327034,-0.113047553664104
"3",-0.397333571358269,0.196447643803443
"4",-0.875264248799599,-1.57628371273354
"5",-0.225441433921137,-0.107919886825751
"6",0.0817573026498336,0.370207951209058
"7",-2.15935431462587,-1.21145928947767
"8",1.46638207013879,1.10215217029937
"9",0.311683673588212,-0.470550477344661
"10",0.526532837749974,-0.104382608454622

R 中的包 mvtnorm 生成随机多元法线。您可以指定相关性。

如果 M 是您的随机法线矩阵，请执行 write.csv(M, file="mydata.csv") 将其写入文件。

只是为了防止将“不可能”的相关性设置为一个整体（相关性矩阵可能变为非正定） - 例如，您不能定义两个几乎相关的变量，而第三个变量靠近其中一个而远其中另一个 - 从“因子加载”矩阵开始可能更有用，它将随机变量的组成描述为线性（回归）方程。这在开始时看起来不太“自然”，但人们可以习惯这一点。
以下可能在R中类似地完成，也许更好，但我在这里用我自己的矩阵工具语言MatMate展示它，因为我对R没有经验。可以做的更短，, nv等，您可以只插入值，但对于此处的文档，我已经使用更丰富的文档形式完成了它。示例是：

3个隐藏的公因数和
6 个项目特定的误差因素（正态分布）
6个“经验”变量

在N = 1000例中测量。

//==============================================================   
N = 1000       
nv = 6          // set number of empirical variables               
ncf,nef = 3,nv  // set number of common factors, error-factors               
nf = ncf+nef    // needed uncorrelated random-factors                  

// create a hidden ("unknown") loadingsmatrix, which describes the 
// composition of our empirical data by the "unknown" factors
// remember we want ncf=3 common factors and nef=nv=6 error factors
ulad = {{ 10.0 ,  1,  0}, _
        {  9   ,  0,  1}, _
        {  0   , 11,  0}, _
        {  1   , 12,  1}, _
        {  0.2 , -1, 11}, _
        { -0.3 ,  1, 10}}

ulad = ulad || 2*einh(nef)  // append a identity-matrix as definition of the 
                            // error-variance
                            // make the itemspecific variance a bit bigger
                            // than the spurious cross-factors loadings in
                            // the ulad-loadingsmatrix 
     chk = ulad * ulad'     // check the expected covariancematrix
     list chk               // print it out
     chk = covtocorr (chk)  // look at it as correlation-matrix
     list chk               // print it out


// Now generate random data for nf uncorrelated normally-distributed factors
 set randomstart=41  // set randomgenerator to get reproducable random data
 rn = randomn(nf,N)  // fix a basic datamatrix of random numbers (normal dist)
    chk = (rn *' - N*einh(nf))*1e3  // we find spurious correlations of 1e-3

 ufac=unkorrzl(rn)        // refine data in rn: remove spurious correlations
    // the process leaves still spurious correlations of 1e-12
    chk = (ufac *' - N*einh(nf))*1e12  // still spurious correlations of 1e-12

    // repeat to higher-precision 
          ufac=zvaluezl(abwzl(ufac))  // correct again for exacter z-values
    ufac=unkorrzl(ufac)   // remove again spurious correlation
    chk = (ufac *' - N*einh(nf))*1e18  // spurious correlations of 1e-18

// create "empirical" dataset with N=1000 measures
//       having the wished compositions of the random factors 
data = ulad * ufac

// ========= end of the empirically unobservable mechanism ============

// now you can proceed with regression, factoranalysis or whatever on
// that data      
// .....................
// or you can write out the data in a csv-file or into the clipboard
matwrite csv("mydata.csv",10,6) = data   // write in csv-format, cases along row
                                         // max 10 digits, 6 of them decimals
matwrite csv("mydata.csv",10,6) = data'  // cases along column
matwrite csv("clip",10,6) = data'  // write it directly into clipboard
                                   // to insert it, for instance, in Excel

其它你可能感兴趣的问题

上一篇期望( X+ Y)2(X+Y)2在哪里XX和是Y是独立泊松随机变量下一篇ARIMA 模型上的结构包