机器算法验证 - 为什么 R 的 ccf 和 SciPy 的 correlate 的结果不同？ - 吾爱随笔录

我有两个 10000 元素时间序列，我想找到它们之间的互相关（Here & Here）。我正在用 R 进行数据探索（我是新手）并用 python 编写我的程序。R 的 ccf 似乎产生了与 SciPy 的相关函数不同的结果。为什么是这样？这是每个的代码：

代码：

library(readr)
CsI <- read_csv("/local_data0/collaborations/APT/data/180323-5/CsI.dat",col_names = FALSE)
WLS <- read_csv("/local_data0/collaborations/APT/data/180323-5/WLS.dat",col_names = FALSE)
result = ccf(CsI, WLS, type='correlation', lag.max = 10)
result$acf

输出：

              [,1]
 [1,]  0.032645497
 [2,] -0.033755607
 [3,] -0.037785705
 [4,]  0.022958839
 [5,]  0.057945119
 [6,] -0.014065906
 [7,] -0.092227126
 [8,] -0.041183831
 [9,]  0.078548696
[10,]  0.071090395
[11,] -0.009364837
[12,] -0.090044781
[13,] -0.037137735
[14,]  0.063409838
[15,]  0.077286233
[16,] -0.052529829
[17,] -0.103558852
[18,] -0.016650386
[19,]  0.102380818
[20,]  0.092912381
[21,] -0.022470694

蟒蛇代码：

import scipy.signal as ss
import numpy as np
import matplotlib.pyplot as plt

maxlags = 10
CsI = np.loadtxt('CsI.dat')
WLS = np.loadtxt('WLS.dat')
result = ss.correlate(CsI, WLS, method='direct') #This is 19999 elements in length
lo = (len(result)-1)//2-10 #just get +/- 10 elements around lag 0
hi = (len(result)-1)//2+11

locs = np.arange(lo, hi)
for loc in locs:
    print(str(loc)+'\t:\t'+str(result[loc]))

#Make a plot like ccf
f, ax = plt.subplots()
ax.stem(np.arange(-10,11), result[lo:hi], '-.')
ax.set_xticks(np.arange(-10,11))
plt.show()

蟒蛇输出：

9989    :       0.0011603199999999998
9990    :       -4.864000000000013e-05
9991    :       -0.00012224000000000002
9992    :       0.0009836799999999998
9993    :       0.0016211199999999998
9994    :       0.00031039999999999963
9995    :       -0.00111232
9996    :       -0.00018240000000000004
9997    :       0.00199808
9998    :       0.0018617599999999998
9999    :       0.0003961599999999999
10000   :       -0.0010732800000000002
10001   :       -0.00010944000000000012
10002   :       0.0017215999999999998
10003   :       0.0019744
10004   :       -0.0003904000000000001
10005   :       -0.0013203199999999998
10006   :       0.0002623999999999999
10007   :       0.0024300800000000003
10008   :       0.00225728
10009   :       0.0001561599999999998

这些显然是不同的。

应该注意的是，计算具有最大可能滞后的 R ccf 会产生与 SciPy 的相关图相似的图：

R ccf with lag.max=10000

SciPy 的相关性：

一般特征相似，但细节不同。整体尺度也有很大不同（x3 差异）。造成这些差异的原因是什么，它们是在 python 中重现 R 的 ccf 结果的一种方法吗？