机器算法验证 - 如果分布不正常，如何测试两个分布的方差是否不同 - 吾爱随笔录

如果分布不正常，如何测试两个分布的方差是否不同

机器算法验证分布统计学意义方差

2022-03-20 17:54:32

我正在研究同一物种的两个地理上孤立的种群。检查分布，我发现两者都是双峰的（它们的出现有一定的季节性），但一个群体中的峰值要高得多且窄得多（即，局部峰值的方差更小）。

什么样的统计检验适合确定这些差异是否显着？

澄清一下，我的 y 轴是特定日期在陷阱中识别的个人数量，x 轴是儒略日。

3个回答

随着时间的推移，这些分布是什么？计数，也许？（如果是这样，那么您可能需要与迄今为止的讨论完全不同的东西）

您所描述的听起来并不像分布方差的差异那样被很好地理解。

听起来您正在模糊地描述这样的事情（忽略轴上的数字，这只是为了让您了解您似乎在描述的一般模式）：

如果这是正确的，那么请考虑：

虽然蓝色曲线的每个峰值在局部中心附近的宽度较窄，但红色和蓝色分布的方差总体上几乎没有差异。

如果您事先确定模态和反模态，则可以测量局部变异性。

首先，我认为您应该分别查看季节性分布，因为双峰分布可能是两个相当独立的过程的结果。这两种分布可能由不同的机制控制，因此例如冬季分布可能对年气候更敏感。如果您想查看人口差异及其原因，我认为单独研究季节性分布更有用。

至于检验，您可以尝试莱文检验（基本上是同方差性检验），它用于比较组间的差异。Bartlett 检验是一种替代方法，但 Levene 检验应该对非正态性更加稳健（尤其是在使用中位数进行检验时）。在 R 中，Levene 和 Bartlett 的检验在library(car).

我同意其他人的说法——即“方差”可能是错误的词（因为您正在考虑的函数不是概率分布，而是时间序列）。

我想你可能想从不同的角度来解决这个问题——只需用 LOWESS 曲线拟合两个时间序列。您可以计算 95% 的置信区间并对它们的形状进行定性评论。我不确定您是否需要做任何比这更花哨的事情。

我在下面写了一些 MATLAB 代码来说明我在说什么。我有点着急，但可以很快提供澄清。我所做的大部分工作都可以直接从这里获取：http: //blogs.mathworks.com/loren/2011/01/13/data-driven-fitting/

%% Generate Example data
npts = 200;
x = linspace(1,100,npts)';
y1 = (1e3*exp(-(x-25).^2/20) + 5e2*exp(-(x-65).^2/40));
y1_noisy = 50*randn(npts,1) + y1;
y2 = (1e3*exp(-(x-25).^2/60) + 5e2*exp(-(x-65).^2/100));
y2_noisy = 50*randn(npts,1) + y2;

figure; hold on
plot(x,y1_noisy,'ob')
plot(x,y2_noisy,'or')
title('raw data'); ylabel('count'); xlabel('time')
legend('y1','y2')

您可能希望标准化两个时间序列以比较它们的相对趋势而不是绝对水平。

%% Normalize data sets
figure; hold on
Y1 = y1_noisy./norm(y1_noisy);
Y2 = y2_noisy./norm(y2_noisy);
plot(x,Y1,'ob')
plot(x,Y2,'or')
title('normalized data'); ylabel('normalized count'); xlabel('time')
legend('Y1','Y2')

现在让LOWESS适合...

%% Make figure with lowess fits
figure; hold on
plot(x,Y1,'o','Color',[0.5 0.5 1])
plot(x,Y2,'o','Color',[1 0.5 0.5])
plot(x,mylowess([x,Y1],x,0.15),'-b','LineWidth',2)
plot(x,mylowess([x,Y2],x,0.15),'-r','LineWidth',2)
title('fit data'); ylabel('normalized count'); xlabel('time')

在此处输入图像描述

最后，您可以创建 95% 置信带，如下所示：

%% Use Bootstrapping to determine 95% confidence bands
figure; hold on
plot(x,Y1,'o','Color',[0.75 0.75 1])
plot(x,Y2,'o','Color',[1 0.75 0.75])

f = @(xy) mylowess(xy,x,0.15);
yboot_1 = bootstrp(1000,f,[x,Y1])';
yboot_2 = bootstrp(1000,f,[x,Y2])';
meanloess(:,1) = mean(yboot_1,2);
meanloess(:,2) = mean(yboot_2,2);
upper(:,1) = quantile(yboot_1,0.975,2);
upper(:,2) = quantile(yboot_2,0.975,2);
lower(:,1) = quantile(yboot_1,0.025,2);
lower(:,2) = quantile(yboot_2,0.025,2);

plot(x,meanloess(:,1),'-b','LineWidth',2);
plot(x,meanloess(:,2),'-r','LineWidth',2);
plot(x,upper(:,1),':b');
plot(x,upper(:,2),':r');
plot(x,lower(:,1),':b');
plot(x,lower(:,2),':r');
title('fit data -- with confidence bands'); ylabel('normalized count'); xlabel('time')

现在您可以根据需要解释最终数字，并且您有 LOWESS 拟合来支持您的假设，即红色曲线中的峰值实际上比蓝色曲线更宽。如果您对函数是什么有更好的了解，则可以改为进行非线性回归。

编辑：根据下面的一些有用的评论，我添加了一些关于明确估计峰宽的更多细节。首先，您首先需要为您认为“高峰”是什么提出一些定义。也许任何超过某个阈值的凸起（在我上面制作的图中类似于 0.05）。基本原则是，您应该找到一种方法，将“真实”或“显着”峰与噪声分开。

然后，对于每个峰，您可以通过多种方式测量其宽度。正如我在下面的评论中提到的，我认为查看“半最大宽度”是合理的，但您也可以查看峰值高于阈值的总时间。理想情况下，您应该使用几种不同的峰宽测量方法，并报告在这些选择下您的结果有多一致。

无论您选择何种指标，您都可以使用自举来计算每个跟踪中每个峰值的置信区间。

f = @(xy) mylowess(xy,x,0.15);
N_boot = 1000;
yboot_1 = bootstrp(N_boot,f,[x,Y1])';
yboot_2 = bootstrp(N_boot,f,[x,Y2])';

此代码为上图中的蓝色和红色轨迹创建 1000 个自举拟合。我将忽略的一个细节是平滑因子 0.15 的选择——您可以选择此参数以最大限度地减少交叉验证错误（请参阅我发布的链接）。现在您所要做的就是编写一个函数来隔离峰值并估计它们的宽度：

function [t_peaks,heights,widths] = getPeaks(t,Y)
%% Computes a list of times, heights, and widths, for each peak in a time series Y
%% (column vector) with associated time points t (column vector).

% The implementation of this function will be problem-specific...

然后在每个数据集的 1000 条曲线上运行此代码，并计算每个峰宽的第 2.5 和第 97.5 个百分位数。我将在 Y1 时间序列上说明这一点 - 您将对 Y2 时间序列或任何其他感兴趣的数据集执行相同的操作。

N_peaks = 2;  % two peaks in example data
t_peaks = nan(N_boot,N_peaks);
heights = nan(N_boot,N_peaks);
widths = nan(N_boot,N_peaks);
for aa = 1:N_boot
  [t_peaks(aa,:),heights(aa,:),widths(aa,:)] = getPeaks(x,yboot_1(:,aa));
end

quantile(widths(:,1),[0.025 0.975]) % confidence interval for the width of first peak
quantile(widths(:,2),[0.025 0.975]) % same for second peak width

如果您愿意，您可以执行假设检验而不是计算置信区间。请注意，上面的代码很简单——它假设每条自举的 lowess 曲线都有 2 个峰值。这个假设可能并不总是成立，所以要小心。我只是想说明我将采取的方法。

注意： “mylowess”功能在我上面发布的链接中给出。这就是它的样子……

function ys=mylowess(xy,xs,span)
%MYLOWESS Lowess smoothing, preserving x values
%   YS=MYLOWESS(XY,XS) returns the smoothed version of the x/y data in the
%   two-column matrix XY, but evaluates the smooth at XS and returns the
%   smoothed values in YS.  Any values outside the range of XY are taken to
%   be equal to the closest values.

if nargin<3 || isempty(span)
  span = .3;
end

% Sort and get smoothed version of xy data
xy = sortrows(xy);
x1 = xy(:,1);
y1 = xy(:,2);
ys1 = smooth(x1,y1,span,'loess');

% Remove repeats so we can interpolate
t = diff(x1)==0;
x1(t)=[]; ys1(t) = [];

% Interpolate to evaluate this at the xs values
ys = interp1(x1,ys1,xs,'linear',NaN);

% Some of the original points may have x values outside the range of the
% resampled data.  Those are now NaN because we could not interpolate them.
% Replace NaN by the closest smoothed value.  This amounts to extending the
% smooth curve using a horizontal line.
if any(isnan(ys))
  ys(xs<x1(1)) = ys1(1);
  ys(xs>x1(end)) = ys1(end);
end

其它你可能感兴趣的问题

上一篇为什么扩展功能会降低 SVM 性能？下一篇R 中一类 SVM 的示例