如何使用线性混合模型分析双峰分布数据

机器算法验证 混合模式 lme4-nlme 二项分布 实验设计 嵌套数据
2022-03-31 04:04:07

我正在进行的实验具有以下设计:

ABCDEF
BADEFC
ABEFCD
BAFCDE

  • 每个字母代表本实验分析的称为“系统”的单一因素的不同水平。该数据集包含八年,我们正在分析的因变量是yield
    A 和 B 可以根据系统类型组合在一起,C 到 F 也可以组合在一起。我知道 AB 组和 CDEF 组之间缺少随机化,这是由于法规所必需的,以及这两个组中缺少随机化,遗憾的是,这根本没有进行。
  • 我正在调查系统之间的产量是否存在显着差异 (AF)

我的数据如下所示:

> str(data)
'data.frame':   192 obs. of  6 variables:
 $ year  : Factor w/ 8 levels "2012","2013",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ type  : Factor w/ 2 levels "org","pest": 1 1 1 1 1 1 1 1 1 1 ...
 $ system: Factor w/ 6 levels "dgst_org","cc_pest",..: 3 3 3 3 5 5 5 5 6 6 ...
 $ row   : Factor w/ 4 levels "row_1","row_2",..: 1 2 3 4 2 3 4 1 3 4 ...
 $ column: Factor w/ 6 levels "column_1","column_2",..: 6 5 4 3 6 5 4 3 6 5 ...
 $ yield : num  26.2 41.4 43.4 45 40.8 52.3 47.1 47.2 40.1 42.4 ...

> summary(data)
      year      type             system      row          column       yield       
 2012   :24   org :128   dgst_org   :32   row_1:48   column_1:32   Min.   : 26.20  
 2013   :24   pest: 64   cc_pest    :32   row_2:48   column_2:32   1st Qu.: 52.30  
 2014   :24              cc_org     :32   row_3:48   column_3:32   Median : 62.95  
 2015   :24              manure_pest:32   row_4:48   column_4:32   Mean   : 73.79  
 2016   :24              manure_org :32              column_5:32   3rd Qu.:103.83  
 2017   :24              fmyd_org   :32              column_6:32   Max.   :127.10  

> head(data,20)
    year type     system   row   column yield
377 2012  org     cc_org row_1 column_6  26.2
378 2012  org     cc_org row_2 column_5  41.4
379 2012  org     cc_org row_3 column_4  43.4
380 2012  org     cc_org row_4 column_3  45.0
417 2012  org manure_org row_2 column_6  40.8
418 2012  org manure_org row_3 column_5  52.3
419 2012  org manure_org row_4 column_4  47.1
420 2012  org manure_org row_1 column_3  47.2
461 2012  org   fmyd_org row_3 column_6  40.1
462 2012  org   fmyd_org row_4 column_5  42.4
463 2012  org   fmyd_org row_1 column_4  39.5
464 2012  org   fmyd_org row_2 column_3  35.7
505 2012  org   dgst_org row_4 column_6  57.8
506 2012  org   dgst_org row_1 column_5  48.8
507 2012  org   dgst_org row_2 column_4  52.3
508 2012  org   dgst_org row_3 column_3  64.1
537 2013  org     cc_org row_1 column_6  41.2
538 2013  org     cc_org row_2 column_5  43.3
539 2013  org     cc_org row_3 column_4  57.2
540 2013  org     cc_org row_4 column_3  51.1

我试图提出一个适当的线性混合效应模型,但由于实验设计不佳而遇到了一些问题。

产率呈双峰分布,正如预期的系统类型的影响。 直方图

我知道这不是问题,只要模型的残差是正态分布的,它们是

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:column) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:column) +      (1 | year:row)
   Data: data

REML criterion at convergence: 1262.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2604 -0.4993  0.0596  0.5585  2.3880 

Random effects:
 Groups      Name        Variance Std.Dev.
 year:column (Intercept)  0.01384 0.1176  
 year:system (Intercept) 43.85302 6.6222  
 year:row    (Intercept)  2.27887 1.5096  
 year        (Intercept) 22.30702 4.7230  
 Residual                26.42919 5.1409  
Number of obs: 192, groups:  year:column, 48; year:system, 48; year:row, 32; year, 8

Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)         62.981      3.028  27.986  20.801  < 2e-16 ***
systemcc_pest       46.566      3.552  34.309  13.110 6.42e-15 ***
systemcc_org        -9.744      3.552  33.574  -2.743  0.00969 ** 
systemmanure_pest   47.147      3.552  34.309  13.274 4.49e-15 ***
systemmanure_org    -8.369      3.552  33.574  -2.356  0.02444 *  
systemfmyd_org     -10.722      3.552  33.574  -3.019  0.00482 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.587                                          
systemcc_rg -0.587  0.500                                   
systmmnr_ps -0.587  0.500     0.500                         
systmmnr_rg -0.587  0.500     0.500     0.500               
systmfmyd_r -0.587  0.500     0.500     0.500      0.500  

QQ剧情

  1. 然后我的第一个想法是将整个数据集分成两个数据集(AB 和 CDEF),每个数据集都具有正态分布的数据,并检查系统之间的显着差异,首先是分开,然后是一起检查。
    我的 CDEF 组的 lmer 模型是:
    m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row) + (1|year:column))
    我尝试添加一个额外的随机效应来解释行和列之间的交互, +(1|row:column)
    但收到一条错误消息:boundary (singular) fit: see ?isSingular
    AB 组的模型是:
    m2 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row))
    因为只有单个复制的行。我检查了 emmeans 包,如果组之间存在显着差异,发现 F 之间存在显着差异,产量较高,CDE 产量较低。系统 A 和 B 之间没有发现差异。之后我不知道如何继续和比较两组。
  1. 我的第二个想法是添加一个考虑系统类型的分组变量,并创建一个可以同时比较整个实验的模型。
    我提出的 lmer 模型是:
    m3 <- lmer(yield ~ type + system + (1|year) + (1|year:system) + (1|year:type) + (1|year:row))
    我又遇到了一些问题,我不知道如何正确嵌套我的固定效果,因为它们显然是嵌套的以及如何考虑列。

正如Russ Lenth在评论中提到的那样,分裂人口是没有意义的,因为这是治疗的影响

因此,我的问题是:

  • 我是否应该划分我的数据集并分别分析两种系统类型(AB 和 CDEF),如果是这样,我如何在 AB 模型中包含列,之后我有什么可能比较 AB 和 CDEF?

  • 或者我应该制作一个模型来统治它们并为系统类型创建一个新的分组变量并正确嵌套它们并忽略列的随机效应?

  • 或者你有任何其他想法如何处理这个设计?

新型号

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:row)
   Data: data

REML criterion at convergence: 1262.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2609 -0.4988  0.0592  0.5590  2.3885 

Random effects:
 Groups      Name        Variance Std.Dev.
 year:system (Intercept) 43.868   6.623   
 year:row    (Intercept)  2.276   1.509   
 year        (Intercept) 22.305   4.723   
 Residual                26.442   5.142   
Number of obs: 192, groups:  year:system, 48; year:row, 32; year, 8

Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)         62.981      3.028  28.260  20.799  < 2e-16 ***
systemcc_pest       46.566      3.552  35.000  13.108  4.6e-15 ***
systemcc_org        -9.744      3.552  35.000  -2.743  0.00954 ** 
systemmanure_pest   47.147      3.552  35.000  13.272  3.2e-15 ***
systemmanure_org    -8.369      3.552  35.000  -2.356  0.02421 *  
systemfmyd_org     -10.722      3.552  35.000  -3.018  0.00472 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.587                                          
systemcc_rg -0.587  0.500                                   
systmmnr_ps -0.587  0.500     0.500                         
systmmnr_rg -0.587  0.500     0.500     0.500               
systmfmyd_r -0.587  0.500     0.500     0.500      0.500   


> m2 <- lmer(yield ~ system + (1|year) + (1|year:row) +  (1|year:column), data = data)
> summary(m2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:row) + (1 | year:column)
   Data: data

REML criterion at convergence: 1302.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0617 -0.5748  0.1023  0.5824  2.7636 

Random effects:
 Groups      Name        Variance Std.Dev.
 year:column (Intercept) 27.2467  5.2198  
 year:row    (Intercept)  0.2432  0.4932  
 year        (Intercept) 25.0757  5.0076  
 Residual                38.6421  6.2163  
Number of obs: 192, groups:  year:column, 48; year:row, 32; year, 8

Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)         62.981      2.281  12.319  27.616 1.87e-12 ***
systemcc_pest       46.566      2.229  75.612  20.889  < 2e-16 ***
systemcc_org        -9.744      1.554 116.002  -6.270 6.39e-09 ***
systemmanure_pest   47.147      2.229  75.612  21.149  < 2e-16 ***
systemmanure_org    -8.369      1.554 116.002  -5.385 3.84e-07 ***
systemfmyd_org     -10.722      1.554 116.002  -6.899 2.93e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.405                                          
systemcc_rg -0.341  0.349                                   
systmmnr_ps -0.405  0.757     0.349                         
systmmnr_rg -0.341  0.349     0.500     0.349               
systmfmyd_r -0.341  0.349     0.500     0.349      0.500 




1个回答

我试着总结一下我从评论中学到的东西来结束这个问题:

  1. 线性混合效应模型不一定需要正态分布的数据;这是另一个处理相同问题的帖子的链接
  2. 不是数据本身,而是模型的残差应该是正态分布的
  3. 使用 lme 模型时要注意的最重要的事情之一是找到正确代表您的实验的正确模型语法,帮助我找到以下资源的资源: