编辑:我最初认为 OP 知道哪些观察来自哪个物种。OP的编辑清楚地表明我原来的方法是不可行的。我会把它留给后代,但另一个答案要好得多。作为安慰,我在 Stan 中编写了一个混合模型。我并不是说贝叶斯方法在这种情况下特别好,但它只是我可以贡献的一些巧妙的东西。
斯坦代码
data{
//Number of data points
int N;
real y[N];
real x[N];
}
parameters{
//mixing parameter
real<lower=0, upper =1> theta;
//Regression intercepts
real beta_0[2];
//Regression slopes.
ordered[2] beta_1;
//Regression noise
real<lower=0> sigma[2];
}
model{
//priors
theta ~ beta(5,5);
beta_0 ~ normal(0,1);
beta_1 ~ normal(0,1);
sigma ~ cauchy(0,2.5);
//mixture likelihood
for (n in 1:N){
target+=log_mix(theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
}
}
generated quantities {
//posterior predictive distribution
//will allow us to see what points belong are assigned
//to which mixture
matrix[N,2] p;
matrix[N,2] ps;
for (n in 1:N){
p[n,1] = log_mix(theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
p[n,2]= log_mix(1-theta,
normal_lpdf(y[n] | beta_0[1] + beta_1[1]*x[n], sigma[1]),
normal_lpdf(y[n] | beta_0[2] + beta_1[2]*x[n], sigma[2]));
ps[n,]= p[n,]/sum(p[n,]);
}
}
从 R 运行 Stan 模型
library(tidyverse)
library(rstan)
#Simulate the data
N = 100
x = rnorm(N, 0, 3)
group = factor(sample(c('a','b'),size = N, replace = T))
y = model.matrix(~x*group)%*% c(0,1,0,2)
y = as.numeric(y) + rnorm(N)
d = data_frame(x = x, y = y)
d %>%
ggplot(aes(x,y))+
geom_point()
#Fit the model
N = length(x)
x = as.numeric(x)
y = y
fit = stan('mixmodel.stan',
data = list(N= N, x = x, y = y),
chains = 8,
iter = 4000)
结果
虚线是基本事实,实线是估计的。
原始答案
如果您知道哪个样品来自哪个品种的水仙花,您可以估计品种和土壤 PH 之间的相互作用。
你的模型看起来像
y=β0+β1variety+β2PH+β3variety⋅PH
这是 R 中的一个示例。我生成了一些如下所示的数据:
显然是两条不同的线,而且这些线对应于两个物种。以下是如何使用线性回归估计线条。
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
结果是
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
对于标记为 0 的物种,这条线大约是
y=19+1⋅PH
对于标记为 1 的物种,这条线大约是
y=40−2⋅PH