我想建议一个(标准)初步分析,以消除(a)用户之间的变化,(b)所有用户对变化的典型反应,以及(c)从一个时间段到下一个时间段的典型变化的主要影响.
一个简单(但绝不是最好)的方法是对数据执行几次“中值抛光”迭代,以清除用户中值和时间段中值,然后随着时间的推移平滑残差。确定变化很大的平滑:它们是您要在图形中强调的用户。
因为这些是计数数据,所以使用平方根重新表达它们是个好主意。
作为可能产生的结果的一个例子,这里是一个模拟的 60 周数据集,包含 240 个用户,他们通常每周执行 10 到 20 个操作。在第 40 周之后,所有用户都发生了变化。其中三个被“告知”对变化做出负面反应。左图显示了原始数据:随时间变化的用户操作计数(用户以颜色区分)。正如问题中所说,这是一团糟。右图显示了此 EDA 的结果——颜色与之前相同——自动识别并突出显示响应异常的用户。标识——虽然它有点特别——是完整和正确的(在这个例子中)。
这是R
生成这些数据并进行分析的代码。它可以通过多种方式进行改进,包括
尽管如此,测试表明该解决方案适用于广泛的用户数量,12 - 240 或更多。
n.users <- 240 # Number of users (here limited to 657, the number of colors)
n.periods <- 60 # Number of time periods
i.break <- 40 # Period after which change occurs
n.outliers <- 3 # Number of greatly changed users
window <- 1/5 # Temporal smoothing window, fraction of total period
response.all <- 1.1 # Overall response to the change
threshold <- 2 # Outlier detection threshold
# Create a simulated dataset
set.seed(17)
base <- exp(rnorm(n.users, log(10), 1/2))
response <- c(rbeta(n.users - n.outliers, 9, 1),
rbeta(n.outliers, 5, 45)) * response.all
actual <- cbind(base %o% rep(1, i.break),
base * response %o% rep(response.all, n.periods-i.break))
observed <- matrix(rpois(n.users * n.periods, actual), nrow=n.users)
# ---------------------------- The analysis begins here ----------------------------#
# Plot the raw data as lines
set.seed(17)
colors = sample(colors(), n.users) # (Use a different method when n.users > 657)
par(mfrow=c(1,2))
plot(c(1,n.periods), c(min(observed), max(observed)), type="n",
xlab="Time period", ylab="Number of actions", main="Raw data")
i <- 0
apply(observed, 1, function(a) {i <<- i+1; lines(a, col=colors[i])})
abline(v = i.break, col="Gray") # Mark the last period before a change
# Analyze the data by time period and user by sweeping out medians and smoothing
x <- sqrt(observed + 1/6) # Re-express the counts
mean.per.period <- apply(x, 2, median)
residuals <- sweep(x, 2, mean.per.period)
mean.per.user <- apply(residuals, 1, median)
residuals <- sweep(residuals, 1, mean.per.user)
smooth <- apply(residuals, 1, lowess, f=window) # Smooth the residuals
smooth.y <- sapply(smooth, function(s) s$y) # Extract the smoothed values
ends <- ceiling(window * n.periods / 4) # Prepare to drop near-end values
range <- apply(smooth.y[-(1:ends), ], 2, function(x) max(x) - min(x))
# Mark the apparent outlying users
thick <- rep(1, n.users)
thick[outliers <- which(range >= threshold * median(range))] <- 3
type <- ifelse(thick==1, 3, 1)
cat(outliers) # Print the outlier identifiers (ideally, the last `n.outliers`)
# Plot the residuals
plot(c(1,n.periods), c(min(smooth.y), max(smooth.y)), type="n",
xlab="Time period", ylab="Smoothed residual root", main="Residuals")
i <- 0
tmp <- lapply(smooth,
function(a) {i <<- i+1; lines(a, lwd=thick[i], lty=type[i], col=colors[i])})
abline(v = i.break, col="Gray")