如何将一周的分钟数据聚合成每小时的平均值?

机器算法验证 r 时间序列 聚合
2022-01-30 10:12:21

您将如何获得多个数据列的每小时平均值,在同一时间段内,并在同一图表中显示十二个“主机”的结果?也就是说,我想为一周的数据绘制 24 小时周期的图表。最终目标是在采样前后比较两组数据。

                dates     Hos      CPUIOWait CPUUser CPUSys
1 2011-02-11 23:55:12     db       0         14      8
2 2011-02-11 23:55:10     app1     0          6      1
3 2011-02-11 23:55:09     app2     0          4      1

我已经能够运行xyplot(CPUUser ~ dates | Host)良好的效果。但是,我不想显示一周中的每个日期,而是希望 X 轴是一天中的时间。

尝试将此数据放入 xts 对象会导致错误,例如:

“order.by 需要一个适当的基于时间的对象”

这是一个str()数据框:

'data.frame':   19720 obs. of  5 variables:
$ dates    : POSIXct, format: "2011-02-11 23:55:12" "2011-02-11 23:55:10" ...
$ Host     : Factor w/ 14 levels "app1","app2",..: 9 7 5 4 3 10 6 8 2 1 ...  
$ CPUIOWait: int  0 0 0 0 0 0 0 0 0 0 ...
$ CPUUser  : int  14 6 4 4 3 10 4 3 4 4 ...
$ CPUSys   : int  8 1 1 1 1 3 1 1 1 1 ...

更新:仅供将来参考,我决定使用箱线图来显示中位数和“异常值”。

本质上:

Data$hour <- as.POSIXlt(dates)$hour  # extract hour of the day
boxplot(Data$CPUUser ~ Data$hour)    # for a subset with one host or for all hosts
xyplot(Data$CPUUser ~ Data$hour | Data$Host, panel=panel.bwplot, horizontal=FALSE)
4个回答

这是一种使用 cut() 创建适当的小时因子和 ddply() 从 plyr 库中计算均值的方法。

library(lattice)
library(plyr)

## Create a record and some random data for every 5 seconds 
## over two days for two hosts.
dates <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
             as.POSIXct("2011-01-02 23:59:55", tz = "GMT"),
             by = 5)
hosts <- c(rep("host1", length(dates)), rep("host2", 
           length(dates)))
x1    <- sample(0:20, 2*length(dates), replace = TRUE)
x2    <- rpois(2*length(dates), 2)
Data  <- data.frame(dates = dates, hosts = hosts, x1 = x1, 
                    x2 = x2)

## Calculate the mean for every hour using cut() to define 
## the factors and ddply() to calculate the means. 
## getmeans() is applied for each unique combination of the
## hosts and hour factors.
getmeans  <- function(Df) c(x1 = mean(Df$x1), 
                            x2 = mean(Df$x2))
Data$hour <- cut(Data$dates, breaks = "hour")
Means <- ddply(Data, .(hosts, hour), getmeans)
Means$hour <- as.POSIXct(Means$hour, tz = "GMT")

## A plot for each host.
xyplot(x1 ~ hour | hosts, data = Means, type = "o",
       scales = list(x = list(relation = "free", rot = 90)))

聚合也可以在不使用的情况下工作zoo(来自 3 天的 2 个变量的随机数据和来自 JWM 的 4 个主机)。我假设您每小时都有来自所有主机的数据。

nHosts <- 4  # number of hosts
dates  <- seq(as.POSIXct("2011-01-01 00:00:00"),
              as.POSIXct("2011-01-03 23:59:30"), by=30)
hosts  <- factor(sample(1:nHosts, length(dates), replace=TRUE),
                 labels=paste("host", 1:nHosts, sep=""))
x1     <- sample(0:20, length(dates), replace=TRUE)  # data from 1st variable
x2     <- rpois(length(dates), 2)                    # data from 2nd variable
Data   <- data.frame(dates=dates, hosts=hosts, x1=x1, x2=x2)

我不完全确定您是想在每个小时内进行平均,还是在所有天的每个小时内进行平均。我都会做。

Data$hFac <- droplevels(cut(Data$dates, breaks="hour"))
Data$hour <- as.POSIXlt(dates)$hour  # extract hour of the day

# average both variables over days within each hour and host
# formula notation was introduced in R 2.12.0 I think
res1 <- aggregate(cbind(x1, x2) ~ hour + hosts, data=Data, FUN=mean)
# only average both variables within each hour and host
res2 <- aggregate(cbind(x1, x2) ~ hFac + hosts, data=Data, FUN=mean)

结果如下所示:

> head(res1)
  hour hosts        x1       x2
1    0 host1  9.578431 2.049020
2    1 host1 10.200000 2.200000
3    2 host1 10.423077 2.153846
4    3 host1 10.241758 1.879121
5    4 host1  8.574713 2.011494
6    5 host1  9.670588 2.070588

> head(res2)
                 hFac hosts        x1       x2
1 2011-01-01 00:00:00 host1  9.192308 2.307692
2 2011-01-01 01:00:00 host1 10.677419 2.064516
3 2011-01-01 02:00:00 host1 11.041667 1.875000
4 2011-01-01 03:00:00 host1 10.448276 1.965517
5 2011-01-01 04:00:00 host1  8.555556 2.074074
6 2011-01-01 05:00:00 host1  8.809524 2.095238

我也不完全确定您想要的图表类型。这是图表的基本版本,仅用于第一个变量,每个主机都有单独的数据线。

# using the data that is averaged over days as well
res1L <- split(subset(res1, select="x1"), res1$hosts)
mat1  <- do.call(cbind, res1L)
colnames(mat1) <- levels(hosts)
rownames(mat1) <- 0:23
matplot(mat1, main="x1 per hour, avg. over days", xaxt="n", type="o", pch=16, lty=1)
axis(side=1, at=seq(0, 23, by=2))
legend(x="topleft", legend=colnames(mat1), col=1:nHosts, lty=1)

仅在每小时内平均的数据的相同图表。

res2L <- split(subset(res2, select="x1"), res2$hosts)
mat2  <- do.call(cbind, res2L)
colnames(mat2) <- levels(hosts)
rownames(mat2) <- levels(Data$hFac)
matplot(mat2, main="x1 per hour", type="o", pch=16, lty=1)
legend(x="topleft", legend=colnames(mat2), col=1:nHosts, lty=1)

您可以aggregate.zoo从包中签出功能:http zoo: //cran.r-project.org/web/packages/zoo/zoo.pdf

查理

鉴于您具有 POSIXct 时间格式,您可以使用 as.POSIXct(time) 来执行此操作,您所需要的只是 cut 和 aggregate()。

试试这个:

split_hour = cut(as.POSIXct(temp$time), breaks = "60 mins") # summrise given mins
temp$hour = split_hour # make hourly vaiable
ag = aggregate(. ~ hour, temp, mean)

在这种情况下, temp 就像这个 temp

1  0.6 0.6 0.0 0.350 0.382 0.000 2020-04-13 18:30:42
2  0.0 0.5 0.5 0.000 0.304 0.292 2020-04-13 19:56:02
3  0.0 0.2 0.2 0.000 0.107 0.113 2020-04-13 20:09:10
4  0.6 0.0 0.6 0.356 0.000 0.376 2020-04-13 20:11:57
5  0.0 0.3 0.2 0.000 0.156 0.148 2020-04-13 20:12:07
6  0.0 0.4 0.4 0.000 0.218 0.210 2020-04-13 22:02:49
7  0.2 0.2 0.0 0.112 0.113 0.000 2020-04-13 22:31:43
8  0.3 0.0 0.3 0.155 0.000 0.168 2020-04-14 03:19:03
9  0.4 0.0 0.4 0.219 0.000 0.258 2020-04-14 03:55:58
10 0.2 0.0 0.0 0.118 0.000 0.000 2020-04-14 04:25:25
11 0.3 0.3 0.0 0.153 0.160 0.000 2020-04-14 05:38:20
12 0.0 0.7 0.8 0.000 0.436 0.493 2020-04-14 05:40:02
13 0.0 0.0 0.2 0.000 0.000 0.101 2020-04-14 05:40:44
14 0.3 0.0 0.3 0.195 0.000 0.198 2020-04-14 06:09:26
15 0.2 0.2 0.0 0.130 0.128 0.000 2020-04-14 06:17:15
16 0.2 0.0 0.0 0.144 0.000 0.000 2020-04-14 06:19:36
17 0.3 0.0 0.4 0.177 0.000 0.220 2020-04-14 06:23:43
18 0.2 0.0 0.0 0.110 0.000 0.000 2020-04-14 06:25:19
19 0.0 0.0 0.0 1.199 1.035 0.251 2020-04-14 07:05:24
20 0.2 0.2 0.0 0.125 0.107 0.000 2020-04-14 07:21:46

ag是这样的

1  2020-04-13 18:30:00 0.60000000 0.6000000 0.0000000 0.3500000 0.38200000 0.00000000
2  2020-04-13 19:30:00 0.15000000 0.2500000 0.3750000 0.0890000 0.14175000 0.23225000
3  2020-04-13 21:30:00 0.00000000 0.4000000 0.4000000 0.0000000 0.21800000 0.21000000
4  2020-04-13 22:30:00 0.20000000 0.2000000 0.0000000 0.1120000 0.11300000 0.00000000
5  2020-04-14 02:30:00 0.30000000 0.0000000 0.3000000 0.1550000 0.00000000 0.16800000
6  2020-04-14 03:30:00 0.30000000 0.0000000 0.2000000 0.1685000 0.00000000 0.12900000
7  2020-04-14 05:30:00 0.18750000 0.1500000 0.2125000 0.1136250 0.09050000 0.12650000
8  2020-04-14 06:30:00 0.10000000 0.1000000 0.0000000 0.6620000 0.57100000 0.12550000
9  2020-04-14 07:30:00 0.00000000 0.3000000 0.2000000 0.0000000 0.16200000 0.11800000
10 2020-04-14 19:30:00 0.20000000 0.3000000 0.0000000 0.1460000 0.19000000 0.00000000
11 2020-04-14 20:30:00 0.06666667 0.2000000 0.2666667 0.0380000 0.11766667 0.17366667
12 2020-04-14 22:30:00 0.20000000 0.3000000 0.0000000 0.1353333 0.18533333 0.00000000
13 2020-04-14 23:30:00 0.00000000 0.5000000 0.5000000 0.0000000 0.28000000 0.32100000
14 2020-04-15 01:30:00 0.25000000 0.2000000 0.4500000 0.1355000 0.11450000 0.26100000