数据挖掘 - R总结条件 - 吾爱随笔录

R总结条件

数据挖掘 r dplyr

2022-02-21 18:22:36

我有他们购买的产品和购买日期的客户数据。

我想提取一个结果，显示每个客户和他们购买的前两种水果。

我的实际集合有 90000 行，有 9000 个唯一客户。我已经尝试过 groupby 和 summarise 函数，但我希望能够像我们使用 select 和 where 子句一样使用带有条件的 summarise。感谢您的建议

3个回答

这是 iris 数据集的示例

t(sapply(by(iris$Sepal.Length,iris$Species,function(x){x[1:2]}),as.numeric))

物种是您的客户，而 Sepal.Length 是您的果实。

如果你想要一个 dplyr 解决方案，你可以试试这个：

yourdata %>% 
      mutate(date = paste(date, "-2018", sep = ""), # add year to date
             date = as.Date(date, format = "%d-%b-%Y")) %>% # save date in date format
      arrange(date) %>% # sort by date
      group_by(customer) %>%
      slice(1:2) %>% # keep only first two rows (fruits) per customer
      mutate(date = c("fruit1", "fruit2")) # change date variable to fruit1/fruit2
      spread(key = date, value = fruit) %>% # spread data

更短的代码版本（压缩变异部分）：

yourdata %>% 
      mutate(date = as.Date(paste(date, "-2018", sep = ""),
                            format = "%d-%b-%Y")) %>%
      arrange(date) %>% # sort by date
      group_by(customer) %>%
      slice(1:2) %>% # keep only first two rows (fruits) per customer
      mutate(date = c("fruit1", "fruit2")) %>% # change date variable to fruit1/fruit2
      spread(key = date, value = fruit) # spread data

这是使用data.table的解决方案

首先按customerand对 data.table 进行排序date

然后group by customer并选择第一个两个fruits

> df[order(customer,date)][,.(fruit1=fruit[1],fruit2=fruit[2]),by=customer] 
   customer fruit1 fruit2
1:        A orange banana
2:        B  apple  apple
3:        C banana banana

样本数据

> df <- data.table(
+ customer = c('A','A','C','C','B','B','C','B','A'),
+ fruit = c('orange','apple','banana','orange','apple','banana','banana','apple','banana'),
+ date = c(as.Date('2018-05-04'),as.Date('2018-07-09'),as.Date('2018-01-02'),as.Date('2018-01-03'),as.Date('2018-01-02'),
+ as.Date('2018-04-05'),as.Date('2018-01-02'),as.Date('2018-01-06'),as.Date('2018-06-01'))
+ )
> df
   customer  fruit       date
1:        A orange 2018-05-04
2:        A  apple 2018-07-09
3:        C banana 2018-01-02
4:        C orange 2018-01-03
5:        B  apple 2018-01-02
6:        B banana 2018-04-05
7:        C banana 2018-01-02
8:        B  apple 2018-01-06
9:        A banana 2018-06-01

其它你可能感兴趣的问题

上一篇验证 Pyspark 数据框列类型的可靠方法下一篇数据增强：ImageDataGenerator vs openCV