查找数据的聚合信息

数据挖掘 r Python 数据
2021-10-11 07:46:23

我是数据科学的新手。我有大约 200,000 条记录的数据集,有 5 列。有一个字段叫做,class对于每一个class,都有一个或多个divisions我必须这样做: 1. 过滤数据集,使得只有那些classes至少有 5 个部门的数据才会出现。

  1. 对于每个部门,我必须attendance从另一列计算。

  2. attendance每个 都有一个最小值class我必须找到percentage of divisions in each class with the minimum attendance.

我开始使用 Pandas 在 python 中导入数据,并开始编写循环来处理它。但我确信这不是正确的做法。你能给出一些想法吗?我可以在 Excel 数据透视表中执行此操作吗?

1个回答

在没有数据的情况下解决你的问题有点困难,但我试着尝试一下我认为数据的编码方式。我将 R 与data.tables一起使用。您可以使用 读取 data.tables fread()

步骤1

 require(data.table)

 # Assume sample_data has the following format:
 #   class: the class
 #   division: the division
 #   attendance: the attandance for a match
 #
 # I assume the table is in long format e.g. multiple rows exist per class with
 # per class one or different divisions.

 # Make the list of classes with at least 5 divisions.
 classes_of_interest <- 
   sample_data[, 
               .(num_divisions = length(unique(divisions))),
               by = class][num_divisions > 4, class]

第2步

 # Only consider the classes that were in at least 5 divisions.
 attandance_by_division <- 
   sample_data[class %in% classes_of_interest, 
               .(attendance = sum(num_people)),
               by = list(division, class)]
 setkey(attandance_by_division, "class")

第 3 步

 # Merge the data set with a datas set that contains the required number
 # of attendants per class.
 # The  format is as follows:
 #   class: the class
 #   mininum_attendance: the minimum attendance
 attendance_data <- 
   merge(attendance_requirements, 
         attandance_by_division, by = "class")

 # Here I exploit the fact that the true/false condition will be converted
 # to a 1 and 0. So I can sum and divide by the length of index subset created
 # by aggregating on 'class'.
 pct_of_division <- 
   sample_data[, 
               .(pct_with_min_attendance = (sum(attendance > minimum_attendance)
                                            / length(.I))),
                 by = class]

希望这可以帮助