在没有数据的情况下解决你的问题有点困难,但我试着尝试一下我认为数据的编码方式。我将 R 与data.tables一起使用。您可以使用 读取 data.tables fread()。
步骤1
require(data.table)
# Assume sample_data has the following format:
# class: the class
# division: the division
# attendance: the attandance for a match
#
# I assume the table is in long format e.g. multiple rows exist per class with
# per class one or different divisions.
# Make the list of classes with at least 5 divisions.
classes_of_interest <-
sample_data[,
.(num_divisions = length(unique(divisions))),
by = class][num_divisions > 4, class]
第2步
# Only consider the classes that were in at least 5 divisions.
attandance_by_division <-
sample_data[class %in% classes_of_interest,
.(attendance = sum(num_people)),
by = list(division, class)]
setkey(attandance_by_division, "class")
第 3 步
# Merge the data set with a datas set that contains the required number
# of attendants per class.
# The format is as follows:
# class: the class
# mininum_attendance: the minimum attendance
attendance_data <-
merge(attendance_requirements,
attandance_by_division, by = "class")
# Here I exploit the fact that the true/false condition will be converted
# to a 1 and 0. So I can sum and divide by the length of index subset created
# by aggregating on 'class'.
pct_of_division <-
sample_data[,
.(pct_with_min_attendance = (sum(attendance > minimum_attendance)
/ length(.I))),
by = class]
希望这可以帮助