如何计算最频繁的值组合

数据挖掘 r 聚类 k-均值
2022-01-25 17:38:45

我有以下 CSV 数据:

shot_id,round_id,hole,shotType,clubType,desiredShape,lineDirection,shotQuality,note
48,2,1,tee,driver,straight,straight,good,
49,2,1,approach,iron,straight,right,bad,
50,2,1,approach,wedge,straight,straight,bad,
51,2,1,approach,wedge,straight,straight,bad,
52,2,1,putt,putter,straight,straight,good,
53,2,1,putt,putter,straight,straight,good,
54,2,2,tee,driver,draw,straight,good,
55,2,2,approach,iron,draw,straight,good,
56,2,2,putt,putter,straight,straight,good,
57,2,2,putt,putter,straight,straight,good,
58,2,3,tee,driver,draw,straight,good,
59,2,3,approach,iron,straight,right,good,
60,2,3,chip,wedge,straight,straight,good,
61,2,3,putt,putter,straight,straight,good,
62,2,4,tee,iron,straight,straight,good,
63,2,4,putt,putter,straight,straight,good,
64,2,4,putt,putter,straight,straight,good,
65,2,5,tee,driver,straight,left,good,
66,2,5,approach,wedge,straight,straight,good,
67,2,5,putt,putter,straight,straight,bad,
68,2,5,putt,putter,straight,straight,good,
69,2,6,tee,driver,draw,straight,bad,
70,2,6,approach,hybrid,draw,straight,good,
71,2,6,putt,putter,straight,straight,good,
72,2,6,putt,putter,straight,straight,good,
73,2,7,tee,driver,straight,straight,good,
74,2,7,approach,wood,fade,straight,good,
75,2,7,approach,wedge,straight,straight,bad,long
76,2,7,putt,putter,straight,straight,good,
77,2,7,putt,putter,straight,straight,good,
78,2,8,tee,iron,straight,right,bad,
79,2,8,approach,wedge,straight,straight,good,
80,2,8,putt,putter,straight,straight,bad,
81,2,9,tee,driver,straight,straight,good,
82,2,9,approach,iron,straight,straight,good,
83,2,9,approach,wedge,straight,straight,bad,
84,2,9,putt,putter,straight,straight,good,
85,2,9,putt,putter,straight,straight,good,
86,2,10,tee,driver,straight,left,good,
87,2,10,approach,iron,straight,left,good,
88,2,10,chip,wedge,straight,straight,good,
89,2,10,putt,putter,straight,straight,good,
90,2,10,putt,putter,straight,straight,good,
91,2,11,tee,driver,draw,straight,good,
92,2,11,approach,iron,draw,straight,good,
93,2,11,putt,putter,straight,straight,good,
94,2,11,putt,putter,straight,straight,good,
95,2,12,tee,iron,draw,straight,good,
96,2,12,putt,putter,straight,straight,good,
97,2,12,putt,putter,straight,straight,good,
98,2,13,tee,driver,draw,straight,good,
99,2,13,approach,wood,straight,straight,bad,topped
100,2,13,putt,putter,straight,straight,good,
101,2,13,putt,putter,straight,straight,good,
102,2,14,tee,driver,draw,straight,good,
103,2,14,approach,wood,straight,straight,bad,
104,2,14,approach,iron,draw,straight,good,
105,2,14,approach,wedge,straight,straight,bad,
106,2,14,putt,putter,straight,straight,bad,
107,2,14,putt,putter,straight,straight,good,
108,2,15,tee,iron,draw,right,bad,
109,2,15,approach,wedge,straight,straight,good,
110,2,15,putt,putter,straight,straight,good,
111,2,15,putt,putter,straight,straight,good,
112,2,16,tee,driver,draw,right,good,
113,2,16,approach,iron,straight,left,bad,
114,2,16,approach,wedge,straight,left,bad,
115,2,16,putt,putter,straight,straight,good,
116,2,17,tee,driver,straight,straight,good,
117,2,17,approach,wood,straight,right,bad,
118,2,17,approach,wedge,straight,straight,good,
119,2,17,putt,putter,straight,straight,good,
120,2,17,putt,putter,straight,straight,good,
121,2,18,tee,driver,fade,right,bad,
122,2,18,approach,wedge,straight,straight,good,
123,2,18,approach,wedge,straight,straight,good,
124,2,18,putt,putter,straight,straight,good,
125,2,18,putt,putter,straight,straight,good,

而且我希望能够确定哪些值组合是最常出现的。

  • 球杆类型:发球杆、木杆、铁杆、挖起杆、推杆
  • 击球类型:开球、进场、切球、推杆
  • 线方向:左、中、右
  • 射击质量:好,坏,中性

理想情况下,我能够确定一个最佳位置(没有双关语)组合:“driver”+“tee”+“straight”+“good”

我打算仅针对静态数据集来衡量这一点,而不是针对任何未来值或预测。所以,我的想法是,这可能是一个聚类/k-means 问题。那是对的吗?

如果是这样,我将如何开始使用 R 中的这些类型的值进行 K-Mean 分析?

如果这不是 kmeans 问题,那么它是什么?

1个回答

如果我理解您的问题,您想知道哪种组合最常见或组合相对于其他组合的频率。这是一种静态方法,它将确定总共的唯一组合(即所有五列的组合)。

plyr软件包有一个漂亮的实用程序,用于将data.frame. 我们可以指定要分组的列的名称,然后指定要为每个组合执行的函数。在这种情况下,我们指定与您的高尔夫击球质量相关的列,并使用该函数nrow计算列相同的大型 data.frame 的每个子集中的行数。

# You need this library for the ddply() function
require(plyr)

# These are the columns that determine a unique situation (change this if you need)
qualities <- c("shotType","clubType","desiredShape","lineDirection","shotQuality")

# The call to ddply() actually gives us what we want, which is the number 
# of times that combination is present in the dataset
countedCombos <- ddply(golf,qualities,nrow)

# To be nice, let's give that newly added column a meaningful name
names(countedCombos) <- c(qualities,"count")

# Finally, you probably want to order it (decreasing, in this case)
countedCombos <- countedCombos[with(countedCombos, order(-count)),]

现在看看你的产品。最后一列的计数与您提供给的每个唯一列组合相关联ddply

head(countedCombos)
   shotType clubType desiredShape lineDirection shotQuality count
16     putt   putter     straight      straight        good    30
10 approach    wedge     straight      straight        good     6
9  approach    wedge     straight      straight         bad     5
19      tee   driver         draw      straight        good     5
22      tee   driver     straight      straight        good     4
2  approach     iron         draw      straight        good     3

要查看特定横截面的结果(例如,驱动程序clubType):

countedCombos[which(countedCombos$clubType=="driver"),]
   shotType clubType desiredShape lineDirection shotQuality count
19      tee   driver         draw      straight        good     5
22      tee   driver     straight      straight        good     4
21      tee   driver     straight          left        good     2
17      tee   driver         draw         right        good     1
18      tee   driver         draw      straight         bad     1
20      tee   driver         fade         right         bad     1

作为奖励,您可以ddply再次深入研究这些结果。例如,如果您想查看基于shotType和的“好”与“坏”shotQuality 的比率clubType

shotPerformance <- ddply(countedCombos,c("shotType","clubType"),
    function(x){
        total<- length(x$shotQuality)
            good <- length(which(x$shotQuality=="good"))
        bad <- length(which(x$shotQuality=="bad"))
        c(total,good,bad,good/(good+bad))
    }
 )
names(shotPerformance)<-c("count","shotType","clubType","good","bad","goodPct")

这为您提供了对字符字段 () 计数执行的一些数学的新分解,shotQuality并向您展示了如何为ddply. 当然,您仍然可以以任何您想要的方式订购这些。

head(shotPerformance)
  shotType clubType total good bad   goodPct
1 approach   hybrid    1  1   0 1.0000000
2 approach     iron    6  4   2 0.6666667
3 approach    wedge    3  1   2 0.3333333
4 approach     wood    3  1   2 0.3333333
5     chip    wedge    1  1   0 1.0000000
6     putt   putter    2  1   1 0.5000000