机器算法验证 - 如何从聚合频率表中找到平均年龄和中位年龄 - 吾爱随笔录

如何从聚合频率表中找到平均年龄和中位年龄

机器算法验证擅长意思是中位数分箱年龄

2022-03-04 12:10:16

我正在使用 excel，我试图找到平均年龄和中位年龄。我有两列。1代表类别，另一个代表每个类别的人数。

Under 5       6,360
5-9           6,360
10-14        10,986
15-17         5,204
18-24         7,886
25-34         9,463
35-44        17,349
45-54        18,926
55-64        13,406
65-74         6,309
75 and over   5,520
      total 107,769

2个回答

一般评论：如果所有箱的并集是一个有限区间，您可以在某些假设下计算平均值，或者您可以在没有任何假设的情况下获得界限。一个常见的（尽管通常是站不住脚的）假设是垃圾箱内的一致性。如果箱子不宽，它仍然是一个有用的近似值。

使用开放的高端箱（75+），如果没有一些强有力的假设，您将无法计算平均值。探索平均估计对这些假设的敏感性将是有用的。

通常你可以计算一个中值category，这很简单，所以让我们从那里开始。

中位数：年龄中位数是“中间人”的年龄（如果有偶数，则中间两人之间的任何值 - 使用分箱数据，您希望这两个人在同一个分箱中；幸运的是，分箱边界很少见在它们之间，在这种情况下，任何一个 bin 都可以被视为中值 bin；在这种情况下，您可以选择边界本身作为中值）。

对于 107769 人，(107769 + 1)/2 = 53885-th 最老的人的年龄是中位年龄。

      Agegroup Count cumsum(age$Count)
1      Under 5  6360              6360
2      5-9      6360             12720
3      10-14   10986             23706
4      15-17    5204             28910
5      18-24    7886             36796
6      25-34    9463             46259
7      35-44   17349             63608
8      45-54   18926             82534
9      55-64   13406             95940
10     65-74    6309            102249
11 75 and over  5520            107769

34岁及以下的有46259人，44岁以下的有63608人，因此中位年龄组为35-44岁。

您可以通过做出一些假设来进一步估计其中的年份 - 例如，如果您假设箱内的年龄分布均匀，则中位年龄将为 (53885-46259)/17349 = 43.96%该年龄组的年龄范围，这表明平均年龄约为 39.4 岁。但是，您需要评估该假设的合理性。接近模式，看起来（并且可能是）相当平滑的分布，粗略近似的假设可能不是那么糟糕]

有些书给出了计算中位数估计值的公式，这相当于我刚才所做的，例如这样的公式：中位数 =（其中是包含中位数的 bin 的下限，是该 bin 的宽度，是总人口，是直到（结束前一个 bin），而是中值 bin 中的计数（频率），它的作用几乎相同（除了 (n+1)/2 与 n/2 之外，它是相同的）。 $L + w\frac{(\frac{n}{2} − c)}{f}$ $L$ $w$ $n$ $c$ $L$ $f$

平均值：平均值通常通过将数据视为发生在 bin 中心来计算。对于均值，这相当于假设数据在每个 bin 中均匀分布。

显然，这对没有上限的最后一个类别提出了问题。即使你强加了一个（“好吧，假设没有人活过 120”），中点仍然是对组内平均值的可怕估计。你可以做一些事情，比如假设分布与某些人口相似，并从生命表中获取估计值（许多国家提供生命表，这允许计算每个年龄的人活着的比例）。

你也可以简单地假设一些平均值（比如 80 或 85），然后看看它有多大的不同。来自一个西方国家（平均寿命比美国更长）的 9 岁（左右）数字表明 75 岁以上男性的平均年龄为 82.2 - 如果您无法获得合适的数字，我认为假设 82 和尝试 80 和 85 来了解对假设的敏感性是合理的。

（比这里描述的更复杂的假设是可能的，但不常用）

我意识到这是一篇旧帖子，但由于它是在我寻求找到一个 VBA 脚本来计算一系列值（箱中的值）中的中位数的过程中出现的，就像原始海报所要求的那样，我想我会发布我的解决方案与他人分享。

我是 VB 的初学者，所以可能有更简单的方法来做我所做的一些事情。我已经包含注释来记录我的脚本的每个步骤正在做什么。

您需要在电子表格/值范围中至少有 2 列。第一列必须是范围的开头（即，如果它是用于前两个范围的收入值

$0 to $9,9999 $10,000 to $14,9999

那么第一个 colmun 的前 2 行应该是 0 和 10000）。附加屏幕截图中的示例（突出显示的部分将是我的输入）。

该脚本提示输入值的范围。您可以有多个要获取中位数的总体列。它还将提示您要输入输出中值的单元格。

这是脚本

Sub GetMedian()
'have user select the range for the input data
'the first column must be the number for the beginning of the range
'for the bin
'the second column (can have more than one if doing mulitple areas)
'has the population for that bin
Dim UserRange As Range

Prompt = "Select the input range." & vbCrLf & _
     vbCrLf & "The first column must be the beginning of the " & _
     vbCrLf & "range for the bin. The second column has the  " & _
     vbCrLf & "population for the bin. You can have more than " & _
     vbCrLf & "one column of populations." & _
     vbCrLf
Title = "Select a range"

'Display the Input Box

Set UserRange = Application.InputBox( _
    Prompt:=Prompt, _
    Title:=Title, _
    Default:=Selection.Address, _
    Type:=8) 'Range selection

'get current selected range
Dim myString As String
'myString = Selection.Address
myString = UserRange.Address
'go though the columns

'how many columns are in the UserRange?
NoOfCol = Range(myString).Columns.count

'get the output range from the user
Dim OutRange As Range

Prompt = "Select a cell for the output" & vbCrLf & _
vbCrLf & "Values for multiple medians will be " & _
vbCrLf & "returned in a row beginning with the " & _
vbCrLf & "selected cell " & _
vbCrLf
'   Display the Input Box
On Error Resume Next
Set OutRange = Application.InputBox( _
    Prompt:=Prompt, _
    Title:=Title, _
    Default:=ActiveCell.Address, _
    Type:=8) 'Range selection

'now get to work

'for loop starts with the first population column (2nd in range)
'end is how many pop columns are in the user input range

For col = 2 To NoOfCol

    'set range as an array
    Dim myArray() As Variant
    myArray = Range(myString).Value

    'get sum of pop and divide by 2 to get halfpoint
    'initialize a total to aggregate values
    popSum = 0
    For i = 1 To UBound(myArray)
        popSum = popSum + myArray(i, col)
    Next
    'MsgBox popSum
    MedianIndex = popSum / 2
    'MsgBox MedianIndex
    'initialize for running total
    runtotal = 0

    'step through each row
    For i = 1 To UBound(myArray)

        'add the cumulative total pop
        runtotal = runtotal + myArray(i, col)
        'check if the runtotal exceeds the medianIndex
        If runtotal > MedianIndex Then
            'get the cumulative pop for the bin just before we exceeded
            'so subtract the current array value from the current runTotal
            prevTotal = runtotal - myArray(i, col)
            'determine how much into the bin it is to reach the medianIndex
            howMany = MedianIndex - prevTotal
            'get the pop value in the previous bin
            Binpop = myArray(i, col)
            'determine the pct of how far into this bin the medianIndex falls
            pctInto = howMany / Binpop
            'determine the span of the bin
            'and multiply it by the pct into the bin
            MultiSpan = (myArray(i, 1) - myArray(i - 1, 1)) * pctInto

            'calculate the median by adding the result to
            'the number for the beginning of hte range
            Median = myArray(i, 1) + MultiSpan

            'put the median in the output cell
            OutRange.Offset(0, col - 2).Value = Median

            Exit For
        End If
    Next

Next

End Sub

其它你可能感兴趣的问题

上一篇K-medoids和PAM之间的区别下一篇关于负 AIC 值