之前的两个答案都采用了一种有趣的方法,但都没有真正解决我的问题;如何以统计严格的方式比较给定术语频率随时间的变化,并与同一时期的其他术语进行比较,该方法考虑了术语频率随集合大小的幂律分布。
下面的答案是我迄今为止在这个问题上的工作。无论如何,这不是一个完整的答案,但是我将其发布在这里以征求反馈和改进建议。
以齐夫定律的定义为例,
f(r)∼zmaxr−α
here, r is the rank of a given term, zmax is the frequency of the most frequent term, α is Zipf's exponent, and f(r) is the frequency of the r-th ranked term.
Taking the log of both sides, you get a linear relationship between log(f) and log(r), with gradient −α and intercept zmax
log(f)∼log(zmax)−αlog(r)

We want to find the expected frequency of a single term over a given period. First, split the period into equal sized windows. For the first window, i, we know the frequency of our term fi, and the frequency of the largest term zmax,i and α. We can use these to find the rank of our term,
r∼(fi(r)zmax,i)−α
This rank can then be used to calculate the 'expected' frequency for all subsequent time windows, which I'll indicate with f^,
f^i+1(r)∼zmax,i+1rα(i+1)
The ratio f^f represents the normalised frequency.
Some observations:
- For a stationary underlying distribution, this approach identifies minor deviations from the expected frequency of a term. A moving window could be used if the underlying text distribution changes dramatically, or exhibits seasonal behaviour.
- The initial choice of r could be calculated for the entire period, rather than the first window.
- α could be calculated from the raw term and total frequency counts using heap's law (see Lu et al. 2010 for a derivation of Heap's law from Zipf's).
- This approach assumes a linear relationship between f and r in log space.