下面的脚本使用 LSA 和转换后的 IDF 来切断模板中的参数。这个想法是,所有 IDF 高于某个阈值的项都被视为参数,并且它们的频率被重置为零。阈值可以用语料库中的平均模板出现来近似。剔除参数,相同模板的记录距离为零。
library(tm)
library(lsa)
df <- data.frame(TEMPLATE = c(rep("A",3),rep("B",3),rep("C",3)),
TEXT = c(
paste("Temperature today is",c(28,24,20),"centigrades"),
paste("Temperature today is",c(82,75,68),"Fahrenheit"),
paste("Her eyes are ",c("blue","black","green"), "and hair",c("grey","brown","white"))) , stringsAsFactors=FALSE)
> df
TEMPLATE TEXT
1 A Temperature today is 28 centigrades
2 A Temperature today is 24 centigrades
3 A Temperature today is 20 centigrades
4 B Temperature today is 82 Fahrenheit
5 B Temperature today is 75 Fahrenheit
6 B Temperature today is 68 Fahrenheit
7 C Her eyes are blue and hair grey
8 C Her eyes are black and hair brown
9 C Her eyes are green and hair white
corpus <- Corpus(VectorSource(df$TEXT))
td <- as.matrix(TermDocumentMatrix(corpus,control=list(wordLengths = c(1, Inf)) ))
> td Docs
Terms 1 2 3 4 5 6 7 8 9
20 0 0 1 0 0 0 0 0 0
24 0 1 0 0 0 0 0 0 0
28 1 0 0 0 0 0 0 0 0
68 0 0 0 0 0 1 0 0 0
75 0 0 0 0 1 0 0 0 0
82 0 0 0 1 0 0 0 0 0
and 0 0 0 0 0 0 1 1 1
are 0 0 0 0 0 0 1 1 1
black 0 0 0 0 0 0 0 1 0
blue 0 0 0 0 0 0 1 0 0
brown 0 0 0 0 0 0 0 1 0
centigrades 1 1 1 0 0 0 0 0 0
eyes 0 0 0 0 0 0 1 1 1
fahrenheit 0 0 0 1 1 1 0 0 0
green 0 0 0 0 0 0 0 0 1
grey 0 0 0 0 0 0 1 0 0
hair 0 0 0 0 0 0 1 1 1
her 0 0 0 0 0 0 1 1 1
is 1 1 1 1 1 1 0 0 0
temperature 1 1 1 1 1 1 0 0 0
today 1 1 1 1 1 1 0 0 0
white 0 0 0 0 0 0 0 0 1
## supress terms with idf higher than template frequency
## those terms are considered as parameters
template_freq <- 3
tdw <- lw_bintf(td) * ifelse(gw_idf(td)> template_freq,0, gw_idf(td))
dist <- dist(t(as.matrix(tdw)))
> dist
1 2 3 4 5 6 7 8
2 0.000000
3 0.000000 0.000000
4 3.655689 3.655689 3.655689
5 3.655689 3.655689 3.655689 0.000000
6 3.655689 3.655689 3.655689 0.000000 0.000000
7 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341
8 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000
9 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000 0.000000
距离矩阵清楚地表明,记录 1、2、3 来自同一模板(距离 = 0,使用合成数据;在实际情况下,应该使用一些小的阈值)。同样适用于记录 4、5、6 和 7、8、9。