数据挖掘 - 使用文本片段中的参数识别模板 - 吾爱随笔录

使用文本片段中的参数识别模板

数据挖掘机器学习 r nlp

2021-09-17 14:47:03

我有一个数据集，其中的文本片段具有可以包含参数的固定结构。例子是：

 Temperature today is 20 centigrades
 Temperature today is 28 centigrades

或者

 Her eyes are blue and hair black.
 Her eyes are green and hair brown.

第一个示例显示了一个带有一个数字参数的模板。第二个是具有两个因子参数的模板。

模板的数量和参数的数量不知道。

问题是识别模板并将每个文本片段分配给相应的模板。

显而易见的第一个想法是使用聚类。距离度量被定义为多个不匹配的单词。即示例一中的记录的距离为 1，示例二中的距离为 2。示例一和二中的记录之间的距离为 7。这种方法工作正常，前提是知道簇的数量，但事实并非如此，所以它没有用。

我可以想象一种编程方法扫描距离矩阵，搜索距离为 1（或 2,3，..）的许多邻居的记录，但我很好奇我是否可以应用一些无监督机器学习算法来解决这个问题。R是优选的，但不是必需的。

3个回答

以下建议背后的基本原理是将“特征向量”和“模板”关联起来。

特别是可以基于一个词袋在整个语料库上使用 LSA。得到的特征向量将用作代理模板；这些特征向量不应直接受每个模板中的单词数的影响。随后，分数可用于按照标准程序将文档聚集在一起（例如。 $k$ -表示与 AIC 结合使用）。作为 LSA 的替代方案，可以使用 NNMF。让我指出，LSA（或 NNMF）可能需要对转换后的 TF-IDF 而不是原始字数矩阵进行。

您可以考虑使用word2vec来识别语料库中的短语。短语（而不是单个标记）的存在可能表示“模板”。

从这里开始，与您的模板短语最相似的标记很可能是您的参数值。

下面的脚本使用 LSA 和转换后的 IDF 来切断模板中的参数。这个想法是，所有 IDF 高于某个阈值的项都被视为参数，并且它们的频率被重置为零。阈值可以用语料库中的平均模板出现来近似。剔除参数，相同模板的记录距离为零。

 library(tm)
 library(lsa)
 df <- data.frame(TEMPLATE = c(rep("A",3),rep("B",3),rep("C",3)),
 TEXT = c(
 paste("Temperature today is",c(28,24,20),"centigrades"),
 paste("Temperature today is",c(82,75,68),"Fahrenheit"),
 paste("Her eyes are ",c("blue","black","green"), "and hair",c("grey","brown","white"))) , stringsAsFactors=FALSE)
> df     
   TEMPLATE                                TEXT
 1        A Temperature today is 28 centigrades
 2        A Temperature today is 24 centigrades
 3        A Temperature today is 20 centigrades
 4        B  Temperature today is 82 Fahrenheit
 5        B  Temperature today is 75 Fahrenheit
 6        B  Temperature today is 68 Fahrenheit
 7        C    Her eyes are  blue and hair grey
 8        C  Her eyes are  black and hair brown
 9        C  Her eyes are  green and hair white

 corpus <- Corpus(VectorSource(df$TEXT))
 td <- as.matrix(TermDocumentMatrix(corpus,control=list(wordLengths = c(1, Inf)) ))

 > td             Docs
 Terms         1 2 3 4 5 6 7 8 9
   20          0 0 1 0 0 0 0 0 0
   24          0 1 0 0 0 0 0 0 0
   28          1 0 0 0 0 0 0 0 0
   68          0 0 0 0 0 1 0 0 0
   75          0 0 0 0 1 0 0 0 0
   82          0 0 0 1 0 0 0 0 0
   and         0 0 0 0 0 0 1 1 1
   are         0 0 0 0 0 0 1 1 1
   black       0 0 0 0 0 0 0 1 0
   blue        0 0 0 0 0 0 1 0 0
   brown       0 0 0 0 0 0 0 1 0
   centigrades 1 1 1 0 0 0 0 0 0
   eyes        0 0 0 0 0 0 1 1 1
   fahrenheit  0 0 0 1 1 1 0 0 0
   green       0 0 0 0 0 0 0 0 1
   grey        0 0 0 0 0 0 1 0 0
   hair        0 0 0 0 0 0 1 1 1
   her         0 0 0 0 0 0 1 1 1
   is          1 1 1 1 1 1 0 0 0
   temperature 1 1 1 1 1 1 0 0 0
   today       1 1 1 1 1 1 0 0 0
   white       0 0 0 0 0 0 0 0 1

 ## supress terms with idf higher than template frequency
 ## those terms are considered as parameters
 template_freq <- 3
 tdw <- lw_bintf(td) * ifelse(gw_idf(td)> template_freq,0, gw_idf(td))
 dist <- dist(t(as.matrix(tdw)))

 > dist
          1        2        3        4        5        6        7        8
 2 0.000000                                                               
 3 0.000000 0.000000                                                      
 4 3.655689 3.655689 3.655689                                             
 5 3.655689 3.655689 3.655689 0.000000                                    
 6 3.655689 3.655689 3.655689 0.000000 0.000000                           
 7 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341                  
 8 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000         
 9 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000 0.000000

距离矩阵清楚地表明，记录 1、2、3 来自同一模板（距离 = 0，使用合成数据；在实际情况下，应该使用一些小的阈值）。同样适用于记录 4、5、6 和 7、8、9。

其它你可能感兴趣的问题

上一篇给定 ngram 搜索类似文档的最佳方法下一篇在时间序列中发现不可预测性或不确定性