机器算法验证 - 这种对稀疏性的解释准确吗？ - 吾爱随笔录

这种对稀疏性的解释准确吗？

机器算法验证 r 文本挖掘自然语言

2022-03-08 16:02:27

removeSparseTerms根据包中函数的文档tm，这就是稀疏性的含义：

A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.

那么，如果sparse等于 0.99，那么我们将删除最多仅出现在 1% 数据中的项，这是否是正确的解释？

1个回答

是的，尽管您在这里的困惑是可以理解的，因为在这种情况下很难清楚地定义“稀疏性”一词。

在的sparse论点的意义上removeSparseTerms()，稀疏性是指一个词条的相对文档频率的阈值，超过该阈值该词条将被删除。这里的相对文档频率是指一个比例。正如命令的帮助页面所述（虽然不是很清楚），稀疏度在接近 1.0 时变得更小。（请注意，稀疏度不能取 0 或 1.0 的值，只能取两者之间的值。）

因此，您的解释是正确的，因为sparse = 0.99它将仅删除比 0.99更稀疏的术语。确切的解释sparse = 0.99是，对于术语，您将保留的所有术语，其中是文档数——在这种情况下，可能所有术语都将被保留（参见下面的示例） . $j$ $df_j > N * (1 - 0.99)$ $N$

在另一个极端附近，如果sparse = .01，则仅保留（几乎）每个文档中出现的术语。（当然，这取决于术语的数量和文档的数量，在自然语言中，像“the”这样的常用词很可能出现在每个文档中，因此永远不会是“稀疏的”。）

稀疏度阈值为 0.99 的示例，其中一个词最多出现在（第一个示例）少于 0.01 个文档和（第二个示例）仅超过 0.01 个文档中：

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

以下是一些带有实际文本和术语的附加示例：

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

在最后一个带有的示例中sparse = 0.34，仅保留了出现在三分之二文档中的术语。

根据文档频率从文档术语矩阵中修剪术语的另一种方法是文本分析包quanteda。这里的相同功能不是指稀疏性，而是直接指术语的文档频率（如tf-idf中）。

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

这种用法对我来说似乎更直接。

其它你可能感兴趣的问题

上一篇PyMC3 中的贝叶斯模型选择下一篇多项式和有序逻辑回归之间有什么区别？