什么是 R 中“tm”(文本挖掘)包中的 VectorSource 和 VCorpus

机器算法验证 r 文本挖掘
2022-03-28 06:00:33

我不太确定“tm”包中的 VectorSource 和 VCorpus 到底是什么。

这些文档不清楚,任何人都可以让我简单地理解吗?

2个回答

“语料库”是文本文档的集合。

tm 中的 VCorpus 指的是“Volatile”语料库,这意味着语料库存储在内存中,并且在包含它的 R 对象被销毁时将被销毁。

将此与存储在内存之外的 PCorpus 或永久语料库进行对比,例如在数据库中。

为了使用 tm 创建 VCorpus,我们需要将“Source”对象作为参数传递给 VCorpus 方法。您可以使用此方法找到可用的资源 -
getSources()

[1] “DataframeSource” “DirSource” “URISource” “VectorSource”
[5] “XMLSource” “ZipSource”

Source 抽象输入位置,例如 directory 或 URI 等。 VectorSource 仅用于字符向量

一个简单的例子:

假设你有一个 char 向量 -

input <- c('这是第一行','这是第二行')

创建源 - vecSource <- VectorSource(input)

然后创建语料库 - VCorpus(vecSource)

希望这可以帮助。你可以在这里阅读更多 - https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

实际上,Corpus和之间存在很大差异VCorpus

Corpus用作SimpleCorpus默认值,这意味着 的某些功能VCorpus将不可用。一个显而易见的是,它SimpleCorpus不允许您保留破折号、下划线或其他标点符号;SimpleCorpusCorpus自动删除它们,VCorpus不会。Corpus您可以在帮助中找到其他限制?SimpleCorpus

这是一个例子:

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

输出将是:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

如果您检查对象:

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

您会注意到Corpus解包文本:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

虽然VCorpus将其保持在对象内。

假设现在您对两者进行矩阵转换:

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

最后,让我们看看内容。这是来自Corpus

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

VCorpus

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

看看带标点的单词。这是一个巨大的差异。不是吗?