我不太确定“tm”包中的 VectorSource 和 VCorpus 到底是什么。
这些文档不清楚,任何人都可以让我简单地理解吗?
我不太确定“tm”包中的 VectorSource 和 VCorpus 到底是什么。
这些文档不清楚,任何人都可以让我简单地理解吗?
“语料库”是文本文档的集合。
tm 中的 VCorpus 指的是“Volatile”语料库,这意味着语料库存储在内存中,并且在包含它的 R 对象被销毁时将被销毁。
将此与存储在内存之外的 PCorpus 或永久语料库进行对比,例如在数据库中。
为了使用 tm 创建 VCorpus,我们需要将“Source”对象作为参数传递给 VCorpus 方法。您可以使用此方法找到可用的资源 -
getSources()
[1] “DataframeSource” “DirSource” “URISource” “VectorSource”
[5] “XMLSource” “ZipSource”
Source 抽象输入位置,例如 directory 或 URI 等。 VectorSource 仅用于字符向量
一个简单的例子:
假设你有一个 char 向量 -
input <- c('这是第一行','这是第二行')
创建源 - vecSource <- VectorSource(input)
然后创建语料库 - VCorpus(vecSource)
希望这可以帮助。你可以在这里阅读更多 - https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
实际上,Corpus
和之间存在很大差异VCorpus
。
Corpus
用作SimpleCorpus
默认值,这意味着 的某些功能VCorpus
将不可用。一个显而易见的是,它SimpleCorpus
不允许您保留破折号、下划线或其他标点符号;SimpleCorpus
或Corpus
自动删除它们,VCorpus
不会。Corpus
您可以在帮助中找到其他限制?SimpleCorpus
。
这是一个例子:
# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)
# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk
输出将是:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 46
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 46
如果您检查对象:
# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])
您会注意到Corpus
解包文本:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 2
[1]
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 0
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 139
虽然VCorpus
将其保持在对象内。
假设现在您对两者进行矩阵转换:
dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168
dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187
最后,让我们看看内容。这是来自Corpus
:
grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)
从VCorpus
:
grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)
[1] "alabama," "almighty," "brotherhood." "brothers."
[5] "california." "catholics," "character." "children,"
[9] "city," "colorado." "creed:" "day,"
[13] "day." "died," "dream." "equal."
[17] "exalted," "faith," "gentiles," "georgia,"
[21] "georgia." "hamlet," "hampshire." "happens,"
[25] "hope," "hope." "injustice," "justice."
[29] "last!" "liberty," "low," "meaning:"
[33] "men," "mississippi," "mississippi." "mountainside,"
[37] "nation," "nullification," "oppression," "pennsylvania."
[41] "plain," "pride," "racists," "ring!"
[45] "ring," "ring." "self-evident," "sing."
[49] "snow-capped" "spiritual:" "straight;" "tennessee."
[53] "thee," "today!" "together," "together."
[57] "tomorrow," "true." "york."
看看带标点的单词。这是一个巨大的差异。不是吗?