看看这个链接。
在这里,他们将引导您加载非结构化文本以创建 wordcloud。你可以调整这个策略,而不是创建一个词云,你可以创建一个使用术语的频率矩阵。这个想法是采用非结构化文本并以某种方式对其进行结构化。您可以通过 Document Term Matrices 将所有内容更改为小写(或大写),删除停用词,并查找每个工作职能的常用术语。您还可以选择词干。如果您使用词干,您将能够将不同形式的词检测为同一个词。例如,'programmed' 和 'programming' 可以被称为 'program'。您可以在 ML 模型训练中将这些频繁项的出现添加为加权特征。
您还可以将其调整为常用短语,为每个工作职能找到 2-3 个单词的常见组。
例子:
1)加载库并构建示例数据
library(tm)
library(SnowballC)
doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
"job" = c(job1,job2,job3))
2)现在我们做一些文本结构。我很肯定有更快/更短的方法来执行以下操作。
# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)
# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))
# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))
# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x){
paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
})
# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) {
paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
})
3)制作语料库源和文档术语矩阵。
# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))
# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)
# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)
现在我们有了频率矩阵 jobFreq,它是一个(3 x x)矩阵、3 个条目和 X 个单词。
你从这里去哪里取决于你。您可以只保留特定(更常见)的单词并将它们用作模型中的特征。另一种方法是保持简单,并在每个职位描述中使用一定比例的单词,比如“java”在“软件工程师”中的出现率为 80%,而在“质量保证”中的出现率仅为 50%。
现在是时候去查找为什么“保证”有 1 个“r”而“发生”有 2 个“r”。