我在 R 中的 for 循环中苦苦挣扎。我有一个带有句子的以下数据框和两个带有 pos 和 neg 词的字典:
library(stringr)
library(plyr)
library(dplyr)
library(stringi)
library(qdap)
library(qdapRegex)
library(reshape2)
library(zoo)
# Create data.frame with sentences
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
"wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
"great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
number = c(1,1,1,1,1,1,1), stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
现在我要为大数据模拟创建原始数据帧的复制:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL
对于我的进一步方法,我将不得不对字典中的单词进行降序排序,并使用它们的情绪分数(pos word = 1 和 neg word = -1)。
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL
然后我有一个带有 for 循环的以下函数。1) 匹配确切的单词 2) 计算它们 3) 计算分数 4) 从句子中删除匹配的单词以进行另一次迭代:
scoreSentence_new <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
sd <- function(text) {stri_count(text, regex=wordsDF[x,1])} # count matched words
results <- sapply(sentence, sd, USE.NAMES=F) # count matched words
score <- (score + (results * wordsDF[x,2])) # compute score
sentence <- str_replace_all(sentence, wordsDF[x,1], " ") # remove matched words from sentence for next iteration
}
score
}
当我调用该函数时
SentimentScore_new <- scoreSentence_new(sent$words)
sent_new <- cbind(sent, SentimentScore_new)
sent_new$words <- str_trim(sent_new$words, side = "both")
它产生了所需的输出:
words user SentimentScore_new
great just great right size and i love this notebook 1 4
benefits great laptop at the top 2 2
wouldnt bad notebook and very good 3 2
very good quality 4 1
bad orgtop but great 5 0
great improvement for that great improvement bad product but overall is not good 6 0
notebook is not good but i love batterytop 7 0
实际上,我正在使用带有 pos/neg 单词的词典,大约 7.000 个单词,并且我有 200.000 个句子。当我将我的方法用于 1.000 个句子时,需要 45 分钟。拜托,谁能帮我用一些更快的方法使用矢量化或并行解决方案。由于我的初学者 R 编程技能,我正在尽我的努力:-( 非常感谢您的任何建议或解决方案
我想知道这样的事情:
n <- 1:nrow(wordsDF)
score <- 0
try_1 <- function(ttt) {
sd <- function(text) {stri_count(text, regex=wordsDF[ttt,1])}
results <- sapply(sent$words, sd, USE.NAMES=F)
score <- (score + (results * wordsDF[ttt,2])) # compute score (count * sentValue)
sent$words <- str_replace_all(sent$words, wordsDF[ttt,1], " ")
score
}
a <- unlist(sapply(n, try_1))
apply(a,1,sum)
但不起作用:-(