情感分析中for循环的向量化

数据挖掘 r
2021-10-03 20:55:30

我在 R 中的 for 循环中苦苦挣扎。我有一个带有句子的以下数据框和两个带有 pos 和 neg 词的字典:

library(stringr)
library(plyr)
library(dplyr)
library(stringi)
library(qdap)
library(qdapRegex)
library(reshape2)
library(zoo)

# Create data.frame with sentences
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
                         "wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
                         "great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               number = c(1,1,1,1,1,1,1), stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")

现在我要为大数据模拟创建原始数据帧的复制:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
    sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
    sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL

对于我的进一步方法,我将不得不对字典中的单词进行降序排序,并使用它们的情绪分数(pos word = 1 和 neg word = -1)。

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL

然后我有一个带有 for 循环的以下函数。1) 匹配确切的单词 2) 计算它们 3) 计算分数 4) 从句子中删除匹配的单词以进行另一次迭代:

scoreSentence_new <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    sd <- function(text) {stri_count(text, regex=wordsDF[x,1])} # count matched words
    results <- sapply(sentence, sd, USE.NAMES=F) # count matched words
    score <- (score + (results * wordsDF[x,2])) # compute score
    sentence <- str_replace_all(sentence, wordsDF[x,1], " ") # remove matched words from sentence for next iteration
  }
  score
}

当我调用该函数时

SentimentScore_new <- scoreSentence_new(sent$words)
    sent_new <- cbind(sent, SentimentScore_new)
    sent_new$words <- str_trim(sent_new$words, side = "both")

它产生了所需的输出:

                                                                             words user     SentimentScore_new
                             great just great right size and i love this notebook    1                  4
                                                 benefits great laptop at the top    2                  2
                                               wouldnt bad notebook and very good    3                  2
                                                                very good quality    4                  1
                                                             bad orgtop but great    5                  0
 great improvement for that great improvement bad product but overall is not good    6                  0
                                       notebook is not good but i love batterytop    7                  0

实际上,我正在使用带有 pos/neg 单词的词典,大约 7.000 个单词,并且我有 200.000 个句子。当我将我的方法用于 1.000 个句子时,需要 45 分钟。拜托,谁能帮我用一些更快的方法使用矢量化或并行解决方案。由于我的初学者 R 编程技能,我正在尽我的努力:-( 非常感谢您的任何建议或解决方案

我想知道这样的事情:

n <- 1:nrow(wordsDF)
score <- 0

try_1 <- function(ttt) {
sd <- function(text) {stri_count(text, regex=wordsDF[ttt,1])}
results <- sapply(sent$words, sd, USE.NAMES=F)
    score <- (score + (results * wordsDF[ttt,2])) # compute score (count * sentValue)
    sent$words <- str_replace_all(sent$words, wordsDF[ttt,1], " ")
score
}

a <- unlist(sapply(n, try_1))
apply(a,1,sum)

但不起作用:-(

1个回答

呜呜呜

我可以想象它会运行缓慢:) 尽可能避免在 R 中使用很长的 for 循环。如果是这样,请保持迭代尽可能简单,小心使用慢速搜索函数,不要子集或编辑大型数据结构。

重构你的数据

我在一个聚会上和一个人交谈,做类似的分析。他说他们将数据排列在一个由 nrow sentence (200.000) 和 ncol (7000) 单词组成的矩阵中。任何元素都会计算给定单词在给定句子中出现的次数。如果以后需要对数据集做更多的分析,任何操作都可以很快完成。您的情绪分析结果将是任何行与评分向量的内积(+1 +2 -1,单词评分)。如果您使用 bigmemory 包并选择短裤(最多 255 个计数)而不是整数,则矩阵将占用大约 2Gb 的空间。bigmemory 与多核(foreach、doMC 等)一起顺利进行,并且可以比 R 矩阵更快地单独修改元素。矩阵可以保存到共享内存或硬盘。bigmemory 使用 C++ 数据结构,这使得使用 Rcpp 执行一些非常快速的操作成为可能。虽然是陡峭的学习曲线。

稀疏矩阵也是一个好主意。bigmemory 包不支持 windows

也许在这里得到一些灵感:

https://stackoverflow.com/questions/10233087/sentiment-analysis-using-r