感谢你的回复。偶然我发现 MicrosoftML ( https://msdn.microsoft.com/en-us/microsoft-r/microsoftml-introduction ) 提供了我所需要的——通过使用 Microsoft R 客户端和 R 服务器中提供的 featureizeText 函数。也许它可以帮助其他人-我找到了他们的示例,并且可以将其转换为我的数据(Microsoft ML 下的 featureizeText() 帮助)。
trainReviews <- data.frame(review = c(
"This is great",
"I hate it",
"Love it",
"Do not like it",
"Really like it",
"I hate it",
"I like it a lot",
"I kind of hate it",
"I do like it",
"I really hate it",
"It is very good",
"I hate it a bunch",
"I love it a bunch",
"I hate it",
"I like it very much",
"I hate it very much.",
"I really do love it",
"I really do hate it",
"Love it!",
"Hate it!",
"I love it",
"I hate it",
"I love it",
"I hate it",
"I love it"),
like = c(TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, FALSE, TRUE), stringsAsFactors = FALSE
)
testReviews <- data.frame(review = c(
"This is great",
"I hate it",
"Love it",
"Really like it",
"I hate it",
"I like it a lot",
"I love it",
"I do like it",
"I really hate it",
"I love it"), stringsAsFactors = FALSE)
outModel <- rxLogisticRegression(like ~ reviewTran, data = trainReviews,
mlTransforms = list(featurizeText(vars = c(reviewTran = "review"),
stopwordsRemover = stopwordsDefault(), keepPunctuations = FALSE)))
# 'hate' and 'love' have non-zero weights summary(outModel)