数据挖掘 - 有没有办法对字母数字字符串进行分类？ - 吾爱随笔录

有没有办法对字母数字字符串进行分类？

数据挖掘机器学习文本分类

2022-02-12 15:07:09

我有包含各种项目的数据。每个项目都有一个与之关联的唯一字母数字代码（参见下面的示例）。

有没有办法根据字母数字代码预测项目类型？

数据：

Item          code          type
1       4S2BDANC5L3247151   book
2       1N4AL3AP1JC236284   book
3       3R4BTTGC3L3237430   book
4       KNMAT2MT1KP546287   book
5       97806773062273208   pen
6       07356196706378892   Pen
7       97807345361169253   pen
8       01008130715194136   chair
9       01076305063010CCE44 chair
etc

1个回答

在下面找到基于 n-gram 的（二进制）文本分类的最小实现。它是用基础 R 编写的。您首先需要提取 ngram 并应用一些模型（在我的例子中是 Lasso）来预测每个 ngram 的类。您可以通过相应地更改模型轻松地将其扩展为多类。请注意，这只是一个有改进空间的最小示例。

另请参阅：https ://github.com/Bixi81/R-ml/blob/master/NLP_ngram_short_text.R

在这里找到 GLMnet 的 Python 实现：https ://web.stanford.edu/~hastie/glmnet_python/

数据：

# Dummy data
# Note the little differences in the strings, however there is a clear pattern
df = data.frame(text=c("ab13ab12ab16","ag16ag16fg16","ab12ab12ab12","fg12fg16fg16","ab16ab12af12","fg16fg16fg16"),target=c(1,0,1,0,1,0))

head(df)

          text target
1 ab13ab12ab16      1
2 ag16ag16fg16      0
3 ab12ab12ab12      1
4 fg12fg16fg16      0
5 ab16ab12af12      1
6 fg16fg16fg16      0

Ngram：

# Set up lists to post results from loop
ngrams = list()
targets = list()
observation = list()

# Loop over rows of DF
counter = 1
for (row in 1:nrow(df)){
  # Loop over strings in first colum per row
  for (s in 1:nchar(as.character(df$text[[row]]))){
    # Get "ngram" (sequence of two letters/digits)
    substring=substring(as.character(df$text[[row]]), s, s+1)
    # Append if >1, also post row and target
    if (nchar(substring)>1){
      ngrams[[counter]]<-substring
      targets[[counter]]<-df$target[[row]]
      observation[[counter]]<-row
      counter = counter+1
    }
  }
}

# Lists to DFs
ngramdf=data.frame(ngram=matrix(unlist(ngrams), nrow=length(ngrams), byrow=T))
targets=data.frame(ngram=matrix(unlist(targets), nrow=length(targets), byrow=T))
obs=data.frame(obs=matrix(unlist(observation), nrow=length(observation), byrow=T))

# Get dummy encoding ("one hot") from all the ngrams
dummies = model.matrix(~ . -1 , data=ngramdf)

# Bind target and dummies to train DF
train = cbind(targets,dummies)

估计一些模型：

# Now apply Lasso to predict the classes
# https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
library(glmnet)
cvfit = cv.glmnet(as.matrix(train[,-1]), as.matrix(train[,1]), family = "binomial", type.measure = "class")

# Predict class per ngram
classes = predict(cvfit, newx = as.matrix(train[,-1]), s = "lambda.min", type = "class")

# predicted probability per ngram -> needs (type="response")
probs = predict(cvfit, newx = as.matrix(train[,-1]), s = "lambda.min", type = "response")

# Calculate average prob that a sequence (so ngrams per row) belongs to some class
result = cbind(probs,obs)
# Get mean per row
result = aggregate(. ~ result$obs, result[1], mean)
# Bind original text and target
result = cbind(result,df)
colnames(result)<-c("observation","estimated_prob", "text", "original_target")

结果：

  observation estimated_prob         text original_target
1           1      0.7173567 ab13ab12ab16               1
2           2      0.3020248 ag16ag16fg16               0
3           3      0.7674571 ab12ab12ab12               1
4           4      0.2922798 fg12fg16fg16               0
5           5      0.6815783 ab16ab12af12               1
6           6      0.2393033 fg16fg16fg16               0

其它你可能感兴趣的问题

上一篇A/B 测试结果与离线机器学习模型性能相矛盾下一篇如果我们将序列数据输入前馈网络会怎样？