数据挖掘 - 从句子中提取关键文本的一般方法 (nlp) - 吾爱随笔录

从句子中提取关键文本的一般方法 (nlp)

数据挖掘机器学习 nlp 文本挖掘数据清理

2021-09-13 22:01:18

给定这样的句子：

Complimentary gym access for two for the length of stay ($12 value per person per day)

我可以采取什么一般方法来识别健身房或健身房访问这个词？

3个回答

浅自然语言处理技术可用于从句子中提取概念。

------------------------------------------

浅层 NLP 技术步骤：

将句子转换为小写
删除停用词（这些是在语言中发现的常用词。像 for、very、and、of、are 等词是常见的停用词）
从给定的文本序列中提取 n-gram 即 n 项的连续序列（只需增加 n，模型可用于存储更多上下文）
分配句法标签（名词、动词等）
通过语义/句法分析方法从文本中提取知识，即尝试保留句子中权重较高的单词，例如名词/动词

------------------------------------------

让我们检查将上述步骤应用于给定句子的结果Complimentary gym access for two for the length of stay ($12 value per person per day)。

1-gram 结果：健身房、访问、长度、停留、价值、人、天

Summary of step 1 through 4 of shallow NLP:

1-gram          PoS_Tag   Stopword (Yes/No)?    PoS Tag Description
-------------------------------------------------------------------    
Complimentary   NNP                             Proper noun, singular
gym             NN                              Noun, singular or mass
access          NN                              Noun, singular or mass
for             IN         Yes                  Preposition or subordinating conjunction
two             CD                              Cardinal number
for             IN         Yes                  Preposition or subordinating conjunction
the             DT         Yes                  Determiner
length          NN                              Noun, singular or mass
of              IN         Yes                  Preposition or subordinating conjunction
stay            NN                              Noun, singular or mass
($12            CD                              Cardinal number
value           NN                              Noun, singular or mass
per             IN                              Preposition or subordinating conjunction
person          NN                              Noun, singular or mass
per             IN                              Preposition or subordinating conjunction
day)            NN                              Noun, singular or mass

Step 4: Retaining only the Noun/Verbs we end up with gym, access, length, stay, value, person, day

让我们增加 n 以存储更多上下文并删除停用词。

2-gram 结果：免费健身房、健身房使用权、停留时间、停留价值

Summary of step 1 through 4 of shallow NLP:

2-gram              Pos Tag
---------------------------
access two          NN CD
complimentary gym   NNP NN
gym access          NN NN
length stay         NN NN
per day             IN NN
per person          IN NN
person per          NN IN
stay value          NN NN
two length          CD NN
value per           NN IN

Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym, gym access, length stay, stay value

3 克结果：免费使用健身房、停留时间价值、每天人次

Summary of step 1 through 4 of shallow NLP:

3-gram                      Pos Tag
-------------------------------------
access two length           NN CD NN
complimentary gym access    NNP NN NN
gym access two              NN NN CD
length stay value           NN NN NN
per person per              IN NN IN
person per day              NN IN NN
stay value per              NN NN IN
two length stay             CD NN NN
value per person            NN IN NN


Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym access, length stay value, person per day

要记住的事情：

参考Penn 树库了解 PoS 标签描述
根据您的数据和业务环境，您可以决定从句子中提取 n-gram 的 n 值
添加特定领域的停用词将提高概念/主题提取的质量
深度 NLP 技术将提供更好的结果，即检测句子中的关系并将其表示/表达为复杂的结构，而不是 n-gram，以保留上下文。有关其他信息，请参阅此

工具：

您可以考虑使用 OpenNLP / StanfordNLP 进行词性标注。大多数编程语言都有 OpenNLP/StanfordNLP 的支持库。您可以根据自己的舒适度选择语言。下面是我用于 PoS 标记的示例 R 代码。

示例 R 代码：

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 32-bit version
library(rJava)
require("openNLP")
require("NLP")

s <- paste("Complimentary gym access for two for the length of stay $12 value per person per day")

tagPOS <-  function(x, ...) {
  s <- as.String(x)
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
  }
  
  tagged_str <-  tagPOS(s)
  tagged_str

#$POStagged
#[1] "Complimentary/NNP gym/NN access/NN for/IN two/CD for/IN the/DT length/NN of/IN stay/NN $/$ 12/CD value/NN per/IN     person/NN per/IN day/NN"
#
#$POStags
#[1] "NNP" "NN"  "NN"  "IN"  "CD"  "IN"  "DT"  "NN"  "IN"  "NN"  "$"   "CD" 
#[13] "NN"  "IN"  "NN"  "IN"  "NN"

浅层和深层 NLP 的其他阅读材料：

整合浅层和深层 NLP 进行信息提取

您需要分析句子结构并提取相应的感兴趣的句法类别（在这种情况下，我认为它会是名词短语，这是一个短语类别）。有关详细信息，请参阅相应的 Wikipedia 文章和NLTK 书的“Analyzing Sentence Structure”一章。

关于实现上述方法及其他方法的可用软件工具，我建议考虑NLTK（如果您更喜欢 Python）或StanfordNLP 软件（如果您更喜欢 Java）。对于许多其他 NLP 框架、库和编程各种语言的支持，请参阅此出色的精选列表中的相应 (NLP) 部分。

如果您是 R 用户，这里有很多很好的实用信息。查看他们的文本挖掘示例。
另外，看看 tm 包。
这也是一个很好的聚合站点。

其它你可能感兴趣的问题

上一篇[CLS] 令牌的用途是什么，为什么它的编码输出很重要？下一篇使用随机森林建模是否需要交叉验证？