数据挖掘 - 合并两个相似文档的最佳方法 - 吾爱随笔录

合并两个相似文档的最佳方法

数据挖掘 nlp 文本挖掘相似信息检索

2021-10-04 11:44:25

我有 f.ex.：两篇报道同一事件的新闻文章。但是，这两个文本相似但不相同。我想将这两个文本结合起来，创建一个仅包含最“相关”信息的文本。

我正在考虑根据其“信息价值”逐段检查文本段落，然后仅组合最“相关”的段落。

关于如何做到这一点的任何建议？有没有关于这个主题的研究论文？

4个回答

您的描述可能对应于summarization的 NLP 任务。这是一个活跃的研究领域：https ://scholar.google.com/scholar?q=text+summarization

一个更简单的选择是只从两者中提取句子：在这种情况下，目标不是产生读起来像故事的文本，只是句子的枚举。

即使在这种情况下，您也必须定义如何衡量“信息价值”，这并不容易。

也许这不是您要寻找的（取决于文本的相似程度），但您可能会尝试通过“字符串距离”来解决问题。这不一定会检测语义相似性，但会检测相似的 ngram 或单词序列。

您可以比较文本的每个段落，如果它们“相似”，则只保留其中一个，如果它们“不相似”，则根据预定义的标准保留它们。这不会给你一个完美的最终文本，而是一个（希望是不同的）内容的摘要。

不久前，我在 R 中使用了一些代码。

# 0) Load packages:

library(dplyr)
library(tidytext)
library(fuzzyjoin)
library(tokenizers)
library(stringdist)
library(pdftools)
library(parallel)

# Global parameter settings
# Length of word-sequences to be considered 
# Ngram too long (danger of missing equal sequences)
# Ngram too small (will find many matches)
ngramlength = 7
# Regulation of string comparison (higher = less accurate)
maxd = 12
# Method of string comparison (see ?amatch) / Options: "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"
matchmethod = "osa"

###########################################################################
# 1) Read content:

# Read text
# Note: One sentecne in both examples below is very similar (text2, last sentence)

# IN TEXT 1:
# The EU ETS was launched in 2005 and is the first - and still by far the largest - international 
# system for trading greenhouse gas emission allowances covering over three-quarters of the allowances 
# traded on the international carbon market.

# IN TEXT2:
# The EU Eemissions Trading System has been launched in 2005 and is the first international 
# system covering over three-quarters of the allowances traded on the carbon market.


text1=as.character(
"European Union Emissions Trading System (EU ETS) is the cornerstone of the European Union's policy to tackle climate change 
and its key tool for cost-effective reduction of emissions of carbon dioxide (CO2) and other greenhouse gases (GHG) in the power, 
aviation and industrial sectors. The EU ETS was launched in 2005 and is the first - and still by far the largest - international 
system for trading greenhouse gas emission allowances covering over three-quarters of the allowances traded on the international carbon market.
The EU ETS operates in the 31 countries of the European Economic Area (EEA). It limits emissions from nearly 11,000 power plants and manufacturing 
installations as well as slightly over 500 aircraft operators flying between EEA's airports (Report from the Commission to the 
European Parliament and to the Council, Report on the functioning of the European carbon market, 23 November 2017 (COM(2017) 693 final, p. 7)."
)

text2=as.character(
"The primary intended audience of this package is scholars and professionals in fields where the impact of news on society is a prime factor, 
such as journalism, political communication and public relations (Baum and Groeling 2008; Boczkowski and De Santos 2007; Ragas 2014). 
To what extent the content of certain sources is homogeneous or diverse has implications for central theories of media effects, such as 
agenda-setting and the spiral of silence (Bennett and Iyengar 2008; Blumler and Kavanagh 1999). Identifying patterns in how news travels 
from the initial source to the eventual audience is important for understanding who the most influential gatekeepers are (Shoemaker and Vos 2009). 
Furthermore, the document similarity data enables one to study news values (Galtung and Ruge 1965) by analyzing what elements of news 
predict their diffusion rate and patterns. The EU Eemissions Trading System has been launched in 2005 and is the first international 
system covering over three-quarters of the allowances traded on the carbon market."
)

mytext1=data_frame(text1)
mytext2=data_frame(text2)

###########################################################################
# 2) Generate ngrams:

ttext = unnest_tokens(mytext1, ngram, text1, token = "ngrams", n = ngramlength)
torig = unnest_tokens(mytext2, ngram, text2, token = "ngrams", n = ngramlength)

# Drop every second row (we might not need all ngrams)#toDelete <- seq(0, length(dat), 2)
# ATTENTION: ngrams should not be too long!
ttext <-  ttext[-seq(0, length(ttext$ngram), 2), ]
torig <-  torig[-seq(0, length(torig$ngram), 2), ]

###########################################################################
# 3) Compare each ngram in ttext to each in torig

# 3.1) Use the "stringdist" package
# https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

# With "generous" distance allowed, similarities arround 151-153 are detected
amatch(ttext$ngram,torig$ngram,maxDist=10, method = matchmethod)

# Let's store results in a way that allows interpretation
results1 = cbind(ttext$ngram, torig$ngram[amatch(ttext$ngram,torig$ngram,maxDist=maxd, method = matchmethod)])
# Remove "nas" 
results1 = results1[complete.cases(results1), ]
results1

# 3.2) "Join" similar ngrams (uses stringdist)
# THIS METHOD YIELDS SAME RESULT AS METHOD ABOVE
results2 = stringdist_join(ttext, torig, 
                by = "ngram",
                mode = "left",
                ignore_case = T, 
                method = matchmethod, 
                max_dist = maxd, 
                distance_col = "dist"
) %>%
  group_by(ngram.x) %>%
  top_n(1, -dist)

results2

你如何识别哪些词是相关的？ 如果你已经有一组相关的词，你可以使用TFIDF。TFIDF 将关键字视为特征并对其进行明确检查，并根据其数学公式分配分数。您可以在此链接中进一步了解它：http ://www.tfidf.com/ 。

如果您想提取相关关键字或提取文本摘要，您可以使用 gensim 的摘要器。

要获得摘要：

from gensim.summarization.summarizer import summarize
text = '''Rice Pudding - Poem by Alan Alexander Milne
... What is the matter with Mary Jane?
... She's crying with all her might and main,
... And she won't eat her dinner - rice pudding again -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her dolls and a daisy-chain,
... And a book about animals - all in vain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well, and she hasn't a pain;
... But, look at her, now she's beginning again! -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her sweets and a ride in the train,
... And I've begged her to stop for a bit and explain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well and she hasn't a pain,
... And it's lovely rice pudding for dinner again!
... What is the matter with Mary Jane?'''
print(summarize(text))

她不会吃她的晚餐——又是大米布丁——我答应过她的洋娃娃和菊花链，我已经答应过她的糖果和搭火车，而且晚餐又是可爱的大米布丁！

提取关键字：

from gensim.summarization import keywords
text = '''Challenges in natural language processing frequently involve
... speech recognition, natural language understanding, natural language
... generation (frequently from formal, machine-readable logical forms),
... connecting language and machine perception, dialog systems, or some
... combination thereof.'''
keywords(text).split('\n')

[u'natural language', u'machine', u'frequently']

您还可以使用 gensim 和 TFIDF 的组合，通过使用 gensim 提取摘要/关键字并使用 TFIDF 进行比较。

Gensim NLP 库非常强大，并且具有相似性 API。

易于安装和测试。

https://radimrehurek.com/gensim/tut3.html#similarity-interface

祝你好运！

其它你可能感兴趣的问题

上一篇word2vec 词嵌入创建了很远的向量，最近的余弦相似度仍然很远，只有 0.7 下一篇Sklearn 中的检查点