文本摘要可以分为两类 1. 抽取式摘要和 2. 抽象式摘要
- 提取摘要:这些方法依赖于从一段文本中提取几个部分,例如短语和句子,并将它们堆叠在一起以创建摘要。因此,在提取方法中识别正确的句子进行摘要是最重要的。
- Abstractive Summarization:抽象方法基于语义理解来选择单词,即使这些单词没有出现在源文档中。它旨在以一种新的方式生产重要的材料。他们使用先进的自然语言技术解释和检查文本,以生成一个新的较短的文本,从原始文本中传达最关键的信息。
您正在寻找的是抽象摘要。由于您在 R 中工作,因此有一个名为lexRank的不错的库,从这里举个例子看起来像
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."
编辑:我喜欢如何思考抽象摘要:Y
对 seq2seq 问题使用编码器-解码器架构(使用转换器扩展),您基本上可以获得文本的嵌入,其中相同的句子可以在不同的上下文中以不同的方式嵌入,从而提供相同/相似的输出。