数据挖掘 - 正则表达式删除句子中的重复单词 - 吾爱随笔录

正则表达式删除句子中的重复单词

数据挖掘 Python nlp 正则表达式

2021-09-26 10:09:31

我是正则表达式的新手。我正在做一个项目，我需要用那个词替换重复的词。例如：

我需要从头开始学习正则表达式。

我需要将其更改为：

我需要从头开始学习正则表达式。

我可以使用以下正则表达式识别重复的单词 \b(\w+)\b[\s\r\n]*(\l[\s\r\n])+

为了替换它，我需要重复单词短语中的单词。 pattern.sub(sentence, <what do i write here?>)

2个回答

由于您正在使用 RegEx，因此我将提供 RegEx 解决方案。

我还将表明您还需要注意首先删除标点符号。（我不会再将标点符号重新插入原来的位置！）

正则表达式解决方案：

import re
sentence = 'I need need to learn regex... regex from scratch!'

# remove punctuation
# the unicode flag makes it work for more letter types (non-ascii)
no_punc = re.sub(r'[^\w\s]', '', sentence, re.UNICODE)
print('No punctuation:', no_punc)

# remove duplicates
re_output = re.sub(r'\b(\w+)( \1\b)+', r'\1', no_punc)
print('No duplicates:', re_output)

回报：

No punctuation: I need need to learn regex regex from scratch
No duplicates: I need to learn regex from scratch

\b : 匹配单词边界
\w : 任何单词字符
\1 ：用找到的第二个单词替换匹配项 - 第二组括号中的组

括号中的部分被称为组，你可以做一些事情，比如命名它们，然后在正则表达式中引用它们。这种模式应该递归地捕捉重复的单词，所以如果连续有 10 个单词，它们将被替换为最后出现的单词。

在这里查看正则表达式模式的更详细定义。

更蟒蛇（看起来）的方式

不得不说，groupby方法有一定的python-zen感觉！简单、易读、美观。

在这里，我只是展示了另一种删除标点符号的方法，利用string模块，将任何标点符号字符转换为 None （删除它们）：

from itertools import groupby
import string

sentence = 'I need need to learn regex... regex from scratch!'

# Remove punctuation
sent_map = sentence.maketrans(dict.fromkeys(string.punctuation))
sent_clean = sentence.translate(sent_map)
print('Clean sentence:', sent_clean)

no_dupes = ([k for k, v in groupby(sent_clean.split())])
print('No duplicates:', no_dupes)

# Put the list back together into a sentence
groupby_output = ' '.join(no_dupes)
print('Final output:', groupby_output)

# At least for this toy example, the outputs are identical:
print('Identical output:', re_output == groupby_output)

回报：

Clean sentence: I need need to learn regex regex from scratch
No duplicates: ['I', 'need', 'to', 'learn', 'regex', 'from', 'scratch']
Final output: I need to learn regex from scratch
Identical output: True

基准

出于好奇，我将上面的行转储到函数中并运行了一个简单的基准测试：

正则表达式：

In [1]: %timeit remove_regex(sentence)
8.17 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

分组：

In [2]: %timeit remove_groupby(sentence)
5.89 µs ± 527 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我已经读到现在正则表达式会更快（使用 Python3.6）——但在这种情况下，坚持漂亮的代码似乎是有回报的！

免责声明：例句很短。此结果可能无法扩展到具有更多/更少重复单词和标点符号的句子！

据我了解，使用正则表达式规则执行此操作可能会很棘手而且有点慢。在python中，我使用它并且它完美地工作：

from itertools import groupby
line_split = [k for k,v in groupby(line.split())]

其它你可能感兴趣的问题

上一篇测量预测的不确定性下一篇什么是光谱聚类？