在清理文本数据时面临一个困难的正则表达式问题

数据挖掘 Python 文本挖掘 数据清理 正则表达式
2022-02-10 07:54:57

我正在尝试用出现在多个文档中的长字符串中的一些符号替换一系列单词。例如,假设我要删除:

Decision and analysis and comments

从一长串。让字符串为:

s = Management's decision and analysis and comments is to be removed.

我想Decision and analysis and commentss. 问题是,在Decision, and, analysis, and,之间commentss可能有 0、1 或多个空格和换行符(\n)出现在不同的文档中,没有任何模式,例如,一个文档显示:

Management's decision  \n \n and analysis\n and \n comments is to be removed

而另一个有不同的模式。我该如何解决这个问题并仍然将其从字符串中删除?

我尝试了以下方法,当然没有成功:

st = 'Management's decision  \n \n and analysis\n and  \n comments is to be removed'    
re.sub(r'Decision[\s\n]and[\s\n]analysis[\s\n]and[\s\n]comments','',s)
1个回答

要删除多个空格匹配,您需要[\s\n]+,注意包含+(match one or more)。

代码:

这是一个从文本片段自动构建正则表达式的函数:

def remove_words(to_clean, words, flags=re.IGNORECASE):
    regex = r'[\s\n]+'.join([''] + words.split() + [''])
    return re.sub(regex, ' ', to_clean, flags)

测试代码:

st = "Management's decision  \n \n and analysis\n " \
     "and  \n comments is to be removed"
print(remove_words(st, 'decision and analysis and comments'))

结果:

Management's is to be removed