数据挖掘 - 相似外观文本的分组 - 吾爱随笔录

相似外观文本的分组

数据挖掘 Python 熊猫文本

2022-03-04 13:34:27

我有一个数据框，它有两列，“标题”和“描述”。标题栏有一堆与临床实验室测试相关的标题。不幸的是，大多数标题都是相同测试的重复，但由于标题的微小变化，标题显示为唯一。

values = [('Complete blood picture', 'AB'), ('Complete BLOOD test', 'AB'), ('blood glucose', 'AB'), ('COMplete blood Profile', 'AB')]

labels = ['title', 'description']
import pandas as pd
labtest = pd.DataFrame.from_records(values, columns = labels) # Create data frame

这就是数据框的样子。[实际数据集有很多这样的标题，这只是为了这个问题的目的]

Title                       Description
Complete blood test         AB
COMPLETE Blood test\        AB
Blood glucose               AB
Complete blood picture      AB

这就是我希望数据框的样子：

Title                       Description
Blood test                   AB
Blood test                   AB
Blood test                   AB
Blood test                   AB

我想在每个标题中搜索“血液”这个词，如果它是真的，那么用“血液测试”更改整个标题。有没有办法做到这一点？

3个回答

一种可能的解决方案如下：

import re
import pandas as pd

pattern = re.compile('blood', re.IGNORECASE)

def change(text):
    if pattern.findall(text):
        return 'Blood test'
    else:
        return text

values = [('Complete blood picture', 'AB'), ('Complete BLOOD test', 'AB'), ('blood glucose', 'AB'), ('COMplete blood Profile', 'AB')]
labels = ['title', 'description']

# Create data frame
labtest = pd.DataFrame.from_records(values, columns=labels)
labtest['title'] = labtest['title'].apply(change)

print labtest

输出是：

        title description
0  Blood test          AB
1  Blood test          AB
2  Blood test          AB
3  Blood test          AB

第一行导入 Python 的 regex（正则表达式）模块。该行：

pattern = re.compile('blood', re.IGNORECASE)

创建一个正则表达式，查找单词 blood ignoring case。功能更改，将输入文本替换为 'Blood test'，以防找到字符串 'blood'。最后，您使用了 pandas DataFrame 中的 apply 方法来转换列。最后，apply 方法，顾名思义，将函数更改“应用”到“title”列中的每个值。

有关使用 Python 和 pandas 应用方法的正则表达式的更多信息，请参见此处和此处。如果您想了解更多关于 Python 中的文本处理的信息，我建议您看看这个问题中的指针。

Pandas 可以直接进行该字符串比较，然后使用比较结果查找适当的行，以便对其进行设置。这可以用一个表达式来完成：

代码：

labtest['title'][labtest['title'].str.contains('blood', case=False)] = 'Blood test'

这是如何运作的？

从内到外，我们有：

选择 'title' 列作为字符串向量：
```
labtest['title'].str
```

将字符串向量转换为布尔向量：

labtest['title'].str.contains('blood', case=False)

使用布尔向量选择标题列中的特定行：

labtest['title'][labtest['title'].str.contains('blood', case=False)]

将这些单元格分配给所需的新值

labtest['title'][labtest['title'].str.contains('blood', case=False)] = 'Blood test'

测试代码：

values = [
    ('Complete blood picture', 'AB'),
    ('Complete BLOOD test', 'AB'),
    ('blood glucose', 'AB'),
    ('COMplete blood Profile', 'AB'),
    ('bloud glucose', 'AB'),
]
labels = ['title', 'description']

# Create data frame
labtest = pd.DataFrame.from_records(values, columns=labels)

labtest['title'][labtest['title'].str.contains('blood', case=False)] = 'Blood test'

print(labtest)

测试结果：

           title description
0     Blood test          AB
1     Blood test          AB
2     Blood test          AB
3     Blood test          AB
4  bloud glucose          AB

另一种解决方案：

new_values=[]
for tup in values:
    if tup[0].lower().find('blood')>=0:
        new_values.append(['Blood test',tup[1]])
    else: new_values.append([ tup[0],tup[1] ])

这基本上会获取您的值列表并使用替换的文本创建一个new_values列表。见下文：

values =     [('Complete blood picture', 'AB'), ('Complete BLOOD test', 'AB'), ('blood glucose', 'AB'), ('COMplete blood Profile', 'AB')]
new_values = [['Blood test', 'AB'], ['Blood test', 'AB'], ['Blood test', 'AB'], ['Blood test', 'AB']]

因此，现在您可以在数据框中使用new_values ( pd.DataFrame.from_records(new_values, columns = labels) ) 或使用它来替换values。

其它你可能感兴趣的问题

上一篇如何在不过度拟合的情况下计算理想的决策树深度？下一篇文本后处理