数据挖掘 - 将列表折叠到最常见的拼写 - 吾爱随笔录

我有一个用户生成的专有名称列表。为了交谈，想象它是宠物名字。

此列表将有许多相同名称的变体：拼写错误、替代拼写，甚至随机使用标点符号或空格。

例如

Fluffy
Fluffy the Dog
Flufy Dog
Fllufy

我可以计算出每个拼写，例如，Fluffy 是最常见的，出现 2000 次，而其他的都小于 100。

我尝试使用 Levenshein 距离包 ( python-Levenshtein) 进行压缩，但效果好坏参半。如果 Levenshtein 距离 < 3 并且“更有可能”的名称具有更高的计数，我会为“更有可能”的名称分配一个名称。当我手动搜索一些已知的常用名称时，我可以手动从失败者中剔除 Levenshein 获胜者，以及将其他 Levenshein 遗漏在我的列表中（例如Fluffy the Dog，Fluffy距离很大 8）。

我应该尝试更好的技术吗？

names = df.index #iterate through list find the top 10 (arbitrary) matching strings > 90 score, pop off the highest-volume order one def find_top_match(name): my_matches = process.extract(name, names, limit=10, scorer=fuzz.token_set_ratio) my_matches = [t[0] for t in my_matches if t[1] >= 90] top_match = df.loc[my_matches].sort_values(by='count', ascending=False).index[0] return(top_match) top_matches_set = [] for c in df.index: returned_match = find_top_match(c) top_matches_set.append(returned_match) df['real_name'] = top_matches_set