数据挖掘 - 文档聚类以合并常见标签 - 吾爱随笔录

我正在构建一个推荐系统，我必须清理我拥有的一些标签。以数据为例

df['resolution_modified'].value_counts()

给

105829
It is recommended to replace scanner                                                                                                 1732
It is recommended to reboot station                                                                                                  1483
It is recommended to replace printer                                                                                                  881
It is recommended to replace keyboard                                                                                                 700
                                                                                                                                    ...  
It is recommended to update both computers in erc to ensure y be compliant with acme                                                    1
It is recommended to configure and i have verify alignement printer be work now corrado                                                 1
It is recommended to create rma for break devices please see tt for more information resolve this in favor of rma ticket create         1
It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update                        1
It is recommended to switch out dpi head from break printers                                                                            1

请注意It is recommended to replace keyboard和It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update非常相似。理想情况下，我只想收敛到更频繁出现的字符串，因此第二个字符串应该转换为第一个。

我正在考虑使用文档聚类来处理这种方法。我尝试过使用fuzzywuzzy，但由于我有很多字符串，所以下面的过程太慢了

from fuzzywuzzy import fuzz

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping
res = h['resolution_modified'].unique()
res.sort()
mapping = generate_mapping(res)
for k, v in mapping.items():
    if k != v:
        h.loc[h['resolution_modified'] == k, 'resolution_modified'] = v

我想知道是否有一些我可以应用的文档聚类在多次出现的字符串中加权，因此我只会采用与它们相关的出现较少的常见字符串并将它们收敛到更频繁出现的字符串。有人对使用哪种方法有任何建议吗？

到目前为止我尝试过的：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
v = TfidfVectorizer()
x = v.fit_transform(df['resolution_modified'])
kmeans = KMeans(n_clusters=2).fit(x)
test_strings = ['It is recommended to replace keyboard', 'It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update']
kmeans.predict(v.transform(test_strings))

这使

array([1, 0], dtype=int32)

显然到目前为止还没有工作，将尝试增加集群的数量。