使用 NLP 对句子进行自动语义聚类和标记

数据挖掘 Python nlp 聚类 word2vec
2022-03-07 05:42:46

关键词聚类的 NLP 分析

我有一组搜索引擎的关键字,我想创建一个 python 脚本来在未知类别下对它们进行分类和标记。

为了清楚起见,我应该在不知道类别(产品、颜色、配件、品牌...)的情况下得到这样的输出:
+----------------- ---------+------------+----------+--------------+- ----------+
|.......关键词............|.产品......|.颜色。|.配饰。|.品牌。 ..|
+----------------+------------+------ ---+--------------+------------+
|.red 带高跟鞋的鞋子。|.shoes......|.red ......|.高跟鞋.........|........|
|.苹果电脑.......|.电脑.|............|.......| .苹果....|
|.Armani 蓝鞋....|.shoes......|.blue.....|..................|.Armani.. |
|.黑色鼠标........|.鼠标....|.黑色...|........ .|........|
|.游戏笔记本电脑............|.电脑.|......|...... ..|........|
+----------------+------------+------ ---+--------------+------------+

关于我如何能够做到这一点的任何建议?

我目前正在使用 Word2Vec 来查找单词和一些 API 之间的相似性,以识别关键字中的品牌和实体

    model2 = models.Word2Vec.load('semantic_clustering/datasets/it/it.bin')
    with open('tmp/kw_msm.txt') as f:
        kwlist = f.readlines()
    kwlist = [x.strip() for x in kwlist

    # unique_words = list(set([word for word in kw.split(' ') for kw in kwlist]))
    freq_words = defaultdict(int)
    words = set()
    for kw in kwlist:
        for word in kw.split(' '):
            freq_words[word] += 1
            words.add(word)
    sorted_freq_words = sorted(freq_words.items(), key=operator.itemgetter(1), reverse=True)

    # Creating Dataframe with words as columns
    df = pd.DataFrame(columns=words)
    df['keyword'] = kwlist
    for i, row in df.iterrows():
        for w in row['keyword'].split():
            df.loc[i, w] = 1

    good_words = [w for w in words if kw_in_vocab(w, model2)]

    KW_MODEL = model2[good_words]
    NUM_CLUSTERS = 10
    kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
    assigned_clusters = kclusterer.cluster(KW_MODEL, assign_clusters=True)
    clusters = {}
    for i, w in enumerate(good_words):
        clusters[w] = assigned_clusters[i]
    sorted_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
    for k in sorted_clusters:
        print(k)

这是我正在使用的一段代码,创建一个稀疏的单词矩阵并用固定数量的集群对列进行聚类,这只是第一次测试

0个回答
没有发现任何回复~