关键词聚类的 NLP 分析
我有一组搜索引擎的关键字,我想创建一个 python 脚本来在未知类别下对它们进行分类和标记。
为了清楚起见,我应该在不知道类别(产品、颜色、配件、品牌...)的情况下得到这样的输出:
+----------------- ---------+------------+----------+--------------+- ----------+
|.......关键词............|.产品......|.颜色。|.配饰。|.品牌。 ..|
+----------------+------------+------ ---+--------------+------------+
|.red 带高跟鞋的鞋子。|.shoes......|.red ......|.高跟鞋.........|........|
|.苹果电脑.......|.电脑.|............|.......| .苹果....|
|.Armani 蓝鞋....|.shoes......|.blue.....|..................|.Armani.. |
|.黑色鼠标........|.鼠标....|.黑色...|........ .|........|
|.游戏笔记本电脑............|.电脑.|......|...... ..|........|
+----------------+------------+------ ---+--------------+------------+
关于我如何能够做到这一点的任何建议?
我目前正在使用 Word2Vec 来查找单词和一些 API 之间的相似性,以识别关键字中的品牌和实体
model2 = models.Word2Vec.load('semantic_clustering/datasets/it/it.bin')
with open('tmp/kw_msm.txt') as f:
kwlist = f.readlines()
kwlist = [x.strip() for x in kwlist
# unique_words = list(set([word for word in kw.split(' ') for kw in kwlist]))
freq_words = defaultdict(int)
words = set()
for kw in kwlist:
for word in kw.split(' '):
freq_words[word] += 1
words.add(word)
sorted_freq_words = sorted(freq_words.items(), key=operator.itemgetter(1), reverse=True)
# Creating Dataframe with words as columns
df = pd.DataFrame(columns=words)
df['keyword'] = kwlist
for i, row in df.iterrows():
for w in row['keyword'].split():
df.loc[i, w] = 1
good_words = [w for w in words if kw_in_vocab(w, model2)]
KW_MODEL = model2[good_words]
NUM_CLUSTERS = 10
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(KW_MODEL, assign_clusters=True)
clusters = {}
for i, w in enumerate(good_words):
clusters[w] = assigned_clusters[i]
sorted_clusters = sorted(clusters.items(), key=operator.itemgetter(1))
for k in sorted_clusters:
print(k)
这是我正在使用的一段代码,创建一个稀疏的单词矩阵并用固定数量的集群对列进行聚类,这只是第一次测试