数据挖掘 - python中现有数据框中的主题建模 - 吾爱随笔录

我正在尝试在熊猫数据框中执行主题提取。我正在使用 LDA 主题建模来提取数据框中的主题。没问题。

但是，我想将 LDA 主题建模应用于我的数据框中的每一行。

当前数据名：

日期	cust_id	字
2019 年 3 月 14 日	100001	samantha 吊带裙 pi 滑雪
2020 年 1 月 21 日	10002	钢裙纯绿色
2020 年 5 月 19 日	10003	亚利桑那牛仔衬衫 d

我正在寻找的数据框：

日期	cust_id	字	话题 0 字	主题 0 权重
2019 年 3 月 14 日	100001	samantha 吊带裙 pi 滑雪	短裙	0.5
2020 年 1 月 21 日	10002	裙子纯绿色	偏绿	0.2
2020 年 5 月 19 日	10003	亚利桑那牛仔衬衫	牛仔布	01

vectorizer = CountVectorizer(max_df=0.9, min_df=20, token_pattern='\w+| $ [\d.]+|\S+')

tf = vectorizer.fit_transform(features['words']).toarray()

tf_feature_names = vectorizer.get_feature_names()

number_of_topics = 6 模型 = LatentDirichletAllocation(n_components=number_of_topics, random_state=1111)

模型.fit(tf)

我试图将两个数据框合并在一起，它不起作用。
我如何能够在每列中添加每个主题并添加每个主题权重以添加到我的所有行中？

我在stackoverflow中发布了这个问题： https ://stackoverflow.com/questions/71476309/topic-modelling-in-an-existing-dataframe-in-python

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data): sent_topics_df = pd.DataFrame() for i, row in enumerate(ldamodel[corpus]): row = sorted(row, key=lambda x: (x[1]), reverse=True) # Get the Dominant topic, Perc Contribution and Keywords for each document for j, (topic_num, prop_topic) in enumerate(row): if j == 0: # -- dominant topic wp = ldamodel.show_topic(topic_num) topic_keywords = ", ".join([word for word, prop in wp]) sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True) else: break sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] # Add original text to the end of the output contents = pd.Series(texts) sent_topics_df = pd.concat([sent_topics_df, contents], axis=1) return(sent_topics_df) df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=df) df_dominant_topic = df_topic_sents_keywords.reset_index() df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text'] df_dominant_topic.head(5)