数据挖掘 - 从文本列 pandas 的列表中获取单词的行频率计数 - 吾爱随笔录

从文本列 pandas 的列表中获取单词的行频率计数

数据挖掘 nlp 文本挖掘词嵌入 python-3.x

2022-02-26 18:11:18

我有一个数据框，其中包含来自客户服务电话对话的音频转录列。我创建了一个包含单词和句子的列表

words = ["rain", "buy new house", "tornado"]

我需要做的是在数据框中创建一列，逐行检查文本列中的这些单词，如果出现，则用单词及其频率更新列。例如第一行文本

"I was going to buy new house last week but it was raining since then. Once the rain stops I'll go and buy new house"

该列应为

{"buy new house",2}, {"rain",2}

或者可以创建重复行并在下一行添加逗号部分。

由于我还很新，如何进行此操作。

1个回答

这是处理核心逻辑的一种方法：

def count_phrases(string: str, phrases: str) -> dict:
    "Find the number of occurances of phrases in a string."
    return {phrase: string.count(phrase) for phrase in phrases}

string = "I was going to buy new house last week but it was raining since then. Once the rain stops I'll go and buy new house"
phrases = ["rain", "buy new house", "tornado"]

assert count_phrases(string, phrases) == {'rain': 2, 'buy new house': 2, 'tornado': 0}

然后该函数可以在 Pandas DataFrame 中使用.apply

其它你可能感兴趣的问题

上一篇使用 arima 和非线性趋势以及过多残差进行时间序列预测下一篇用于处理大文件的 Python 库