从文本列 pandas 的列表中获取单词的行频率计数

数据挖掘 nlp 文本挖掘 词嵌入 python-3.x
2022-02-26 18:11:18

我有一个数据框,其中包含来自客户服务电话对话的音频转录列。我创建了一个包含单词和句子的列表

words = ["rain", "buy new house", "tornado"]

我需要做的是在数据框中创建一列,逐行检查文本列中的这些单词,如果出现,则用单词及其频率更新列。例如第一行文本

"I was going to buy new house last week but it was raining since then. Once the rain stops I'll go and buy new house"

该列应为

{"buy new house",2}, {"rain",2}

或者可以创建重复行并在下一行添加逗号部分。

由于我还很新,如何进行此操作。

1个回答

这是处理核心逻辑的一种方法:

def count_phrases(string: str, phrases: str) -> dict:
    "Find the number of occurances of phrases in a string."
    return {phrase: string.count(phrase) for phrase in phrases}

string = "I was going to buy new house last week but it was raining since then. Once the rain stops I'll go and buy new house"
phrases = ["rain", "buy new house", "tornado"]

assert count_phrases(string, phrases) == {'rain': 2, 'buy new house': 2, 'tornado': 0}

然后该函数可以在 Pandas DataFrame 中使用.apply