数据挖掘 - 如何在 Python 的结构化数据集中应用 TFIDF？ - 吾爱随笔录

我知道 TFIDF 是一种用于特征提取的 NLP 方法。

而且我知道有些库可以直接从文本中计算 TFIDF。

这不是我想要的

就我而言，我的文本数据集已被转换为 Bag of words

我“不能”访问的原始数据集如下所示

RepID     RepText
------------------
1         Doctor sys patient has diabetes and needs rest for ...
2         Patients history: broken arm, and ...
3         A dose of Metformin 2 times a day ...
4         Xray needed for the chest...
5         Covid-19 expectation and patient should have a rest ...

但我的数据集看起来像这样

RepID   Word         BOW
-------------------------
1       Doctor       3
1       diabetes     4
1       patient      1
.       .            .
.       .            .
2       patient      2
2       arm          7
.       .            .
.       .            .
5684    cough        9
5684    Xray         3
5684    Covid        5
.       .            .
.       .            .

我想要的是为我的数据集中的每个单词找到 TFIDF。

我正在考虑将我的数据集转换为非结构化格式

所以看起来像这样

RepID     RepText
------------------
1         Doctor Doctor Doctor diabetes diabetes diabetes diabetes ...
2         Patients patients arm arm arm arm arm arm arm ...
.
.
5684      cough cough cough cough cough cough cough cough cough Xray Xray

所以每个单词重复相同数量的 BOW

但我认为这不是最好的方法，因为我将结构化数据集转换为非结构化数据集..

如何从结构化数据集中找到 TFIDF？有图书馆或算法吗？

笔记：

数据集存储在 MS SQL Server 中，我使用的是 Python 代码。

import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer # input data df = pd.DataFrame({ 'RepID': [1, 1, 1, 2, 2, 5684, 5684, 5684], 'Word': ['Doctor', 'diabetes', 'patient', 'patient', 'arm', 'cough', 'Xray', 'Covid'], 'BOW': [3, 4, 1, 2, 7, 9, 3, 5] }) # count matrix df = pd.pivot_table(df, index='RepID', columns='Word', values='BOW', aggfunc='sum') df = df.fillna(value=0) print(df) # Word Covid Doctor Xray arm cough diabetes patient # RepID # 1 0.0 3.0 0.0 0.0 0.0 4.0 1.0 # 2 0.0 0.0 0.0 7.0 0.0 0.0 2.0 # 5684 5.0 0.0 3.0 0.0 9.0 0.0 0.0 # tf-idf transform X = TfidfTransformer().fit(df.values) print(X.idf_) # [1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207]