python中数据框中数据的标记化

数据挖掘 Python 数据框 标记化
2022-02-23 15:39:05

我正在对数据框中的每一行执行标记化,但仅对第一行进行标记化。有人可以帮帮我吗。谢谢你。以下是我的代码:


import pandas as pd
import json
import nltk

nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize


with open(r"C:\Users\User\Desktop\Coding\results.json" , encoding="utf8") as f:
     data = json.load(f)
df=pd.DataFrame(data['part'][0]['comment'])
split_data = df["comment"].str.split(" ")
data = split_data

print(data)

def tokenization_s(data): # same can be achieved for words tokens
    s_new = []
    for sent in (data[:][0]): #For NumpY = sentences[:]
        s_token = sent_tokenize(sent)
        if s_token != '':
            s_new.append(s_token)
    return s_new

print(tokenization_s(data))

我的输出是:

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
0                             [enjoy, a, lovely, moment]
1      [I, was, there, for, my, honeymoon., The, hote...
2      [Had, an, amazing, stay, for, 2, nights.\nThe,...
3                 [Had, a, good, time., Food, is, good.]
4      [A, highly, recommendable, hotel., Value, for,...
                             ...                        
131    [Wonderful, experience,, a, quite, different, ...
132                            [Was, a, paradise, stay.]
133    [It, was, really, a, place, to, be, for, relax...
134    [It, was, just, perfect, with, an, excellent, ...
135                               [It's, was, excellent]
Name: comment, Length: 136, dtype: object
[['enjoy'], ['a'], ['lovely'], ['moment']]

Process finished with exit code 0

我应该为系统做些什么来标记数据框中的每一行?

1个回答

你可以试试这个:

import pandas as pd
import nltk

df = pd.DataFrame({'frases': ['Do not let the day end without having grown a little,', 'without having been happy, without having increased your dreams', 'Do not let yourself be overcomed by discouragement.','We are passion-full beings.']})

df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['frases']), axis=1)