所以我在数据框中有一个名为“plot”的列,我想创建一个名为“keywords”的新列,它只有重要的 plot 单词。这是代码:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
df = pd.read_csv('IMDB_Top250Engmovies2_OMDB_Detailed.csv')
df = df[['Title','Genre','Director','Actors','Plot']]
df['Keywords'] = ''
for index,row in df.iterrows():
plot = row['Plot']
plot = re.sub('[^a-zA-Z]'," ", plot)
plot = plot.lower()
plot = plot.split()
plot = [i for i in plot if not i in set(stopwords.words('english'))]
plot = ' '.join(plot)
row['Key_words'] = str(plot)
这是输出:(
链接到 csv:https ://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7
谢谢 !
