数据挖掘 - 在 Python3 中填充嵌入式列表的缺失值 - 吾爱随笔录

我搜索了一个类似的问题，但我没有遇到。我是这个领域的新手，我希望我能很好地解释我的问题。

我有一个由文本数据组成的数据集。我将它们存储在一个列表中，列表的每一行都包含一个字符串值。但是每一行的长度是不相等的。我希望它们是平等的，所以我可以在自注意力模型中使用它们。

我的数据集样本

In [8]: myList
Out[8]: 
[
['the first line of my dataset'], 
['the second line'],
['the 3rd'],
['the 4th'],
['the 5th'],
['the 6th'],
['the 7th'],
]

所以你可以看到第一个比其他的要长。我想填充某个值，例如#均衡字数。

我想做的示例输出

In [8]: myList
Out[8]: 
[
['the first line of my dataset'], 
['the second line # # #'],
['the 3rd # # # #'],
['the 4th # # # #'],
['the 5th # # # #'],
['the 6th # # # #'],
['the 7th # # # #']
]

如果这将是一个数据框，我可以使用Pandas 库fillna()的功能。我试图应用这个：

train_X = pd.Series(train_X).fillna("#").values

但由于它是一个嵌入式列表（我猜）它不起作用。有没有更好的方法来做到这一点？

任何建议表示赞赏。

import pandas as pd from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences my_list = [['the first line'], ['the 2nd line'], ['the 3r line'], ['the 4th line'], ['the 5th line'], ['the'], ['the 5th line, this is']] max_features = 10 #how many unique words you're using tokenizer = Tokenizer(num_words=max_features) tokenizer.fit_on_texts(my_list) my_list = tokenizer.texts_to_sequences(my_list) my_list = pad_sequences(train_X, maxlen=None, dtype='int32', padding='post', truncating='post', value=0.0)