一种方法(在许多其他方法中)是将序列的元素视为一个单词。换句话说,如果您假设您的列表是一个句子,那么您可以提取 ngram。
import nltk
from nltk import ngrams
a = [1, 15, 1, 1, 13, 14]
b = [1, 1, 1, 1, 12, 1, 7, 11, 9, 11, 7, 11, 7, 11, 7, 4, 7, 7, 14, 15, 9, 2]
c = [13, 1, 13, 15, 13, 2, 9, 2, 9, 2, 2, 2, 2, 2, 2, 2]
d = [1, 2, 9, 1, 6, 10, 6, 1, 6, 10, 14, 3, 10]
bb = list()
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in a])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in b])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in c])))
bb.append(str(','.join(str(e) for e in ['x' + str(e) for e in d])))
我添加了x
, 因为似乎CountVectorizer
忽略了单个数字/字母。让我们进行字数统计 - 或者您也可以继续使用ngrams (在此处阅读 sklearn 文档)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bb)
X.toarray()
输出看起来像这样
array([[3, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[5, 0, 4, 1, 0, 1, 1, 1, 0, 1, 0, 6, 2],
[1, 0, 0, 0, 3, 0, 1, 9, 0, 0, 0, 0, 2],
[3, 3, 0, 0, 0, 1, 0, 1, 1, 0, 3, 0, 1]])
基本上列对应的词是
print(vectorizer.get_feature_names())
['x1', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x2', 'x3', 'x4', 'x6', 'x7', 'x9']
和行是你的样本。
现在您有了一个特征矩阵,您可以继续进行聚类,例如kmeans
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
结果
array([0, 1, 0, 0], dtype=int32)