数据挖掘 - 'int' 类型的拟合分类器对象没有 len() - 吾爱随笔录

我们有LDA主题建模，其目的是在给定一组文档的情况下生成多个主题。所以每个文档可以属于不同的主题。

此外，我们可以评估我们创建的模型。其中一种方法是使用 SVM 等分类方法。我的目标是评估创建的模型。

我面临着两种用于制作 LDA 模型的代码。

方法一：

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

这样我就不能使用 Fit_transform

方法二：

tf_vectorizer = CountVectorizer(max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda_x=lda.fit_transform(tf)

在第一种方法中，LDA 模型没有 fit_transform 方法，我不知道为什么，因为我不明白它们之间的区别。

无论如何，我需要将我用第一种方法创建的 LDA 模型传递给 SVM（我把这两种方法放在这里的原因是我知道第二种方法没有错误，可能是因为 fit_transform 但由于某种原因我不能'不要用那个。

最终代码：

import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = {'a'}

# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]

fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
    with open(os.path.join(fullPath[2],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[3]:
    with open(os.path.join(fullPath[3],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[4]:
    with open(os.path.join(fullPath[4],j)) as f:
                    a=f.read()
                    lines.append(a)

# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

# Assigns the topics to the documents in corpus

dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
    new_y = []
    for l in i:
        sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
        if l[1] > 0.005:
            new_y.append(l[0])
        label_y.append(new_y)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_df=2,min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)

正如您在我的代码中看到的，出于某些原因，我使用了第一种方法，但在最后一行，它引发了一个错误（int 类型的对象没有 len()）。似乎它不能接受以这种方式创建的 LDA（我在想，因为这种方式我没有使用 fit_transform）如何用我的代码修复这个错误？

堆栈跟踪：

/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in <module>
    classifier.fit(lda, label_y)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
    for doc in raw_documents:
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
    return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
    if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()

Process finished with exit code 1