模型的特征数量必须与输入相匹配。模型 n_features 为 740,输入 n_features 为 400

数据挖掘 分类 nlp scikit-学习 随机森林
2022-02-12 13:59:51

我从随机分类器中得到这个错误预测,有人能指出我在哪里出错了吗?

(背景信息:是的,我正在尝试用 2 个标签进行句子分类)

#Initializing BoW
cv = CountVectorizer()

#Test-Train Split
X_train,X_test,y_train,y_test = train_test_split(experiment_df['Sentence'],experiment_df['Label'])

#Transform
train = cv.fit_transform(X_train)
test = cv.fit_transform(X_test)


#Train Classifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(train,y_train)

#Pred
y_pred = clf.predict(test)

错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-a6f8e9da0bb0> in <module>()
      1 clf = RandomForestClassifier(max_depth=2, random_state=0)
      2 clf.fit(train,y_train)
----> 3 y_pred = clf.predict(val)

3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py in _validate_X_predict(s

elf, X, check_input)
    389                              "match the input. Model n_features is %s and "
    390                              "input n_features is %s "
--> 391                              % (self.n_features_, n_features))
    392 
    393         return X

ValueError: Number of features of the model must match the input. Model n_features is 740 and input n_features is 400 
```
1个回答

您目前fit_transform在训练数据集和测试集上都使用该方法。这是不正确的,因为您不应该在测试集上拟合模型,因为(取决于使用的模型)这会过度拟合,并且在根据数据中的值创建新列时可能会出现数据集形状问题(计数矢量化器,创建虚拟列等)。正确的方法是训练数据,然后只有fit测试数据集:transformtransform

#Initializing BoW
cv = CountVectorizer()

#Test-Train Split
X_train,X_test,y_train,y_test = train_test_split(experiment_df['Sentence'],experiment_df['Label'])

#Transform
train = cv.fit_transform(X_train)
test = cv.transform(X_test)


#Train Classifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(train,y_train)

#Pred
y_pred = clf.predict(test)