我正在尝试根据标题和摘要对一本书是否是小说/非小说进行分类。
这是两种不同类型的信息 - 有没有办法在将其提供给模型之前进行分割,而不是连接信息title?summary
例如:
标题:"such a long journey"
概括:"it is bombay in 1971, the year india went to..."
标签:("fiction"其中虚构=1)
当前程序:
到目前为止我一直在做的是连接信息,所以上面变成了,
example = "such a long journey it is bombay in 1971, the year india went to..."
label = 1
然后是通常的设置,例如,
X.append(example)
y.append(label)
...
X = lemmatize(X)
...
X_train, X_test, y_train, y_test = split_data(X,y)
vectorizer = TfidfVectorizer(...)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)
但是提供连接的数据在直觉上是错误的。有一个更好的方法吗?
如果由于某种原因它可能与 sklearn (keras,tensorflow)以外的库一起使用,我也愿意听到这个消息。
更新
从,
X = ['two'],['two'],['four'],['two'],['four'],['four']]
y = ['human','human','dog','human','dog','dog']
到,
X = [['two','hello'],['two','hello'],['four','bark'],['two','hi'],['four','bark'],['four','woof']]
y = ['human','human','dog','human','dog','dog']
导致错误被抛出。
'list' object has no attribute 'lower'is X 是一个列表,'numpy.ndarray' object has no attribute 'lower'如果 X 是一个数组。
当我打电话时抛出错误,
X_train = vectorizer.fit_transform(X_train)
是否可以传入特征向量?