我有一个从 Facebook 收集的数据集,包含 10 个类,每个类有 2500 个帖子,但是当计算每个类中唯一单词的数量时,它们的计数不同,如图所示
这是由于字数导致的不平衡问题,还是根据帖子数量平衡的问题。如果不平衡,最好的解决方案是什么?
更新 我的 python 代码:
data = pd.read_csv('E:\cluster data\One_File_nonnormalizenew2norm.txt', sep="*")
data.columns = ["text", "class1"]
data.dropna(inplace=True)
data['class1'] = data.class1.astype('category').cat.codes
text = data['text']
y = (data['class1'])
sentences_train, sentences_test, y_train, y_test = train_test_split(text, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
num_class = len(np.unique(data.class1.values))
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
model = Sequential()
max_words=5000
model.add(Dense(512, input_shape=(60874,)))
model.add(Dense(20,activation='softmax'))####
model.summary()
model.compile(loss='sparse_categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train,batch_size=150,epochs=10,verbose=2,validation_data=(X_test,y_test),shuffle=True)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
0.9592031872509961