word label_numeric
0 active 0
1 adventurous 0
2 aggressive 0
3 aggressively 0
4 ambitious 0
我使用 word2Vec 训练模型并将每个单词转换为 300 维的单词向量。这就是它现在的样子。
0 0.058594 -0.016235 -0.174805 0.072266 -0.201172 0.073242 -0.074219 -0.149414 0.245117 -0.050049 -0.016357 -0.147461 -0.003311 0.071289 -0.008545 -0.179688 0.001686 -0.009949 -0.036621 0.048096 -0.033447 0.105957 -0.490234 0.249023 -0.199219 -0.025635 -0.248047 0.136719 -0.068848 -0.320312 0.259766 -0.053223 0.154297 -0.050537 0.110840 0.027100 0.000412 -0.133789 0.077148 0.058838 0.230469 -0.033203 -0.179688 -0.125977 -0.166992 -0.110352 -0.365234 -0.330078 -0.021729 -0.076660 0.124023 -0.107910 -0.051758 0.127930 0.192383 0.025024 0.033691 -0.386719 -0.006195 -0.074219 -0.175781 -0.088379 -0.341797 0.145508 -0.051758 0.099609 0.020874 -0.042969 -0.145508 0.090332 0.096191 0.061768 0.209961 0.314453 -0.080078 -0.304688 0.238281 -0.060791 0.146484 0.041504 -0.113281 0.019409 0.328125 0.300781 -0.153320 -0.174805 -0.347656 -0.002167 0.115723 0.104004 0.012817 -0.175781 0.088867 -0.291016 -0.092773 0.144531 -0.006256 -0.066406 -0.145508 -0.182617 -0.144531 0.074707 -0.157227 -0.025513 -0.013977 -0.289062 0.051514 -0.010559 0.121582 0.072754 0.005188 -0.162109 -0.246094 0.002014 -0.072266 -0.026733 0.143555 0.067383 0.398438 -0.212891 0.029663 -0.041748 -0.005157 0.337891 -0.192383 -0.135742 0.226562 -0.033691 -0.188477 0.322266 0.136719 -0.058594 -0.068359 0.136719 0.029175 -0.152344 -0.086426 0.021729 -0.005524 0.115723 0.106445 0.257812 0.000546 -0.161133 -0.046875 -0.049805 -0.058594 -0.110840 0.029907 -0.322266 -0.032715 -0.136719 -0.148438 0.125977 -0.205078 0.027222 -0.005219 -0.188477 0.318359 0.002792 0.155273 0.261719 -0.043457 0.113281 0.142578 0.170898 -0.202148 0.028687 0.239258 0.033203 -0.330078 -0.003647 -0.054199 -0.142578 0.201172 0.053467 -0.249023 -0.180664 0.147461 -0.036865 -0.015259 -0.107910 -0.134766 0.052002 0.109863 0.067871 0.022705 0.058838 -0.189453 -0.093262 -0.043945 -0.009216 0.020386 -0.232422 -0.083008 0.062500 0.016479 0.033936 0.041016 0.049805 0.071289 0.076660 -0.003937 -0.261719 -0.198242 -0.269531 -0.035889 -0.249023 -0.023071 -0.091797 -0.093750 0.192383 -0.376953 0.170898 0.027832 0.023438 0.047363 -0.051270 0.020386 -0.029663 0.128906 0.044434 -0.199219 0.060547 0.138672 0.104980 0.314453 -0.125000 -0.075684 0.088379 0.109863 -0.058594 0.063477 -0.120117 -0.177734 0.017700 0.112793 -0.161133 -0.188477 -0.102051 -0.068848 -0.073730 0.168945 -0.042236 -0.024536 0.128906 -0.066406 -0.020996 0.087891 -0.224609 0.025146 -0.054932 -0.102539 -0.020142 0.123047 -0.171875 0.195312 -0.203125 -0.265625 -0.026367 0.154297 -0.235352 0.092773 0.032715 0.177734 0.063477 -0.168945 0.153320 -0.182617 0.101074 0.074219 0.031250 -0.038086 0.037598 0.035400 -0.150391 -0.108398 -0.071289 -0.080078 0.078613 0.022705 0.148438 -0.098633 -0.032471 0.083984 0.031494 -0.052002 -0.062988 0.316406 -0.105957 0.026733 0.018921 0.026855 -0.176758 -0.088379 0.127930 -0.104980 0.206055 -0.003296 0.184570 0
1 -0.068359 0.076660 -0.224609 0.292969 0.054688 -0.069824 0.028809 0.090332 -0.160156 0.080566 0.289062 -0.005615 0.074219 -0.071289 0.069824 0.032715 -0.036133 0.043457 0.084961 0.224609 -0.001160 0.100098 -0.090820 0.209961 0.101074 0.009949 0.038818 0.151367 0.209961 -0.157227 0.118652 0.247070 0.090332 0.244141 0.125000 -0.253906 0.204102 -0.234375 0.118652 -0.000603 0.253906 -0.146484 -0.077148 0.180664 -0.110840 0.018677 -0.113770 0.159180 0.245117 -0.033447 -0.041748 0.246094 0.018677 0.034180 0.103516 0.087891 0.339844 -0.357422 -0.230469 -0.051758 -0.038574 -0.281250 -0.218750 -0.210938 -0.150391 -0.040283 -0.049072 -0.292969 0.151367 0.143555 0.048340 -0.194336 -0.027344 0.038574 -0.086426 -0.003036 -0.095215 0.062500 -0.098145 0.085938 -0.099609 0.046875 0.039551 0.182617 -0.142578 0.189453 -0.261719 0.030273 0.056152 0.123535 -0.082520 -0.075684 -0.267578 0.014832 0.047852 -0.012451 0.131836 0.240234 -0.107910 -0.316406 0.081055 0.092285 0.014771 0.211914 0.062500 -0.143555 0.412109 -0.210938 -0.064453 -0.193359 0.051025 0.027954 0.026367 -0.109375 0.020752 -0.124512 0.198242 -0.105469 0.250000 -0.071289 -0.065430 -0.139648 -0.032959 0.386719 -0.185547 -0.166992 0.036621 0.001389 -0.090820 0.030396 -0.249023 -0.047363 -0.013245 0.318359 -0.150391 0.048340 -0.037354 0.125000 -0.053711 0.562500 0.005463 -0.067383 -0.345703 0.214844 0.044678 0.170898 -0.218750 0.243164 -0.165039 -0.259766 -0.158203 -0.275391 -0.138672 0.080566 -0.212891 -0.238281 -0.075684 0.015320 0.089844 -0.052490 0.031738 0.339844 0.035400 0.212891 0.127930 -0.033447 0.234375 0.130859 -0.209961 -0.106445 -0.236328 0.047607 -0.153320 -0.075195 0.048340 0.133789 -0.085449 0.122070 -0.187500 -0.172852 -0.137695 -0.392578 -0.028809 -0.177734 -0.131836 -0.141602 0.071777 -0.118652 -0.072754 -0.081543 -0.070312 0.033447 0.124023 -0.088379 -0.130859 0.131836 -0.010437 0.247070 -0.287109 0.077637 0.033203 0.032959 -0.136719 -0.079590 0.051758 -0.045898 -0.131836 -0.326172 -0.202148 -0.033203 -0.176758 0.180664 -0.148438 0.227539 -0.212891 -0.143555 0.273438 0.134766 -0.261719 0.073242 -0.054688 0.027466 0.126953 0.234375 0.097168 0.259766 0.253906 -0.170898 -0.189453 0.239258 -0.173828 0.024536 0.002090 0.101074 0.351562 0.174805 0.162109 -0.146484 -0.103516 -0.037354 0.065430 -0.104004 0.108398 0.296875 0.172852 0.078613 -0.209961 -0.133789 0.037354 -0.125977 0.172852 -0.102539 0.034424 0.095215 0.158203 -0.291016 -0.047852 -0.161133 -0.024414 -0.162109 -0.161133 0.109375 0.003372 0.218750 -0.022339 0.057861 -0.351562 -0.113770 -0.247070 -0.108398 0.097656 0.083008 0.357422 0.347656 0.341797 -0.031006 0.056885 0.114746 0.083008 0.192383 0.335938 0.154297 -0.244141 -0.445312 0.166992 0.396484 -0.132812 0.077148 -0.108398 0.131836 0.063477 0.001221 -0.219727 -0.062988 -0.137695 -0.133789 0.223633 -0.069336 0.163086 0.236328 0
2 -0.003067 0.219727 -0.082520 0.255859 -0.209961 -0.117188 0.109863 0.107422 0.059570 0.007233 0.059082 -0.152344 0.208984 -0.095703 -0.096680 -0.312500 -0.154297 0.024780 0.032471 0.250000 0.090820 0.017944 0.105957 0.133789 -0.122070 0.199219 -0.073730 -0.142578 0.203125 0.047607 0.222656 0.019531 0.026123 -0.138672 0.061768 0.120605 -0.008789 -0.047852 0.269531 -0.182617 0.566406 -0.218750 -0.043457 -0.051270 -0.273438 -0.084961 -0.240234 -0.158203 0.221680 -0.043457 0.308594 0.221680 -0.112305 -0.014343 0.070312 0.174805 -0.090332 -0.384766 0.003281 -0.002808 -0.273438 -0.116211 -0.542969 -0.008057 -0.137695 0.209961 0.231445 -0.008484 -0.092285 0.226562 -0.021851 -0.083984 0.069336 0.277344 -0.217773 0.057129 0.269531 0.218750 0.137695 0.093750 -0.101562 0.281250 0.029785 0.126953 0.066406 -0.019775 -0.287109 0.267578 0.195312 -0.135742 0.012207 0.048828 -0.237305 0.101562 0.206055 -0.091309 -0.085938 0.112305 -0.008423 -0.037109 0.099121 0.018433 -0.108398 0.031982 0.202148 -0.273438 -0.007874 -0.179688 0.025879 -0.046387 -0.172852 -0.202148 -0.086426 -0.028564 -0.033447 -0.047852 0.184570 -0.146484 0.109863 -0.243164 -0.251953 -0.000456 -0.073730 0.199219 -0.248047 -0.265625 0.261719 0.003693 0.092285 -0.111816 -0.118652 -0.320312 0.121582 0.127930 -0.127930 -0.087402 0.229492 0.040527 -0.121094 0.233398 0.052734 0.213867 -0.111328 -0.030884 -0.084961 0.054932 -0.068848 0.133789 -0.121582 -0.235352 -0.031982 0.062500 -0.137695 0.244141 -0.070312 -0.090820 -0.050781 0.041748 0.166992 0.200195 0.016724 0.292969 0.023682 -0.232422 -0.113281 -0.032959 0.038330 -0.357422 0.187500 -0.034180 -0.157227 -0.213867 0.007233 0.136719 0.018433 0.040771 0.089355 0.162109 -0.051514 -0.109863 -0.142578 -0.292969 -0.043945 0.200195 -0.079102 -0.007172 0.131836 0.206055 -0.125977 -0.092285 0.118652 -0.042236 -0.054443 -0.082520 -0.238281 -0.078125 0.052979 0.003601 -0.045166 0.126953 0.064453 0.296875 0.145508 -0.006378 0.015869 -0.070312 0.036377 -0.277344 0.038574 -0.112793 -0.224609 0.171875 -0.184570 0.062500 0.142578 -0.170898 0.189453 -0.067871 -0.239258 -0.110840 -0.043213 0.089844 0.069824 0.012512 0.162109 -0.194336 0.419922 -0.116699 0.170898 0.119141 -0.189453 0.102051 0.055420 0.026245 0.008545 0.052246 -0.088379 -0.236328 -0.041016 -0.125000 -0.051514 0.020020 0.051758 -0.137695 0.206055 -0.029297 -0.106445 -0.039062 0.285156 -0.018677 0.265625 -0.072266 -0.090820 -0.030640 -0.112793 -0.181641 -0.000690 -0.171875 -0.115234 -0.179688 0.114746 0.032227 -0.016235 -0.063477 0.054688 -0.033691 -0.242188 -0.292969 -0.229492 0.067871 0.006378 0.345703 0.024780 0.148438 0.119629 0.121582 0.024780 0.086914 0.066895 0.181641 0.120605 0.234375 0.034180 -0.306641 -0.124512 0.145508 0.025269 -0.138672 0.353516 -0.227539 -0.082520 -0.035645 0.066895 -0.085938 -0.159180 -0.087402 0.186523 0.289062 -0.075195 0.050781 0
In [223]:
我有两个标签 0 和 1。我现在正在使用 300 维词向量作为特征进行二进制分类。
# Splitting the dataset to train test
from sklearn.cross_validation import train_test_split
train_X, test_X,train_Y,test_Y = train_test_split(jpsa_X_norm,jpsa_Y, test_size=0.30, random_state=42)
print("Total Sample size in Training {}\n".format(train_X.shape[0]))
print("Total Sample size in Test {}".format(test_X.shape[0]))
Total Sample size in Training 151
Total Sample size in Test 65
0 87
1 64
dtype: int64
所以它是稍微不平衡的类数据集,比例为 0:1=1:35
我现在为 SVM 和随机森林做一个 GridSearchCV。在这两个算法中,我把
这是我的 GridSearchCV 函数:
def grid_search(self):
"""This function does Cross Validation using Grid Search
from sklearn.model_selection import GridSearchCV
self.g_cv = GridSearchCV(estimator=self.estimator,param_grid=self.param_grid,cv=5)
我得到以下 SVM 的结果。
The mean train scores are [ 0.57615906 0.57615906 0.57615906 0.57615906 0.93874475 0.57615906
0.57615906 0.57615906 1. 0.94867633 0.57615906 0.57615906
1. 1. 0.950343 0.57615906 0.81777921 0.99668044
1. 1. ]
The mean validation scores are [ 0.57615894 0.57615894 0.57615894 0.57615894 0.87417219 0.57615894
0.57615894 0.57615894 0.8807947 0.8807947 0.57615894 0.57615894
0.86754967 0.87417219 0.88741722 0.57615894 0.70860927 0.90728477
0.87417219 0.87417219]
The score on held out data is: 0.9072847682119205
Parameters for Best Score : {'C': 1, 'kernel': 'linear'}
The accuracy of svm on test data is: 0.8769230769230769
Classification Metrics for svm :
precision recall f1-score support
0 0.87 0.92 0.89 37
1 0.88 0.82 0.85 28
avg / total 0.88 0.88 0.88 65
传递给 SVM 的 GridSearchCV 的超参数值的参数网格是:
grid_svm=[{'kernel': ['rbf'], 'gamma': [1e-1,1e-2,1e-3,1e-4],\
'C': [0.1, 1, 10, 100]},\
{'kernel': ['linear'], 'C': [0.1,1,10,100]}]
The mean train scores are [ 0.99009597 1. 0.99833333 1. 0.99833333 1.
0.99834711 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1.
1. ]
The mean validation scores are [ 0.79470199 0.85430464 0.8807947 0.87417219 0.8807947 0.85430464
0.83443709 0.82781457 0.86754967 0.84768212 0.88741722 0.87417219
0.81456954 0.86092715 0.85430464 0.83443709 0.8410596 0.8410596
0.83443709 0.86092715 0.85430464 0.83443709 0.84768212 0.82781457
0.82781457 0.82119205 0.85430464 0.81456954 0.82781457 0.85430464
0.82781457 0.84768212 0.83443709 0.86092715 0.87417219 0.86754967
0.86092715 0.86092715 0.8410596 0.86754967 0.86754967 0.8410596 ]
The score on held out data is: 0.8874172185430463
Parameters for Best Score : {'max_depth': 4, 'n_estimators': 600}
The accuracy of rf on test data is: 0.8307692307692308
Classification Metrics for rf :
precision recall f1-score support
0 0.77 1.00 0.87 37
1 1.00 0.61 0.76 28
avg / total 0.87 0.83 0.82 65
我有 42 个 RF 的超参数值组合,如下所示:
grid_rf={'n_estimators': [30,100,250,500,600,900], 'max_depth':[2,4,7,8,9,10,13]}
现在,如果您同时查看 SVM 和 RF 的输出,我的训练准确度接近 99%,但测试准确度和验证准确度并不接近训练准确度。这应该表明过度拟合,但我使用网格搜索和随机森林进行了超参数调整,通常也不会过度拟合。
ROC 图的 AUC 也非常好,接近 0.96。所以 AUC 很好,但准确度很差,我可以理解类不平衡问题可能在起作用。但是后来我在两者中都使用了类权重参数来解决这个问题。那么我的测试和验证准确性也无法与培训相提并论吗?
我还添加了更多数据,所以现在我有 2000 个 0 和 1000 个 1。我在每个算法中使用 scikit learn class_weight 选项中的“balanced”选项来实现类不平衡
The mean train scores are [ 0.70347493 0.73347328 0.74070792 0.74368715 0.74609988 0.74772955
0.7476584 0.78035322 0.80624038 0.81432687 0.8194324 0.81581485
0.81773002 0.81929078 0.9497877 0.96858105 0.97283788 0.97524883
0.9759579 0.97567365 0.9751775 0.97851051 0.99099354 0.99248265
0.99489341 0.99468108 0.99538994 0.99595762 0.98999975 0.99794336
0.99872325 0.99893632 0.99872348 0.99914909 0.99907804 0.99687948
0.99957447 0.99978721 0.99957452 0.99978728 0.99971639 0.99978728
0.99985806 1. 1. 1. 1. 1. 1. ]
The mean validation scores are [ 0.68765957 0.71460993 0.7222695 0.71829787 0.71744681 0.72453901
0.71971631 0.7248227 0.73191489 0.74439716 0.74638298 0.74524823
0.74695035 0.74297872 0.75716312 0.77730496 0.78468085 0.78382979
0.79120567 0.78609929 0.7906383 0.75120567 0.77531915 0.78808511
0.78780142 0.79035461 0.79234043 0.78808511 0.75716312 0.7693617
0.78297872 0.78553191 0.78609929 0.77957447 0.78269504 0.75234043
0.77673759 0.77021277 0.7764539 0.76879433 0.77134752 0.77673759
0.74241135 0.75148936 0.75375887 0.75375887 0.75432624 0.75829787
The score on held out data is: 0.7923404255319149
Hyper-Parameters for Best Score : {'max_depth': 8, 'n_estimators': 700}
The accuracy of rf on test data is: 0.8022486772486772
Classification Metrics for rf :
precision recall f1-score support
0 0.83 0.90 0.86 956
1 0.71 0.64 0.67 433
2 0.92 0.62 0.74 123
avg / total 0.80 0.80 0.80 1512
这似乎将准确率从 82% 降低到 80%。为什么会这么冷?如果数据在增加,那么为什么更多的数据准确性会下降?结果表明训练精度为 1,但验证和测试接近 t0 0.8。这是为什么?是否存在过度拟合,因为验证错误很高并且训练错误很低,但是随机森林通常不会过度拟合。