数据挖掘 - sklearn CountVectorizer token_pattern -- 如果模式匹配则跳过标记 - 吾爱随笔录

如果这个问题放错了地方，我深表歉意——我不确定这更像是一个re问题还是一个CountVectorizer问题。我试图排除任何包含一个或多个数字的令牌。

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import pandas as pd
>>> docs = ['this is some text', '0000th', 'aaa more 0stuff0', 'blahblah923']   
>>> vec = CountVectorizer()
>>> X = vec.fit_transform(docs)
>>> pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
   0000th  0stuff0  aaa  blahblah923  is  more  some  text  this
0       0        0    0            0   1     0     1     1     1
1       1        0    0            0   0     0     0     0     0
2       0        1    1            0   0     1     0     0     0
3       0        0    0            1   0     0     0     0     0

我想要的是这个：

   aaa  is  more  some  text  this
0    0   1     0     1     1     1
1    0   0     0     0     0     0
2    1   0     1     0     0     0
3    0   0     0     0     0     0

我的想法是使用CountVectorizer'token_pattern参数来提供一个正则表达式字符串，该字符串将匹配除一个或多个数字之外的任何内容：

>>> vec = CountVectorizer(token_pattern=r'[^0-9]+')

但结果包括被否定类匹配的周围文本：

   aaa more   blahblah  stuff  th  this is some text
0          0         0      0   0                  1
1          0         0      0   1                  0
2          1         0      1   0                  0
3          0         1      0   0                  0

此外，替换默认模式(?u)\b\w\w+\b显然会与我想要保留的标记器的正常功能相混淆。

我真正想要的是使用 normal token_pattern，但对这些标记进行二次筛选以仅包括那些具有严格字母的标记。如何才能做到这一点？