数据挖掘 - Scikitlearn - TfidfVectorizer - 如何使用自定义分析器并仍然使用 token_pattern - 吾爱随笔录

仅在 Analyzer == 'word' 时使用的文档状态：token_pattern

token_pattern : string

    Regular expression denoting what constitutes a “token”, only used if
    analyzer == 'word'. The default regexp selects tokens of 2 or more 
    alphanumeric characters (punctuation is completely ignored and always 
    treated as a token separator).

以下是我想要的管道：

analyzer = TfidfVectorizer().build_analyzer()
stemmer = SnowballStemmer('english')

def processed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

vec = TfidfVectorizer(analyzer=processed_words, strip_accents='unicode',
            stop_words='english', token_pattern=r'\b[^_\d\W]+\b')

如果我包含analyzer=processed_words，那么我将失去删除数以千计的特征的能力，这些特征是数字、下划线以及正则表达式中指定的任何其他无效字符序列。

有没有办法同时实现词干匹配和 token_pattern 匹配？我必须提前遍历所有文档并在使用正则表达式过滤后重新加入拆分文档吗？

如果我想使用两个分析器（也包括WordNetLemmatizer()）应用词干和词形还原，如何处理呢？