数据挖掘 - 删除汉字作为特征 - - 吾爱随笔录

我使用 TfIdfVectorizer 创建了文档术语矩阵，但只是注意到该特征包含中文字符。是否可以使用 Python 的正则表达式删除它们？

我相信这些特征是我的模型预测精度较低的原因之一。

目前我使用以下内容来预处理我的数据-

   # Pre-processing the data
    def text_preprocess( data ):
        # Changing to lower case
        data = data.lower()
        # Removing special characters
        data = re.sub("(\\d|\\W)+"," ",data)
        return data

另外，请注意我stopwords='english'在我的TfidfVectorizer.

如果需要任何信息，请告诉我。（新来的，还在学习中）