每个文件夹都有 200 个 *.txt 唯一文件:
每个文件都是由公共宣传的法律领域(文件夹)分隔的诉讼初始文本。
我想创建训练数据来预测他们的法律领域的新诉讼。
去年我尝试过使用PHP-ML,但是它消耗太多内存,所以我想迁移到Python。
我启动了代码,将每个文本文件加载到一个json-alike
结构中,但我不知道接下来的步骤:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import os
path = 'C:\wamp64\www\machine_learning\webroot\iniciais\\'
files = {}
for directory in os.listdir(path):
if os.path.isdir(path+directory):
files[directory] = [];
full_path = path+directory
for filename in os.listdir(full_path):
full_filename = path+directory+"\\"+filename
if full_filename.endswith(".txt"):
with open(full_filename, 'r', encoding='cp437') as f:
files[directory].append(f.readlines())
提前致谢