我一直在寻找处理大型 CSV 文件读取方法
它超过 100gb 并且需要知道如何处理块文件处理
并使连接更快
%%time
import time
filename = "../code/csv/file.csv"
lines_number = sum(1 for line in open(filename))
lines_in_chunk = 100# I don't know what size is better
counter = 0
completed = 0
reader = pd.read_csv(filename, chunksize=lines_in_chunk)
CPU times: user 36.3 s, sys: 30.3 s, total: 1min 6s
Wall time: 1min 7s
这不会花很长时间,但问题是 concat
%%time
df = pd.concat(reader,ignore_index=True)
这部分需要太长时间并且占用太多内存还有
没有办法让这个 concat 过程更快更有效?