软件推荐 - 最快的免费 Python 库，用于读取 1~3 列数字的 CSV 文件 - 吾爱随笔录

最快的免费 Python 库，用于读取 1~3 列数字的 CSV 文件

软件推荐免费图书馆 Python CSV 表现

2021-10-23 21:33:23

我正在寻找最快的 Python 库来将 CSV 文件（如果重要，1 或 3 列，所有整数或浮点数，例如）读取到 Python 数组（或我可以以类似方式访问的某个对象，具有类似访问时间）。它应该是免费的，可以在 Windows 7 和 Ubuntu 12.04 以及 Python 2.7 x64 上运行。

带有 1 列的 CSV：

具有 3 列的 CSV：

9,52,1
52,91,0
91,135,0
135,174,0
174,218,0
218,260,0
260,301,0
301,341,0
341,383,0
383,423,0
423,466,0
466,503,0
503,547,0
547,583,0
583,629,0
629,667,0
667,713,0
713,754,0
754,796,0
796,839,1

4个回答

因此，我最终使用 Steve Barnes 指出的库编写了一个小型基准测试。我在寻找它时发现了同样的问题，所以我想这是主要的问题。其他一些尚未尝试的想法：Python 的 HDF5、PyTables、IOPro（非免费）。

简而言之，pandas.io.parsers.read_csv击败其他所有人，NumPy 的速度非常慢，而loadtxtNumPy 的速度非常快。from_fileload

数据（我应该在基准测试中生成它们，但我现在没时间了）

代码：

import csv
import os
import cProfile
import time
import numpy
import pandas
import warnings

# Make sure those files in the same folder as benchmark_python.py
# As the name indicates:
# - '1col.csv' is a CSV file with 1 column
# - '3col.csv' is a CSV file with 3 column
filename1 = '1col.csv'
filename3 = '3col.csv'
csv_delimiter = ' '
debug = False

def open_with_python_csv(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        for row in csvreader:
            data.append(row)    
    return data

def open_with_python_csv_cast_as_float(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        for row in csvreader:
            data.append(map(float, row))    
    return data

def open_with_python_csv_list(filename):
    '''
    https://docs.python.org/2/library/csv.html
    '''
    data =[]
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=csv_delimiter, quotechar='|')
        data = list(csvreader)    
    return data


def open_with_numpy_loadtxt(filename):
    '''
    http://stackoverflow.com/questions/4315506/load-csv-into-2d-matrix-with-numpy-for-plotting
    '''
    data = numpy.loadtxt(open(filename,'rb'),delimiter=csv_delimiter,skiprows=0)
    return data

def open_with_pandas_read_csv(filename):
    df = pandas.read_csv(filename, sep=csv_delimiter)
    data = df.values
    return data    


def benchmark(function_name):  
    start_time = time.clock()
    data = function_name(filename1)       
    if debug: print data[0] 
    data = function_name(filename3)
    if debug: print data[0]
    print function_name.__name__ + ': ' + str(time.clock() - start_time), "seconds"


def benchmark_numpy_fromfile():
    '''
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html
    Do not rely on the combination of tofile and fromfile for data storage, 
    as the binary files generated are are not platform independent.
    In particular, no byte-order or data-type information is saved.
    Data can be stored in the platform independent .npy format using
    save and load instead.
    
    Note that fromfile will create a one-dimensional array containing your data,
    so you might need to reshape it afterward.
    '''
    #ignore the 'tmpnam is a potential security risk to your program' warning
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        fname1 = os.tmpnam()
        fname3 = os.tmpnam()
        
    data = open_with_numpy_loadtxt(filename1)
    if debug: print data[0]
    data.tofile(fname1)
    data = open_with_numpy_loadtxt(filename3)
    if debug: print data[0]
    data.tofile(fname3)
    if debug: print data.shape
    fname3shape = data.shape
    start_time = time.clock()
    data = numpy.fromfile(fname1, dtype=numpy.float64) # you might need to switch to float32. List of types: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
    if debug: print len(data), data[0], data.shape
    data = numpy.fromfile(fname3, dtype=numpy.float64)
    data = data.reshape(fname3shape)
    if debug: print len(data), data[0], data.shape    
    print 'Numpy fromfile: ' + str(time.clock() - start_time), "seconds"

def benchmark_numpy_save_load():
    '''
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html
    Do not rely on the combination of tofile and fromfile for data storage, 
    as the binary files generated are are not platform independent.
    In particular, no byte-order or data-type information is saved.
    Data can be stored in the platform independent .npy format using
    save and load instead.
    
    Note that fromfile will create a one-dimensional array containing your data,
    so you might need to reshape it afterward.
    '''
    #ignore the 'tmpnam is a potential security risk to your program' warning
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        fname1 = os.tmpnam()
        fname3 = os.tmpnam()
        
    data = open_with_numpy_loadtxt(filename1)
    if debug: print data[0]    
    numpy.save(fname1, data)    
    data = open_with_numpy_loadtxt(filename3)
    if debug: print data[0]    
    numpy.save(fname3, data)    
    if debug: print data.shape
    fname3shape = data.shape
    start_time = time.clock()
    data = numpy.load(fname1 + '.npy')
    if debug: print len(data), data[0], data.shape
    data = numpy.load(fname3 + '.npy')
    #data = data.reshape(fname3shape)
    if debug: print len(data), data[0], data.shape    
    print 'Numpy load: ' + str(time.clock() - start_time), "seconds"


def main():
    number_of_runs = 20
    results = []
    
    benchmark_functions = ['benchmark(open_with_python_csv)', 
                           'benchmark(open_with_python_csv_list)',
                           'benchmark(open_with_python_csv_cast_as_float)',
                           'benchmark(open_with_numpy_loadtxt)',
                           'benchmark(open_with_pandas_read_csv)',
                           'benchmark_numpy_fromfile()',
                           'benchmark_numpy_save_load()']
    # Compute benchmark
    for run_number in range(number_of_runs):
        run_results = []
        for benchmark_function in benchmark_functions:
            run_results.append(eval(benchmark_function))
            results.append(run_results)
        
    # Display benchmark's results
    print results
    results = numpy.array(results)
    numpy.set_printoptions(precision=10) # http://stackoverflow.com/questions/2891790/pretty-printing-of-numpy-array
    numpy.set_printoptions(suppress=True)  # suppress suppresses the use of scientific notation for small numbers:
    print numpy.mean(results, axis=0)
    print numpy.std(results, axis=0)    
    
    #Another library, but not free: https://store.continuum.io/cshop/iopro/

if __name__ == "__main__":
    #cProfile.run('main()') # if you want to do some profiling
    main()

Windows 7的：

输出：

open_with_python_csv: 1.57318865672 seconds
open_with_python_csv_list: 1.35567931732 seconds
open_with_python_csv_cast_as_float: 3.0801260484 seconds
open_with_numpy_loadtxt: 14.4942111801 seconds
open_with_pandas_read_csv: 0.371965476805 seconds
Numpy fromfile: 0.0130216095713 seconds
Numpy load: 0.0245501650124 seconds

安装所有库：用于 Python 扩展包的非官方 Windows 二进制文件

窗口配置：

Windows 7 SP1 x64 旗舰版
Python 2.7.6 x64
NumPy 1.7.1 ( import numpy; numpy.version.version)
熊猫 0.13.1 ( import pandas as pd; pd.__version__)
MSI Computer Corp. 笔记本电脑 GE70 0ND-033US;9S7-175611-033（带 SSD Crucial M5）

Ubuntu 12.04：

输出：

open_with_python_csv: 1.93 seconds
open_with_python_csv_list: 1.52 seconds
open_with_python_csv_cast_as_float: 3.19 seconds
open_with_numpy_loadtxt: 7.47 seconds
open_with_pandas_read_csv: 0.35 seconds
Numpy fromfile: 0.01 seconds
Numpy load: 0.02 seconds

要安装所有库：

sudo apt-get install python-pip
sudo pip install numpy
sudo pip install pandas

如果库已经安装但需要升级：

sudo apt-get install python-pip
sudo pip install numpy --upgrade
sudo pip install pandas --upgrade

Ubuntu 配置：

Ubuntu 12.04 x64
Python 2.7.3
NumPy 1.8.1 ( import numpy; numpy.version.version)
熊猫 0.14.0 ( import pandas as pd; pd.__version__)

显然，您可以随意通过评论/编辑/等来改进基准，我确信有很多东西需要增强：

确保当前的加载功能得到很好的优化
尝试新的函数/库，例如用于 Python 的 HDF5、PyTables、IOPro（非免费）。
在基准测试中生成 CSV（这样就不必下载 CSV 文件）

我想在这里贡献另一个库，我偶然发现了类似的问题。我用 Franck Dernoncourts 基准代码对其进行了测试，它比 Python 的标准 csv 和 Pandas 好几英里。我无法使用 numpy 进行测试，因为我使用带有数字和字符串值的 24.000 行 csv 进行了测试。

这个快速库实际上是基于默认的 csv 实现，只是使用TextIO使其更快并正确处理 unicode 字符串。

它被命名为fastcsv，由 Masaya Suzuki 开发。您可以在 GitHub 中关闭它或使用 Pypi 安装。最简单的是：

pip install fastcsv

在http://pythonhosted.org/fastcsv/你可以看到更多的 Benchmark 结果，但是为了阅读 csv 让我在这里重复他们的结果：

使用 fastcsv 读取基准

想知道这会如何处理您的数据会很有趣。

您有多种选择，具体取决于数据大小和复杂性以及您将如何处理结果数据：

Python 默认自带的csv库。
NumPy -numpy.from_file 函数- 读取 NumPy 数组，因此它非常强大。
Pandas -pandas.io.parsers.read_csv 函数- 读取一个 pandas 数据框，功能非常强大，可以处理庞大的数据集。

第一个可能会更快导入，而其他更强大。所有都是免费和跨平台的。如果你有一个默认的，第一个已经是你的 Python 安装的一部分。

有一个pydatatable基于 R data.tablefread实现的具有非常快的 csv 阅读器的新包。
阅读更多https://github.com/h2oai/datatable 如果你想加载 pandas 对象，你可以简单地运行

pandas_dataframe = dt.fread(srcfile).to_pandas()

其它你可能感兴趣的问题

上一篇专注于软件工程师的游戏下一篇Wolfram|Alpha Pro 的逐步解决方案功能的免费替代品？