一次将(大量)数据输入数据帧

数据挖掘 Python 熊猫 麻木的 数据框 数据争吵
2022-01-25 13:35:30

我正在使用python。一些 2D numpy 数组存储在 Series 的各个行中。它们是 30x30 的图像。它看起来像这样:

     pixels
0    [[23,4,54...],[54,6,7...],[........]]
1    [[65,54,255,55,...],[43,54,6...],[......]]
...
...
...
7000 [[........]]

对于系列中的每一行,我想获取这些二维数组,将它们展平为一维,获取值并将它们分配给数据帧中一行的列。每行将有 30x30 = 900pixels,存储每个像素的值。像这样:

    pixel1    pixel2    pixel3...    pixel900
0       23         4        54             77
1       65        54       255             33
...
...
... 

我正在使用一个精心制作的函数,它一次从系列中提取一行,展平数组,再次将其转换为系列,并将其附加到数据框。这需要很长时间。我相信一定有更快的方法。我正在使用这段代码:

def prep_pixels(X):
    # X is a series
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through the whole series
        df = df.append(pd.Series(X[i].flatten()), ignore_index=True) 
    return df

编辑:根据用户的要求,我将提供代码,说明我最初是如何陷入这种困境的:D

#reading the files
filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)
print(data_np[0])

输出:

 [array([[255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
        185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
        120,  75,  74, 209],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
         58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
         21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
         12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
         53,  58,  64, 124],
    ... 30 rows of 30 pixels
    ...
    ... last row coming up ...
    [255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
    249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
    255, 255, 255, 255]], dtype=uint8), 'क']

此列表中的最后一个符号是此图像所代表的字符。这是“标签”。它是使用 CNN 的监督学习。无论如何,我需要它们采用我描述的其他格式才能使用它们。这就是我处理这些数据的方式:

data = pd.DataFrame(data_np, columns=['pixels','labels'])
def prep_pixels(X):
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through whole series
        df = df.append(pd.Series(X[i].ravel()), ignore_index=True)  
    return df

X = prep_pixels(data['pixels'])
y = data['labels']

编辑:用户建议我使用可变数据类型来执行此过程。他们说这可能会加快速度,因为计算不需要复制数据。我使用了一些嵌套的 for 循环,它将时间缩短了一半(1 分 22 秒而不是 3 分钟)。鉴于我的数据集只有 7000 个 30x30 像素的图像,我仍然觉得它很可悲。或者,也许我只是数据争论的新手。

这是我使用的代码。如果您有任何其他建议,请告诉我:

filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)

df = pd.DataFrame()
for row in range(IMG_ROW):   
    for col in range(IMG_COL):
        pixel=[]
        for img in range(len(data_np)):
            pixel.append(data_np[img][0][row][col]) 
        columns = pd.Series(pixel, name=col)
        df = pd.concat([df, columns], ignore_index=True, axis=1)
2个回答

我使用以下方法获得timeit了大约 1/4 的结果:

flatX = X.apply(lambda x: x.flatten())
pd.DataFrame(item for item in flatX)

另请参阅 https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe 以获得第二行的一些可能更好的选项。

(关于我之前的评论,最后仅通过数据框化我并没有得到任何真正的节省。)

或者您遍历每一列,请参阅 Test3。总体思路:

结果:

  • Test1:232.96580633900157(@IsuShrestha,展平每一行,逐行追加)
  • Test2:6.919965944998694(@BenReiniger,将每一行展平)
  • 测试3:0.3909464059997845(逐列追加=一次性添加所有行=29个追加)

(AMD A8-3870 APU,带 Radeon(tm) 高清显卡,3000 MHz,4 核,4 个逻辑处理器,8 GB 内存,Windows 10 64 位)

import numpy as np
import pandas as pd
# https://stackoverflow.com/questions/7370801/how-to-measure-elapsed-time-in-python
from timeit import default_timer as timer


row = [np.array([
    [255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
    185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
    120,  75,  74, 209],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],       
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],    
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],         
    [255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
    249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
    255, 255, 255, 255]
    ], dtype='uint8'), "क"]



data_np = []
for i in range(7000):
    data_np.append(row)
    
    

data = pd.DataFrame(data_np, columns=['pixels','labels'])


# Test1 (@IsuShrestha, flatten each row, append row-wise)
def prep_pixels(X):
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through whole series
        df = df.append(pd.Series(X[i].ravel()), ignore_index=True)  
    return df
start = timer()
test = prep_pixels(data['pixels'])
print(timer()-start)
print(test.shape)
# 232.96580633900157
# (7000, 900)


# Test2 (@BenReiniger, flatten each row)
def prep_pixels2(X):
    flatX = X.apply(lambda x: x.flatten())
    return pd.DataFrame(row for row in flatX)
start = timer()
test2 = prep_pixels2(data['pixels'])
print(timer()-start)
print(test2.shape)
# 6.919965944998694
# (7000, 900)


# Test3 (append column-wise = over all rows in one go = 29 appends)
def prep_pixels3(X):
    test = np.array([x[0] for x in X])
    for i in range(len(X[0])-1):
        # print(i)
        test = np.append(arr=test, values=np.array([x[i+1] for x in X]), axis=1)
    return pd.DataFrame(test)
start = timer()
test3 = prep_pixels3(data['pixels'].to_numpy())
print(timer()-start)
print(test3.shape)
# 0.3909464059997845
# (7000, 900)