从 2 个 numpy 数组构造 2d 列表时出现内存不足错误

数据挖掘 Python 麻木的
2022-02-25 01:27:18

我正在处理来自 luna16 数据集的肺 CT 图像,该数据集有一个 3d 肺图像和一个来自 CSV 文件的标签,我有一个用于从 3d 数组 25x25x25(3d 图像)和标签 [0,1] 构建 2d 列表的代码或 CSV 文件中的 [1,0],在创建 2d 列表后我想将其保存在 numpy 文件中,下面是我创建 2d 列表并将其保存在 numpy 文件中的代码:

def getIDlist(csv_Dist,Data_Dist):
    # receive marked coords and ID in annotations.csv, and return the distination with coords.
    print('loading')
    data = np.loadtxt(csv_Dist, delimiter = ',', dtype = 'str')
    # delete the header file via 1:0, and receive the ID, x, y, z, r via 0:5 to a list.
    ID_coords = data[1:,0:5][0:10000] # get list of 'seriesuid' 'coordX' 'coordY' 'coordZ' 'class' (without header).
    # define the output file.
    ID_dist = []

    print('strat finding')
    process_bar = ShowProcess(len(ID_coords))

    for ID,x,y,z,label in ID_coords: 
        ID = ID +'.mhd' 
        found = 0       
        for parent, dirnames, filenames in os.walk(Data_Dist):
            for filename in filenames:# loop inside all files                 
                if ID == filename: # ID + .mhd in csv equal to filename in files
                    process_bar.show_process()
                    ID = parent + '\\' + ID# ID gets full path of the founded file
                    ID_dist.append([ID,x,y,z,label])# ID_dist gets info of founded files
                    found = 1
                    #print("found: ", found)
                    break
            if found == 1:
                break
        if found == 1:
            continue

    process_bar.close()                 
    return ID_dist 

def get3Dmatrix(ID_dist):

    print('preparing the 3d matrix')
    matrixlist = []
    for Dist, xcoords, ycoords, zcoords, label in tqdm(ID_dist):
        # read the image
        imagearray,origin,spacing = load_itk_image(Dist)
        # resample in to 1mm*1mm*1mm
        imagearray = resample(imagearray,spacing,(1,1,1))

        # transfer world coordinates to voxel-coordinates, divide new spacing 1mm
        z = int(round((float(zcoords)-float(origin[0]))/1))
        y = int(round((float(ycoords)-float(origin[1]))/1))
        x = int(round((float(xcoords)-float(origin[2]))/1))

        # get the 3D array with shape 25*25*25           
        imagearray = imagearray[z-13:z+12,y-13:y+12,x-13:x+12]

        #converting the label number into a one-hot-encoding
        if int(label) == 1: 
            label=np.array([0,1])
        elif int(label) == 0: 
            label=np.array([1,0])

        # put it into output file
        matrixlist.append([imagearray,label])# 2d list consist of 3d array + label of all cases.
    return matrixlist 

 def main():
    start_time = time.time()
    # get ID_list from the csv and data dist.
    ID_list = getIDlist(candidates_V2_Dist, Data_Dist)# nested list - get file name with dist + x,y,z,class
    # Data_set[i][0] is the 3D array, Data_set[i][1] is the label
    Data_set = get3Dmatrix(ID_list) # 2d list consist of 3d array + label of all cases.
    print("Begin saving in numpy file")
    np.save(output_path+'np_ds(10000)-25-25-25(zyx)_one_hot.npy', Data_set)
    print("%s time takes in seconds" % (time.time() - start_time))

if __name__ == "__main__":
    main()

我的问题是:

1-大约有 550 个样本,RAM 被填满,我得到内存错误,我正在使用 16 gb ram 笔记本电脑使用 dell inspiron core i7。

2-创建每个样本需要 34 秒,我发现这对于一个样本来说是巨大的时间。

我在google上做了很多搜索并在其他一些论坛上问了一个问题,但没有得到任何答案,请谁能帮帮我?真的我对那个错误感到困惑。下图是错误信息: 在此处输入图像描述

1个回答

我建议在任何时候将问题分解一下以减少内存使用量。

main 函数的第一部分使用getIDList. 这似乎很好,所以把它放在那里。

然后我会将该列表分解成更小的块,get3Dmatrix依次调用每个块。更改您的代码,它可能看起来像这样:

# Get number of entries in ID list
N = len(ID_list)

# break it down into a number of chunks e.g. 4, based on your progress bar
import numpy as np    # should already be imported

N = len(ID_list)
num_chunks = 4           # you can play with this number, making it larger until you don't get emmory errors
chunks = np.linspace(0, N, num_chunks)

for i in range(len(chunks) - 1):
    this_sublist = ID_list[chunks[i] : chunks[i + 1]]
    sub_data_set = get3Dmatrix(this_sublist)

    # At this point, either save this sub_data_set, or try appending it to another list toi make one final numpy matrix at the end before saving

...

print("Begin saving in numpy file")
np.save(output_path+'np_ds(10000)-25-25-25(zyx)_one_hot.npy', Data_set)
print("%s time takes in seconds" % (time.time() - start_time))

即使从您添加的回溯中,也很难说出代码中发生的确切位置。

粗略查看您提到的尺寸,16Gb 机器内存不足似乎也不合理 - 所以我不能完全了解正在保存多少图像/补丁。