从 Python Pandas 中的索引列表中返回行

数据挖掘 Python 熊猫
2022-03-15 17:58:13

语境

我有一个 CSV,其中包含两种类型的行,一个观察记录和下面的行,一个与上面的观察记录相关的观察值。记录行包含一个四字母代码,表示观察类型。我的目标是创建一个新的 CSV,其中仅包含与特定代码列表匹配的那些观察记录,以及下面一行中的相关观察值。

文件中的示例

OBSERV\LTRC,CL1,0,10.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,10,20.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,20,30.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,30,40.00;
OBVAL\14,,0.5,V;

到目前为止的代码

import pandas as pd
data = pd.read_csv(r"CSVFILEPATH")
df = pd.DataFrame(data)
df.set_index
newdf = df.loc[df[0].str.contains('LLRT|LLTX|LRRT|LTRC|LV10|LV3|LEDR|LTRV|LES2|LES1', regex = True)]
# this returns all observation record rows I care about but I still need the associated observation values.

keep_ind = [] #This list will contain all indexes to keep 

observ_ind = ndf.index.values.tolist()#The list of observation record indexes to keep
keep_ind.append(observ_ind)#Added these to the keep_ind list

问题

如何获取此索引列表(keep_ind),附加一个新列表,该列表与每个项目添加 1 相同的列表(以获取记录下的所有观察值行)并创建一个包含所有的新数据框这个组合列表中每个索引的行?

到目前为止,我已经尝试过:

keep_ind.append(observ_ind + 1 for i in observ_ind)

但这给出了错误:

generator object <genexpr> at 0x0000028057C33648>]
1个回答

我想解决两个问题。一个是你如何获得你的 CSV,另一个是你的代码中可能不起作用的东西。

如何获取关联值

您尝试做的是该shift()方法的一项不错的任务

# you probably don't need the semicolons at the end of the line, right?
# if you want to get rid of them, you can do:
df= pd.read_csv(
    r"CSVFILEPATH",   # your file
    engine='python',  # use the python engine instead of C to use a regex as separator
    sep=r'[,;]',      # use ; as an alternative separator
    usecols=range(3), # exclude the last column (after the ;)
    names=range(3))   # assign names, if you like you can also assign a list of more verbose column names here (this just uses numbers)

# create a dataframe that is a version of the original
# which is just one row shifted to the top
df_shifted= df.shift(-1)

# concatenate it with the original data frame and assign unique column names
df_concat=  pd.concat([df, df_shifted], axis='columns')
df_concat.columns= range(6)

# apply your filter
newdf= df.loc[df_concat[0].str.contains('LLRT|LLTX|LRRT|LTRC|LV10|LV3|LEDR|LTRV|LES2|LES1', regex = True)]

这输出:

In [44]: newdf
Out[44]: 
             0    1     2         3    4    5
0  OBSERV\LTRC  CL1   0.0  OBVAL\14  NaN  0.0
2  OBSERV\LTRC  CL1  10.0  OBVAL\14  NaN  0.0
4  OBSERV\LTRC  CL1  20.0  OBVAL\14  NaN  0.0
6  OBSERV\LTRC  CL1  30.0  OBVAL\14  NaN  0.5

根据您提供的以下测试数据:

import io
import pandas as pd

raw=\
r"""OBSERV\LTRC,CL1,0,10.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,10,20.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,20,30.00;
OBVAL\14,,0,V;
OBSERV\LTRC,CL1,30,40.00;
OBVAL\14,,0.5,V;"""

df= pd.read_csv(io.StringIO(raw), engine='python', sep=r'[,;]', usecols=range(3), names=range(3))

代码,那可能不行,你打算做什么

# the following line references a method, but doesn't call it:
df.set_index
# if you execute this line, it outputs:
df.set_index
Out[18]: 
<bound method DataFrame.set_index of              0    1     2
0  OBSERV\LTRC  CL1   0.0
1     OBVAL\14  NaN   0.0
...
# look at the <bound method part, this is the __repr__ string
# of the object (bound methods are objects themselfes)
# if you execute this line inbetween a script, it has no effect
# at all (just maybe slows down execution a very tiny bit)
# because you don't do anything with the returned object

你写了,你得到了消息generator object <genexpr> at 0x0000028057C33648>]这不是错误方法。bound method上面的消息一样,这也是一个对象的__repr__字符串。在这种情况下,一个生成器对象。如果您调用append一个列表,它会将传入的参数append视为一个对象。我猜您宁愿将增加一的索引添加到现有列表中。这可以通过以下代码完成:

# create a copy of the list to avoid funny results
new_indices=list(observ_ind)
# add the elements of returned by the generator object
# to the list (rather than the generator object itself
new_indices.extend(i + 1 for i in observ_ind)