使用熊猫,检查匹配文本的列,如果为真则更新新列

数据挖掘 Python 熊猫 数据清理 数据框
2021-10-05 08:58:43

我的目标:使用熊猫,检查匹配文本的列[不精确],如果为真,则更新新列。

从 csv 文件中,创建了一个数据框,并检查特定列的值 - COLUMN_to_Check 是否匹配文本模式 - 'PEA'。根据模式是否匹配,在数据框中创建一个新列,使用 YES 或 NO。

我在文件 DATA2.csv 中有以下数据

ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY
GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness)
RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure
RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure
RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access
RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales

并使用此代码....

import pandas as pd

df=pd.read_csv('DATA2.csv')

Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list

pattern = '|'.join(Search_for_These_values)     # joining list for comparision

IScritical=df['COLUMN_to_Check'].str.contains(pattern)
for CHECK in IScritical:
    if not CHECK:
        print CHECK
        df['NEWcolumn']='NO'
    else:
        print CHECK
        df['NEWcolumn']='YES'

df.to_csv('OUPUT.csv')

打印 'CHECK' 的值返回正确的值,即第一行返回 false。

C:\Users\ME\Documents\SandBox (master)
λ python numpytest_pub.py
False
True
True
True
True

但输出 csv 文件将“NEWColumn”的所有值显示为“YES”,其中“NEWcolumn”行 [0] 上的值应为“NO”,因为此处的“COLUMN_to_Check”不应与模式匹配。

,ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY,NEWcolumn
0,GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness),YES
1,RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure,YES
2,RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure,YES
3,RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access,YES
4,RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales,YES

我可以感觉到 CHECK 部分缺少某些东西,但无法弄清楚是什么。任何人都可以帮忙吗?

让我知道该问题是否需要重新表述以更好地理解或未来的社区使用。

4个回答

你只需要这样做:

df['NEWcolumn'] = df['COLUMN_to_Check'].str.contains(pattern)
df['NEWcolumn'] = df['NEWcolumn'].map({True: 'Yes', False: 'No'})

df['NEWcolumn']='NO'将整列设置为 value 'NO'因此,您会看到表中最后一行的结果,分布在整个列中。

这是一种实现您想要的方法:

df['NEWcolumn'][IScritical]='YES'
df['NEWcolumn'][~IScritical]='NO'

https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking

您可以直接使用IScritical您创建的功能:

import pandas as pd

df=pd.read_csv('DATA2.csv')

Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list

pattern = '|'.join(Search_for_These_values)     # joining list for comparision

IScritical=df['COLUMN_to_Check'].str.contains(pattern)

df['NEWcolumn'] = IScritical.replace((True,False), ('YES','NO'))

您可以先添加列并将值默认为“NO”,然后使用 .loc 更新数据框:

df['NEWcolumn']='NO'
df.loc[df['COLUMN_to_Check'].str.contains(pattern), 'NEWcolumn'] = 'YES'