根据唯一列值从现有数据框创建新数据框

数据挖掘 Python 熊猫 数据框
2022-01-29 15:11:25

我有一个大型数据集(450 万行,35 列)。感兴趣的列是company_id(string) 和company_score(float)。大约有 10,000 个独特company_id的。

company_id    company_score    date_submitted    company_region
AA            .07              1/1/2017          NW
AB            .08              1/2/2017          NE
CD            .0003            1/18/2017         NW

我的目标是通过 unique 创建大约 10,000 个新company_id数据框,其中仅包含该数据框中的相关行。

我的第一个想法是创建如下所示的数据框集合,然后遍历原始数据集并根据标准附加新值。

company_dictionary = {}  
for company in df['company_id']:  
    company_dictionary[company_id] = pd.DataFrame([])

有没有更好的方法通过利用熊猫来做到这一点?即,有没有一种方法可以使用内置的 pandas 函数来创建仅包含相关行的新过滤数据框?

编辑:我尝试了一种新方法,但我现在遇到了我不理解的错误消息。

    [In]  unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
    [In]  unique_company_id
    [Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
       'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
    [In]  for id in unique_company_id:
    [In]      new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()

pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)()

pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)()

KeyError: 'id'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
      1 for id in unique_bank_id:
----> 2     new_df = df[df['id'] == id]

C:\ in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

C:\ in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

C:\ in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

C:\ in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\ in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()

pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)()

pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)()

KeyError: 'id'
1个回答

您可以列并将其结果转换为 DataFrame 的字典:groupby company_id

import pandas as pd

df = pd.DataFrame({
    "company_id": ["AA", "AB", "AA", "CD", "AB"],
    "company_score": [.07, .08, .06, .0003, .09],
    "company_region": ["NW", "NE", "NW", "NW", "NE"]})

# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}

# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))

import pprint
pprint.pprint(dict_of_companies)

输出:

{'AA':   company_id company_region  company_score
0         AA             NW           0.07
2         AA             NW           0.06,
 'AB':   company_id company_region  company_score
1         AB             NE           0.08
4         AB             NE           0.09,
 'CD':   company_id company_region  company_score
3         CD             NW         0.0003}