我有一个大型数据集(450 万行,35 列)。感兴趣的列是company_id
(string) 和company_score
(float)。大约有 10,000 个独特company_id
的。
company_id company_score date_submitted company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW
我的目标是通过 unique 创建大约 10,000 个新company_id
数据框,其中仅包含该数据框中的相关行。
我的第一个想法是创建如下所示的数据框集合,然后遍历原始数据集并根据标准附加新值。
company_dictionary = {}
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame([])
有没有更好的方法通过利用熊猫来做到这一点?即,有没有一种方法可以使用内置的 pandas 函数来创建仅包含相关行的新过滤数据框?
编辑:我尝试了一种新方法,但我现在遇到了我不理解的错误消息。
[In] unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:\get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()
pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)()
pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)()
KeyError: 'id'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]
C:\ in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
C:\ in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
C:\ in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
C:\ in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\ in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()
pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)()
pandas\src\hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)()
KeyError: 'id'