pd.qcut 垃圾箱错误!

数据挖掘 Python 熊猫 数据
2022-03-06 23:33:11

嗨数据科学社区!

我正在做一些 RFM 分析,但是在设置垃圾箱时,我遇到了一个错误。这是我的代码,下面是确切的错误。

import modules
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import datetime as dt

#Load the file
data = pd.read_excel('Orders.xlsx')
#EDA (Exploratory Data Analysis)
data.head()
data.info()
data.describe()

#Removing null values
data= data[pd.notnull(data['CustomerID'])]

#Delete duplicate records
filtered_data=data[['Country','CustomerID','InvoiceNo','Description']].drop_duplicates()
filtered_data.info()

#Top ten country's customer
filtered_data.Country.value_counts()[:10].plot(kind='bar')

#Filter US customers only
us_data=data[data.Country=='US']
us_data.info()
us_data.describe()

#Filter out unit price <= 0
us_data = us_data[(us_data['UnitPrice']>0)]
us_data.info()
us_data.describe()

#Filter to only required columns for RFM Analysis
us_data=us_data[['CustomerID','InvoiceDate','InvoiceNo','Quantity','UnitPrice']]
#Create TotalPrice column
us_data['TotalPrice'] = us_data['Quantity'] * us_data['UnitPrice']
#Find the oldest and newest dates
us_data['InvoiceDate'].min(),us_data['InvoiceDate'].max()
#Define the present date
PRESENT = dt.datetime(2019,10,1)
#Covert InvoiceDate to to_datetime
us_data['InvoiceDate'] = pd.to_datetime(us_data['InvoiceDate'])
us_data.head()

#RFM Analysis
rfm= us_data.groupby('CustomerID').agg({'InvoiceDate': lambda date: (PRESENT - date.max()).days,
                                        'InvoiceNo': lambda num: len(num),
                                        'TotalPrice': lambda price: price.sum()})
rfm.columns
# Change the name of the columns
rfm.columns=['recency','frequency','monetary']
rfm['recency'] = rfm['recency'].astype(int)
rfm.head()

#Computing Quantile of RFM values
rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'])
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['1','2','3','4'])
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['1','2','3','4'])
rfm.head()

这是回溯:

Traceback (most recent call last):

  File "<ipython-input-57-e15dc8d2e29f>", line 2, in <module>
    rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['1','2','3','4'])

  File "/Users/omarmartinez/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 313, in qcut
    dtype=dtype, duplicates=duplicates)

  File "/Users/omarmartinez/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 339, in _bins_to_cuts
    "the 'duplicates' kwarg".format(bins=bins))

ValueError: Bin edges must be unique: array([ 1.,  1.,  1.,  1., 37.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

我尝试包含 duplicates='drop' 参数,如下所示:

#Computing Quantile of RFM values
rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'], duplicates='drop')
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['1','2','3','4'],duplicates='drop')
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['1','2','3','4'],duplicates='drop')
rfm.head()

但后来我得到另一个回溯:

Traceback (most recent call last):

  File "<ipython-input-59-bf551522a462>", line 2, in <module>
    rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['1','2','3','4'],duplicates='drop')

  File "/Users/omarmartinez/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 313, in qcut
    dtype=dtype, duplicates=duplicates)

  File "/Users/omarmartinez/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 359, in _bins_to_cuts
    raise ValueError('Bin labels must be one fewer than '

ValueError: Bin labels must be one fewer than the number of bin edges

我在这里有点迷路,所以对此的任何帮助将不胜感激。

预先感谢您的支持!

1个回答

对于任何有这个或类似问题的人,我建议做一些 EDA。这是一个非常重要的步骤,我们可能会在没有真正了解数据分布的情况下完成。

本质上,我的问题是分布非常偏斜,例如,90% 的观察值具有相同的值。所以我使用特定的百分位数来创建 4 个存储桶。