数据挖掘 - 如何使用 Pandas 和 Bokeh 绘制多个变量 - 吾爱随笔录

如何使用 Pandas 和 Bokeh 绘制多个变量

数据挖掘 Python 熊猫

2021-09-29 16:58:54

我是熊猫和散景的新手；我想创建一个条形图，显示两个不同的变量并列进行比较。

例如，对于下面的 Pandas 数据框，我想看看每年的召回量与恢复量的比较。

    year    Recalled    Recovered
0   1994    11.472  10.207
1   1995    11.810  10.326
2   1996    10.632  10.094
3   1997    13.857  12.944
4   1998    13.861  12.588
5   1999    13.375  11.951
6   2000    11.278  nan
7   2001    12.827  nan
8   2002    12.687  nan
9   2003    10.859  nan
10  2004    nan nan
11  2005    11.782  11.047
12  2006    12.089  10.194
13  2007    14.351  13.401
14  2008    14.921  13.886
15  2009    11.759  10.815
16  2010    12.987  11.482
17  2011    13.262  10.730
18  2012    9.980   9.520
19  2013    10.626  9.591
20  2014    12.199  10.270

4个回答

这在最新版本的 Bokeh 中发生了变化（我猜是 0.12.7）。这是这样做的新方法。

from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure

output_file("bars.html")

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ['2015', '2016', '2017']

data = {'fruits' : fruits,
        '2015'   : [2, 1, 4, 3, 2, 4],
        '2016'   : [5, 3, 3, 2, 4, 6],
        '2017'   : [3, 2, 4, 4, 5, 3]}

# this creates [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]
x = [ (fruit, year) for fruit in fruits for year in years ]
counts = sum(zip(data['2015'], data['2016'], data['2017']), ()) # like an hstack

source = ColumnDataSource(data=dict(x=x, counts=counts))

p = figure(x_range=FactorRange(*x), plot_height=250, title="Fruit Counts by Year",
           toolbar_location=None, tools="")

p.vbar(x='x', top='counts', width=0.9, source=source)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

show(p)

只是熊猫，没有散景（在运行前将数据复制到剪贴板）：

import pandas, seaborn
DF = pandas.read_clipboard()
DF.plot.bar(x='year')

如果您首先融合 Pandas 数据框，则可以在 Bokeh 高级条形图中使用分组。

import pandas as pd
from bokeh.plotting import figure, show

# Use output_notebook if you are using an IPython or Jupyter notebook
from bokeh.io import output_notebook
output_notebook()

# Get your data into the dataframe
df = pd.read_csv("data.csv")

# Create a "melted" version of your dataframe
melted_df = pd.melt(df, id_vars=['year'], value_vars=['Recalled', 'Recovered'])

melted_df.head()

这是您融化的数据框的格式：

+---+------+----------+--------+
|   | year | variable | value  |
+---+------+----------+--------+
| 0 | 1994 | Recalled | 11.472 |
| 1 | 1995 | Recalled |  11.81 |
| 2 | 1996 | Recalled | 10.632 |
| 3 | 1997 | Recalled | 13.857 |
| 4 | 1998 | Recalled | 13.861 |
+---+------+----------+--------+

然后只需使用融化的数据框作为散景条形图中的数据：

p = Bar(melted_df, label="year", values="value", group="variable", legend="top_left", ylabel='Values')
show(p)

这是我仅使用 matplotlib 和 numpy 的答案。

与接受的答案相比，我的代码似乎很长。因此，如果有人可以帮助我改进它，那就太好了！

代码在这里：

## Reading the data
year = np.arange(1994,2015,1)
type = ["Recalled", "Recovered"]
value_1 = [11.472, 11.81, 10.632, 13.857, 13.861, 13.375, 11.278, 12.827, 12.687, 10.859, np.nan, 11.782, 12.089,\
       14.351, 14.921, 11.759, 12.987, 13.262, 9.98, 10.626, 12.199]
value_2 = [10.207, 10.326, 10.094, 12.944, 12.588, 11.951, np.nan, np.nan, np.nan, np.nan, np.nan, 11.047, 10.194,\
       13.401, 13.886, 10.815, 11.482, 10.73, 9.52, 9.591, 10.27] 



## Changing data into ndarray format
dpoints = np.array([type[0], year[0], value_1[0]])
conditions =  np.unique(dpoints[:,0])
categories =  np.unique(dpoints[:,1]).tolist()

for i in range(1,len(year),1):
    dpoints = np.vstack([dpoints,np.array([type[0],year[i],value_1[i]])])
for i in range(0,len(year),1):
    dpoints = np.vstack([dpoints,np.array([type[1],year[i],value_2[i]])])

# Plot it!
fig = plt.figure(figsize=(16,5))
ax = plt.subplot()
#the space between each set of bars
space = 0.2
n = len(conditions)
width = (1 - space) / (len(conditions))

# Create a set of bars at each position
for i,cond in enumerate(conditions):
    indeces = range(1, len(categories)+1)
    vals = dpoints[dpoints[:,0] == cond][:,2].astype(np.float)
    pos = [j - (1 - space) / 2. + i * width for j in indeces]

    ax.bar(pos, vals, width=width, label =cond,lw = 0,color=  ["blue","r"][i],alpha = 0.6)


    # Set the x-axis tick labels to be equal to the categories
    ax.set_xticks(indeces)
    ax.set_xticklabels(categories)
    ax.set_xlim(0,22)
    plt.setp(plt.xticks()[1], rotation=0,fontsize = 12)
    ax.set_ylabel('Variables',fontsize =15)
    ax.set_xlabel("Year",fontsize =15)


    # Add a legend
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles[::-1], labels[::-1], loc='upper left',frameon =False,fontsize =12)

图在这里

其它你可能感兴趣的问题

上一篇测试集的意义何在？下一篇是否可以使用生成模型来“共享”私人数据？