
Aenean ornare velit lacus, ac varius enim lorem ullamcorper dolore aliquam.
In this post, I am going to create interlinked, interactive scatter plots using the Bokeh library. Below is the description of the library from the homepage.
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.
I quite like its clean look and more than anything the interactive visualization capabilities. It also allows using javascript based web browser interactions without learning javascript. I have been picking on what it can do from its documentations and tutorials available on Bokeh NBViewer Gallery.
First, I am going to load the libraries I am going to use and run output_notebook
function from the bokeh
library. The function configures Bokeh plot objects to be displayed on the notebook.
import pandas as pd
from bokeh.io import output_notebook, output_file, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.models import CategoricalColorMapper
from bokeh.models import Plot, Range1d, HoverTool
from bokeh.layouts import gridplot
from bokeh.palettes import Set2
output_notebook()
To enable interlinking between plots, a common ColumnDataSource
needs to be used as the data source between plots. You can create one from a pandas DataFrame
or a dictionary
. I am going to use the diabetes dataset originally from here to demonstrate this. Below is a brief description of the dataset from the original source.
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
I am going to plot each of the 9 numeric features against the response variable on individual scatter plots. I will
In the code block below, the dataset is loaded as a pandas DataFrame
and a ColumnDataSource
is defined using the DataFrame
.
df = pd.read_table('../data/diabetes_tab.txt')
# assuming 1 is female and 2 is male
df['Gender'] = ['FEMALE' if x == 1 else 'MALE'
for x in df.SEX.values]
df.rename(columns={'AGE': 'Age'}, inplace=True)
one_source = ColumnDataSource(df)
df.head()
Age | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y | Gender | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 59 | 2 | 32.1 | 101.0 | 157 | 93.2 | 38.0 | 4.0 | 4.8598 | 87 | 151 | MALE |
1 | 48 | 1 | 21.6 | 87.0 | 183 | 103.2 | 70.0 | 3.0 | 3.8918 | 69 | 75 | FEMALE |
2 | 72 | 2 | 30.5 | 93.0 | 156 | 93.6 | 41.0 | 4.0 | 4.6728 | 85 | 141 | MALE |
3 | 24 | 1 | 25.3 | 84.0 | 198 | 131.4 | 40.0 | 5.0 | 4.8903 | 89 | 206 | FEMALE |
4 | 50 | 1 | 23.0 | 101.0 | 192 | 125.4 | 52.0 | 4.0 | 4.2905 | 80 | 135 | FEMALE |
Next, I am going to create a single scatter plot with age and the response variable. I am going to add a few interaction effects including a hover effect showing the x, y values of each point.
# define a color map for SEX variable
cmap = CategoricalColorMapper(
factors=('FEMALE', 'MALE'),
palette=Set2[3]
)
# define a function to enable reuse
def plot_diabetes(x, width=480, height=320,
legend=None, legend_location=None,
legend_orientation='vertical'):
hover = HoverTool(
tooltips=[('Index', '$index'),
(x, '$x'),
('Progression', '$y'),
('Gender', '@Gender')
])
tools = [hover, 'box_select', 'tap',
'wheel_zoom', 'reset', 'help']
plt = figure(width=width, height=height,
title=x +' vs. diabetes progression',
tools=tools)
plt.circle(x, 'Y', alpha=0.8, source=one_source,
fill_color={'field': 'Gender', 'transform': cmap},
line_color={'field': 'Gender', 'transform': cmap},
# highlight when selected
selection_alpha=1,
selection_fill_color={'field': 'Gender', 'transform': cmap},
selection_line_color={'field': 'Gender', 'transform': cmap},
# mute when not selected
nonselection_alpha=0.2,
nonselection_fill_color={'field': 'Gender', 'transform': cmap},
nonselection_line_color=None,
legend=legend)
plt.xaxis.axis_label = x
plt.xaxis.axis_label_text_font_style = 'normal'
plt.yaxis.axis_label = 'Diabetes progression'
plt.yaxis.axis_label_text_font_style = 'normal'
if(legend):
plt.legend.location = legend_location
plt.legend.orientation = legend_orientation
plt.legend.background_fill_alpha = 0.7
return(plt)
p1 = plot_diabetes('Age', legend='Gender', legend_location='top_left',
legend_orientation='horizontal')
output_file('../html/01-bokeh-plot-example-plot-01.html')
show(p1)
You can now see an interactive scatter plot. A toolbar is placed beside the plot where you can switch on and off different tools we included. In particular, in this plot you can see the values for each data point when you hover over them. You can set the list of values you want to show by configuring tooltips
with a list of (label, value) pairs in the HoverTool
object.
You can refer to different variables in the source dataset by prefixing @
. Fields starting with $
will are used for “special fields” such as the coordinates and the color apparently the color values are pulled from the data source, not the figure’s fill_color
as used above.
Now, I am going to create multiple plots and place them in a single grid using bokeh library’s gridplot
. The plots are linked by a single data source. Selecting data points in one plot will highlight the same data points in all.
plots = [plot_diabetes(x, 240, 180)
for x in df.columns
if x not in ['SEX', 'Gender', 'Y']]
# create an empty plot with only the title
gtitle = figure(width=240, height=80, title='Linked scatter plots')
gtitle.circle(0, 0, fill_color=None, line_color=None)
gtitle.title.text_font_size = '18px'
gtitle.border_fill_color = None
gtitle.grid.visible = False
gtitle.axis.visible = False
gtitle.outline_line_color = None
# create an empty plot with only the legend
glegend = figure(width=240, height=80, title=None)
glegend.circle(0,0, fill_color=Set2[3][0], line_color=Set2[3][0], legend='FEMALE')
glegend.circle(0,0, fill_color=Set2[3][1], line_color=Set2[3][1], legend='MALE')
glegend.border_fill_color = None
glegend.grid.visible = False
glegend.axis.visible = False
glegend.outline_line_color = None
glegend.legend.border_line_color = None
glegend.legend.location = 'center'
output_file('../html/01-bokeh-plot-example-plot-02.html')
show(gridplot([gtitle, None, glegend] + plots, ncols=3))
You can now see nine different plots linked with a single data source. When you select any data points in one plot the same data points are highlighted across all while the rest are ‘muted’.
This could be useful when inspecting data with multiple dimensions. For example, when I clicked on the person with the highest S1 measurement, I can she that he also had the highest measurements of S2 and S4. Besides, it is just fun playing with these plots. I am looking forward to going through more of the library examples and tutorials.