Creating linked plots using Python's bokeh library

In this post, I am going to create interlinked, interactive scatter plots using the Bokeh library. Below is the description of the library from the homepage.

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.

I quite like its clean look and more than anything the interactive visualization capabilities. It also allows using javascript based web browser interactions without learning javascript. I have been picking on what it can do from its documentations and tutorials available on Bokeh NBViewer Gallery.

Load libraries

First, I am going to load the libraries I am going to use and run output_notebook function from the bokeh library. The function configures Bokeh plot objects to be displayed on the notebook.

import pandas as pd
from bokeh.io import output_notebook, output_file, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.models import CategoricalColorMapper
from bokeh.models import Plot, Range1d, HoverTool
from bokeh.layouts import gridplot
from bokeh.palettes import Set2
output_notebook()

Load data

To enable interlinking between plots, a common ColumnDataSource needs to be used as the data source between plots. You can create one from a pandas DataFrame or a dictionary. I am going to use the diabetes dataset originally from here to demonstrate this. Below is a brief description of the dataset from the original source.

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

I am going to plot each of the 9 numeric features against the response variable on individual scatter plots. I will In the code block below, the dataset is loaded as a pandas DataFrame and a ColumnDataSource is defined using the DataFrame.

df = pd.read_table('../data/diabetes_tab.txt')
# assuming 1 is female and 2 is male
df['Gender'] = ['FEMALE' if x == 1 else 'MALE'
                for x in df.SEX.values]
df.rename(columns={'AGE': 'Age'}, inplace=True)
one_source = ColumnDataSource(df)
df.head()
Age SEX BMI BP S1 S2 S3 S4 S5 S6 Y Gender
0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151 MALE
1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75 FEMALE
2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141 MALE
3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206 FEMALE
4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135 FEMALE

Create an interactive scatter plot

Next, I am going to create a single scatter plot with age and the response variable. I am going to add a few interaction effects including a hover effect showing the x, y values of each point.

  • Box select: Highlight data points selected in a rectangular box by dragging the mouse
  • Lasso select: Highlight data points selected in a lasso shape by dragging the mouse
  • Tap: Highlight selected data points by clicking the mouse
  • Wheel zoom: Zoom in and out of the plot using the mouse wheel zoom
  • Reset: Reset the plot to its default state
# define a color map for SEX variable
cmap = CategoricalColorMapper(
    factors=('FEMALE', 'MALE'),
    palette=Set2[3]
)
# define a function to enable reuse
def plot_diabetes(x, width=480, height=320,
                  legend=None, legend_location=None,
                  legend_orientation='vertical'):
    hover = HoverTool(
        tooltips=[('Index', '$index'),
                  (x, '$x'),
                  ('Progression', '$y'),
                  ('Gender', '@Gender')
                 ])
    tools = [hover, 'box_select', 'tap',
             'wheel_zoom', 'reset', 'help']
    plt = figure(width=width, height=height,
                 title=x +' vs. diabetes progression',
                 tools=tools)
    plt.circle(x, 'Y', alpha=0.8, source=one_source,
               fill_color={'field': 'Gender', 'transform': cmap},
               line_color={'field': 'Gender', 'transform': cmap},
               # highlight when selected
               selection_alpha=1,
               selection_fill_color={'field': 'Gender', 'transform': cmap},
               selection_line_color={'field': 'Gender', 'transform': cmap},
               # mute when not selected
               nonselection_alpha=0.2,
               nonselection_fill_color={'field': 'Gender', 'transform': cmap},
               nonselection_line_color=None,
               legend=legend)
    plt.xaxis.axis_label = x
    plt.xaxis.axis_label_text_font_style = 'normal'
    plt.yaxis.axis_label = 'Diabetes progression'
    plt.yaxis.axis_label_text_font_style = 'normal'
    if(legend):
        plt.legend.location = legend_location
        plt.legend.orientation = legend_orientation
        plt.legend.background_fill_alpha = 0.7
    return(plt)

p1 = plot_diabetes('Age', legend='Gender', legend_location='top_left',
                   legend_orientation='horizontal')
output_file('../html/01-bokeh-plot-example-plot-01.html')
show(p1)

You can now see an interactive scatter plot. A toolbar is placed beside the plot where you can switch on and off different tools we included. In particular, in this plot you can see the values for each data point when you hover over them. You can set the list of values you want to show by configuring tooltips with a list of (label, value) pairs in the HoverTool object.

You can refer to different variables in the source dataset by prefixing @. Fields starting with $ will are used for “special fields” such as the coordinates and the color apparently the color values are pulled from the data source, not the figure’s fill_color as used above.

Create multiple linked plots

Now, I am going to create multiple plots and place them in a single grid using bokeh library’s gridplot. The plots are linked by a single data source. Selecting data points in one plot will highlight the same data points in all.

plots = [plot_diabetes(x, 240, 180)
         for x in df.columns
         if x not in ['SEX', 'Gender', 'Y']]

# create an empty plot with only the title
gtitle = figure(width=240, height=80, title='Linked scatter plots')
gtitle.circle(0, 0, fill_color=None, line_color=None)
gtitle.title.text_font_size = '18px'
gtitle.border_fill_color = None
gtitle.grid.visible = False
gtitle.axis.visible = False
gtitle.outline_line_color = None

# create an empty plot with only the legend
glegend = figure(width=240, height=80, title=None)
glegend.circle(0,0, fill_color=Set2[3][0], line_color=Set2[3][0], legend='FEMALE')
glegend.circle(0,0, fill_color=Set2[3][1], line_color=Set2[3][1], legend='MALE')
glegend.border_fill_color = None
glegend.grid.visible = False
glegend.axis.visible = False
glegend.outline_line_color = None
glegend.legend.border_line_color = None
glegend.legend.location = 'center'

output_file('../html/01-bokeh-plot-example-plot-02.html')
show(gridplot([gtitle, None, glegend] + plots, ncols=3))

You can now see nine different plots linked with a single data source. When you select any data points in one plot the same data points are highlighted across all while the rest are ‘muted’.

This could be useful when inspecting data with multiple dimensions. For example, when I clicked on the person with the highest S1 measurement, I can she that he also had the highest measurements of S2 and S4. Besides, it is just fun playing with these plots. I am looking forward to going through more of the library examples and tutorials.

Share

  • Written on: Oct 06, 2017
  • Written by: Michael J. Moon

Related posts