Visualizing Data in Python: An In-Depth Comparison of Python's Top Libraries

Check out my article on medium

Data visualization is the process of converting data into visual formats such as charts, graphs, maps, and infographics, to effectively communicate insights, patterns, and trends. It is a critical aspect of data analysis as it helps people to understand complex information quickly and easily.

One of the main benefits of data visualization is that it enables users to identify trends and patterns in data that may not be immediately apparent when viewed in tabular or numerical form. For example, a line graph can show changes in a data set over time, making it easy to identify trends and patterns, while a bar chart can be used to compare data between different categories.

Data visualization also enables users to communicate their findings to others more effectively. When data is presented in an attractive and easy-to-understand format, it is more likely to be remembered and acted upon. In addition, visualizations can help to clarify the meaning of data and highlight key insights, making it easier for others to understand and use the information. Per Professor Ben Shneiderman: “The purpose of visualization is insight, not pictures.”

Data Visualization in Python

Python is a popular programming language for data analysis and data visualization. There are several libraries and tools available in Python for creating visualizations, including:

  • Matplotlib: This is the most widely used data visualization library in Python, and is well-suited for creating static, animated, and interactive visualizations.
  • Seaborn: This library is built on top of Matplotlib and provides a high-level interface for creating attractive statistical graphics.
  • Plotly: This library is well-suited for creating interactive and dynamic visualizations. It supports a wide range of visualizations, including bar charts, line graphs, scatter plots, and more.
  • Bokeh: This library is focused on creating interactive visualizations for the web. It provides a high-level interface for creating visualizations that can be easily embedded in web pages.
  • Altair: This library is a declarative visualization library that allows users to specify visualizations in a simple, human-readable format.
  • ggplot: This library is a Python implementation of the popular R library ggplot2, and provides a high-level interface for creating complex and attractive visualizations.
  • Pygal: This library is well-suited for creating static, SVG-based visualizations. It supports a wide range of visualizations, including bar charts, line graphs, and more.

In the following section, I will compare each libraries and draw the same boxplot for each of them

Matplotlib

Pros:

  • Widely used and well-documented
  • Highly customizable, allowing for a wide range of visualizations
  • Works well for creating static, animated, and interactive visualizations

Cons:

  • Has a low-level API, requiring more code to create simple visualizations
  • Some visualizations can be unattractive without additional customization

Installation: pip install matplotlib

To draw a boxplot:

import matplotlib.pyplot as plt
import numpy as np

data = np.random.normal(100, 20, 200)
plt.boxplot(data)
plt.show()

The output would look pretty vanilla plain:

Seaborn

Pros:

  • Built on top of Matplotlib and provides a high-level interface for creating attractive statistical graphics
  • Offers built-in support for creating visualizations for a wide range of statistical analyses

Cons:

  • Can still be limited in terms of customization compared to Matplotlib
  • Some visualizations can still be unattractive without additional customization

Installation: pip install seaborn

To draw a boxplot:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(100, 20, 200)
sns.boxplot(data)
plt.show()

Plotly

Pros:

  • Well-suited for creating interactive and dynamic visualizations
  • Offers a wide range of visualizations, including bar charts, line graphs, scatter plots, and more
  • Easy to embed in web pages

Cons:

  • Can be complex to use for some types of visualizations
  • Some visualizations can be slow to render, especially for large datasets

Installation: pip install plotly

To draw a boxplot:

import plotly.express as px
import pandas as pd
import numpy as np

np.random.seed(10)
data = [np.random.normal(0, 1, 100), np.random.normal(2, 1, 100)]

df = pd.DataFrame(data).transpose()
df.columns = ['A', 'B']

fig = px.box(df, y="A", points="all")
fig.update_layout(title_text="Boxplot Example")

fig.show()

There are a lot more data annotations and widgets for plotly output:

Note that in Plotly Express, you need to pass the data in a long format using the px.box function, as opposed to the wide format used in the previous examples. This allows Plotly Express to process the data correctly without encountering any errors.

Bokeh

Pros:

  • Focused on creating interactive visualizations for the web
  • Offers a high-level interface for creating visualizations that can be easily embedded in web pages

Cons:

  • Can be limited in terms of customization compared to other libraries
  • Some visualizations can be slow to render, especially for large datasets

Installation: pip install bokeh

Here’s an example drawing boxplots by referring to the Bokeh documentation:

import pandas as pd

from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg2 import autompg2
from bokeh.transform import factor_cmap

df = autompg2[["class", "hwy"]].rename(columns={"class": "kind"})

kinds = df.kind.unique()

# compute quantiles
qs = df.groupby("kind").hwy.quantile([0.25, 0.5, 0.75])
qs = qs.unstack().reset_index()
qs.columns = ["kind", "q1", "q2", "q3"]
df = pd.merge(df, qs, on="kind", how="left")

# compute IQR outlier bounds
iqr = df.q3 - df.q1
df["upper"] = df.q3 + 1.5*iqr
df["lower"] = df.q1 - 1.5*iqr

source = ColumnDataSource(df)

p = figure(x_range=kinds, tools="", toolbar_location=None,
           title="Highway MPG distribution by vehicle class",
           background_fill_color="#eaefef", y_axis_label="MPG")

# outlier range
whisker = Whisker(base="kind", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("kind", "TolRainbow7", kinds)
p.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black")
p.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black")

# outliers
outliers = df[~df.hwy.between(df.lower, df.upper)]
p.scatter("kind", "hwy", source=outliers, size=6, color="black", alpha=0.3)

p.xgrid.grid_line_color = None
p.axis.major_label_text_font_size="14px"
p.axis.axis_label_text_font_size="12px"

show(p)

The output would be:

Altair

Pros:

  • Declarative visualization library that allows users to specify visualizations in a simple, human-readable format
  • Offers a wide range of visualizations
  • Easy to use for creating simple visualizations

Cons:

  • Can be limited in terms of customization compared to other libraries
  • Some visualizations can be slow to render, especially for large datasets

Installation: pip install altair

To draw a boxplot:

import altair as alt
import pandas as pd
import numpy as np

data = np.random.normal(100, 20, 200)
df = pd.DataFrame(data, columns=['data'])

alt.Chart(df).mark_boxplot().encode(
    y='data:Q'
).properties(
    width=400,
    height=300
).interactive()

ggplot

Pros:

  • A Python implementation of the popular R library ggplot2
  • Provides a high-level interface for creating complex and attractive visualizations

Cons:

  • Can be complex to use for some types of visualizations
  • Some visualizations can be slow to render, especially for large datasets

Installation: pip install ggplot

To draw a boxplot:

import ggplot
import pandas as pd
import numpy as np

data = np.random.normal(100, 20, 200)
df = pd.DataFrame(data, columns=['data'])

p = ggplot(df, aes(x='data'))
p = p + geom_boxplot()
p.show()

Note there are some compatibility issues with ggplot and pandas. If using the latest pandas, it will generate AttributeError: module ‘pandas’ has no attribute ‘tslib’.

This error occurs because the “tslib” module was removed from pandas starting from version 0.25.0.

To resolve this issue, you can either:

  1. Downgrade to a version of pandas prior to 0.25.0, such as 0.24.2.
  2. Remove any references to “tslib” in your code.

Pygal

Pros:

  • Focuses on creating simple and clean visualizations
  • Offers a wide range of visualizations, including bar charts, line graphs, scatter plots, and more
  • Lightweight and easy to install

Cons:

  • Can be limited in terms of customization compared to other libraries
  • Some visualizations can be unattractive without additional customization
  • May not be suitable for more complex visualizations

Installation: pip install pygal

Unfortunately, Pygal does not have built-in support for boxplots. Example code for creating a bar chart in Pygal:

import pygal

bar_chart = pygal.Bar()
bar_chart.title = "Bar Chart Example"
bar_chart.x_labels = ["Label 1", "Label 2", "Label 3"]
bar_chart.add("Series 1", [1, 2, 3])
bar_chart.add("Series 2", [3, 2, 1])
bar_chart.render_to_file("bar_chart.svg")

The output will be generated as a svg chart:

Conclusion

Each library has its own strengths and weaknesses. For example, Matplotlib is a low-level library and requires more code to create visualizations, but is highly customizable. Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive statistical graphics. Plotly is well-suited for creating interactive and dynamic visualizations, while Bokeh is focused on creating interactive visualizations for the web. ggplot is a Python implementation of the popular R library ggplot2, and provides a high-level interface for creating complex and attractive visualizations. Altair is a declarative visualization library that allows users to specify visualizations in a simple, human-readable format. The choice of tool will depend on the specific requirements of the project, such as the type of data being analyzed, the complexity of the visualization, and the need for interactivity.


   Reprint policy


《Visualizing Data in Python: An In-Depth Comparison of Python's Top Libraries》 by Isaac Zhou is licensed under a Creative Commons Attribution 4.0 International License
  TOC