Box Plots Explained | Alteryx & Tableau

What is a Box Plot?

A box plot, also known as a box-and-whisker plot, is a graphical representation of data distribution. It provides a summary of key statistical measures and helps to visualise the spread of the data. Box plots are particularly useful for comparing distributions between different groups or identifying outliers within a dataset. They allow for quick, at-a-glance comparisons between different variables.

Key Features of a Box Plot

 

Image depicts the anatomy of a box plot

Box

The box represents the interquartile range (IQR), which is the middle 50% of the data. The bottom and top boundaries of the box mark the first quartile (Q1) and the third quartile (Q3), respectively.

Median

The line inside the box represents the median (the middle value when the data is sorted).

Whiskers

The whiskers extend from the edges of the box and represent the range of the data, excluding potential outliers. The whiskers are typically calculated by extending 1.5 times the IQR from Q1 to the lower whisker and from Q3 to the upper whisker. Data points beyond the whiskers are considered potential outliers and are plotted individually.

Outliers

Individual data points that fall outside the whiskers are plotted as individual points and are considered potential outliers.

Box Plot Video Tutorial

Creating a Box Plot in Alteryx

We will make use of the Titanic dataset in this example. The Titanic dataset consists of details of the passengers onboard the Titanic such as age, gender, passenger class and survival status.

I'm depicts Alteryx workflow to create a box plot

1. Load the dataset in Alteryx using the ‘Input Data‘ tool.

2. Clean the data using the ‘Data Cleansing‘ tool and remove rows with null values for the variables that are going to be added to the box plot.

Image depicts data cleansing in Alteryx for box plots

3. Use the ‘Interactive Chart‘ tool to create the box plots. To visualise the age distribution of the passengers by sex, set the type to be ‘Box and Whisker‘, X-axis as ‘sex‘ and Y-axis as ‘age’.

4. You can choose to add additional layers to the box plot to compare multiple variables side-by-side. In the Create > Layer page, add another layer of data with X-axis as ‘pclass’ and Y-axis as ‘age’. This will create another set of box plots using the same Y-axis.

Image depicts box plot with X-axis and Y-axis.

5. Then use the ’Render‘ tool or ’Output Tool to export the box plot in a pdf file.

Image depicts pdf of age distribution of passengers by gender box plot

Creating a Box Plot in Tableau

1. Drag one or more measures, e.g. Age to the Rows pane.

2. Drag one or more dimensions, e.g. Sex to the Columns pane.

3. Click ’Show Me‘ to select the box plot option and Tableau will build the box plot visualisation automatically.

Image depicts the box plot visualisation tool

4. If users get aggregated single points, then in the ‘Analysis tab’ deselect ‘Aggregated Measures’ to disaggregate the points.

Image depicting disaggregated points in Tableau

How to Interpret a Box Plot

Below is a compilation of box plots created in Tableau using the Titanic dataset.

Starting by analysing the survivor demographic of Titanic, specifically how the sex, passenger class and age of the passengers affected the survival rate.

From the survivors’ age distribution by sex analysis, we can see that the age range of female survivors is larger than the age range of male survivors. The median age of female survivors (28.5 years old) is also higher than the median age of male survivors (27 years old). This is because the lifeboat policy on Titanic prioritised women and children over men.

From the survivors’ age distribution by passenger class, the median age of survivors from the first class is the highest at 36 years old, followed by the median age of 25 years old from the second class and the median age of 22 years old from the third class. The age range of survivors from the first class is also larger than the age range of survivors from the second class and third class.

Benefits of Box Plots

Outlier Identification

Box plots are extremely effective for identifying outliers. The whiskers of the plot extend to the lowest and highest values within a certain range (often 1.5 times the interquartile range from the lower and upper quartiles) and any data points outside of these whiskers can be easily identified as outliers. This is particularly useful in data cleaning and anomaly detection.

Data Distribution Overview

They provide a clear summary of the distribution of the data. The box itself contains the middle 50% of the data (interquartile range or IQR), with the median marked inside the box. This makes it straightforward to see the central tendency, variability and skewness of the data at a glance.

Comparative Analysis

Box plots are excellent for comparing distributions across different groups or categories. When multiple box plots are drawn side by side, it becomes easy to compare medians, ranges and variances between different data sets. This is particularly useful in exploratory data analysis and when making initial observations about the data.

Limitations of Box Plots

Limited Detail on Data Distribution

Box plots provide a summary of the data distribution but do not convey detailed information about the actual data points. They show the median, quartiles and potential outliers, but they don’t show how data points are distributed within these regions, making it difficult to identify specific characteristics like clustering or gaps in the data.

Ineffective for Large Data Sets or Detailed Analysis

For very large data sets, box plots can become cluttered and less informative, especially when there are many outliers. Additionally, they fail to convey precise information about the underlying distribution, such as whether it is normal or skewed, which is crucial for in-depth statistical analysis and interpretation.

Summary

Box plots effectively streamline the comparison of multiple groups or variables by offering a clear and concise visual representation of data distribution. They efficiently highlight crucial statistical measures such as median, quartiles, and outliers, enabling quick and insightful analytical comparisons. This makes them an invaluable asset in the realm of statistical analysis and data interpretation.

Ready to revolutionise your data visualisation game? Let’s turn your raw data into actionable insights with the power of Tableau and Alteryx. 

Leave a Comment