# What is Cluster Analysis?

Cluster analysis, or clustering, is a type of unsupervised machine learning that allows users to group similar data points within a dataset according to some distance measure, without reference to a labelled dataset.

The significance of this analysis spans across various domains, from marketing and customer segmentation to fraud detection and beyond. In essence, cluster analysis helps to uncover patterns and relationships inherent in complex data by organising them into clusters or groups based on similarities.

Providing a map of the data’s journey from inception to its current state, encompassing its various touchpoints, transformations and transfers, lineage enhances transparency and enables effective audits.

## Understanding Cluster Analysis

The most common types of cluster analysis methods are Hierarchical Clustering and Partitioning Clustering.

**1.** Hierarchical Clustering: A set of nested clusters organised as a hierarchical tree.

**a) **Agglomerative (bottom-up): This method starts with each data point in an individual cluster by itself and then progressively merges clusters based on similarity until all data points belong to a single cluster. The result is a hierarchical tree structure known as a dendrogram.

**b) **Divisive (top-down): The divisive hierarchical clustering method begins with all data points in one cluster and splits them into smaller clusters recursively based on dissimilarity.

**2.** Partitioning Clustering: Clusters that are non-overlapping, with each data object in exactly one cluster.

**a)** K-Means: One of the most widely used methods, K-means partitions data into a predetermined number of clusters (k) by minimising the variance within each cluster. It assigns data points to clusters based on the mean value and Euclidean distance is used to measure the proximity of a point to a centroid.

**b)** K-Medoids: Like K-means, but instead of using the mean, K-medoids uses the most centrally located data point (medoid) as the representative of a cluster. It is more robust to outliers.

**c)** Neural Gas: Similar to the K-Means algorithm, but the distinction is that each example vector is allocated to multiple classes rather than just one. It will be distributed among classes with varying weights and the closest class with the highest weight.

**K-Means Clustering & Elbow Methods Explained**

A popular use case for cluster analysis is Customer Segmentation, which groups customers based on common characteristics, demographics or behaviours. By categorising customers into segments, businesses can tailor their marketing strategies, products and services to meet the unique requirements of each segment more effectively.

This blog will go step-by-step through the process of performing customer segmentation using K-Means clustering using a Customer Segmentation dataset. This dataset consists of customer details from a supermarket, such as age, gender, annual income and spending score. Spending score refers to a score that is assigned to a customer based on parameters like customer behaviour and purchasing data.

**What are the steps of K-Means Clustering?**

**1. **Choose the Number of Clusters (K):

Decide how many clusters need to be created. This is a critical step, as the choice of K influences the resulting clusters.

**2.** Initialise centroids (central point of the cluster):

Randomly select K data points from the dataset as initial cluster centroids.

**3.** Assign Data Points to Clusters:

For each data point, calculate the distance to each cluster centre. Assign the data point to the cluster whose centroid is the closest (based on distance metrics such as Euclidean distance).

**4.** Update Cluster Centres:

Recalculate the centroids of each cluster by taking the mean of all data points assigned to that cluster.

**5.** Repeat Steps 3 and 4:

Re-iterate steps three and four until there is no change to the centroids, i.e., when the cluster assignments no longer change significantly.

It’s important to note that K-Means clustering is sensitive to the initial selection of cluster centres. Running the algorithm multiple times with different initialisations and selecting the solution with the lowest within-cluster variance can help mitigate this sensitivity. Additionally, determining the optimal number of clusters (K) can be challenging and may require exploration using techniques like the elbow method.

**What is the elbow method?**

The elbow method plots the number of clusters against the sum of squared distances (SSD) between each data point and its cluster centre. Search for the SSD curve’s flattening point, which signifies that clustering quality does not significantly improve with more clusters added.

## Cluster Analysis Alteryx & Tableau Video Tutorial

## Performing Cluster Analysis with Alteryx

**1.** Load the dataset in Alteryx using the ‘Input Data’ tool. Make sure that the data has no null values by using the ‘Browse’ tool.

**2.** Use the ‘Select’ tool to convert the variables to be used in the clustering to a numeric data type, as users can only cluster numeric data fields for K-Means clustering. In the below example, Age, Annual Income and Spending Score have been changed from string data type to an integer data type.

**3.** The ‘Histogram’ tool is then used to generate histogram plots for each of the variables to visualise the skewness of the variable distribution. If the variables used for K-Means clustering is very skewed, it will not give good results. Choosing Age, Annual Income and Spending Score as variables for K-Means clustering will yield better results as they are relevant to customer segmentation and have no missing observations. From the histogram plots, the skewness of these distributions is minimal, hence they can be used in this K-Means clustering.

**4.** Next, select Alteryx’s K-Means clustering tools under the ‘Predictive Grouping’ tab.

**5.** To get the optimal number of clusters, the ’K-Centroid Diagnostics’ tool can be utilised. Variables with a larger range of scale will influence the clustering more since similarities in clusters is determined by a distance metric. By standardising cluster variables using the z-score, users can adjust for differences in scale. There are three clustering methods to choose from – K-Means, K-Medians and Neural Gas. In this case, select K-Means as the clustering method.

It outputs a K-Means Cluster Assessment Report, which the ‘Browse‘ tool can then visualise.

**The Adjusted Rand Index**

This index measures the similarity between the assigned clusters, i.e., how similar the data points are within the clusters. The goal is to find the number of clusters with the highest index values and the narrowest quartile spread (smaller box range), as this means the values will be close to each other.

**The Calinski-Harabasz Index**

This index measures the separation between clusters and the compactness within clusters. Higher values of the index indicate better separation and compactness. A good rule of thumb is to look for a number of clusters where the Calinski-Harabasz index is maximised. Choose the cluster that has the highest median.

Here, at four clusters, the Adjusted Rand Index has the smallest quartile range and the Calinski-Harabasz index is the highest, suggesting that this might be the optimal number of clusters.

**6.** The ’K-Centroid Cluster Analysis‘ tool can be used to calculate the clusters.

The ‘R’ anchor of the ’K-Centroid Cluster Analysis‘ tool outputs a report with a statistical summary and cluster solution plots.

**7.** Use the ’Append Cluster‘ tool to append the cluster assignments to the initial dataset.

A new column with the name set in the configuration window will be added to the dataset.

## Performing Cluster Analysis with Tableau

**1.** Connect and load the dataset in Tableau. Ensure that the fields being used for clustering are appropriately formatted.

**2.** Build a scatter plot using the relevant dimensions and measures. This provides an initial visualisation of the data.

**3.** Navigate to the Analytics pane and double-click on ‘Clusters‘ option. This opens the Clusters configuration window, where users choose the measure or measures to cluster or specify the number of clusters based on analysis requirements.

Note, Tableau uses the K-Means algorithm for clustering, min-max normalisation for scaling and the Calinski-Harabasz index to determine the optimal number of clusters.

Measures are aggregated by the default aggregation for the field, whereas dimensions are aggregated using ATTR. Tableau will automatically create a new field ‘Clusters‘, indicating the cluster assignment for each data point. Users can choose to set the number of clusters or let Tableau decide.

In this case, the number of clusters is set as four and colour is used in the scatter plot to visualise the clusters.

When adjusting the variables, the cluster assignment and number of clusters will also change, as shown below, when only clustering on Age.

**4.** To generate a summary report of the clusters, right-click the ‘Clusters‘ field and select ‘Describe clusters’.

**5.** To edit the cluster variables, right click the ‘Clusters‘ field and select ‘Edit clusters’.

**6.** Further analyse the cluster characteristics by creating other visualisations, such as a box plot for each variable.

From the box plots, it can be seen that cluster one consists of older middle-aged adults with low annual income and second-lowest spending score and cluster two consists of young adults in their mid-30s to 40s with the highest annual income and highest spending score. Cluster three consists of young adults in their mid-20s to 30s with the lowest annual income and second-highest spending score and cluster four consists of middle-aged adults with a high annual income but lowest spending score.

The supermarket can make use of the four clusters identified to personalise marketing campaigns tailored to each customer cluster and boost sales.

**Comparing Alteryx and Tableau: Which tool to use for cluster analysis?**

It depends on the goal of the cluster analysis. If a simple cluster analysis with visually appealing and interactive visualisations is the goal output, Tableau would be a good choice. If the project will be completed by an experienced data scientist or statistical analyst with the ability to configure the parameters (e.g., choosing different clustering methods) of the clusters then Alteryx would be a great choice.

Alternatively, the two pieces of software can be used in conjunction with one another. Alteryx can be used to build and output the clusters and Tableau to take the data and visualise it.

**How to interpret a cluster?**

Interpreting a cluster involves understanding the characteristics of the data points belonging to that cluster and identifying the patterns that distinguish it from the others. This process typically involves examining the following aspects:

**Comparison**

Compare the characteristics to those of other clusters. This can help identify similarities and differences and reveal broader patterns in the data.

**Domain Knowledge**

Incorporate domain knowledge and expertise to interpret the results in the context of the specific problem or domain. This can provide deeper insights into the meaning and implications of the cluster.

**Spread**

Measure the spread of the data points, such as the variance or standard deviation. This indicates how tightly or loosely the data points are grouped around the central tendency.

**Summary**

Cluster analysis groups similar data points to reveal patterns, aiding in applications like customer segmentation and fraud detection. Businesses embracing Alteryx or Tableau (or both in tandem) can excel in data processing, gain detailed insights and interpret them through intuitive dashboards.

*Ready to uncover hidden patterns, optimise decision-making and transform your data into actionable insights? Contact us via the form below and take the first step towards maximising the potential of your data.*