Fuzzy Matching in Alteryx & Tableau | Beyond the Basics

Fuzzy matching, also known as fuzzy logic, is an innovative technology used in artificial intelligence and machine learning. It’s a process that identifies elements in data sets that are similar but not identical. Unlike typical search logic, which works on a binary pattern, fuzzy matching uses an algorithm that navigates intermediate degrees of certainty to find duplicate entries that may not share the exact same characteristics. In essence, it enables the identification of entries that are potentially misspelled, formatted differently or exist only in partial form.

The Importance of Fuzzy Matching

According to a survey conducted by MIT, 97% of businesses admitted to having problematic data, with duplicates being one of the contributing factors. However, duplicates are challenging to detect, especially as many are non-exact matches. As such, fuzzy matching’s ability to detect non-exact duplicates proves invaluable in ensuring data quality across organisations. Besides deduplication, fuzzy matching can also be used in various other applications, including:

  • Spell Checkers: Improve the accuracy of suggested corrections for misspelled words.
  • Fraud Detection: Identify unusual patterns or similarities in data.
  • Record Linkage: Identify matches between datasets sourced from different systems or databases.
  • Information Retrieval: Identify relevant documents or entries in text-based applications such as search engines and document management systems.

Methods of Fuzzy Matching

While there are many fuzzy matching algorithms to choose from, there is no one-size-fits-all solution, as each algorithm comes with its strengths and weaknesses. Some of the most common fuzzy matching algorithms can be categorised as follows:

Algorithms: Character Overlap MeasureEdit Distance ApproachPhonemic Algorithms
ExplanationCalculates the number of characters that two strings have in common, regardless of their position or order.Measures similarity between two strings by measuring the number of edits (e.g., insertions, deletions, or substitutions) needed to transform one string into another.Utilises phonetic representations of words or phrases to compare the sounds of different words or phrases.
StrengthEffectively detects different lengths or structures.Effectively detects variations in spelling or structure (e.g., abbreviations).Effectively detects spelling or pronunciation variations in similar sounding words or phrases.
WeaknessSpelling errors are not considered.Ineffective to compare strings with different lengths.Can produce false matches if there are multiple valid phonetic representations of a word or phrase.
Algorithms(1) Jaccard Similarity Coefficient (2) Cosine Similarity Measure (3) Dice’s Coefficient(1) Levenshtein distance (2) Damerau-Levenshtein distance (3) Jaro-Winkler distance(1) Soundex (2) Metaphone (3) Double Metaphone
ExamplesThe names "Bella" and "Bela" share a common set of characters {b, e, l, a}, resulting in a Jaccard similarity score of 1.0, indicating a perfect match.The names "Dwayne" and "Duane" have a Jaro similarity score of 0.911, which meets the threshold of 0.85 and proves to be a match.The names "Robert" and "Rupert" both have the same Soundex code, R163, indicating a match.

Fuzzy Matching: A Practical Guide

To illustrate the use of fuzzy matching, consider a dataset with numerous names, yet only a handful of distinct individuals identified by unique IDs. The goal is to aggregate all variations of the same name, which may present themselves in various forms, including but not limited to typographical errors, abbreviations, and rearrangements. The aim is to reduce the dataset to solely the handful of unique names. Fuzzy matching is a technique that can be readily accomplished utilising simple, no-code software platforms like Tableau Prep Builder or Alteryx.

Performing Fuzzy Matching with Tableau Prep Builder

 

Image depicting the Fuzzy Matching workflow in Tableau Prep Builder

Tableau Prep Builder is a user-friendly and low-code platform that allows you to perform fuzzy matching with ease. Below is a step by step guide breaking down how to do it:

  1. Import the data into Tableau Prep Builder and add a ‘Clean Step’ to the workflow.
  • In the Profile Pane view, click on the ‘…’ option to access more options.
  • Hover over the ‘Group Values’ option to see the available fuzzy matching algorithms.
Gif showing how to complete the initial cleaning step in Tableau Prep Builder

2. Pronunciation: This algorithm uses the Metaphone3 algorithm to generate keys based on the English pronunciation of the values. It is useful for fixing misspellings or punctuation errors where words still sound similar despite the inconsistency.

3. Common Characters: This algorithm is useful for fixing capitalisation or formatting issues. It applies the ngram fingerprint algorithm, which arranges words based on their distinctive characters and excludes duplicates, punctuation marks and spaces.

Image depicting the results of the Common Characters functionality in Tableau Prep

4. Spelling: This algorithm uses the Levenshtein distance algorithm to calculate the number of changes needed to transform one value into the other. If the number of changes is below a certain threshold, the values are grouped together.

Image depicting results from Tableau prep Spelling Functionality

5. To improve performance and capture additional similar names, stringency of the algorithm rules can be decreased by adjusting the grouping parameters.

6. It’s worth highlighting that relaxing the grouping rules comes with the potential drawback of reducing the accuracy of the algorithm. However, this risk can be mitigated by utilising the Manual Selection option to make necessary adjustments and fine-tune the algorithm to achieve optimal performance.

Image depicting the path to Manual Selection button in Tableau Prep

Performing Fuzzy Matching with Alteryx

Image showing Alteryx Workflow configuration

Alteryx is a powerful data analytics tool that offers advanced options for performing fuzzy matching on large datasets. Here is a step by step guide to get you started:

  1. Pre-Processing: This is the initial step in the fuzzy matching process where the input datasets are imported and prepared for the matching process. Pre-processing is important because it ensures that the data is consistent and standardised, reducing the likelihood of errors and mismatches during the matching process. This step involves tasks such as cleaning, standardising, and deduplicating data to ensure that it is ready for matching.

2. Fuzzy Match: This is the core step of the fuzzy matching process in Alteryx. It is used to identify records that match each other despite minor differences, such as spelling mistakes or variations in formatting. To use the Fuzzy Match tool in Alteryx, you need to set the following parameters:

2.a. Merge/Purge Mode: Decide whether to purge or merge records that have been identified as duplicates. Purging removes the duplicate records from the dataset entirely, while merging combines them into a single record with the values from both records. The appropriate option depends on your specific use case.

2.b. Record ID Field: Select the column that you want to use for the fuzzy matching process. This could be an ID number or any other unique identifier that is present in both datasets.

2.c. Match Threshold: This parameter defines the level of similarity required for a record to be considered a match. You can set this to a value between 0% and 100%, where 100% represents an exact match and 0% represents no match at all. The appropriate threshold value depends on your specific use case, but in general, a higher threshold will result in more precise matching.

2.d. Match Fields: Specify the fields to compare when identifying matches. This could include columns for names, addresses, phone numbers, or any other relevant data. It’s important to choose the fields that are most likely to contain matching data.

2.e. Match Style: The Fuzzy Match tool in Alteryx offers several match styles that allow you to compare records based on specific criteria. Here are some of the match styles available:

    • Name: This match style is used to compare names. You can also specify whether to include nicknames or phonetic matches in the comparison. In this case, we would select the ‘Name w/ Nicknames’ style.

ANNE KING, ANNE MEIER KING, and ANNEE MEIER KING are matched with ANN MEIER KING, with match scores of 85, 98, and 95 respectively, using the MatchKey of KNK.

Image showing Fuzzy Match results within Alteryx
  • Address: This match style is used to compare addresses and includes options for comparing AddressPart, Address No Suite, and Zip Codes.
  • Phone: This match style is used to compare phone numbers.
  • Company: This match style is used to compare company names and includes options for comparing the entire name or specific parts of the name, such as the legal name or trading name.
  • Custom: This match style allows you to create your own matching criteria. This can be useful if you have specific formatting requirements or unique data structures.

2.f. Edit: You can further customise the matching process based on your specific needs. Here are some available match options:

  • Preprocess: This option allows you to do a preliminary pre-processing of your data to improve the analysis of matches. The options include Strip Punctuations, Tokenize Japanese Text, and more.
  • Generate Keys: This option allows you to generate keys for the strings being matched, which can improve the accuracy of the matching process. Keys are essentially transformed versions of the strings being matched that are easier to compare.
  • Match Function: This option allows you to choose the specific algorithm used for the matching process. Alteryx provides several pre-defined algorithms, such as Jaro, Jaro-Winkler, Levenshtein, and Smith-Waterman. Each algorithm has its own strengths and weaknesses and you can choose the one that best fits your specific matching needs.

2.g. Advanced Options: Additional advanced options are available for customising record matching and output in data processing. These options can help reduce processing time and streamline data output.

By configuring the Fuzzy Match tool appropriately, you can identify potential matches between records that may not be immediately apparent, allowing you to extract valuable insights and data from your datasets.

3. Post-Processing: This is the final step in the fuzzy matching process where the matched records are organised and presented in a meaningful way. Post-processing is crucial because it allows you to analyse and interpret the results of the fuzzy matching process. This step involves tasks such as grouping, sorting, and formatting the matched records for presentation and analysis. By organising the matched records in this way, you can gain valuable insights and make informed decisions based on the results of the fuzzy matching process.

By using Alteryx to perform fuzzy matching, you can efficiently process large datasets and extract valuable insights that might otherwise be hidden.

The example above illustrates a basic form of fuzzy matching, but Alteryx has the ability to perform much more advanced and complicated matching. By adjusting the match threshold, match configuration, and other parameters, you can customise the fuzzy matching process to suit your specific use case and data. With its intuitive interface and powerful capabilities, Alteryx is an ideal tool for anyone looking to perform fuzzy matching and gain deeper insights into their data.

Tableau vs Alteryx: Which Data Preparation Tool is Right for You?

While both tools have their strengths, it is important to choose the right tool based on your specific needs and requirements.

To learn more about the comparison between the two tools, you can refer to our blog comparing Alteryx and Tableau Prep.

Best Practices for Fuzzy Matching

Fuzzy matching is a technique used to compare data sets and identify matching records, even when the data is not an exact match. While it can be a powerful tool, it is important to follow best practices to ensure accurate results.

  • Selecting Matching Criteria and Algorithms: Before performing fuzzy matching, determine the matching criteria you will use to compare datasets, such as the similarity of names, addresses, and phone numbers. Select the appropriate fuzzy matching algorithm, such as Levenshtein Distance, Jaro-Winkler Distance, or Cosine Similarity, based on your specific use case.
  • Clean and Standardise Your Data: Fuzzy matching works best when the data is standardised and clean. Remove any special characters or symbols and ensure that the data is in a consistent format.
  • Test and Validate Your Results: Once you have run the fuzzy matching process, test and validate the results to ensure accuracy. This may involve manually reviewing a sample of the matched records to confirm that they are correct.

Limitations of Fuzzy Matching

While fuzzy matching can be a valuable tool for data matching and record linkage, there are also limitations to be aware of.

  • False Positives: One of the main limitations of fuzzy matching is the risk of false positives. This occurs when two records are incorrectly matched, even though they are not actually the same.
  • Contextual Ambiguity: Fuzzy matching algorithms are not able to take into account the context of the data, which can result in incorrect matches. For example, a name that is common in one geographic location may be rare in another, leading to incorrect matches.
  • Scalability: Fuzzy matching can be a time-consuming process, particularly when working with large datasets. This can make it difficult to scale the process to handle larger volumes of data.

Summary

Fuzzy matching is a powerful tool in artificial intelligence and machine learning that helps identify similar elements in data sets that are not identical. It has proven valuable in deduplication, fraud detection, record linkage and information retrieval, among other applications. While there are several fuzzy matching algorithms available, each has its strengths and weaknesses, making it important to choose the right algorithm for a specific task. With user-friendly and low-code platforms such as Tableau Prep Builder and Alteryx, fuzzy matching is accessible to many, and organisations can benefit from this technology by ensuring data quality, improving accuracy and enhancing efficiency.

Frustrated with duplicate data slowing down your workflow? Our team can help you with a custom fuzzy matching workflow that fits your needs. Contact us today to learn more.

  1.  

Leave a Comment