With the increasing demand for data, massive volumes are being generated daily. These large data sets are widely available, however, they are also often unstructured. Companies that manage to get their hands on this type of data will have to undergo massive data cleaning processes to gather the consumer insights they required. Without any prior preparation and cleaning, unstructured data can lead to flawed results and ultimately hinder or negatively impact business decisions. Data can only be used to generate accurate insights when it’s properly formatted and cleaned thoroughly. Therefore, the data cleansing and preparation process is a very crucial step to ensure good data hygiene as well as accuracy.
How can data cleaning help my organisation?
Data cleansing is the process of reviewing the data present within a database, by updating incomplete information and removing incorrect, duplicated and irrelevant data entries. For the marketing field, this process ensures that unique IDs are unrepeated and entries with irrelevant email addresses or phone numbers are removed, to avoid duplicated or invalid contacts. In the other fields, such as healthcare and engineering, using incorrect, corrupted, or even incomplete entries for analysis can lead to margin of errors, where consequences are dire. Cleaned data can greatly minimise such risk and mistakes. It can also lead to more precise analysis when using machine learning algorithms and predictive modelling on cleaned data.
Traditional Data Cleansing
Traditional data cleansing is tedious and repetitive, requiring a lot of additional effort from analyst. These analysts could potentially spend their valuable time on tasks with more business value. Traditional data cleansing is generally a manual process completed on excel by copy-paste methods and as such, are frequently prone to error. These manual efforts are notoriously difficult to keep track of and it’s nearly impossible to undo significant mistakes once they’ve been found. The incorrect dataset can be deemed as redundant even after all the time and effort spent cleansing it. At times, the analyst will have to redo the entire cleansing process from the start.
With the advancement of technologies, there are modern data cleansing tools available in the market to help reduce the challenges of traditional data cleansing. These tools are quick and easy to pick up, requiring only a few clicks and able to track all the changes made by the users.
Data Cleansing with Alteryx
One powerful tool for data cleansing is Alteryx, which is an analytics automation software in a low-code environment. In place of lines of codes for automation, Alteryx has an entire suite of built-in tools readily available to be used for the creation of automated processes. To help you visualise how Alteryx can help your organisation with data cleansing, we have outlined the cleansing process in six key steps.
The six key steps as shown in the diagram above are:
- Remove Duplicates
- Remove Irrelevant Information
- Managing Incomplete Data
- Identify Outliers
- Fix Structural Errors
- Validate the Data
We will be further elaborate each individual step in the post below, with recommended tools and examples.
1. Remove Duplicates
Duplicated entries are quite prevalent within messy data, especially when joining large amounts of data from various sources together, or even through manual entries. These duplicates usually result in unreliable analysis, where the inflated cumulative amounts deviate significantly from the true value.
Recommended Tool: Unique Tool
Unique Tool is a quick and easy way to help identify if the instance in the data is indeed a unique entry or a duplicate. To use the tool, the user can simply configure by selecting the column names which the record should be uniquely distinguished by.
In the example above, we connect the unique tool to the data which we would like to unique and select ’FirstName’ and ‘LastName’ as columns to be cleaned. After running the workflow, in the duplicate (D) anchor, there are six data entries. The first entry has the ‘FirstName’ as DENISE. Going to our unique (U) anchor, we searched for the word DENISE, and there we have another entry with the same First and Last Name. Now we can easily remove six duplicates from our records.
2. Removing Irrelevant Information
Big data, usually contains a lot of the information, of which, some might not be relevant. To focus on the crucial information for further analysis, we will have to remove irrelevant information. This will help to streamline the data load, allowing for faster data processing speed for further steps and processes.
Recommended Tool: Filter Tool
The Filter Tool helps to split the data into two output streams (True and False), based on the conditions specified in the tool configuration. With this tool, relevant information, which fulfils the conditions will be output in the True (T) anchor. Data from the T anchor can then be used downstream for further work.
In the example, we add the Filter Tool to the previous Unique Tool. After which, we use the Basic Filter to help us filter out records with CustomerSegment as ‘Consumer’. After running the workflow, we can observe that there are 10 entries in the False (F) anchor, where all the CustomerSegment is Consumer. This leaves 80 entries in the T anchor which do not consist of any records with ‘Consumer’ CustomerSegment. With the 80 entries in the T anchor, we can bring it forward to our next step.
3. Managing Incomplete Data
After removing duplicates and streamlining our data, there may still be other data issues such as missing values or incomplete data. These problems generally arise from incomplete manual entries or customers not providing all of their information. Hence, one will have to determine if those missing values are plausible values or if they can be dropped off completely from analysis. By purposefully managing this incomplete data, it will help to strengthen data integrity and ensures a standardised data treatment for the incomplete fields.
Recommended Tool: Formula Tool
The Formula Tool in Alteryx is very powerful, which can create a new column, or updates a column, using one or more expressions to perform a variety of calculations and operations. It is usually used for creation of calculated fields or to perform logical amendments to existing data entries. Other uses include:
- Apply conditional statements
- Convert numbers and strings
- Format dates
- Cleanse string data
- Perform validation tests on data
In our example above, we will be applying conditional statement in the Formula Tool to replace all ‘null’ in our data column Visits with ‘1’. This can be a cohesive business agreement made by the management to treat all missing visit data as 1, where incomplete data could be due to new visitors to store.
By dragging the Formula Tool into the Alteryx canvas, we will select the column ‘Visit’, which is the column we would like to update. After which, we will use an if else statement to set a condition where we only change the column ‘Visit’ only when it is null. For those fields with value, we will keep the value as it is. After running the workflow, and giving the output table a quick glance, the null values are not present anymore. The data is now complete without missing values in the required columns.
4. Identify Outliers
The next thing to consider is ensuring that the data points are consistent with each other. Data points which are far away or, in other words, are outliers, should be removed as it can significantly distort the reality of the data. There are various statistical methods to identify outliers, visual techniques include box plots, histograms, or even scatterplots. Numerical methods include using Z-score or standard deviation. Once the outliers are identified, the user can choose either to include or omit the data point depending on how extreme they are and whether these data points skew the analysis result.
Recommended Tool: Scatterplot Tool
The Scatterplot Tool utilise least square regression to plot a basic scatter plot from two variables specified in the configurations. It requires only basic configurations, where the user simply indicates the fields for X and Y axis and determine the styling option for the graph.
With the scatterplot generated from the data, one can visually look out for any data points which maybe deviates from the general trend line. Furthermore, with the marginal box plot generated on both the X and Y axis, one can easily identify data points that are very far away from the mean and interquartile range.
There are also other visual tools in Alteryx for analysis of data such as Distribution Analysis Tool, which looks at how well the selected field, Avg Claims per Month, is being distributed.
With these statistical tools available to evaluate the data, an analyst will be able to understand their data better. This will in turn help them in selecting the most appropriate columns for further statistical analysis or to perform data modelling.
5. Fix Structural Errors
Before validating our data, we will have to do a final preparation. We must ensure that the data is in the correct format and there are no inconsistencies in typography, capitalisation, abbreviation, and formatting. With the data properly formatted, we can then further manipulate it in ways such as grouping the data by category. Without data cleaning, it will be difficult to group the categorical data points into their respective groups. As shown in the table, there are some duplications of categories within the entries for ‘City’ due to the difference in typography.
Recommended Tool: Data Cleansing Tool
The Data Cleansing Tool can clean up common data quality issues using a range of parameters. It is simple to use and can deal with the null rows or columns, removing them from our data set. Furthermore, it can remove leading and trailing whitespaces, numerical text in string and unwanted punctuations on the specified columns.
In the example above, the entries in the column ‘city’ are not consistent, with unwanted punctuations and varying text cases used throughout. With the data cleansing tool, we simply select the ‘city’ column in the Select Fields to Cleanse and check the box for Punctations. To modify the case, we have to check the box for Modify Case and set it to Upper Case as we would like to ensure consistency in the format of the string data. With that the data is properly format, without having to go in each field individually to amend the text.
6. Validate the Data
The final step of data cleansing is to validate the data. This is to ensure that the data is accurate, complete, consistent and uniform. It is a crucial step, especially with automated data cleansing process where checks are required to ensure the output is consistent and ideal. An example of a validation check can be checking the Name Field, which should not have any numerical values, or ensuring the account number is all numeric.
Recommended Tool: Test Tool
The Test Tool in Alteryx is able verify data or processes in the workflow with simple configurations to be made. All the test conditions can be made within one test tool, and it can be connected to multiple incoming data. If the conditions within the tool are not satisfied, the test tool will show an error message in red.
In the test tool above, we have set the Test Name to ‘Only Numeric in ZIP’ as we should only have numerical values in the column ‘ZIP’, coming in from the input connection #1. The test name can be set to any naming convention the user can identify with. There are four different test types in the Test Tool. For this example, we used the Test Type, ‘Expression is True for All Records’ to ensure that all entries specified are tested against the condition. Test Value will be the condition for the test, where ‘IsInteger(ZIP)’, ensures that the entire string of the field ZIP is made up of integers. If the condition is not satisfied, there will be an error message.
After the run of the workflow, the error message appears in the red wordings below the test tool, it indicates that the test failed on record #3. To verify this, we look at the result window, where the record #3 in the column ‘ZIP’ indeed has an alphabet present in the field. The test tools can be quite useful for text delimited database, or to ensure data integrity before using it for other workflows.
As we have seen, in the above examples of a data cleaning process, it can be quite challenging to maintain good data hygiene without any formal processes. Alteryx’s drag and drop capabilities, can maximise the dataset’s accuracy within seconds, significantly reducing the number of manhours working on it. Furthermore, it can be deployed on more complex string cleaning processes such as regex statements and text parsing. Other advanced data cleaning processes with Alteryx incudes extracting insightful information from responses and feedback and forming analysis through word cloud, topic modelling and sentiment analysis.
If you are interested to find out more on implementation of data cleansing processes with Alteryx, feel free to reach out below.