The Value of Data
I’m pretty sure we’re all familiar with the phrase ’Data is the new oil’. The phrase owes all credit to Clive Humby, a British Mathematician who proclaimed and gathered interest in it until The Economist published an edition titled “The world’s most valuable resource.”
Since then we’ve witnessed the power of data grow as it helped businesses make better decisions, improve workplace efficiencies and enhance customer experience and engagement. Government bodies also rely heavily on data to understand their population and make informed decisions, e.g. tracking the progression of COVID-19 to curb the growth of a cluster.
However, actionable insights don’t just happen when working with data, it’s an iterative process. Errors in data entry causes dirty data, which is a term that describes invalid or unusable data; and it is the most common problem for data scientists, engineers and researchers. Dirty data could deal a huge cost to companies and it is critical to understand where it stems from, how it affects the business and how it should be dealt with.
What exactly is dirty data and where does it come from?
Dirty data refers to data that contains inaccurate or incorrect information. Data that is misleading, duplicated, incorrect, inaccurate, misspelled or incorrectly punctuated could all classify as dirty data.
Data scientists are currently spending up to 80% of their time cleaning and organising data, whilst data knowledge workers are spending up to 50% of their time working with hidden or inaccurate data. The effects of dirty data are dire, costing the average business 15% to 25% of their revenue due to data inefficiencies. It is even estimated that the US economy loses over $2.5 to $3 trillion every year due to dirty data.
Knowing the disadvantages of dirty data, you may be wondering – where does it exactly come from?ˆ
In a report published through Experian, it was found that the inability of companies to build a solid data strategy has a significant impact on data quality. This is because if companies are not able to articulate the quantifiable impacts of poor data quality in the organisation, they are more likely to allocate their budget ineffectively, resulting in an ongoing cycle of fewer tools and more manual processes – which causes more human error.
Human error accounts for almost 60% of why dirty data exists and is undoubtedly the biggest challenge in ensuring data cleanliness. Other causes of dirty data could also include poor interdepartmental communication or inadequate data strategies.
How does dirty data affect my company?
Wasted resources and lost productivity is one adverse effect of dirty data on a company. In the US only, it is estimated that 27% of revenue is wasted on business decisions made upon inaccurate or incomplete customer data. Research undertaken by Gartner indicates that the “Average financial impact of poor data quality on organisations is USD$9.7 million per year.” Inaccurate insights that stem from dirty data can cause companies to make the wrong decisions and suffer significant losses.
Productivity is another impact which could affect several important areas within a business. Data specialists would need to spend more time cleaning and organising the data, while knowledge workers would spend extra time analysing inaccurate and misleading data. This causes a dip in productivity when that time could instead be spent on data analysis and deriving insights.
Furthermore, it’s possible the data collection could lose credibility. End-users who rely on the data would want to spend more time confirming its accuracy, further reducing time for insights and productivity. This could also affect the business externally if customers grow frustrated with the inefficiencies of the company, straining client-business relationships. Losing credibility would also impede the transformation into being a data-driven company.
If not handled carefully, failure to clean data could come at a high price for businesses.
How do we identify dirty data?
Rather than using the ineffective method of spotting data inaccuracies with the naked eye scrolling through excel sheets, data-driven businesses are implementing platforms like Alteryx and Tableau. These services specialise in data cleaning and preparation tools that don’t just wash and prepare, but also explore how clean or dirty your data is.
1. Alteryx Designer’s Data Investigation Tools
Alteryx is a leading platform that offers self-service data preparation, data blending and advanced data analytics for both coders and non-coders. Their user-friendly platform, Alteryx Designer provides a very intuitive drag-and-drop environment which allows users to perform advanced analytics seamlessly.
Alteryx Designer offers various Data Investigation tools and functionalities to explore your data with. Here are a few examples:
|Histogram||The Histogram tool produces a histogram to display the empirical cumulative distribution of a single numeric field. The tool shows the frequencies of records falling in a set of continuous value ranges. It also provides a smoothed empirical density plot. Frequencies are displayed when a density plot is not selected, and probabilities when this option is selected.|
|Basic Data Profile||The Basic Data Profile shows an overview, or profile, of data and output the information for further analysis. It analyses data and provides metadata for each column.|
|Frequency Table||The Frequency Table tool helps understand the contents of the data. For each field selected, a frequency table shows a summary of the data with frequency counts and percentages.|
|Scatterplot||The Scatterplot tool makes enhanced scatterplots, with options to include boxplots in the margins, a linear regression line, a smooth curve, a smoothed conditional spread, outlier identification and a regression line.|
2. Tableau Prep Builder Profile Pane
Although Tableau is famously known for their visual analytics platform, they also offer Tableau Prep Builder which is a self-service data preparation platform. Prep Builder provides a modern approach to data preparation, making it easier and faster for users to combine, shape and clean data for analysis with Tableau or other platforms.
This platform is also as visual and intuitive as Tableau Desktop, allowing users to get to quality data in just a few clicks. To illustrate how visually helpful the platform is, Tableau Prep’s Profiling Pane can easily find inaccuracies and errors in your data through highlighted fields as such below:
The inclusion of visual guides showing how to clean and prepare your data makes data exploration, cleaning and preparation more convenient and efficient.
3. Trifacta’s Active Data Profiling
To go up a notch with data exploration, consider Trifacta – a platform which provides users an interactive, visual user experience. Trifacta uses AI/ML-based suggestions to guide you and your data teams through exploration and transformation of ANY dataset.
Like Tableau Prep, Trifacta also makes it easy to identify data quality issues through its active and visual data profiling. Using the profile pane, users can easily understand the distribution, type and format of data values, immediately identifying any mismatched or missing values.
Trifacta’s AI/ML capabilities also enables users to zoom into individual columns to view a more detailed profiling about the data, such as a breakdown of mismatched values sorted by frequency.
Active Data Profiling would be your catalyst in striving for good data as it helps your team easily determine how accurate, complete and valid any dataset is.
With the availability of all these tools in the market today, it is clear how valuable data exploration and preparation is. Investing time and effort to explore your data in great detail will be worthwhile as it saves you time and trouble, while also ensuring data integrity and trust for future analysis.
What do we need to keep in mind while cleaning our data?
Despite the continuous advocacy on how data helps you make better decisions, the caveat here is that information is only valuable as we claim it is, if it is of good quality. Good quality data could cultivate accurate insights which in turn could drive trusted, well-informed business decisions.
Poor data quality can not only result in financial issues, but it also incurs higher processing costs, unreliable analysis, poor governance and compliance risk, and, in worst case scenarios, loss of your brand’s value and trust.
To prevent this you would need measures to guide you in assessing and preparing your data. Consider the following six data quality dimensions which could help any data user assess, interpret and improve their data.
When we address the completeness of our data, it covers a variety of attributes depending on the entity.
If we’re talking about customer data, it should show the minimum information required for a productive engagement with an individual. For example, if the customer address field includes an optional state attribute, data can still be considered complete even when the state information is missing.
For products and services, completeness can suggest vital attributes that help customers compare and choose from a variety of products. When shopping online, if a product description of a television does not include any measurements, the data of this product is incomplete.
Essentially, completeness measures if the data is sufficient enough to deliver meaningful inferences and decisions for its user.
Data accuracy refers to how correct and consistent your data is. For example, when collecting customer data, if a customer’s gender is female but your data records her as male, then this information is inaccurate.
Questions such as, “does this information represent the true reality of the situation?” help to improve your data’s accuracy.
Consistency is ensuring that information is kept uniform as it moves across multiple datasets or applications. It is usually expressed as the percentage of matched values across various data records. Consistent data ensures that the analytics performed on the data correctly captures and leverages the full value of the data at hand.
Data consistency is often associated with accuracy, and any data which scores high on consistency and accuracy sets itself as a high-quality dataset.
Validity measures whether data attributes are aligned with a specific domain or requirement. For example, date records are valid if they match the standard date format and ZIP codes are valid if they contain the correct characters.
A rule of thumb is to consider the business rules you have set in place to assess the validity of your data. Invalid data can affect the completeness of your data. To handle this, you could choose to define rules to ignore or resolve invalid data to restore back its completeness.
The uniqueness dimension indicates if a certain data record is a single recorded instance in the dataset. Uniqueness is the most critical dimension for ensuring that there are no duplicates or overlaps. Data uniqueness is measured against all records within a dataset or across datasets.
Two actions critical in ensuring uniqueness include identifying overlaps and data cleansing. Identifying the overlaps helps in maintaining uniqueness, whereas data cleansing and deduplication can remediate duplicated records.
With highly unique datasets, it could help build trust in the data, improve data governance and speed up compliance.
Timeliness refers to the expectation of time for the accessibility and availability of information of your data. It can be measured as time between when information is expected and when it is readily available for use.
This concept could particularly be of interest when synchronisation is important during data updates to applications with the centralised resource supporting the common, shared representation. If a data is reliable, business applications will have consistent and timely information readily available for them whenever required.
To further simplify this evaluation process for you our partner, Collibra, is a data governance platform that allows you to automate data management activities and provide cross-functional teams with a single location to find and establish a common understanding of your data.
Improving Data Quality With Collibra
1. Creating a data dictionary
A data dictionary is a detailed list of all your data assets. Examples of data assets include database schemas, tables, columns and fields. Your data dictionary will contain descriptions of each of these data assets along with its technical lineage (i.e. how it has been transformed from the data source to the consumption layer), allowed values (e.g. data type, number of characters, or other constraints), and any other relevant technical metadata.
You can find this information by working with the technical users in your organisation, such as data engineers, and documenting how they have designed your databases to meet business requirements.
You can also use data governance platforms, such as Collibra, to ingest data sources and process metadata as you build and maintain your data dictionary. When you connect a data source to Collibra, it will process that data source, identify any data assets, and add them to a data dictionary.
For example, if you register a Google BigQuery data source, Collibra will make that metadata available to you by identifying any schemas, tables, views, columns, primary keys, and foreign keys and add them to a data dictionary. If there is an update to the Google BigQuery data source in the future, and you need to update the data dictionary, you can refresh the schema of the registered data source in Collibra to update the corresponding metadata.
By using Collibra, you can simplify the process of building and maintaining data dictionaries. This will increase visibility of how data flows in your organisation, improve compliance with data quality requirements and increase overall trust in the data.
2. Leveraging machine learning to generate autonomous data quality rules
To ensure data quality for each dataset, organisations may need to write dozens – or even hundreds, of data quality rules for each dataset. This requires significant time and effort. With Collibra DQ, however, you can easily onboard your datasets into Collibra to autonomously generate data quality rules and continuously monitor the quality of your datasets.
When you onboard a dataset, Collibra will first profile your data and then build a machine-learning model for each dataset. This allows Collibra to auto-discover and generate adaptive data quality rules.
When Collibra builds a model for your data, it will define what is ‘normal’ within the context of each dataset. As the data changes, this definition of ‘normal’ will also change.
Data quality checks can be scheduled to run daily. With each run, Collibra will profile the dataset at the column level and establish the model for the dataset. Within a few runs, the model will be able to begin detecting data quality issues.
By leveraging useful features such as the unified scoring system, autonomous data quality rules, and automatically uncovering data quality issues, you can use Collibra to continuously monitor the health of your datasets and mitigate risks in real time.
Upon completing an assessment of your data’s current quality, now it’s time to clean your data to enhance it into its fullest potential.
How do we clean dirty data?
When building analyses through dirty data, garbage data ingested in will produce garbage analysis out. Essentially, it is best to practice data cleansing habits to create a company culture which thrives on quality data decision-making.
Let’s walk through five steps that could guide you in tackling dirty data:
Step 1: Remove Inapplicable or Duplicate Data
Unwanted data should be removed from your dataset, which includes duplicated data or irrelevant information. Removing noise could make analysis more efficient and minimise interference from your data analysis, creating a more manageable and efficient dataset to work with.
Step 2: Fix Structural Errors
Structural Errors refer to abnormal formats of your data including odd naming conventions, typos, or incorrect capitalization. These inconsistencies in your data could cause mislabelled categories or groups, reducing its accuracy. Hence, dealing with these by correcting misspelled fields, or correcting capitalisation can ensure accuracy in the data analysed or presented.
Step 3: Filter Unwanted Outliers
Whilst working with your data, you might chance upon anomalies in your dataset. If an anomaly truly does not fit within the dataset you’re working on, it’s generally recommended that you remove this data point to prevent any skews occurring during analysis. However, if an outlier could still be of meaningful impact during your data analysis, it’s fine to keep it. Dealing with an outlier would require proper consideration, as removing it or keeping it could change the outcome and effectiveness of your data analysis.
Step 4: Handle Missing Data
Missing data could also pose a threat to your data quality as some algorithms will not work with missing values. You have two options when you deal with missing values, neither are optimal but could be considered to ensure data quality.
- Drop columns that have missing values. Although this method could help reduce inaccuracies, doing this would still potentially drop or lose information.
- Impute missing data referring to other data columns. However, this method could also reduce your data integrity as this operates the data with assumptions instead of actual observations.
Step 5: Validate your Data
Once you think your data is ready, you will need to determine whether or not your cleaned dataset makes sense. Does your data now follow the appropriate ordinance that it should follow? Are you able to build trends from your dataset that help you form insights and analyses? If not, is this due to a data quality issue?
If you could answer yes to all the questions above, you’re good to go kickstart your data analysis with your cleaned and prepared dataset.
But wait – what are the tools available to help us clean our data?
There are so many amazing tools available in the market currently which could help you ensure good quality data. Acquiring data that is of good quality and trustworthy for analysis has never been easier.
If you want to know more about these available tools, head over to our Cleansing Data blog, where we deep dive into data cleansing and preparation processes. We feature tools within Alteryx (including their latest release, Trifacta) – for you to assess what suits your needs most.
Alternatively, if you prefer a customised demo to see how our tools could help you build quality data, we would be glad to speak to you to find out more on how we can assist. Here at Billigence, we help organisations transform the way they work with data. Through a blend of resources and tools, our team of over 250 consultants have helped organisations to become more data driven.
Wrong conclusions derived from dirty data could cause poor business strategies, loss of confidence and trust in your data analysis as well as inaccurate decision-making. Remove the risk of poor decisions based off dirty data and start building a data driven culture today. As president of Data Quality Solutions, Thomas Redman once said, “Where there is data smoke, there is business fire.” Don’t leave your data to chance, explore, clean and prepare your data the right way today.