Specialising in self-service data prep, Trifacta is a cloud-native solution that allows users to collaboratively profile, prepare, and pipeline data for analytics and machine learning, in a process known as ‘data wrangling’.
At the start of 2022, Alteryx, best known for its desktop-based analytics solution, Alteryx Designer, announced their acquisition of Trifacta.
This acquisition allows Alteryx to combine its leading low code/no code analytics solution with Trifacta’s cloud native capabilities, to provide a highly cloud-enabled platform. The platform is now known as the Designer Cloud Powered by Trifacta or simply Designer Cloud.
In this blog post, we will explore why data wrangling is so important in today’s data-driven landscape and how the Designer Cloud can help simplify and accelerate this process. Keep reading to learn how you can free up much needed resources for more crucial activities.
What is Data Wrangling?
Data wrangling is the process of transforming raw data into the desired format for decision making. Typically, this process can be broken down into six iterative steps:
1. Discovery: Understanding what is in the data – identifying the distribution of data
2. Structuring: Organizing the data for easier computation and analysis – splitting columns
3. Cleaning: Transforming the data to increase data quality – removing null and/or error values
4. Enriching: Deriving additional information about the data – creating new calculated fields
5. Validating: Ensuring that data adhere to data rules – verifying that columns do not have any null values
6. Publishing: Providing access to wrangled data that is easily understood – providing documentation for data wrangling logic applied
According to Designer Cloud, about 60% of an analyst’s time is spent on cleaning and transforming data. As big data continues to grow, the data required to make key data-driven decisions is often lost in data silos. Moreso, the data is frequently found in inconsistent types and formats, requiring customised scripted solutions by data scientists. These data scientists may not understand the business requirements or context of the data, resulting in numerous back-and-forths between themselves and business users, creating an extremely inefficient and expensive process.
Designer Cloud accelerates this process by empowering the business users to apply their domain expertise to the preparation of the data and leverage insights on their own.
Designer Cloud makes data wrangling intuitive, interactive, and ultimately automated at scale. This renders the time-consuming, complex, and error-prone process of preparing datasets, of any volume, into a point-and-click exercise.
What does a typical Designer Cloud workflow look like?
Through a standards-compliant web browser, the main objects that users will create and work with during the data wrangling process are recipes within a flow.
A flow refers to a container which holds one or more data sets and its associated recipes.
A recipe is a sequence of transformation steps that is created to transform the source datasets.
Within a Designer Cloud flow, the six iterative steps of the data wrangling process (as seen previously), can be summarized into four key processes. They are:
1. Import and sample data
2. Profile data
3. Transform data within recipes
4. Run Job
In the next few sections, we will be exploring the processes in-depth, to showcase how Designer Cloud excels in making the data wrangling process intuitive and interactive. This showcases how users of all skill levels can easily transform data to be used in downstream analytics and machine learning processes.
1. Import and sample data
The first step to any data wrangling process in Designer Cloud is to integrate the data into the flow. Designer Cloud can import data from a huge number of data sources, including flat files and databases.
When the data is imported and transformed within recipes, Designer Cloud does not modify the source data. Instead, a set of metadata is created which allows users to transform a small sample of the imported data. This ensures that the client is not overwhelmed with loading huge datasets which can impact its performance. To ensure that users can get to work quickly once the dataset has been imported, the initial data sample is first generated using the first rows sample method.
Additionally, to better represent the underlying dataset or prepare the sample for further transformation steps, Designer Cloud also offers several other methods for users to re-sample data. This includes:
- First Rows Samples: Select first set of rows from the dataset.
- Random Samples: Randomly select rows from the dataset.
- Filter-Based Samples: Find specific values in a column.
- Anomaly-Based Samples: Find missing or mismatched values.
- Stratified Samples: Find rows for every distinct value in a column.
- Cluster-Based Samples: Find all rows for a subset of values in a column.
For example, if the dataset is sorted by date and the data changes significantly in the later periods, then perhaps a cluster-based sample would be more appropriate to capture the changes. On the other hand, if steps to manage bad data or outliers are to be completed, then the anomaly-based method could be used.
2. Profile data
After importing the dataset into the flow, we are now ready to explore our data. In the transformer page, Designer Cloud provides a dynamic and visual profiling tool for each column, allowing users to understand the data and perform the necessary transformations quickly and easily.
Below is a quick overview of the Designer Cloud visual profiling interface:
A) Column Header: Displays the column name and data type.
B) Data Quality Bar: Identifies the valid, mismatched, and missing values in the column. The data quality bar can be used to quickly fix missing and mismatched values.
C) Column Histogram: Displays the count of each detected value in the column or the count of values within a numeric range. The column histogram can be used to quickly identify and fix unusual or outlier values within each column.
3. Transform data within recipes
With a good understanding of the data, we can now start to wrangle it within one or more recipes.
One of the easiest ways which Designer Cloud allows users to transform data is simply by selecting any element of data in the Transformer page. This includes elements from the visual profiling bar above, as well as one or more values within columns. When an element is selected, Designer Cloud automatically prompts users with a set of suggestions for steps to transform the data.
Alternatively, pre-configured transformations are also available via the transformer tool bar or column context menu. The search panel also allows users to search and select from a huge number of transformation options.
Once transform steps are added to the recipe, it is rendered into the data transformation language and applied in real-time to the sample dataset, allowing users to immediately begin working on the next steps of the process.
In the above example, the discount column is locked as an integer data type and the data profile bar tells us that all available values are mismatched (shown in red). This is because the value ’10%’ is recognized as a string. To fix this, we can simply highlight the % sign. Designer Cloud instantly provides several suggestions to transform the data. By hovering or clicking on each suggestion, there is a preview of the implied change, allowing users to review the effects of the change and make any alterations needed before the suggestion is applied. In this case, we will select the replace transformation to replace the % sign with an empty string, leaving only the numerical values within the field.
Additional Trifacta Transformation Features
3.1 Pattern Matching
Using columnar pattern matching, Designer Cloud automatically identifies patterns within data columns, allowing users to transform the data by filtering out a subset of the records, extracting specific values from the field, or even standardising values.
3.2 Transformation by Example
Build recipe objects by mapping example output values for source values. Designer Cloud then interprets the differences between the input and output to automatically determine what transformations are intended/required by the user.
4. Run Job
Once we are satisfied with our data transformations, we can now execute the job against the entire dataset by clicking the run button – either directly from the transformer page, or from the flow view. This opens the Run Job page which includes four main sections to configure:
A) Running Environment:
The Running Environment allows users to select the environment to execute the job. For small to medium-sized jobs, Trifacta Photon can be used to run the job directly in the browser. Alternatively, the Spark or Dataflow running environment can be selected to leverage Apache Spark or Google Dataflow to process large datasets efficiently in a distributed manner.
Trifacta Pushdown Optimisation
Designed to leverage on the processing capabilities of cloud data warehouses, Designer Cloud’s pushdown optimisation capabilities allow users to convert data transformation logic created in Designer Cloud into SQL code, which is then executed directly in the underlying cloud data source.
During transformation, the data stays within the cloud data source, resulting in a secure, efficient, and scalable solution that allows users to focus on the data, instead of writing complex SQL codes.
This feature is available for Snowflake, Google BigQuery, Amazon Redshift, Oracle DB, PostgreSQL, MS SQL Server, and Azure Synapse integrations.
Allows users to select additional output configurations including:
- Option to generate visual profile of job results.
- Option to check for any schema changes since the last time the dataset was loaded.
- Option to fail job when the above schema detects changes, preventing data corruption for downstream processes.
- Option to ignore errors in recipes and proceed with job execution.
C) Publishing Actions:
Under publishing options, users can add, remove, or edit the outputs that are generated from a job. This includes options to generate outputs in CSV or JSON formats, and even to publish the datasets to new or existing tables in established connections.
D) SQL Scripts:
Lastly, the option to define and run SQL scripts before and/or after the job execution is also available. This is useful to create log entries in a database log table to document the history of a table.
Pipeline Your Data with Plans
In Designer Cloud, plans are a sequence of tasks that can be scheduled for periodic execution. For example, a flow created to extract and clean data from source can then be used as a reference dataset for another flow, to perform further transformations, such as data aggregations for visualisation purposes.
Based on the outcome of each task, users are able to create branching tasks, such as sending success/failure messages to receiving applications, like Slack, so that downstream stakeholders remain informed throughout the process.
Finally, once the job is executed, the Job Details page can be used to view the outcome of the job. This includes an overview of the job, output destinations, data profile of job (if selected), dependency graphs and data source information.
Start wrangling in the cloud!
In this blog post, we have detailed how Designer Cloud can enable analysts, data specialists, and other domain experts to quickly transform data of varying sizes for use in downstream processes, including data analytics and machine learning – all without writing a single line of code!
Here at Billigence, our Alteryx and Designer Cloud certified consultants are here to help you unlock your data wrangling capabilities on the cloud. Please feel free to submit an enquiry form below to find out more about how you can turn data wrangling from a six-week process in the IT lab, into a two-hour exercise that can be done at the analyst’s desk.