Over the years, machine learning algorithms have been widely adopted across many companies and industries. They have been proven to be successful in various areas such as computer vision, demand forecasting and intelligent search. There is in turn, a high demand for machine learning professionals to cater to the increasing adoption of machine learning capabilities. To bridge the gap in expertise, Automated Machine Learning (AutoML) tools, such as Alteryx AutoML, are developed to reduce the complexity of machine learning algorithms so average users can execute and generate predictive insights.
AutoML is the process of applying machine learning models to real-world problems using automation. More specifically, it automates the selection and parameterisation of machine learning models, which can be time consuming and tedious. With the introduction of AutoML tools in the market, users with limited data science expertise can train complex machine learning models with much ease and accuracy. It helps to empowers less technical users to perform data exploration and generate complex machine learning models for forecasting and predictions.
One such AutoML tool is the Alteryx Machine Learning Platform. It is a cloud-native tool that enables users to prepare, blend and output data without having to write any lines of code. Furthermore, Alteryx Machine Learning’s built-in insights and automatically generated visualisations allow users to uncover data quality and key relationships within the dataset. It can also provide suggestions for further improvements and recommend additional features to be used for the generation of high performing models.
AutoML Use Case & Data Set: Credit Card Fraud Detection
To walk through the functionality of Alteryx AutoML, we will be using a dataset from Kaggle containing transactions made by credit cards in September 2013. This dataset contains 492 fraud detections out of 284,807 transactions that occurred in the span of two days.
The objective of this use case is to predict whether the transaction is fraudulent given the time, amount of each credit card transaction and the 25 metrics generated for each transaction. We will be using the column ‘class’ to classify the fraud and non-fraud transactions, where 0 indicates non-fraudulent and 1 indicates fraudulent transaction.
Step One: Prep Data
We will start by importing our dataset onto the platform. There are two ways to import the file onto the platform. One way is to select ‘Browse’ if the files are saved in the Alteryx Cloud. Another way is selecting ‘Import’ for the local file sitting on your desktop. After selecting the input file for modelling, it will load a preview of the data as shown. With this preview, users will be able to have a glance at the input data, checking if the entries and the columns are loaded accurately.
In this stage, users can also access to the data health tab, which shows the current health of the entire dataset. This tab showcases the data performance rating indicating the overall data health of the dataset as well as any null fields and outliers within the dataset.
Since our data is of a pretty high quality, where the outliers are to be expected, we can move on to the next step.
Step 2: Data Insights
In this step, users will have to select the Target, which will be the column that we want to predict and the Machine Learning Method to be used. The Machine Learning method will be dependent on the target selected. In our example, our Target column is ‘class’, which is a categorical field, with 1 (fraud) and 0 (non-fraud). Thus, the Machine Learning method for this analysis is classification.
|Target Data Type||Machine Learning Methods|
|Numeric (with DateTime Col)||Time Series|
After the selection is complete, the correlation matrix and the chord diagram is generated (as shown below) to provide users with visual representation of the correlations between each feature.
This measures how close the features are related to each other and the strength of the relationship, presenting the correlation coefficient into a tabular format.
This represents the flows or connections between the features. It will show the correlation of the features above correlation threshold set at the chart controls.
Using these exploratory data visualisations, we can decide if certain columns should be dropped from further analysis as they might be highly correlated or an outlier.
Step 3: Feature Engineering
Moving on to feature engineering in Alteryx Machine Learning Platform. The platform is able to recommend features and primitives, such as addition of all numeric columns and transformation between the columns, to generate new features for modelling. Usually, this process is very time-consuming, as you would need to individually create the various new features from the existing data. With Alteryx Machine Learning, it is completed with a click of a button, where users can simply select the primitives, they would like to introduce into the modelling.
For our use case, as there are already metrics generated for the modelling, there is no new features required. So, we will simply move on the next step.
Step 4: Auto Modelling
After the feature selection, users simply select ‘run’ in the Auto Model step. Alteryx Machine Learning will automatically generate the leader board of all the available models, ranking them according to the selected ranking metrics. For this case study, we select Area Under Curve (AUC) as our ranking metric as AUC measures the ability to distinguish between the classes, 0 and 1. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
The platform will search through the various machine learning techniques and automatically display the best combination of variables and parameters that gives the highest score. In the result is shown below, XGBoost Classifier is ranked the highest with AUC of 0.98 score, being the model with the highest accuracy in predicting entries to be either 0 or 1. From the result generated, we are also able to observe how the other model performed.
Step 5: Evaluate Model
After selecting the best model, we will proceed to evaluate its performance. Alteryx Machine Learning displays the performance and the pipeline of the workflow in this step, showcasing the processes involved for the generation of the model as well as the sample size for evaluation.
In the performance tab, we will see the various metrics for the holdout set and how it compares to the cross validation. On the right, the Confusion Matrix is also generated showing how the true negative fair with the actual negative, likewise for the positive classification.
Within the insights tab, we can see the importance of the different features for the modelling, both in the list and the graph.
Step 6: Export and Predict
Finally, we can export the findings through several methods for further analysis or discussion with other teams. The images and the visualisation generated can be exported in PowerPoint slides and the model can be exported into a .yxmd file which can be used for further predictions using Alteryx Designer.
New data files can also be read in in this step, the new data points can be assigned a prediction from the above model to determine whether the credit card transaction is a fraud.
Alteryx AutoML Benefits
As depicted throughout this blog, the Alteryx Machine Learning Platform can automate the entire Machine Learning process, from data insights all the way to the selection of the models. The predictions generated are of great importance to businesses, helping them to minimise errors and improve efficiency. In our use case, credit card companies are able to easily identify fraudulent credit card transactions such as customers being charged for items that they did not purchase.
There are also many other use cases which can benefit from the machine learning capabilities within the Alteryx Machine Learning Cloud platform.
Alteryx AutoML vs Alteryx Designer Predictive Tools
If you have already read our previous predictive analytics blog post, where we took a deep dive into Alteryx Designer and its tools for predictive analysis, you may wonder what the difference between Alteryx Machine Learning Cloud and Alteryx Designer Predictive Tools really is.
In Alteryx AutoML Cloud, users simply select the parameters within the cloud platform, without having to use code or tools and the platform automatically generates the machine learning models. There are also many built-in validations and checks in place through the entire process, using the various visualisations to help users identify any potential issues and relationships within their datasets. Alteryx AutoML Cloud can automatically build out the process pipeline in the cloud and randomly select the sampling dataset for evaluation, without the users having to provide the model with additional unseen datasets.
In Alteryx Designer, users will need a good understanding of the various tools available for prediction and select the relevant combinations of tools with relevant configurations to build the prediction models. Users will have to build their own data exploration and insight generation steps before any further analysis of the data, as seen in the previous blog post. However, Alteryx Designer allows for greater flexibility; users can tweak their dataset easily and perform analysis in various different methods.
All in all, Alteryx has been constantly innovating and providing new solutions to help users achieve valuable insights, without needing advance analytics and machine learning backgrounds. Just the predictive tools within Alteryx Designer can simplify the execution of machine learning workflows to a few clicks within the Alteryx Cloud.
If you are interested in learning more about Alteryx or Alteryx AutoML and its capabilities or need some help to get started, we are here to help. Reach out using the form below and start your journey to automation.