What is Natural Language Processing?
Natural Language Processing (NLP) is a subset of artificial intelligence that enables computers to interpret human languages. It focuses on the use of computer programming to process and infer the meaning of ‘natural’ human languages both spoken and written, allowing us to interact with machines via speech or text. With billions of instances of text data being generated every day, there is a growing need for NLP and its capability to breakdown unstructured text to produce insightful analysis including sentiment analysis, topic modelling and even text classification. As there are massive volumes of feedback and comments coming in from various social media channels and in-app messaging, many companies are interested in harnessing the power of NLP, breaking down the widely available data to help them with formulating marketing strategies.
To help understand the applications of NLP, we will be performing text analytics on TripAdvisor reviews obtained from Kaggle (an online community of data scientists and machine learning experts). Due to varying experiences in data science and python expertise, we will be showcasing the NLP process in Google Natural Language API for individuals who are comfortable with coding, as well as Alteryx Text Mining Tools for beginners without a coding background. The first half of the blog will start off with a simple walk through of Google Natural Language API and its application to the TripAdvisor reviews. The second half of the blog will be showcasing the low to no-code Alteryx environment and the Text Mining tools capabilities on the same reviews.
Introduction to Google NLP API and its application
The Google Natural Language Processing API is one of the many powerful programmatic interfaces within the Google Cloud Platform services. It sits within the broader category of Machine Learning APIs, where several other APIs such as Vision API, Cloud Translation API and AutoML can be found. With the availability of these cloud APIs, users can easily add machine learning based data analysis to their data.
Google Natural Language Processing API, can extract tokens and sentences, identify parts of speech and extract entities within the text, labelling people, locations, events, products and dates. On top of that, it also can perform sentiment analysis, which generates a value for a sentiment score based on the emotions detected in the text. With this analysis generated, businesses can easily gather valuable insights on the feedback and comments provided by users.
To begin the walkthrough, we will first start by importing the Google Cloud client library and reading in the TripAdvisor reviews CSV file, as shown below:
After loading the table, we can start our analysis with the Google Cloud Natural Language API that we have successfully imported.
Sentiment Analysis
As mentioned previously, one of the uses of Google Cloud Natural Language Processing API is to generate sentiment analysis. The sentiment analysis in Google Cloud Platform can be done using analyze_sentiment() which is pre-defined function in Cloud Natural Language Processing API. This function will read in text inputs and produce the sentiment score and magnitude as outputs as shown below.
To make sense of the output values, we will have to understand the definition of the score and magnitude of the sentiment analysis.
- Score: a numeric value ranging from -1.0 to 1.0 that indicates how negative or positive, respectively, the analysed text is. A score of 0 is considered neutral.
- Magnitude: a numeric value greater than 0 that indicates the absolute magnitude of sentiment regardless of score or the amount of emotional language that is included in the text. Text with a lot of mixed positive and negative language will return a neutral sentiment score but a high magnitude. Text with truly neutral language will return a low magnitude in addition to returning a neutral sentiment score.
Hence, looking at the above output table, the sentiment score and magnitude will help us understand the overall strength of the emotions identified in the reviews.
Scatterplot of Sentiment Analysis
Putting the score and magnitude of the sentiment analysis together with the rating provided by the users, we can obtain the distribution of scatterplot as shown below. This scatterplot is able to help us visualise the breakdown of the ratings given by the reviewers and the sentiments used within their review. To represent the five different ratings scores, we have made use of varying intensity of the colour hues – the higher the rating, the deeper the colour.
From the scatterplot above, we can see that as the sentiment score becomes more negative, the ratings decrease and when the sentiment score becomes positive, the Tripadvisor ratings increase. So, we can conclude that the breakdown of the sentiment score is approximately reflective of the ratings given by the reviewer. We can see that reviews with neutral or positive sentiments have a good rating, while reviews with negative sentiments generally have a low rating.
Topic Modelling
Moving on to topic modelling, one of the most common modelling methods is Latent Dirichlet Allocation (LDA), which is a type of statistical modelling for discovering the abstract ‘topics’ that occur in a collection of documents. Parameters such as number of modelling topics, maximum number of words and stop words must be provided prior to modelling to achieve an accurate topic classification. With the parameters provided, the model will try to identify the clusters of words that occur frequently with each other and rank the words according to its importance.
Using LDA to create a visualisation as shown above, we can easily identify the three distinct topics and the words within each topic. A more detailed explanation on Topic Modelling can be found under the Alteryx Topic Modelling section below.
Introduction to Alteryx and its application
For users who don’t have much experience coding but want to achieve results similar to the above, they can do so using Alteryx. Alteryx is an automated analytical solution that allows users to simply drag and drop the tools they would like to use for their analysis. Alteryx has an extensive list of tools available on their Alteryx Designer Platform for users to prep their data, automate workflows and generate insights. These tools will replace the lines of codes used when generating analysis.
Now we will be focusing on the text mining tools that are part of the Intelligence Suite within Alteryx. The tools we will be using are:
These tools are easy to operate for users who want to produce text mining analysis without in-depth knowledge on the various analytics methodologies and little to no coding experience. We will be using Alteryx’s drag and drop capabilities to recreate the above analysis and achieve the completed workflow as shown below.
Text Pre-processing Tool
After reading in the TripAdvisor reviews CSV file using the input tool, it’s important to clean the data and keep only important, relevant words before performing any analysis. We will use the Text Pre-Processing Tool to tidy up the reviews.
Selecting the above configurations, we can remove digits and default stop words (which are pronouns and will not value add to the analysis) as well as additional stop words, which are common words for the specific topic. After processing the text through the Text Pre-Processing Tool, we will be able to generate the cleaned-up text in the ‘Review_processed’ column as shown below.
Comparing between the columns ‘Review’ and ‘Review_processed’, we can note that some common words such as ‘hotel’, ‘stay’, ‘rooms’ etc were removed. With the new column generated, we will be able to utilise it for further analysis.
Sentiment Analysis Tool
The Sentiment Analysis Tool is then used to breakdown the pre-processed reviews into the three different sentiment categories (positive, neutral and negative). The user will have to select the text field to be analysed and specify the parameters or the classification of sentiments.
With the configuration above, a sentiment score of less than -0.1 will be classified as negative and sentiment score more than 0.5 will be classified as positive. Scores between -0.1 and 0.5 will be classified as neutral. The sentiment analysis tool will generate the following result window.
The result window shows the magnitude of the sentiment detected for all three categories (negative, neutral and positive). The scores are then collated into the compound sentiment score value, which will determine the sentiment category for the reviews.
Topic Modelling Tool
Aside from looking at the sentiments, we can also generate the distinct topics using the Topic Modelling Tool. The Topic Modelling Tool has one input anchor to read in the data and thee output anchors:
- D anchor shows the data table with the probability of the review falling into each topic in the result window.
- R anchor shows the interactive chart.
- M anchor stores the model.
By going to the configurations of topic modelling to select the field for modelling and inputting three for the number of topics, we will get an interactive report as shown below, in the R output anchor.
The interactive chart above can show the top 30 most salient terms, which are the most relevant words for each topic. These words allow the business users to better understand the classifications of the words among the three different topics, making it easier for them to consume and understand. The bar shows how often the word appears in all the reviews.
On the right-hand side, there is a Intertopic Distance map which shows us how similar the identified topics are. The further away the topics on the Intertopic Distance map are, the more distinct each topic is from one another. In the graph above, we have three distinct topics, a significant distance from each other.
Topic Modelling helps business users discover patterns of words amongst their documents and highlights insights they may never have noticed when analysing manually. It also can help them identify potential connections made between topics which were not previously known. Some common uses of topic modelling include summarisation of text, providing word recommendations, identifying spam filters and more.
Word Cloud Tool
We can utilise Word Cloud Tool to generate word clouds for the three different sentiment classification with a different colour theme for each. Other customisations include specifying the number of words each word cloud should include and the size of the word cloud.
With the word cloud generated, the business users will be able to easily identify words with a higher frequency at a glance.
These results generated from the text mining tools above, can then be consolidated into a report as shown below.
This report contains the breakdown of percentage of reviews for each sentiment level as well as the word cloud generated for each sentiment. Below the word cloud, we have the topic modelling interactive visualisation which displays the top 30 words on the left as well as the three distinct topics on the right.
This report helps businesses to understand the areas of concern or importance from the feedback and reviews provided by their customers. This can impact, influence and help them to prioritise their strategies in the identified areas of improvement, as well as keep tabs on the consumer preferences and maintain positive customer relationships with their existing customers. With the boom in social media engagement and internet connectivity, there are many platforms available for consumers to document their thoughts and opinions. Hence, being able to collate this widely available information can help companies monitor the existing sentiments on their product and services from various sources.
Comparing Google and Alteryx
We have showcased two different methods to perform NLP, with Google Natural Language Processing API and Alteryx Text Mining and we are able to achieve similar analysis with both. The slight variation will be the result generation of the sentiment analysis. Sentiment analysis in Google Natural Language API is achieved with a pre-trained model built by Google, where large number of documents have been fed into the algorithm, training the model such that it can have accurate performance whenever the API is called. Sentiment analysis in Alteryx is by the VADER method, which is a rule-based, unsupervised learning algorithm. The compound normalised output it returns is achieved based on the detection of several factors such as case, puncuations, slang etc.
Another advantage of Google Natural Language API includes other applications of NLP such as analysing of entities, syntax etc. Furthermore, other Python packages can be easily imported into the codes to enhance the parameters, allowing for more flexibility and variety.
However, for individuals not comfortable coding, Alteryx is a fantastic solution. It has a friendly user interface, with drag-and-drop capabilities to select tools that generate similar analyses without any coding required.
Conclusion
With either the Google Natural Language API or Alteryx Text Mining, the business can produce powerful insights through text analysis to raw data extracted from various sources. Furthermore, development time and efforts can be kept minimal, allowing individuals without much text mining background to generate analysis.
For more information on the above NLP capabilities in either Google Natural Language API or Alteryx Text Mining, feel free to reach out to us below. Our certified consultants are here to help you uncover valuable data insights with the recommendation of suitable tools. Please feel free to submit an enquiry form below to find out more.