What is a Data Lake?
A data lake is a central repository that is capable of holding large amounts of data in relatively unprocessed formats. With the reduction in costs associated with cloud computing, many organisations have started to use data lakes as a central repository and/or for a different type of analytics. Some examples include business intelligence, big data processing, real-time data analytics and machine learning.
According to Aberdeen’s research, companies that have deployed data lakes reported user satisfactions that were two times higher than those that have not in critical metrics such as data quality and timeliness.
How does Data Lakes differ from Data warehouses?
Data lakes were created to overcome the limitations of data warehouses. Although data warehouses allow organisations to perform data analytics on structured and refined data, they are mostly ineffective in storing the big data that is produced today and are not able to meet the modern analytical needs of organisations today.
Overall, there are four main differences between data lakes and data warehouses.
First, because data warehouses take into account business processes and needs, the data model is kept as simple as possible. It only retains data crucial for the analysis in a highly structured and processed format. On the other hand, a data lake retains all data. This includes data that may or may not be used in the future, for all time, in case the data might be used in the future or if time analysis is needed.
Second, most data warehouse schemas are designed prior to the data warehouse implementation and can only support data in a structured format. This process is known as ‘schema-on-Write’. Due to this, non-traditional data sources such as web server logs, sensor data, social network activity, text and images are mostly ignored. On the other hand, data lakes were specifically made to store such data. Regardless of the source or structure of the data, data lakes keep data in its original form, only transforming them at the time of analysis. This is known as the ‘schema-on-Read’ process.
Third, because data warehouses are structured and built for specific purposes or functions, they are usually much easier to use and understand. Data warehouses are most ideal for operational or business users, who require daily reports, monitor KPIs or work with the same set of data regularly. These operational users usually do not have much time or expertise to prepare or transform data on their own.
For technical users (e.g. data scientists), who may want to conduct deeper analysis or answer new questions, the data provided by data warehouses are often insufficient. Thus, data lakes are usually preferred by these users. While data lakes allow users to explore and make use of larger and more varied data sets, more work and technical understanding is required to prepare and transform the data for analysis.
Finally, in today’s fast-paced environment where speed of analysis is crucial, data warehouses are often unable to adapt quickly enough as making changes to structures require both time and developer resources. Additionally, any changes made to data warehouse structures could potentially affect or break downstream analytical processes that rely on data being in a specific structure.
On the other hand, since data is stored in data lakes are relatively unprocessed, it is always both accessible and malleable. If the result of data exploration or analysis in a data lake is proven to be useful, a more formal schema can then be applied, transforming the data set into a data warehouse, enabling automation and reusability to a larger audience.
Challenges of Data Lakes
While the ability to store structured and unstructured data in its raw format is undoubtedly one of the core benefits of a data lake, one of the main challenges of maintaining a data lake is also to manage how these raw data is stored.
Without proper organisation, metadata management and data governance, data may become increasingly hard to find, turning the data lake into a data swamp. Over time, as more data is added into the data swamp, data becomes increasingly irrelevant, outdated or even faulty. Eventually, if no corrective action or stringent controls are implemented, the data swamp becomes unusable, only producing costs for organisations.
Data Integration: ETL vs ELT
A typical data integration process consists of the following three steps:
- Extract (E): Pulling the data from the source.
- Transform (T): Changing the data structure into the right format.
- Load (L): Depositing the data into the storage system.
The two most common data integration methods are the ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) process.
In the past, when there were no data lakes or cloud data warehouses that had the ability to transform data, ETL was the only way. For ETL, data is first extracted from multiple sources, into a temporary staging area where it is cleaned and transformed before being loaded into the repository. As data warehouses had to work with relational SQL-based data structures, any data that was to be loaded had to be transformed into a relational format before the data warehouse could ingest it.
With ETL, data that is stored in the repository are already in the right format for analysis, allowing for a much faster data analytics process. Additionally, since the ETL process transforms the data before loading, no raw data is transferred directly into the repository. This ensures that sensitive data does not find its way into the repository or even in analytic reports further down the line, allowing organisations to better comply to privacy laws such as GDPR, HIPAA or CCPA.
On the other hand, ELT is a relatively recent technology that was made possible by cloud-based data warehouses that offer near-unlimited storage capabilities and scalable processing power. For ELT, data is first ingested into the repository before it is transformed inside the data lake or data warehouse itself.
Because data is stored relatively unprocessed, the ELT process provides organisations with the flexibility to transform and analyse data in any way depending on the analytical needs. When used with a data lake, the ELT process also allows an organisation to ingest almost every type of raw data, be it unstructured or structured, as it becomes available. This data then allows for a more in-depth and/or unstructured analysis, such as machine learning.
ELT offers a much faster data integration process because data does not need to be sent to a secondary server to be processed. The retention of data in its raw form also allows organisation to create a rich historical repository that can be transformed and used even as goals and strategies change over time. Thus, while ETL may still be used for data that require transformations before entering a repository, most data lakes use the ELT process due to its flexibility, efficiency and scalability.
Data Integration: Batch Processing vs Stream Processing
Further, when determining the frequency and volume of data ingested, there are two primary ways data is processed. This includes batch and stream processing.
For the batch processing model, data is collected over time before being fed into the data lake or warehouse. Batch processing is mostly used when dealing with vast amounts of data, or when the data sources do not support delivering data in streams. Batch processing is also ideal when it is more important to process a large volume of data, involving deeper data analysis, than it is to do fast analytics. Some examples of batch processing use cases includes payroll, billing and order processing.
For the streaming model, data is fed into data lake or warehouses piece by piece, as soon as it is available. While more expensive, stream processing is ideal when organisations require real time analytics. By collecting and analysing data in real time, organisations can detect issues or uncover insights almost as they happen, allowing them to act upon it immediately. Some examples of stream processing use cases include fraud detection, customer behaviour analysis and sentiment analysis.
In summary, both batch and stream processing offer unique advantages and disadvantages based on individual use cases. To stay agile and gain insights from big data quickly, organisations may employ stream processing whereas batch processing is used when dealing with huge amounts of data. Thus, many data lake providers often offer both processing services, allowing organisations to employ either a batch, stream or even hybrid approach when integrating data.
Enabling the Data Cloud with Snowflake
To round up our two-part discussion on data modelling and ingestion, one of the best ways to build a cloud data lake and/or data warehouse implementation is the Snowflake Data Cloud.
The Snowflake Data Cloud is a fully managed MPP (Massively Parallel Processing) cloud-agnostic platform that supports multiple data workloads, from Data Warehousing and Data Lake to Data Engineering, Data Science and Data Application development. By separating storage and workloads to be executed against its own compute clusters, Snowflake eliminates concurrency issues faced by traditional on-premise data lakes/warehouses, allowing multiple processes to run at the same time without competing for resources. Instead of running its own cloud, Snowflake is available on AWS, Azure and GCP. With a common and interchangeable code base, Snowflake’s global data replication feature allows users to move data to any cloud, in any region easily. Additionally, Snowflake’s architecture also allows for the storage of not only structured data, but even some semi-structured data, eliminating the need for data extraction pipelines.
In this example, data from an organisation’s internal database and external sources, which includes structured, semi-structured or even unstructured data, can either be:
- Stored in a data lake before being transformed and pushed into the Snowflake Data Cloud via an ELT process or,
- Loaded straight into the Snowflake data cloud via an ETL process.
With ETL, the data is first extracted in a temporary staging area before being transformed and loaded into the Snowflake Data Cloud, ready for analysis. Top ETL/ELT pipeline tools that simplify data processing include Alteryx and Matillion.
With ELT, instead of transforming the data before loading it into the Snowflake Data Cloud, the data is first landed into a cloud data lake via services provided by cloud providers (e.g., AWS Lake Formation, Azure Synapse Analytics and Google’s Data Flow) or external ELT tools.
Using Snowflake terminology, the data lake will then be known as an external stage, which refers to the location used to store data that is to be loaded into a Snowflake database. Snowflake’s data transformation capabilities would then be utilised to transform the data in any way necessary before it is loaded into the Data Cloud.
For both processes, the data can also be loaded into the Snowflake Data Cloud either via batch and/or stream processing. For batch processing, bulk data loading into Snowflake tables can be done using the COPY INTO command. For stream processing, Snowflake’s continuous data ingestion service, SnowPipe, is used to load data as soon as they are available, either by leveraging on event notifications from the cloud storage or calling a public REST endpoint.
Finally, the data is now ready to be integrated into front-end tools and interacted by users. This includes reporting, query, analysis, or data mining tools. In this example, there is no need to create separate data marts. Snowflake’s innovative data architecture ensures that it can support an unlimited amount of data and users, preventing slow or overloaded data warehouse issues that warrant the creation of additional data marts.
We are moving Analytics to the Cloud!
In the fast-moving environment for every industry, businesses need to be agile, flexible, and responsive in an increasingly competitive landscape. Getting faster, more accessible, and more scalable analytics is now a necessity for many organisations.
Here at Billigence, we have the technical knowledge, industry experience and proven deployment record required to take your analytics to the next level. If you are looking for assistance in accelerating your transition to cloud-based data warehouses and data lakes, or more information regarding the various tools and solutions, get in touch to see how we can help!