In this two-part blogpost, we will explore data modelling and ingestion concepts that can enable organisations to harness the power of big data. In part one, we introduce what big data is, how it is stored and some data warehouse concepts.
In part two, we will be looking at data lakes, including how they differ from data warehouses and how data is ingested, before concluding with a sample hybrid cloud data lake / warehouse implementation.
The Emergence of Big Data
Big data refers to huge amounts of data that cannot be stored, processed, or analysed using traditional methods. Overall, big data can be characterised by the three ‘V’s’:
- Volume: the huge amount of data produced.
- Velocity: the speed at which data is produced.
- Variety: the different types of data formats produced.
According to Google’s former executive chairman Eric Schmidt, every two days, humankind produces the same amount of data as we did from the dawn of civilisation up until 2013. While the concept of big data has been around since the 1960s, it has evolved tremendously over the years.
Today, big data is driven by mobile technologies and internet-enabled devices, that generate huge amounts of data at near-real time speeds, and in various formats. The emergence of cloud computing in the 2000s also significantly lowered the cost of data storage, allowing organisations to store an almost unlimited amount of raw data in their cloud data lakes or warehouses. As such, new data integration techniques such as ELT were born, enabling organisations to unlock the true analytical power of big data. This allows them to make faster and better decisions, respond to market changes quickly, and give customers what they want, when they want it.
How is big data stored?
For most organisations, their data will sit within a data lake, data warehouse or data mart.
Data lakes contain data in largely unprocessed forms, directly from the source system. With the reduction in storage costs associated with cloud computing, majority of organisations are shifting to data lakes as the central repository. This data usually has to be transformed for business intelligence or machine learning purposes.
On the other hand, data warehouses are designed mainly for data analytics. It typically contains data from multiple sources in a structured and refined manner, allowing operational or business users to conduct analysis to empower decision making. As such, data warehouses are usually organised in a format that can be easily understood and seamlessly integrate with business intelligence tools. Data warehouses can also hold data transformed and passed downstream from data lakes.
Another type of storage is the data mart. Like data warehouses, a data mart stores cleaned data that are processed and ready for analysis. However, the difference is that data marts only hold data that are designed for a specific purpose, such as a specific subject matter or business unit. Data marts allow users to quickly access and process data without the need to sift through huge amounts of irrelevant information. Since they only contain the necessary data, data marts also reduce the risk of data being misused. Data marts may be created in a top-down approach where data is stored in a centralised location in the data warehouse, before being filtered for specific uses. They can also be created independently, drawing and processing data from multiple data sources instead.
While data lakes, data warehouses and data marts are used to store big data, all three are distinct and serve different purposes. Depending on organisational needs, any or a combination of these storage types could be beneficial in streamlining the ever-growing amount of big data a company has, providing an edge in the long run.
How are data warehouses different from traditional systems?
Data warehouses act as a single repository of structured information. Once data is added into a warehouse, it should not be changed or altered, ensuring a secure, reliable, and easy to manage single source of truth. When dealing with today’s big data, data warehouses have several key benefits over traditional data storage systems, such as Microsoft Excel files saved on a shared drive. This includes improvements in efficiency, data quality, business intelligence, and security.
With data from multiple sources being standardised and stored in a single location, all data is readily available in the correct format. This improves the efficiency of organisations as business users will be able to quickly perform data query by themselves, with little to no reliance on a technical expert.
Besides that, by ensuring that the data is stored in the right format, data warehouses also improve data quality by allowing organisations to easily identify and resolve duplicated, incomplete, or even incorrect data.
Business intelligence is also enhanced as organisations can easily retrieve and consolidate data from any source, instead of only relying on limited data from a single source or having to cross-reference different data sources. Since the data is standardised, results produced from the data would also be consistent across all departments, providing a single source of truth and a solid basis for decision making.
Finally, by keeping all data sources consolidated in a single repository with enhanced security features, data warehouses significantly enhance data security. This includes features such as data encryption, which protects confidential data, and role-based access controls, which restricts access based on a ‘need to know’ basis, ensuring that users only receive access to the necessary data.
Data Warehouse Modelling and Schema
In data warehousing, a schema is a logical description of the entire database, including how the data tables are structured and how they relate to one another. Unlike databases that use relational models, data warehouses use dimensional modelling, which is optimised for the efficient and versatile retrieval of data. For example, in dimensional modelling, instead of a date being entered as a single row in a date table, every aspect of the date (e.g. day, month, quarter, year) is entered separately, allowing for quick retrieval of date information and a more flexible analysis.
Dimensional modelling involves fact and dimension tables. Fact tables represent the data that we want to analyse while dimension tables decode and give meaning to the attributes in the fact tables. For example, a fact table could be purchases made at a store, including what was bought, who bought it and the employee who made the sale. This would then be supplemented with dimension tables for the item, customer, and employee, providing additional information, such as the name, address, age of the customer and employee.
Two of the most popular schemas in data warehousing are the Star and Snowflake schemas.
A Star schema is made up of a single fact table, which is connected to multiple dimension tables that are purposely de-normalised (i.e. having redundant copies of data or grouping data to improve read performance of a database). The fact table is linked via a foreign key relationship to the primary key of each dimension table. Since the fact table is directly related to each dimension table, querying data stored with a Star schema is simpler and faster as compared to a normalised (non-redundant/non-repetitive) database. However, due to the de-normalisation of the dimension tables, the Star schema faces an issue with redundancy, where multiple values are repeated continuously. This increases the storage space needed and the risk of data integrity issues.
Another option for a data warehouse schema is the Snowflake schema. Developed based upon the Star schema, the Snowflake schema normalizes all dimension tables from a Star schema in a process known as ‘snowflaking’. With this, dimension tables in the Snowflake schemas can have their own dimension tables and may or may not be directly connected to the fact table. This removes data redundancy, reducing the storage space requirements and data integrity issues faced by the Star schema. However, it in turn also reduces some of the advantages that the Star schema offers, including the efficiency of queries and data retrievals.
Overall, both the Star schema and Snowflake schema greatly improve the speed and ease of data analysis as compared to traditional relational models. While there are trade-offs between the two schemas, the choice of which schema to use ultimately depends on the individual use case.
Galaxy & Starflake Schemas
A combination of the two schemas can also be used. Some examples include the Galaxy and Starflake schema. The Galaxy schema contains multiple fact tables that share some dimension tables, reducing the overall size of the database. On the other hand, the Starflake schema combines the Star and Snowflake schema by only normalising certain dimension tables, getting the best of both approaches.
Data Vault Model
Alternatively, to account for the evolution of big data that is increasingly unstructured and ever-changing, a database modelling method that has also been growing in popularity, ever since its introduction in 2000 is Data Vault modelling. A Data Vault is a hybrid data modelling methodology that combines the unstructured nature of data lakes with the rigidity of the usual data warehouse schemas. In a Data Vault model, dimensions and fact tables are replaced by hubs, satellites, and links. Where hubs and links form the structure of the data model, similar to the dimension and fact tables in a traditional data warehouse schema, satellites contain temporal and descriptive attributes including metadata that provide powerful detailed oriented analytics and history tracking capabilities.
On-premises Warehouses vs. Data Clouds
Finally, when deploying a data warehouse, another key consideration is whether to deploy the data warehouse on-premises or in the cloud. For on-premises data warehouses, organisations must purchase, deploy and maintain all hardware and software. On the other hand, for data clouds, there is no physical hardware. Instead, organisations can rely on cloud providers for storage space and computing power. There are five main factors to consider when selecting a deployment type for a data warehouse.
For on-premises data warehouses, scaling up or down would require both time and resources to install or uninstall the physical infrastructure. On-premises data warehouses cannot accommodate bursts of activity that require more compute or memory, thus organisations often size their systems for peak usage that may only be needed for a short period of time. This results in inefficiencies and huge costs if not managed properly.
For data clouds, there is virtually infinite scalability, with scaling up or down being as simple as changing subscription tiers. There is usually also no need to make any configuration changes when scaling up or down.
For on-premises warehouses, since organisations own the physical hardware and software, a huge capital expenditure (CapEx) is necessary. The cost to set-up an on-premises data warehouse becomes an asset on the organisation’s balance sheet and is depreciated over a period of time. Organisations would also have to bear the cost of all necessary software licencing, maintenance, security costs.
For data clouds, the initial upfront ongoing upkeep cost of on-premises data warehouses is replaced with a simple usage-based or operating expenses (OpEx) strategy. With this strategy, organisations only pay for the storage and computing it needs, and these payments are recorded in the period’s income statements instead of the balance sheet. The cloud provider also handles all ongoing maintenance, administration, and updates.
While the performance of a data warehouse is affected by multiple factors, assuming that all else is equal, for an organisation located in a single location, on-premises data warehouses typically perform better than data clouds. This is because the on-premises data warehouses are not affected by latency issues, (i.e. the time it takes to store or retrieve data).
On the other hand, when comparing deployment types for an organisation located in multiple sites, a data cloud could improve the overall performance of the data warehouse. This is because data clouds usually exist in multiple locations around the world, allowing queries to travel from the fastest (or closest) server instead of a fixed location.
Connectivity refers to how effective a data warehouse is in connecting to various data sources and tools. Today, big data is coming from more sources than ever before and is being pulled into data warehouses. On the opposite end, data stored in data warehouses is also being used in much more downstream processes.
Traditionally, for on-premises data warehouses, connectivity is much more secure because organisations have full control over how or when the data warehouse is connected. This is crucial for organisations where data security is critical to operations, such as finance and government sectors.
On the other hand, significant investments have also been made by cloud providers to ensure secure connectivity. Increasingly, many data cloud providers, such as Snowflake, have been certified by independent bodies as safe (Snowflake’s Security & Compliance Reports). Additionally, for data clouds, connectivity is more seamless as they typically come with a built-in ecosystem that allows for easy connections to a wide array of cloud services.
5. Reliability & Security
For on-premises data warehouses, the organisation is fully responsible for the reliability and security of the setup. An organisation will have complete control of what hardware and software it uses, where it sits, and who has access to it. In an event of a failure, complete access to physical hardware and software is available for the IT team to resolve the issue, without needing to rely on a third party.
For data clouds, since organisations do not manage the infrastructure, they fully depend on the cloud providers for reliability and security. While this may sound counter-intuitive, it can result in a more reliable and secure data warehouse. This is because most data clouds offer service level agreements that guarantees up to 99.99% of availability and backs up data to ensure maximum reliability and disaster recovery. Additionally, since cloud providers serve multiple organisations, they are also able to afford the expertise and resources to provide industrial-strength, end-to-end data warehouse security, something that individual organisations may not be able to achieve.
Data Cloud is the Future
While some companies may still prefer an on-premises data warehouse due to the need for total control of its data, data clouds are the way of the future.
According to Foundry’s 2022 Cloud Computing Research, about 69% of organisations have started to accelerate their cloud migration over the past year and the percentage of companies with most or all IT infrastructure in the cloud is expected to rise from 41% in 2022 to 63% in 2024.
Not only are data clouds cost-effective, easy to use and scalable, but the technology has also been evolving rapidly over the past few years, pushing the boundaries of what data warehouses are capable of.
Alternatively, the option for a ‘hybrid cloud’ solution may also be considered to balance the trade-offs of both deployment types. For example, an organisation can store sensitive data on-premises, while relying on a cloud provider to store other day-to-day data. This way, specific security, compliance, or performance requests can be managed, while the flexibility and scalability of the cloud can still be leveraged.