Businesses today need fast, reliable access to their data in order to make informed decisions. With large volumes of data being generated and processed every day, optimising data retrieval is essential for efficient and effective operations. One approach to optimisation is through caching, which allows frequently accessed data to be stored in a temporary location for faster retrieval. This blog post, will explore how caching works in Snowflake and best practices for optimising a caching strategy.
What is Caching?
Caching is the process of storing frequently accessed data in a temporary location, such as in memory or on disk, for faster access. By caching data, applications can avoid the time-consuming process of retrieving the same data repeatedly from its original source. This can help improve application performance, reduce network latency and minimise resource usage.
Snowflake Caching Architecture
Snowflake‘s caching infrastructure comprises three types of cache: the Query Result Cache, Metadata Cache and Virtual Warehouse Cache. These caches are a vital feature of Snowflake’s internal architecture, which is designed to ensure that frequently executed queries are executed faster and with greater efficiency. Both the Query Result Cache and Metadata Cache are fully managed in the cloud services layer, while the Virtual Warehouse Cache is implemented in the compute layer.
Query Result Cache
This cache stores query results in memory for fast access. When a user executes a query, Snowflake automatically checks if the query result is already in the Result Set Cache. If it is, Snowflake returns the cached result instead of executing the query again.
This cache is valid for 24 hours from the last run time of the query and is invalidated if the query syntax changes or the underlying data in the queried tables changes. Typically, query results are reused if all of the following conditions are met:
- The user executing the query has the necessary access privileges for all the tables used in the query.
- The new query syntactically matches the previously-executed query.
- The table data contributing to the query result has not changed.
- The persisted result for the previous query is still available.
- Any configuration options that affect how the result was produced have not changed.
- The query does not include functions that must be evaluated at execution (e.g. CURRENT_TIMESTAMP()).
- The table’s micro-partitions have not changed (e.g. been re-clustered or consolidated) due to changes to other data in the table.
To verify whether a query made use of the Result Cache, check the query profile in the Snowflake UI. It will show a node like the following:
This cache stores information about database, schemas, tables, views and other database objects in memory for fast access. When a user queries a table or view, Snowflake checks if the metadata for the object is already in the Metadata Cache. If it is, Snowflake returns the cached metadata instead of retrieving it again from the database.
This cache has no time dependency and is not invalidated by changes to table data, but is invalidated if the schema or object definitions change. Table row count, table size in bytes, min and max values, number of distinct values and null count are stored in the Metadata Cache.
Virtual Warehouse Cache
This cache is implemented in the compute layer and stores the micro-partitions that were used from the first query and leaves them on your warehouse for future queries. If the next query needs those same micro-partitions to complete its task, it’ll use the Warehouse Cache, rather than fetching the micro-partitions again.
The query will use all of the micro-partitions, which means neither the first or second query are actually pruning any micro-partitions in your table scan. It’s important to understand that the cache is made up of the entire micro-partitions that were fetched and not just the records that were selected in the first query. The Warehouse Cache, in this instance, can optimise the performance of the query
Caching Best Practices
The Result Set Cache and Metadata Cache are used for general query performance optimisation, while the Virtual Warehouse Cache is used for optimising performance within a specific virtual warehouse.
Each cache serves a different purpose and can be configured independently to optimise data retrieval based on your specific needs. By leveraging all three types of cache, Snowflake provides a comprehensive caching solution that can significantly improve query performance and reduce resource usage.
Caching in Snowflake Demo
USER1 runs query for the first time and the query returns results in 1m 27s.
USER1 runs query for the second time and query returns results in 155ms.
USER2 runs query for the first time and query returns results in 89ms.
The Query History below shows the results of the three queries:
Compared to the first query, the second and third queries took less time to execute. Therefore, retrieving results from the cache saves time.
The second and third queries did not use warehouse resources. Therefore, retrieving results from cache incurs no cost.
The first and second queries were executed by USER1 while the third query was executed by USER2. Therefore, results from the cache can be retrieved by any user.
Caching is an important feature of Snowflake that enhances query performance by reducing the time it takes to retrieve query results. With its three types of caches, Snowflake ensures that frequently executed queries are executed faster and with greater efficiency, resulting in a more streamlined and productive data processing environment. The use of caching in Snowflake is an excellent way to optimise the performance of data processing, providing businesses with a faster, more efficient and cost-effective way to access their data.