10 min read
Imagine you are challenged with the following task: Design a cloud service capable of (1) accepting hundreds of billions of records on a daily basis, (2) storing this data reliably for weeks or months, (3) answering complex analytics queries on the data, (4) maintaining a low latency (seconds) of delay from data ingestion to query, and finally (5) completing those queries in seconds even when the data is a combination of structured, semi-structured, and free text?
This is the task we undertook when we started developing the Azure Data Explorer cloud service under the codename “Kusto”. The initial core team consisted of four developers working on the Microsoft Power BI service. For our own troubleshooting needs we wanted to run ad-hoc queries on the massive telemetry data stream produced by our service. Finding no suitable solution, we decided to create one.
As it turned out, we weren’t the only people in Microsoft who needed this kind of technology. Within a few months of work, we had our first internal customers, and adoption of our service started its steady climb.
Nearly five years later, our brainchild is now in public preview. You can watch Scott Guthrie’s keynote, and read more about what we’re unveiling in Azure Data Explorer announcement blog. In this blog post we describe the very basics of the technology behind Azure Data Explorer. More details will be available in an upcoming technology white paper.
What is Azure Data Explorer?
Azure Data Explorer is a cloud service that ingests structured, semi-structured, and unstructured data. The service then stores this data and answers analytic ad-hoc queries on it with seconds of latency. One common use is for ingesting and querying massive telemetry data streams. For example, the Azure SQL Database team uses the service to troubleshoot its service, run monitoring queries, and find service anomalies. This serves as the basis for taking auto-remediation actions. Azure Data Explorer is also used for storing and querying the Microsoft Office Client telemetry data stream, giving Microsoft Office engineers the ability to analyze how users interact with the individual Microsoft Office suite of applications. Another example depicts how Azure Monitor uses Azure Data Explorer to store and query all log data. Therefore, if you have ever written an Azure Monitor query, or browsed through your Activity Logs, then you are already a user of our service.
Users working with Azure Data Explorer see their data organized in a traditional relational data model. Data is organized in tables, and all data records of the table are of a strongly-typed schema. The table schema is an ordered list of columns, each column having a name and a scalar data type. Scalar data types can be structured (e.g. int, real, datetime, or timespan), semi-structured (dynamic), or free text (string). The dynamic type is similar to JSON – it can hold a single value of other scalar types, an array, or a dictionary of such values. Tables are contained in databases, and a single deployment (a cluster of nodes) may host multiple databases.
To illustrate the power of the service, below are some numbers from the database utilized by the team to hold all the telemetry data from the service itself. The largest table of this database accepts approximately 200 billion records per day (about 1.6 PB of raw data in total), and the data for that table is retained for troubleshooting purposes for 14 days.
The query I used to count these 200 billion records took about 1.2 seconds to complete:
KustoLogs | where Timestamp > ago(1d) | count
While executing this query, the service also sent new logs to itself (to the very same
KustoLogs table). Shown below is the query to retrieve all of those logs according to the correlation ID, here forced to use the term index on the
ClientActivityId column through the use of the has operator, simulating a typical troubleshooting point query.
KustoLogs | where Timestamp > ago(1d) | where ClientActivityId has "4c8fcbab-6ad9-491d-8799-9176fabaf93e"
This query took about 1.1 seconds to complete, faster than the previous query, even though much more data is returned. This is due to the fact that two indexes are used in conjunction – one on the Timestamp column and another on the
ClientActivityId (string) column.
The heart of the storage/query engine is a unique combination of three highly successful technologies: column store, text indexing, and data sharding. Storing data in a sharded column store makes it possible to store huge data sets, as data arranged in column order compresses better than data stored in row order. Query performance is also improved, as sharding allows one to utilize all available compute resources, and arranging data in columns allows the system to avoid loading data in columns that are not required by the particular query. The text index, and other index types, make it possible to efficiently skip entire batches of records when queries are predicated on the table’s raw data.
Fundamentally, data is stored in Azure Blob, with each data shard composed of one or more blobs. Once created through the ingestion process, a data shard is immutable. All its storage artifacts are kept the same without change, until the data shard itself is deleted. This has a number of important implications:
- It allows multiple Compute nodes in the cluster to cache the data shard, without complex change management coordination between them.
- It allows multiple Compute clusters to refer to the same data shard.
- It adds robustness to the system, as there’s no complex code to “surgically modify” parts of existing storage artifacts.
- It allows “travel back in time” to a previous snapshot as long as the storage artifacts of the data shard are not hard-deleted.
Azure Data Explorer uses its own proprietary format for the data shards storage artifacts, custom-built for the technology. For example, the format is built so that storage artifacts can be memory-mapped by the process querying them, and allows for data management operations that are unique to our technology, including index-only merge of data shards. There is no need to transform the data prior to querying.
Indexing at line speed
The ability to index free-text columns and dynamic (JSON-like) columns at line speed is one of the things that sets our technology apart from many other databases built on column store principles. Indeed, building up an inverted text index (Bloom filters are used for low-cardinality indexes, but are rarely useful for free-text fields) is a complex task in Compute resources (hash table often exceeds the CPU cache size) and Storage resources (the size of the inverted index itself is considerable).
Azure Data Explorer has a unique inverted index design. In the default case, all string and dynamic (JSON-like) columns are indexed. If the cardinality of the column is high, meaning that the number of unique values of the column approaches the number of records, then the engine defaults to creating an inverted term index with two “twists”. The index is kept at the shard level so multiple data shards can be ingested in parallel by multiple Compute nodes, and is low granularity so instead of holding per-record hit/miss information for each term, we only keep this information per block of about 1,000 records. A low granularity index is still efficient in skipping rarely occurring terms, such as correlation IDs, and is small enough so it’s more efficient to generate and load. Of course, if the index indicates a hit, the block of records must still be scanned to determine which of the individual records matches the predicate, but in most cases this combination results in faster (potentially much faster) performance.
Having low granularity, and therefore small, indexes also makes it possible to continuously optimize how data shards are stored in the background. Data shards that are small are merged together as a background activity, improving compression and indexing. For example, because the data they contain comes in continuously and we want to keep query latency small. Beyond a certain size, the storage artifacts holding the data itself stop getting merged, and the engine just merges the indexes, which are usually small enough so that merging them results in improved query performance.
Data in columns is compressed by standard compression algorithms. By default, the engine uses LZ4 to compress data, as this algorithm has an excellent performance and reasonable compression ratio. In fact, we estimate that this compression is virtually always to be preferred over keeping the data uncompressed, simply because the saving on moving the data into the CPU cache is worth the CPU resources to decompress it! Additional compression algorithms are supported, such as LZMA and Brotli, but most customers just use the default.
The engine always holds the data compressed, including when it is loaded into the RAM cache.
One interesting trade-off is to avoid performing “vertical compression”, used, for example, by Microsoft SQL Server Analysis Server Tabular Models. This column store optimization looks for a few ways to sort the data before finally compressing and storing it, often resulting in better compression ratios and therefore improved data load and query times. This optimization is avoided by Azure Data Explorer as it has a high CPU cost, and we want to make data available for query quickly. The service does enable customers to indicate the preferred sort order of data for cases in which there is a dominant query pattern, and we might make vertical compression a future background activity as an optimization.
Alongside the data, Azure Data Explorer also maintains the metadata that describes the data, such as:
- The schema of each table in the database
- Various policy objects that are used during data ingestion, query, and background grooming activities
- Security policies
Metadata is stored according to same principles as data storage – in immutable Azure Blob storage artifacts. The only blob which is not immutable is the “HEAD” pointer blob, which indicates which storage artifacts are relevant for the latest metadata snapshot. This model has all the advantages noted above due to immutability.
One of the early decisions taken by the designers of Azure was to ensure there’s isolation between the three fundamental core services: Compute, Storage, and Networking. Azure Data Explorer strictly adheres to this principle – all the persistent data is kept in Azure Blob Storage, and the data kept in Compute can be thought of as “merely” a cache of the data in Azure Blob. This has several important advantages:
- Independent scale-out. We can independently scale-out Compute (for example, if a cluster’s CPU load grows due to more queries running concurrently) vs. Storage (for example, if the number of storage transactions per second grows to a point one needs additional Storage resources).
- Resiliency to failures. In cases of failures, we can simply create a new Compute cluster and switch over traffic from the old Compute cluster without a complex data migration process.
- The ability to scale-up Compute. Applying a similar procedure to the above, with the new cluster being of a higher Compute SKU than the older cluster.
- Multiple Compute clusters using the same data. We can even have multiple clusters that use the same data, so that customers can, for example, run different workloads on different clusters with total isolation between them. One cluster acts as the “leader”, and is given permission to write to Storage, while all others act as “followers” and run in read-only mode for that data.
- Better SKU fitness. This is closely related to scale-out. The Compute nodes used by the service can be tailored to the workload requirements precisely because we let Azure Storage handle durable storage with SKUs that are more appropriate for storage.
Last, but not least, is that we’re relying on Azure Storage for doing what it does best – store data reliably through data replication. This means that very little coordination work needs to happen between service nodes, simplifying the service considerably. Essentially, just metadata writes need to be coordinated.
Compute data caching
While Azure Data Explorer is careful to isolate Compute and Storage, it makes full use of the local volatile SSD storage as a cache – in fact, the engine has a sophisticated multi-hierarchy data cache system to make sure that the most relevant data is cached as “closely” as possible to the CPU. This system critically depends on the data shard storage artifacts being immutable, and consists of the following tiers:
- Azure Blob Storage – persistent, durable, and reliable storage
- Azure Compute SSD (or Managed Disks) – volatile storage
- Azure Compute RAM – volatile storage
An interesting aspect of the cache system is that is works completely with compressed data. This means that the data is held compressed even when in RAM, and only decompressed when needed for an actual query. This makes optimal use of the limited/costly cache resources.
Distributed data query
The distributed data query technology behind Azure Data Explorer is strongly impacted by the scenario the service is built to excel in – ad-hoc analytics over massive amounts of unstructured data. For example:
- The service treats all temporary data produced by the query as volatile, held in the cluster’s aggregated RAM. Temporary results are not written to disk. This includes data that is in-transit between nodes in the cluster.
- The service has a rather short default for query timeouts (about four minutes). The user can ask to increase this timeout per query, but the assumption here is that queries should complete fast.
- The service queries provide snapshot isolation by having all relevant data shards “stamped” on the query plan. Since data shards are immutable, all it takes is for the query plan to reference the combination of data shards. Additionally, since queries are subject to timeout (four minutes by default, can be increased up to one hour), it’s sufficient to guarantee that data shards “linger” for one hour following a delete, during which they are no longer available for new queries.
- Perhaps most notable of all: The service implements a new query language, optimized for both ease of use and expressiveness. Our users tell us it is (finally!) a pleasure to author and read queries expressed in this syntax. The language’s computation model is similar to SQL in that it is built primarily for a relational data model, but the syntax itself is modeled after data flow languages, such as Unix pipeline of commands.
In fact, we regard the query language as a major step forward, and the toolset built around it as one of the most important aspects of the service that propelled its adoption. You can find more information about the query language. You can also take an online PluralSight course.
One interesting feature of the engine’s distributed query layer is that it natively supports cross-cluster queries, with optimizer support to re-arrange the query plan so that as much of the query is “remoted” to the other cluster as needed to reduce the amount of data exchanged between the two (or more) clusters.
In this post, we’ve touched on the very basics of the technology behind Azure Data Explorer. We will continue to share out more about the service in the coming weeks.
To find out more about Azure Data Explorer you can: