Oftentimes you want to join the incoming event stream like device, sensor measurements with slow changing “reference data” like device profile or customer profile information for your queries as part of your stream analytics jobs. This enables you to create enhanced reports on insights generated by the stream job. This post and the accompanying sample will show you how to leverage Azure Data Factory to pull reference data from a variety of data stores, refresh it on a schedule and provide it as input to your stream analytics job.
Working with reference data in Azure Stream Analytics
Stream Analytics supports taking reference data stored in Azure blob storage as one of the “inputs” for the job. To enable support for refreshing reference data the user needs to specify a list of blobs in the input configuration using the {date} and {time} tokens inside the path pattern. The job will load the corresponding blob based on the date and time encoded in the blob names using UTC time zone.
For example if the job has a reference input configured in the portal with the path pattern such as: /referencedata/{date}/{time}/customertable.csv where the date format is “YYYY/MM/DD” and the time format is “HH/mm” then the job will pick up a file named /referencedata/2015/07/26/08/30/customertable.csv at 8:30 AM on July 26th 2015 UTC time zone.
This requires the customers to address the following two challenges:
-
If your reference data is in a data store other than Azure blob you need to move it to Azure blob.
-
While reference data changes relatively infrequently, it still changes. You want to have a regular refresh schedule so the reference data is picked up and dropped in Azure blob with the right path and datatime information.
Refreshing reference data from a variety of data stores with Azure Data Factory
Azure Data Factory is the perfect solution for the above mentioned challenges. Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. It supports connecting to a large number of cloud based and on-premises data stores and moving data easily on whatever regular schedule you specify.
Let's take a loot at an example …
The Steam Analytics Get Started guide shows a scenario for a telecommunication company where call record data is processed in a streaming fashion at scale and analyzed for SIM card fraud (multiple calls coming from the same identity around the same time but in geographically different locations). The stream analytics job for this scenario takes one input, the streaming call records data coming in, through EventHub. Now, suppose we wanted to add another input, reference data with information about the customers (customerInfo table) like their name, contact information. This allows us to add a join against the customertInfo table in the streaming query that detects fraudulent calls to identify which customers are being affected by the fraud. Also suppose the customerInfo table is maintained in an Azure SQL database and can be updated multiple times during the day as new customers are added, contact information is changed etc..
The diagram below shows the high level solution architecture leveraging Azure Data Factory and Stream Analytics together to run the above mentioned query with reference data and setup for the refresh for reference data on a schedule.
As illustrated above, you can create a data factory pipeline with copy activity that copies the latest version of the customertable from Azure SQL to blob in the corresponding path based on date and time information. The Azure Stream Analytics jobs are configured to take customertable as reference data input and always pick up the latest copy of the reference data, as it becomes available.
For more details on setting up the above sample and step-by-step instruction on how to setup a data factory to copy reference data, please refer to the reference data refresh for azure stream analytics job sample on GitHub.