This blog post was co-authored by Dinesh Chandnani, Principal Group Engineering Manager, Microsoft.
Standing up a data pipeline for the first time can be a challenge and decisions you make at the start of a project can limit your choices long after the initial deployment has been rolled out. Often what is needed is a playground in which to learn about and evaluate the available options and capabilities in the solution space. To that end, we are excited to be announcing that an internal Microsoft project known as Data Accelerator is now being open sourced.
Data Accelerator started in 2017 as a large-scale data processing project in Microsoft’s Developer Division that eventually landed on streaming on Apache Spark for reasons of scale and speed. The pipeline today operates at Microsoft scale.
Some of the reasons we think it will have value to the wider community:
- Fast Dev-Test loop: Events can be sampled to support local execution of queries, short circuiting the wait and delay of submitting your job to the cluster for it to fail seven minutes later due to a misplaced semicolon.
- One-box deployment for local testing and discovery: Learn before you commit to a prototype.
- Designer-based rules and query building: Stand up an end-to-end ETL pipeline without writing any code or dig right into the details.
- Time-windowing, reference data, and output capabilities added to SQL-Spark syntax: Keyword extensions to SQL-Spark syntax avoid the complexity and error-prone management of these common tasks.
The Developer Division of Microsoft is using Data Accelerator in production every day and will continue to make improvements in the toolchain over time, but we recognize the toolset could do many more things given the need. We hope that by opening this project some of you will find Data Accelerator even more helpful.