The data processing landscape is more diverse than ever; processing data across geographic locations, on-premises and cloud, the variety of data types, and volume of data. Developers are left writing large amounts of custom logic to deliver an information production system that can manage and co-ordinate all of this data and processing.
We are excited to release the preview of our new Data Factory service – a managed service to compose data storage, processing, and movement services into managed data production pipelines. You can get started using Data Factory today. With a few clicks in the Azure portal, or command line operations, you can create a new data factory and link it to data and processing resources. Access to on-premises data in SQL Server and cloud data in Azure Blob, Table & Database services are included in this preview release. Based on your feedback through the preview stage of Data Factory, additional sources will be added. Access to on-premises data is provided through a data management gateway allowing for easy configuration and management of connections to your on-premises SQL Servers.
Data processing is enabled initially through Hive, Pig and custom C# activities. Such activities can be used to clean data, mask data fields, and transform data in a wide variety of complex ways. The Hive and Pig activities can be run on an HDInsight cluster you create or you can allow Data Factory to fully manage the Hadoop cluster lifecycle on your behalf. Author your activities, combine them into a pipeline, set an execution schedule and you’re done – no Hadoop cluster setup or management. Data Factory also provides an up-to-the moment monitoring dashboard, which means you can deploy your data pipelines and immediately begin to view them as part of your monitoring dashboard.
Once you have created and deployed pipelines to your Data Factory you can quickly assess end-to-end data pipeline health, pinpoint issues, and take corrective action as needed. Within the Azure Preview Portal, you get a visual layout of all of your pipelines and data inputs and outputs. You can see all the relationships and dependencies of your data pipelines across all of your sources so you always know where data is coming from and where it is going. You get a historical accounting of job execution, data production status, and system health in a single monitoring dashboard.
Finally, use data pipelines to automatically deliver your transformed data from the cloud to on-premises SQL Server databases, and/or keep it in your cloud storage sources for consumption by BI and analytics tools and applications.
To get started …
We’re excited to finally have Data Factory out for you to try and looking forward to hearing your feedback. Give it a try and let us know what you’d like to see added / changed to Data Factory by submitting your thoughts and ideas