Co-authored by Nishant Thacker.
Yesterday, we announced Azure Data Lake Analytics and Azure Data Lake Store are in public preview. To help you use the Azure Data Lake as productively as possible, we will have a six-part series on different aspects of the Azure Data Lake over the next few days. This is the second blog in the series giving you the details of U-SQL, the new language we introduced for Azure Data Lake Analytics.
U-SQL is a language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability allows you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. This blog post will outline basics of U-SQL. The next post will go into even more details of how to develop using U-SQL.
As you approach U-SQL for the first time, you will notice it's a language you’ll be comfortable with from Day One. The syntax is based on T-SQL while it uses C# types as default. This allows you to easily conceptualize how data will be processed while writing queries, and doesn’t scare you with new frameworks or concepts. Essentially, it abstracts the deeper concepts of parallelism and distributed processing so you don’t need to worry about them while writing your queries. You don’t need special programming skills or months of training to be able to deliver. Rather, just a good understanding of SQL and knowledge of C#.
U-SQL allows you to process any type of data. From analyzing BotNet attack patterns to security logs and extracting features from images or videos for machine learning, the language enables you to work with any data.
U-SQL integrates custom code seamlessly to allow you to express your complex, often proprietary business algorithms. Different use cases like processing different file types and encryption processes may require custom processing, often not easily expressed in standard query languages, ranging from user-defined functions, to custom input and output formats. This is something that U-SQL excels at.
Finally, U-SQL was developed to efficiently scale to any size of data without you focusing on scale-out topologies, plumbing code or limitations of a specific distributed infrastructure. Again, no restriction to the total data size or individual units of data that can be processed and it automatically scales to utilize available resources. Let developers concentrate on business logic to be implemented and not on infrastructure that needs to be setup to process their queries on massive amounts of data.
Let’s take a sneak peek in to see what U-SQL looks like. A typical U-SQL query would be something like the one below:
@Result =
SELECT country, city, COUNT(*) AS NumberOfDrivers
FROM @Drivers
GROUP BY country, city
ORDER BY NumberOfDrivers DESC, country, city
FETCH FIRST 10 ROWS;
As can be seen, U-SQL leverages a syntax a DBA would understand with typical SELECT, FROM, GROUP BY clauses. Since this is meant to work on massive data, it provisions the FETCH clause which allows the developer to preview some data for analysis of the query results. Here’s the rowset @Drivers being extracted:
@Drivers =
EXTRACT driver_id int
, name string
, street string
, city string
, region string
, zipcode string
, country string
, phone_numbers string
FROM @INPUT_DRIVERS
USING Extractors.Text(delimiter : 't', quoting: true, encoding : Encoding.Unicode);
As we see, it extracts a set of fields from a file using a text extractor, which can be customized to suit any format and extensible for any type of data or delimiters etc.
Once you have the result set from the first query, you can also use Outputters to go save them in the format of your choice:
OUTPUT @Result
TO @OUTPUT
USING Outputters.Csv(quoting : true);
This is probably the simplest example of U-SQL. However, U-SQL includes many more capabilities like the following:
- Operating over set of files with patterns
- Using (Partitioned) Tables
- Federated Queries against Azure SQL DB
- Encapsulating your U-SQL code with Views, Table-Valued Functions and Procedures
- SQL Windowing Functions
- Programming with C# User-defined Operators (custom extractors, processors)
- Complex Types (MAP, ARRAY)
- Using U-SQL in data processing pipelines
- U-SQL in a lambda architecture for IOT analytics
To learn more about U-SQL, watch the video below and stay tuned for the next blog post.
Where can I get more information?
- Read the announcement post more details.
- Check out the Visual Studio’s U-SQL post to learn more about the new big data language.
- Visit Azure.com Data Lake solution page.
- Watch a video about U-SQL.
- Watch the Azure Data Lake Video Series.
Documentation and How-To's
- U-SQL
- Azure Data Lake Analytics
- Overview of Azure Data Lake Analytics
- Getting started with Azure Data Lake Analytics in the portal
- Getting started with Azure Data Lake Analytics with PowerShell
- Getting started with Azure Data Lake Analytics and the tooling
- Getting started with Azure Data Lake Analytics with the SDK
- Managing Azure Data Lake Analytics with the portal
- Managing Azure Data Lake Analytics with PowerShell
- Interactive tutorials on Azure Data Lake Analytics
- Analyzing web logs with Azure Data Lake Analytics
- Azure Data Lake Store
- Overview of Azure Data Lake Store
- Getting started with Azure Data Lake Store from the Portal
- Getting started with Azure Data Lake Store from PowerShell
- Getting started with Azure Data Lake with .NET SDK
- Securing data with Azure Data Lake Store
- Connecting Azure HDInsight with Azure Data Lake Store
- Connecting other OSS applications with Azure Data Lake Store
- WebHDFS APIs with Azure Data Lake Store