Azure operates one of the largest public big data cluster services on the planet. Every day thousands of customers build and operate mission-critical big data analytics, business intelligence (BI), and machine learning (ML) solutions using Azure HDInsight. Come explore architectural best practices, recommended patterns, and tips and tricks for building successful production systems using Apache Spark, Hive, Kafka, and HBase. We will also showcase customer examples and reference architectures.
Democratizing data empowers customers by enabling more and more users to gain value from data through self-service analytics. Processing raw data for building apps and gaining deeper insights is one of the critical tasks when building your modern data warehouse architecture. In this session, we will show you how to build data pipelines with Spark and your favorite .NET programming language (C#, F#) using both Azure HDInsight and Azure Databricks, and connect them to Azure SQL Data Warehouse for reporting and consumption.
Kafka is one of the most used OSS for large enterprises — and a defacto messaging bus. In this session, we’ll walk through the Kafka use case with Lambda Architecture for down streaming and processing of messages at scale We’ll share what we’ve learned working with customers and how they use Kafka on Azure to solve their business problems.
Maxim Lukiyanov and Scott Hanselman discuss intricate ways in which Apache Spark jobs can fail in production and how new diagnostics tools, now available in Azure HDInsight, visualize these problems in a new intuitive way and help discover and understand them from the first glance.For more information:Spark Debugging and Diagnostics Toolset for Azure HDInsight (blog post)Apache Spark for Azure HDInsight overviewCreate a free account (Azure)
Dhruv Goel and Scott Hanselman discuss why enterprise customers trust Apache Kafka on Azure HDInsight with their streaming ingestion needs. Get even more control over the security of your data at rest with Bring-Your-Own-Key encryption for Kafka. With Azure HDInsight, you get the best of open source and the security and reliability of a managed platform.For more information:Bring your own key for Apache Kafka on Azure HDInsight (Preview)Azure HDInsight - Hadoop, Spark, and Kafka ServiceAzure HDInsight pricingCreate a free account (Azure)
Dhruv Goel and Scott Hanselman discuss why enterprise customers trust Apache Kafka on Azure HDInsight with their streaming ingestion needs. Integrate Kafka with Azure Active Directory for authentication and set up fine-grained access control with Apache Ranger to let multiple users access Kafka easily and securely. With Azure HDInsight, you get the best of open source on a managed platform.For more information, see:Tutorial: Configure Kafka policies in HDInsight with Enterprise Security Package (Preview)Azure HDInsight - Hadoop, Spark, and Kafka ServiceAzure HDInsight pricingCreate a free account (Azure)
Kafka on Azure HDInsight is an enterprise-grade streaming ingestion service that allows you to quickly and easily setup, use, scale and monitor your Kafka clusters in the cloud. Kafka provides a fault tolerant, distributed pub sub model to enable real-time solutions such as Internet of Things (IoT), fraud detection, clickstream analysis, financial alerts, and social analytics.
Before you can have Big Data, you must collect the data. There are two popular ways to do this: with batches and with live streams. Apache Kafka has changed the way we look at streaming and logging data and now Azure provides tools and services for streaming data into your Big Data pipeline in Azure. This session will outline the different services in the Big Data Streaming ecosystem in Azure, how they work together, and when to use which including HDInsight Kafka and Event Hubs. We will also talk briefly about when using traditional ETL tools is a better idea.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI and do interactive query over their data lake without moving data outside the data lake.
H2O’s AI platform provides open source machine learning framework that works with sparklyr and PySpark. H2O’s Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. With Sparkling Water, users can drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers. H2O's open AutoML also fully automates the process training ML algorithms, tuning the right parameters and building ensemble models. Setting up an environment to perform advanced analytics on top of big data is hard, but with H2O Sparkling Water for HDInsight, customers can get started with just a few clicks. This solution will install Sparkling Water on an HDInsight Spark cluster so you can exploit all the benefits from both Spark and H2O. The solution can access data from Azure Blob storage and/or Azure Data Lake Store in addition to all the standard data sources that H2O support. It also provides Jupyter Notebooks with in-built examples for an easy jumpstart, and a user-friendly H2O FLOW UI to monitor and debug the applications.