Introduction to Hadoop in HDInsight: Big data processing and analysis in the cloud
Get an introduction to the Hadoop ecosystem in HDInsight - components, common terminology, and scenarios. Also, find out about tutorials and resources for using Hadoop in HDInsight.
Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data. The Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze, in parallel, the data stored in this distributed system.
What is big data?
Big data refers to data being collected in ever-escalating volumes, at increasingly high velocities, and for a widening variety of unstructured formats and variable semantic contexts.
Big data describes any large body of digital information from the text in a Twitter feed, to the sensor information from industrial equipment, to information about customer browsing and purchases on an online catalog. Big data can be historical (meaning stored data) or real-time (meaning streamed directly from the source).
For big data to provide actionable intelligence or insight, not only must the right questions be asked and data relevant to the issues be collected, the data must be accessible, cleaned, analyzed, and then presented in a useful way. That's where Hadoop in HDInsight can help.
In this article
This article provides an overview of Hadoop on HDInsight, including:
Overview of the Hadoop ecosystem on HDInsight: HDInsight is the Hadoop solution on Azure and provides implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on. HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
Advantages of Hadoop in the cloud: Reasons you should consider HDInsight's cloud implementation of Hadoop.
HDInsight solutions for big data analysis: Some practical ways you can you HDInsight to answer questions for your organization, from analyzing Twitter sentiment to analyzing HVAC system effectiveness.
Resources for learning more about big data analysis, Hadoop, and HDInsight: Links to additional information.
Overview of the Hadoop ecosystem on HDInsight
Apache Hadoop is the rapidly expanding technology stack that is the go-to solution for big data analysis. HDInsight is the framework for the Microsoft Azure cloud implementation of Hadoop.
Azure HDInsight deploys and provisions Hadoop clusters in the cloud, using either Linux or Windows as the underlying OS.
- HDInsight on Linux (Preview) - A Hadoop cluster on Ubuntu. Use this if you are familiar with Linux or Unix, migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux.
- HDInsight on Windows - A Hadoop cluster on Windows Server. Use this if you are familiar with Windows, are migrating from an existing Windows-based Hadoop solution, or want to integrate with .NET or other Windows capabilities.
The following table presents a comparison between the two:
|Category ||HDInsight on Linux ||HDInsight on Windows |
|Cluster OS ||Ubuntu 12.04 LTS ||Windows Server 2012 R2 |
|Cluster Type ||Hadoop ||Hadoop, HBase, Storm |
|Deployment ||Azure Management Portal, Cross-platform command line, PowerShell ||Azure Management Portal, Cross-platform command line, PowerShell, HDInsight .NET SDK |
|Cluster UI ||Ambari ||Cluster Dashboard |
|Remote Access ||SSH ||RDP |
HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.
Apache Hadoop is a software framework for big data management and analysis. HDInsight provides several configurations for specific workloads, or you can customize clusters using Script Actions.
- Hadoop for HDInsight - Apache Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze data in parallel.
- HBase for HDInsight - HBase is an Apache open source NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data.
- Storm for HDInsight - Storm is a distributed, fault-tolerant, open source computation system that allows you to process data in real time.
In addition to the previous overall configurations, the following individual components are also included on HDInsight clusters.
Ambari: Cluster provisioning, management, and monitoring
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment
Hive: SQL-like querying
Mahout: Machine learning
MapReduce and YARN: Distributed processing and resource management
Oozie: Workflow management
Pig: Simpler scripting for MapReduce transformations
Sqoop: Data import and export
Zookeeper: Coordinates processes in distributed systems
Apache Ambari is for provisioning, managing and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. See Manage HDInsight clusters using Ambari (Linux only), Monitor Hadoop clusters in HDInsight using the Ambari API, and Apache Ambari API reference.
Avro (Microsoft .NET Library for Avro)
The Microsoft .NET Library for Avro implements the Apache Avro compact binary data interchange format for serialization for the Microsoft .NET environment. It uses JSON to define a language-agnostic schema that underwrites language interoperability, meaning data serialized in one language can be read in another. Detailed information on the format can be found in the Apache Avro Specification. The format of Avro files supports the distributed MapReduce programming model. Files are “splittable”, meaning you can seek any point in a file and start reading from a particular block. To find out how, see Serialize data with the Microsoft .NET Library for Avro.
Apache HBase is a non-relational database built on Hadoop and designed for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. HBase clusters on HDInsight are configured to store data directly in Azure Blob storage, with low latency and increased elasticity. See Overview of HBase on HDInsight.
Hadoop Distributed File System (HDFS) is a distributed file system that, with MapReduce and YARN, is the core of the Hadoop ecosystem. HDFS is the standard file system for Hadoop clusters on HDInsight.
Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage using a SQL-like language call HiveQL. Hive, like Pig, is an abstraction on top of MapReduce and when run, Hive translates queries into a series of MapReduce jobs. Hive is conceptually closer to a relational database management system than Pig, and is therefore appropriate for use with more structured data. For unstructured data, Pig is better choice. See Use Hive with Hadoop in HDInsight
Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior. See Generate movie recommendations using Mahout on Hadoop.
MapReduce and YARN
Hadoop MapReduce is a software framework for writing applications to process big data sets in parallel. A MapReduce job splits large data sets and organizes the data into key-value pairs for processing.
Apache YARN is the next generation of MapReduce (MapReduce 2.0, or MRv2) that splits the two major tasks of JobTracker - resource management and job scheduling/monitoring - into separate entities.
For more information on MapReduce, see MapReduce in the Hadoop Wiki. To learn about YARN, see Apache Hadoop NextGen MapReduce (YARN).
Apache Oozie is a workflow coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack and supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. See Use a time-based Oozie Coordinator with Hadoop in HDInsight.
Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large data sets using a simple scripting language called Pig Latin. Pig translates the Pig Latin scripts so they’ll run within Hadoop. You can create User Defined Functions (UDFs) to extend Pig Latin. See Use Pig with Hadoop to analyze an Apache log file.
Apache Sqoop is tool that transfers bulk data between Hadoop and relational databases such a SQL, or other structured data stores, as efficiently as possible. See Use Sqoop with Hadoop.
Apache Storm is a distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Apache Zookeeper coordinates processes in large distributed systems by means of a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes: status, location, configuration, and so on.
Advantages of Hadoop in the cloud
As part of the Azure cloud ecosystem, Hadoop in HDInsight offers a number of benefits, among them:
To read more about the advantages on Hadoop in HDInsight, see the Azure features page for HDInsight.
Resources for learning more about big data analysis, Hadoop, and HDInsight
HDInsight on Linux (Preview)
HDInsight on Windows
Apache Hadoop: Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large data sets across clusters of computers.
HDFS: Learn more about the architecture and design of the Hadoop Distributed File System (HDFS), the primary storage system used by Hadoop applications.
MapReduce: Learn more about the programming framework for writing Hadoop applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
SQL Database on Azure
Microsoft business intelligence (for HDInsight on Windows)
Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services - retrieve, analyze, and report data integrated with HDInsight using either the Power Query add-in or the Microsoft Hive ODBC Driver.
These BI tools can help in your big data analysis:
Try HDInsight solutions for big data analysis (for HDInsight on Widows)
Analyze data from your organization to gain insights into your business. Here are some examples:
Analyze HVAC sensor data: Learn how to analyze sensor data using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel. In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably maintain a set temperature.
Use Hive with HDInsight to analyze website logs: Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from external websites, and a summary of website errors that the users experience.
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop): Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and then displays the processed sensor data as near-real-time information on a web-based dashboard.
To try Hadoop on HDInsight, see "Get started" articles in the Explore section on the HDInsight documentation page. To try more advanced examples, scroll down to the Analyze section.