Get an introduction to Hadoop, its ecosystem, and big data in Azure HDInsight: What is Hadoop in HDInsight and what are the Hadoop components, common terminology, and scenarios for big data analysis? Also, learn about Hadoop tutorials, documentation, and resources for using Hadoop in HDInsight.
Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution. Hadoop often refers to the entire Hadoop ecosystem of components, which includes Storm and HBase clusters, as well as other technologies under the Hadoop umbrella. See Overview of the Hadoop ecosystem on HDInsight below for details.
Big data refers to data being collected in ever-escalating volumes, at increasingly high velocities, and for a widening variety of unstructured formats and variable semantic contexts.
Big data describes any large body of digital information, from the text in a Twitter feed, to the sensor information from industrial equipment, to information about customer browsing and purchases on an online catalog. Big data can be historical (meaning stored data) or real-time (meaning streamed directly from the source).
For big data to provide actionable intelligence or insight, not only must the right questions be asked and data be relevant to the issues be collected, the data must be accessible, cleaned, analyzed, and then presented in a useful way. That's where big data analysis on Hadoop in HDInsight can help.
HDInsight is a cloud implementation on Microsoft Azure of the rapidly exanding Apache Hadoop technology stack that is the go-to solution for big data analysis. It includes implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on. HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
Azure HDInsight deploys and provisions Hadoop clusters in the cloud, by using either Linux or Windows as the underlying OS.
HDInsight on Linux - A Hadoop cluster on Ubuntu. Use this if you are familiar with Linux or Unix, are migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux.
HDInsight on Windows - A Hadoop cluster on Windows Server. Use this if you are familiar with Windows, are migrating from an existing Windows-based Hadoop solution, or want to use .NET or other Windows-only technologies on the cluster.
The following table compares the two:
|Category||Hadoop on Linux||Hadoop on Windows|
|Cluster OS||Ubuntu 12.04 Long Term Support (LTS)||Windows Server 2012 R2|
|Cluster Type||Hadoop, HBase, Storm||Hadoop, HBase, Storm|
|Deployment||Azure preview portal, Azure CLI, Azure PowerShell||Azure portal, Azure preview portal, Azure CLI, Azure PowerShell, HDInsight .NET SDK|
|Cluster UI||Ambari||Cluster Dashboard|
|Remote Access||Secure Shell (SSH), REST API, ODBC, JDBC||Remote Desktop Protocol (RDP), REST API, ODBC, JDBC|
HDInsight provides cluster configurations for Hadoop, HBase, or Storm. Or, you can customize clusters with script actions.
HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.
Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Script Actions are scripts that are ran during cluster provisioning, and can be used to install additional components on the cluster. For Windows-based HDInsight clusters, these are PowerShell scripts. For Linux-based clusters, these are Bash scripts.
The following are example scripts provided by the HDInsight team:
For information on developing your own Script Actions, see Script Action development with HDInsight.
In addition to the previous overall configurations, the following individual components are also included on HDInsight clusters.
Ambari: Cluster provisioning, management, and monitoring.
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout: Machine learning.
MapReduce and YARN: Distributed processing and resource management.
Oozie: Workflow management.
Phoenix: Relational database layer over HBase.
Pig: Simpler scripting for MapReduce transformations.
Sqoop: Data import and export.
Tez: Allows data-intensive processes to run efficiently at scale.
ZooKeeper: Coordination of processes in distributed systems.
For information on the specific components and version information, see What's new in the Hadoop cluster versions provided by HDInsight?
Apache Ambari is for provisioning, managing and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. Linux-based HDInsight clusters provide both the Ambari web UI and the Ambari REST API, while Windows-based clusters provide a subset of the REST API.
Hadoop Distributed File System (HDFS) is a distributed file system that, with MapReduce and YARN, is the core of the Hadoop ecosystem. HDFS is the standard file system for Hadoop clusters on HDInsight.
Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL. Hive, like Pig, is an abstraction on top of MapReduce. When run, Hive translates queries into a series of MapReduce jobs. Hive is conceptually closer to a relational database management system than Pig, and is therefore appropriate for use with more structured data. For unstructured data, Pig is the better choice. See Use Hive with Hadoop in HDInsight.
Apache HCatalog is a table and storage management layer for Hadoop that presents users with a relational view of data. In HCatalog, you can read and write files in any format for which a Hive SerDe (serializer-deserializer) can be written.
Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior. See Generate movie recommendations using Mahout on Hadoop.
Hadoop MapReduce is a software framework for writing applications to process big-data sets in parallel. A MapReduce job splits large datasets and organizes the data into key-value pairs for processing.
Apache YARN is the next generation of MapReduce (MapReduce 2.0, or MRv2) that splits the two major tasks of JobTracker - resource management and job scheduling/monitoring - into separate entities.
Apache Oozie is a workflow coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack and supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. See Use a time-based Oozie Coordinator with Hadoop.
Apache Phoenix is a relational database layer over HBase. Phoenix includes a JDBC driver that allows users to query and manage SQL tables directly. Phoenix translates queries and other statements into native NoSQL API calls - instead of using MapReduce - thus enabling faster applications on top of NoSQL stores. See Use Apache Phoenix and SQuirreL with HBase clusters.
Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large datasets by using a simple scripting language called Pig Latin. Pig translates the Pig Latin scripts so they’ll run within Hadoop. You can create User Defined Functions (UDFs) to extend Pig Latin. See Use Pig with Hadoop to analyze an Apache log file.
Apache Tez is an application framework built on Hadoop YARN that executes complex, acyclic graphs of general data processing. It's a more flexible and powerful successor to the MapReduce framework that allows data-intensive processes, such as Hive, to run more efficiently at scale. See "Use Apache Tez for improved performance" in Use Hive and HiveQL.
Apache ZooKeeper coordinates processes in large distributed systems by means of a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes: status, location, configuration, and so on.
As part of the Azure cloud ecosystem, Hadoop in HDInsight offers a number of benefits, among them:
Automatic provisioning of Hadoop clusters. HDInsight clusters are much easier to create than manually configuring Hadoop clusters. For details, see Provision Hadoop clusters in HDInsight.
State-of-the-art Hadoop components. For details, see What's new in the Hadoop cluster versions provided by HDInsight?.
High availability and reliability of clusters. See Availability and reliability of Hadoop clusters in HDInsight for details.
Efficient and economical data storage with Azure Blob storage, a Hadoop-compatible option. See Use Azure Blob storage with Hadoop in HDInsight for details.
To read more about the advantages on Hadoop in HDInsight, see the Azure features page for HDInsight.
Build on this introduction to Hadoop on HDInsight and big data analysis with the resources below.
HDInsight documentation: The documentation page for Azure HDInsight with links to articles, videos, and more resources.
Get started with HDInsight on Linux: A quick-start tutorial for provisioning HDInsight Hadoop clusters on Linux and running sample Hive queries.
Get started with Linux-based Storm on HDInsight: A quick-start tutorial for provisioning a Storm on HDInsight cluster and running sample Storm topologies.
Provision HDInsight on Linux: Learn how to provision an HDInsight Hadoop cluster on Linux through the Azure Portal, Azure CLI, or Azure PowerShell.
Working with HDInsight on Linux: Get some quick tips on working with Hadoop Linux clusters provisioned on Azure.
Manage HDInsight clusters using Ambari: Learn how to monitor and manage your Linux-based Hadoop on HDInsight cluster by using Ambari Web, or the Ambari REST API.
HDInsight documentation: The documentation page for Azure HDInsight with links to articles, videos, and more resources.
Learning map for HDInsight: A guided tour of Hadoop documentation for HDInsight.
Get started with Azure HDInsight: A quick-start tutorial for using Hadoop in HDInsight.
Run the HDInsight samples: A tutorial on how to run the samples that ship with HDInsight.
Azure HDInsight SDK: Reference documentation for the HDInsight SDK.
Apache Hadoop: Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large datasets across clusters of computers.
HDFS: Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications.
MapReduce Tutorial: Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.
Azure SQL Database: MSDN documentation for SQL Database.
Management Portal for SQL Database: A lightweight and easy-to-use database management tool for managing SQL Database in the cloud.
Adventure Works for SQL Database: Download page for a SQL Database sample database.
Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver.
These BI tools can help in your big-data analysis:
Connect Excel to Hadoop with Power Query: Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight cluster by using Microsoft Power Query for Excel.
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver: Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver.
Microsoft Cloud Platform: Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.
Use big data analysis on your organization's data to gain insights into your business. Here are some examples:
Analyze HVAC sensor data: Learn how to analyze sensor data by using Hive with HDInsight (Hadoop), and then visualize the data in Microsoft Excel. In this sample, you'll use Hive to process historical data produced by HVAC systems to see which systems can't reliably maintain a set temperature.
Use Hive with HDInsight to analyze website logs: Learn how to use HiveQL in HDInsight to analyze website logs to get insight into the frequency of visits in a day from external websites, and a summary of website errors that the users experience.
Analyze sensor data in real-time with Storm and HBase in HDInsight (Hadoop): Learn how to build a solution that uses a Storm cluster in HDInsight to process sensor data from Azure Event Hubs, and then displays the processed sensor data as near-real-time information on a web-based dashboard.