An introduction to the Hadoop ecosystem on Azure HDInsight
This article provides an introduction to Hadoop on Azure HDInsight, its ecosystem, and big data. Learn about the Hadoop components, common terminology, and scenarios for big data analysis.
Hadoop refers to an ecosystem of open-source software that is a framework for distributed processing, storing, and analysis of big data sets on clusters of computers. Azure HDInsight makes the Hadoop components from the Hortonworks Data Platform (HDP) distribution available in the cloud, and deploys and provisions managed clusters with high reliability and availability.
Apache Hadoop was the original open-source project for big data processing. Following was the development of related software and utilities considered part of the Hadoop technology stack, including Apache Hive, Apache HBase, Apache Spark, and many others. See Overview of the Hadoop ecosystem in HDInsight for details.
Big data describes any large body of digital information, from the text in a Twitter feed, to the sensor information from industrial equipment, to information about customer browsing and purchases on a website. Big data can be historical (meaning stored data) or real-time (meaning streamed directly from the source). Big data is being collected in ever-escalating volumes, at increasingly higher velocities, and in an expanding variety formats.
For big data to provide actionable intelligence or insight, you must collect relevant data and ask the right questions. You must also make sure the data is accessible, cleaned, analyzed, and then presented in a useful way. That's where big data analysis on Hadoop in HDInsight can help.
HDInsight is a cloud distribution on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack for big data analysis. It includes implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari, and so on. HDInsight also integrates with business intelligence (BI) tools such as Power BI, Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
HDInsight provides cluster configurations for Apache Hadoop, Spark, HBase, or Storm. Or, you can customize clusters with script actions.
Apache Spark: A parallel processing framework that supports in-memory processing to boost the performance of big-data analysis applications, Spark works for SQL, streaming data, and machine learning. See Overview: What is Apache Spark in HDInsight?
HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.
Apache Storm: A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Script Actions are scripts that run during cluster provisioning, and can be used to install additional components on the cluster. For Linux-based clusters, these are Bash scripts.
The following example scripts are provided by the HDInsight team:
Hue: A set of web applications used to interact with a cluster. Linux clusters only.
Giraph: Graph processing to model relationships between things or people.
R: An open-source language and environment for statistical computing used in machine learning.
Solr: An enterprise-scale search platform that allows full-text search on data.
For information on developing your own Script Actions, see Script Action development with HDInsight.
The following components and utilities are included on HDInsight clusters.
Ambari: Cluster provisioning, management, monitoring, and utilities.
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout: For scalable machine learning applications.
Oozie: Workflow management.
Phoenix: Relational database layer over HBase.
Pig: Simpler scripting for MapReduce transformations.
Sqoop: Data import and export.
Tez: Allows data-intensive processes to run efficiently at scale.
YARN: Part of the Hadoop core library and next generation of the MapReduce software framework.
ZooKeeper: Coordination of processes in distributed systems.
For information on the specific components and version information, see Hadoop components, versioning, and service offerings in HDInsight
Apache Ambari is for provisioning, managing and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters. Linux-based HDInsight clusters provide both the Ambari web UI and the Ambari REST API, while Windows-based clusters provide a subset of the REST API. Ambari Views on HDInsight clusters allow plug-in UI capabilities.
Hadoop Distributed File System (HDFS) is a distributed file system that, with MapReduce and YARN, is the core of the Hadoop ecosystem. HDFS is the standard file system for Hadoop clusters on HDInsight.
Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL. Hive, like Pig, is an abstraction on top of MapReduce. When run, Hive translates queries into a series of MapReduce jobs. Hive is conceptually closer to a relational database management system than Pig, and is therefore appropriate for use with more structured data. For unstructured data, Pig is the better choice. See Use Hive with Hadoop in HDInsight.
Apache HCatalog is a table and storage management layer for Hadoop that presents users with a relational view of data. In HCatalog, you can read and write files in any format for which a Hive SerDe (serializer-deserializer) can be written.
Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior. See Generate movie recommendations using Mahout on Hadoop.
MapReduce is the legacy software framework for Hadoop for writing applications to batch process big data sets in parallel. A MapReduce job splits large datasets and organizes the data into key-value pairs for processing.
YARN is the Hadoop next-generation resource manager and application framework, and is referred to as MapReduce 2.0. MapReduce jobs run on YARN.
For more information on MapReduce, see MapReduce in the Hadoop Wiki.
Apache Oozie is a workflow coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack and supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. See Use Oozie with Hadoop.
Apache Phoenix is a relational database layer over HBase. Phoenix includes a JDBC driver that allows users to query and manage SQL tables directly. Phoenix translates queries and other statements into native NoSQL API calls - instead of using MapReduce - thus enabling faster applications on top of NoSQL stores. See Use Apache Phoenix and SQuirreL with HBase clusters.
Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large datasets by using a simple scripting language called Pig Latin. Pig translates the Pig Latin scripts so they’ll run within Hadoop. You can create User-Defined Functions (UDFs) to extend Pig Latin. See Use Pig with Hadoop.
Apache Tez is an application framework built on Hadoop YARN that executes complex, acyclic graphs of general data processing. It's a more flexible and powerful successor to the MapReduce framework that allows data-intensive processes, such as Hive, to run more efficiently at scale. See "Use Apache Tez for improved performance" in Use Hive and HiveQL.
Apache YARN is the next generation of MapReduce (MapReduce 2.0, or MRv2) and supports data processing scenarios beyond MapReduce batch processing with greater scalability and real-time processing. YARN provides resource management and a distributed application framework. MapReduce jobs run on YARN.
To learn about YARN, see Apache Hadoop NextGen MapReduce (YARN).
Apache ZooKeeper coordinates processes in large distributed systems by means of a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes: status, location, configuration, and so on.
HDInsight clusters--Hadoop, HBase, Storm, and Spark clusters--support a number of programming languages, but some aren't installed by default. For libraries, modules, or packages not installed by default, use a script action to install the component. See Script action development with HDInsight.
By default, HDInsight clusters support:
Additional languages can be installed using script actions: Script action development with HDInsight.
Many languages other than Java can be run using a Java virtual machine (JVM); however, running some of these languages may require additional components installed on the cluster.
These JVM-based languages are supported on HDInsight clusters:
Jython (Python for Java)
HDInsight clusters support the following languages that are specific to the Hadoop ecosystem:
Pig Latin for Pig jobs
HiveQL for Hive jobs and SparkSQL
As part of the Azure cloud ecosystem, Hadoop in HDInsight offers a number of benefits, among them:
Automatic provisioning of Hadoop clusters. HDInsight clusters are much easier to create than manually configuring Hadoop clusters. For details, see Provision Hadoop clusters in HDInsight.
State-of-the-art Hadoop components. For details, see Hadoop components, versioning, and service offerings in HDInsight.
High availability and reliability of clusters. See Availability and reliability of Hadoop clusters in HDInsight for details.
Efficient and economical data storage with Azure Blob storage, a Hadoop-compatible option. See Use Azure Blob storage with Hadoop in HDInsight for details.
Additional VM sizes and types for running HDInsight clusters. See Hadoop components, versioning, and service offerings in HDInsight for details.
Cluster scaling. Cluster scaling enables you to change the number of nodes of a running HDInsight cluster without having to delete or re-create it.
Virtual Network support. HDInsight clusters can be used with Azure Virtual Network to support isolation of cloud resources or hybrid scenarios that link cloud resources with those in your datacenter.
To read more about the advantages on Hadoop in HDInsight, see the Azure features page for HDInsight.
HDInsight provides big data cloud offerings in two categories, Standard and Premium. HDInsight Standard provides an enterprise-scale cluster that organizations can use to run their big data workloads. HDInsight Premium builds on that and provides advanced analytical and security capabilities for an HDInsight cluster. For more information, see Azure HDInsight Premium
Build on this introduction to Hadoop in the cloud and big data analysis with the resources below.
HDInsight documentation: The documentation page for Hadoop on Azure HDInsight with links to articles, videos, and more resources.
Get started with Hadoop in HDInsight: A quick-start tutorial for provisioning HDInsight Hadoop clusters and running sample Hive queries.
Get started with Spark in HDInsight: A quick-start tutorial for creating a Spark cluster and running interactive Spark SQL queries.
Use R Server on HDInsight: Start using R Server in HDInsight Premium.
Provision HDInsight clusters: Learn how to provision an HDInsight Hadoop cluster through the Azure portal, Azure CLI, or Azure PowerShell.
Apache Hadoop: Learn more about the Apache Hadoop software library, a framework that allows for the distributed processing of large datasets across clusters of computers.
HDFS: Learn more about the architecture and design of the Hadoop Distributed File System, the primary storage system used by Hadoop applications.
MapReduce Tutorial: Learn more about the programming framework for writing Hadoop applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.
Familiar business intelligence (BI) tools - such as Excel, PowerPivot, SQL Server Analysis Services, and SQL Server Reporting Services - retrieve, analyze, and report data integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver.
These BI tools can help in your big-data analysis:
Connect Excel to Hadoop with Power Query: Learn how to connect Excel to the Azure Storage account that stores the data associated with your HDInsight cluster by using Microsoft Power Query for Excel. Windows workstation required. Works with Windows- or Linux-based cluster.
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver: Learn how to import data from HDInsight with the Microsoft Hive ODBC Driver. Windows workstation required. Works with Windows- or Linux-based cluster.
Microsoft Cloud Platform: Learn about Power BI for Office 365, download the SQL Server trial, and set up SharePoint Server 2013 and SQL Server BI.