Apache Hadoop is open-source software for storing and analyzing massive amounts of structured and unstructured data–terabytes or more of everything from emails to sensor readings to server logs to Twitter feeds to GPS signals to just about anything else you can think of. Hadoop can process big, messy data sets for insights and answers–which helps explain all the buzz around it.
Created in 2005 by Mike Cafarella and Doug Cutting (who named it after his son's toy elephant), Hadoop was originally intended for web-related search data. Today, it's an open-source, community-built project of the Apache Software Foundation that's used in all kinds of organizations and industries. Microsoft is an active contributor to the community development effort.
Microsoft has logged over 6,000 engineering hours in the last year, committing code and driving innovation in partnership with the open source community across a range of Hadoop projects. In addition, we have committers on Hadoop, and Microsoft employee Chris Douglas is the Apache Working Group Chair for Hadoop.
One reason for Hadoop's popularity is simple economics. Processing big data sets once required supercomputers and other pricey, specialized hardware. Hadoop makes reliable, scalable, distributed computing possible on industry-standard servers–allowing you to tackle petabytes of data and beyond on smaller budgets. Hadoop is also designed to scale from a single server to thousands of machines, and detect and handle failures at the application layer for better reliability.
By some estimates, as much as 80 percent of the data organizations deal with today isn't the kind that comes neatly packaged in columns and rows. Instead, it's a messy avalanche of emails, social media feeds, satellite images, GPS signals, server logs, and other unstructured, non-relational files. Hadoop can handle nearly any file or format–its other big advantage–so organizations can pose questions they never thought possible.
You can deploy Hadoop in a traditional on-site datacenter. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here's why a growing number of organizations are choosing this option.
Open source doesn't mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.
See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.
In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter.
We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.
Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value.
Microsoft Azure HDInsight is a 100% Apache Hadoop-based service in the Azure cloud. It offers all the advantages of Hadoop, plus the ability to integrate with Excel, your on-premises Hadoop clusters, and the Microsoft ecosystem of business software and services.