What is Hadoop?
Apache Hadoop is open-source software for storing and analysing massive amounts of structured and unstructured data–terabytes or more of everything from emails to sensor readings to server logs to Twitter feeds to GPS signals to just about anything else you can think of. Hadoop can process big, messy data sets for insights and answers–which helps explain all the buzz around it.
A brief history of Hadoop
Created in 2005 by Mike Cafarella and Doug Cutting (who named it after his son's toy elephant), Hadoop was originally intended for web-related search data. Today, it is an open-source, community-built project of the Apache Software Foundation that is used in all kinds of organisations and industries. Microsoft is an active contributor to the community development effort.
Microsoft has logged over 6,000 engineering hours in the last year, committing code and driving innovation in partnership with the open source community across a range of Hadoop projects. In addition, we have committers on Hadoop, and Microsoft employee Chris Douglas is the Apache Working Group Chair for Hadoop.
Built for big data, everyday servers
One reason for Hadoop's popularity is simple economics. Processing big data sets once required supercomputers and other pricey, specialised hardware. Hadoop makes reliable, scalable, distributed computing possible on industry-standard servers–allowing you to tackle petabytes of data and beyond on smaller budgets. Hadoop is also designed to scale from a single server to thousands of machines and detect and handle failures at the application layer for better reliability.
Insights from all kinds of data
By some estimates, as much as 80 percent of the data organisations deal with today is not the kind which comes neatly packaged in columns and rows. Instead, it is a messy avalanche of emails, social media feeds, satellite images, GPS signals, server logs and other unstructured, non-relational files. Hadoop can handle nearly any file or format–its other big advantage–so organisations can pose questions they never thought possible.
Why Hadoop in the cloud?
You can deploy Hadoop in a traditional on-site datacentre. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here is why a growing number of organisations are choosing this option.
The cloud saves time and money
Open source does not mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.
See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.
The cloud is flexible and scales fast
In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyse your data, then shut it down to stop the meter.
We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.
The cloud makes you nimble
Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organisations immediate time to value.
Meet HDInsight: Hadoop in the Azure cloud
Microsoft Azure HDInsight is a 100% Apache Hadoop-based service in the Azure cloud. It offers all the advantages of Hadoop, plus the ability to integrate with Excel, your on-premises Hadoop clusters and the Microsoft ecosystem of business software and services.