Announcing new capabilities of HDInsight and DocumentDB at Strata | Azure Blog

This week in San Jose, Microsoft will be at Strata Hadoop + World where will be announcing new capabilities of Azure HDInsight, our fully managed OSS analytics platform for running all open-source…

As of May 10th 2017,

Azure Cosmos DB is Microsoft’s globally distributed multi-model database. Azure Cosmos DB was built from the ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across any number of Azure regions by transparently scaling and replicating your data wherever your users are. Elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need. Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with multi-homing capabilities—all backed by industry leading service level agreements (SLAs).

Azure Cosmos DB is truly schema-agnostic; it automatically indexes all the data without requiring you to deal with schema and index management. It’s also multi-model, natively supporting document, key-value, graph, and column-family data models. With Azure Cosmos DB, you can access your data using APIs of your choice, as DocumentDB SQL (document), MongoDB (document), Azure Table Storage (key-value), and Gremlin (graph) are all natively supported.

This week in San Jose, Microsoft will be at Strata + Hadoop World where will be announcing new capabilities of Azure HDInsight, our fully managed OSS analytics platform for running all open-source analytics workloads at scale, with enterprise grade security and SLA and Azure DocumentDB, our planet-scale fully-managed NoSQL database service. Our vision is to deeply integrate both services and make it seamless for developers to process massive amounts of data with low-latency and global scale.

DocumentDB announcements

DocumentDB is Microsoft’s globally distributed database service designed to enable developers to build planet-scale applications. DocumentDB allows you to elastically scale both throughput and storage across any number of geographical regions. The service offers guaranteed single-digit millisecond low latency at the 99th percentile, 99.99% high availability, predictable throughput, and multiple well-defined consistency models—all backed by comprehensive SLAs for latency, availability, throughput, and consistency. By virtue of its schema-agnostic and write-optimized database engine, DocumentDB, by default, is capable of automatically indexing all the data it ingests and serves across SQL, MongoDB, and JavaScript language-integrated queries in a scale-independent manner. As one of the foundational services of Azure, DocumentDB has been used virtually ubiquitously as a backend for first-party Microsoft services for many years. Since its general availability in 2015, DocumentDB is one of the fastest growing services on Azure.

Real-time data science with Apache Spark and DocumentDB

At Strata, we are pleased to announce Spark connector for DocumentDB. It enables real-time data science and exploration over globally distributed data in DocumentDB. Connecting Apache Spark to Azure DocumentDB accelerates our customer’s ability to solve fast-moving data sciences problems where data can be quickly persisted and retrieved using DocumentDB. The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering, and advanced analytics to data sciences against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios. The Spark to DocumentDB connector uses the Azure DocumentDB Java SDK. Get started today and download the Spark connector from GitHub!

General availability of high-fidelity, SLA backed MongoDB APIs for DocumentDB

DocumentDB is architected to natively support multiple data models, wire protocols, and APIs. Today we are announcing the general availability of our DocumentDB’s API for MongoDB. With this, existing applications built on top of MongoDB can seamlessly target DocumentDB and continue to use their MongoDB client drivers and toolchain. This allows customers to easily move to DocumentDB while continuing to use the MongoDB APIs, but get comprehensive enterprise grade SLAs, turn-key global distribution, security, compliance, and a fully managed service.

HDInsight announcements

Cloud-first with Hortonworks Data Platform 2.6

Microsoft’s cloud-first strategy has already shown success with customers and analysts, having recently been placed as a leader in the Forrester Big Data Hadoop Cloud Solutions Wave and a Leader in the Gartner Magic Quadrant for Data Management Solutions for Analytics. Operating a fully managed cloud service like HDInsight, which is backed by enterprise grade SLA, enable customers to deploy the latest bits of Hadoop & Spark, on demand. To that end, we are excited that the latest Hortonworks Data Platform 2.6 will be continuously available to HDInsight even before its on-premises release. Hortonworks’ commitment to being cloud-first is especially significant given the growing importance of cloud with Hadoop and Spark workloads.

“At Hortonworks we have seen more and more Hadoop related work loads and applications move to the cloud. Starting in HDP 2.6, we are adopting a “Cloud First” strategy in which our platform will be available on our cloud platforms – Azure HDInsight at the same time or even before it is available on traditional on-premises settings. With this in mind, we are very excited that Microsoft and Hortonworks will empower Azure HDInsight customers to be the first to benefit from our HDP 2.6 innovation in the near future.”
– Arun Murthy, co-founder, Hortonworks

Most secured Hadoop in a managed cloud offering

Last year at Strata + Hadoop World Conference in New York, we announced the highest levels of security for authentication, authorization, auditing, and encryption natively available in HDInsight for Hadoop workloads. Now, we are expanding our security capabilities across other workloads including Interactive Hive (powered by LLAP) and Apache Spark. This allows customers to use Apache Ranger over these popular workloads to provide a central policy and management portal to author and maintain fine-grained access control. In addition, customers can now analyze detailed audit records in the familiar Apache Ranger user interface.

New fully managed, SLA-backed Apache Spark 2.1 offering

With the latest release of Apache Spark for Azure HDInsight, we are providing the only fully managed, 99.9% SLA-backed Spark 2.1 cluster in the market. Additionally, we are introducing capabilities to support real-time streaming solutions with Spark integration to Azure Event Hubs and leveraging the structured streaming connector in Kafka for HDInsight. This will allow customers to use Spark to analyze millions of real-time events ingested into these Azure services, thus enabling IoT and other real-time scenarios. We made this possible through DirectStreaming support, which improves the performance and reliability of Spark streaming jobs as it processes data from Event Hubs. The source code and binary distribution of this work is now available publicly on GitHub.

New data science experiences with Zeppelin and ISV partnerships

Our goal is to make big data accessible for everybody. We have designed productivity experiences for different audiences including the data engineer working on ETL jobs with Visual Studio, Eclipse, and IntelliJ support, the data scientists performing experimentation with Microsoft R Server and Jupyter notebook support, and the business analysts creating dashboards with Power BI, Tableau, SAP Lumira, and Qlik support. As part of HDInsight’s support for the latest Hortonworks Data Platform 2.6, Zeppelin notebooks, a popular workspace for data scientists, will support both Spark 2.1 and interactive Hive (LLAP). Additionally, we have added popular independent software vendors (ISVs) Dataiku and H20.ai to our existing set of ISV applications that are available on the HDInsight platform. Through the unique design of HDInsight edge nodes, customers can spin up these data science solutions directly on HDInsight clusters, which are integrated and tuned out-of-the-box making it easier for customers to build intelligent applications.

Enabling Data Warehouse scenarios through Interactive Hive

Microsoft has been involved from the beginning in making Apache Hive run faster with our contributions to Project Stinger and Tez that sped up Hive query performance up to 100x. We announced support for Hive using LLAP (Long Lived and Process) to speed up query performance up to an additional 25x. With support for the newest version of Apache Hive 2.1.1, customers can expect sub-second query performance, thus enabling data warehouse scenarios over all enterprise data, without the need for data movement. Interactive Hive clusters also support popular BI tools, which is useful for business analysts who want to run their favorite tools directly on top of Hadoop.

Announcing SQL Server CTP 1.4

Microsoft is excited to announce a new preview for the next version of SQL Server Community Technology Preview (CTP) 1.4 will be available on both Windows and Linux in the coming days. This preview offers the ability to schedule jobs using SQL Server Agent in SQL Server v.Next on Linux. When available, you can try the preview in your choice of development and test environments now and for additional detail on CTP 1.4, please visit What’s New in SQL Server v.Next, Release Notes and Linux documentation.

Earlier today, we also announced a new online event that will take place next month – Microsoft Data Amp. During the event, Scott Guthrie and Joseph Sirosh will share some exciting new announcements around investments we are making that put data front and center of application innovation and artificial intelligence. I encourage you to check out Mitra Azizirad’s blog post to learn more about Microsoft Data Amp and save the date for what’s going to be an amazing event.

This week the big data world is focused on Strata + Hadoop World in San Jose, a great event for the industry and community. We are committed to making the innovations in big data and NoSQL natively available, easily accessible, and highly productive as part of our Azure services.