40TiB of advanced in-memory analytics with Azure and ActivePivot

Publicado el 21 marzo, 2017

Senior Product Manager, Azure Big Compute

In-memory computing has accelerated the big compute capabilities and enabled customers to extend its experience beyond just the monte carlo simulation and into the analytics. This is of note within Financial Services where business users wish to move away from pre-canned reports and instead directly interact with data. With Azure, banks can analyze in real-time and make the right decisions intraday and be more equipped to meet the regulatory standards. This blog will look to explore the new possibilities of a scale out architecture for in-memory analytics in Azure through ActivePivot. ActivePivot is part of the ActiveViam platform that brings big compute and big data closer together.

ActivePivot is an in-memory database that aggregates large amounts of fast-moving data through incremental, transactional, and analytical processing to enable customers to make the right decisions in a short amount of time. ActivePivot computes sophisticated metrics on data that is updated on the fly without the need for any pre-aggregation and allows the customer to explore metrics across hundreds of dimensions, analyze live data at its most granular level, and perform what-if simulations at unparalleled speed.

For customers to enable this on-premise, purchasing servers with enough memory can be expensive and is often saved for mission critical workloads. However, the public cloud opens this up to more workloads for research and experimentation, and taking this scenario to Azure is compelling. Utilizing Azure blob storage to collect and store the historical datasets generated over a period of time allows the customer to use the compute only when the user requires it. Starting from scratch to being fully deployed in less than 30 minutes drastically reduces the total cost of ownership and provides enormous business agility.

For our testing, we collectively processed 400 days of historical data to show how 40TB can be loaded onto 128 node cluster in 15 minutes to query 200Bn records in less than 10 seconds. For this we used the G5 instance with 32 cores and 448GiB of RAM running a Linux image with Java and ActivePivot and 40 storage accounts with 10 days of data, roughly 1TB in each.


The graph above shows the rate of data transfer over a five minute period.

 

Utilizing a special cloud connector, ActivePivot pulls from several storage accounts to transfer at 50 GiB per second. This ActiveViam cloud connector opens several HTTP connections to help fully saturate the bandwidth of the VM and storage and is tailored towards large files.

In parallel to the data fetching, ActivePivot indexes this data in-memory so to accelerate the analytical workloads. The ActivePivot query nodes distribute the calculations automatically across the data nodes and does not require any cross-node transfer. We expected to see performance near linear scale as we double the CPU, memory, and data size and we were very pleased to see it track with our expectation.

As you can see on the graph below, when we multiplied the dataset by 64, from 600 GiB up to 37.5 TiB, the throughput has increased 54 times.

This test proved to be very successful for us as we were able to execute extensive queries over the entire dataset in less than 10 seconds, but to see more detail and a step-by-step process, please visit the ActiveViam blog. We will also be following up with further scale-up testing later in the year so watch this space.


 For more information on Big Compute within Financial Services, please visit HPC Financial Services page or HPC on Azure.