Affinio is an advanced marketing intelligence platform that enables brands to understand their users in a deeper and richer level. Affinio’s learning engine extracts marketing insights for its clients from mining billions of points of social media data. In order to store and process billions of social network connections without the overhead of database management, partitioning, and indexing, the Affinio engineering team chose Azure DocumentDB.
You can learn more about Affinio’s journey in this newly published case study. In this blog post, we provide an excerpt of the case study and discuss some effective patterns for storing and processing social network data.
Why are NoSQL databases a good fit for social data?
Affinio’s marketing platform extracts data from social network platforms like Twitter and other large social networks in order to feed into its learning engine and learn insights about users and their interests. The biggest dataset consisted of approximately one billion social media profiles, growing at 10 million per month. Affinio also needs to store and process a number of other feeds including Twitter tweets (status messages), geo-location data, and machine learning results of which topics are likely to interest which users.
A NoSQL database is a natural choice for these data feeds for a number of reasons:
- The APIs from popular social networks produced data in JSON format.
- The data volume is in the TBs, and needs to be refreshed frequently (with both the volume and frequency expected to increase rapidly over time).
- Data from multiple social media producers is processed downstream, and each social media channel has its own schema that evolves independently.
- And crucially, a small development team needs to be able to iterate rapidly on new features, which means that the database must be easy to setup, manage, and scale.
Why does Affinio use DocumentDB over AWS DynamoDB and Elasticsearch
The Affinio engineering team initially built their storage solution on top of Elasticsearch on AWS EC2 virtual machines. While Elasticsearch addressed their need for scalable JSON storage, they realized that setting up and managing their own Elasticsearch servers took away precious time from their development team. They then evaluated Amazon’s DynamoDB service which was fully-managed, but it did not have the query capabilities that Affinio needed.
Affinio then tried Microsoft Azure DocumentDB, Microsoft’s planet-scale NoSQL database service. DocumentDB is a fully-managed NoSQL database with automatic indexing of JSON documents, elastic scaling of throughput and storage, and rich query capabilities which meets all their requirements for functionality and performance. As a result, Affinio decided to migrate its entire stack off AWS and onto Microsoft Azure.
“Before moving to DocumentDB, my developers would need to come to me to confirm that our Elasticsearch deployment would support their data or if I would need to scale things to handle it. DocumentDB removed me as a bottleneck, which has been great for me and them.”
-Stephen Hankinson, CTO, Affinio
Modeling Twitter Data in DocumentDB – An Example
As an example, we take a look at how Affinio stored data from Twitter status messages in DocumentDB. For example, here’s a sample JSON status message (truncated for visibility).
{ "created_at":"Fri Sep 02 06:43:15 +0000 2016", "id":771599352141721600, "id_str":"771599352141721600", "text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!", "user":{ "id":2557284469, "id_str":"2557284469", "name":"Azure DocumentDB", "screen_name":"DocumentDB", "location":"", "description":"A blazing fast, planet scale NoSQL service delivered by Microsoft.", "url":"https://t.co/30Tvk3gdN0" } }
Storing this data in DocumentDB is straightforward. As a schema-less NoSQL database, DocumentDB consumes JSON data directly from Twitter APIs without requiring schema or index definitions. As a developer, the primary considerations for storing this data in DocumentDB are the choice of partition key, and addressing any unique query patterns (in this case, searching with text messages). We'll look at how Affinio addresses these two.
Picking a good partition key: DocumentDB partitioned collections require that you specify a property within your JSON documents as the partition key. Using this partition key value, DocumentDB automatically distributes data and requests across multiple physical servers. A good partition key has a number of distinct values and allows DocumentDB to distribute data and requests across a number of partitions. Let’s take a look at a few candidates for a good partition key for social data like Twitter status messages.
- “created_at” – has a number of distinct values and is useful for accessing data for a certain time range. However, since new status messages are inserted based on the created time, this could potentially result in hot spots for certain time value like the current time
- “id” – this property corresponds to the ID for a Twitter status message. It is a good candidate for a partition key, because there are a large number of unique users, and they can be distributed somewhat evenly across any number of partitions/servers
- “user.id” – this property corresponds the ID for a Twitter user. This was ultimately the best choice for a partition key because not only does it allow writes to be distributed, it also allows reads for a certain user’s status messages to be efficiently served via queries from a single partition
With “user.id” as the partition key, Affinio created a single DocumentDB partitioned collection provisioned with 200,000 request units per second of throughput (both for ingestion and for querying via their learning engine).
Searching within the text message: Affinio needs to be able to search for words within status messages, and didn’t need to perform advanced text analysis like ranking. Affinio runs a Lucene tokenizer on the relevant fields when it needs to search for terms, and it stores the terms as an array inside a JSON document in DocumentDB. For example, “text” can be tokenized as a “text_terms” array containing the tokens/words in the status message. Here’s an example of what this would look like:
{ "text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!", "text_terms":[ "rt", "documentdb", "dotnet", "sdk", "v1.9.4", "just", "released" ] }
Since DocumentDB automatically indexes all paths within JSON including arrays and nested properties, it is now possible to query for status messages with certain words in them like “documentdb” or “dotnet” and have these served from the index. For example, this is expressed in SQL as:
SELECT * FROM status_messages s WHERE ARRAY_CONTAINS(s.text_terms, "documentdb")
Next Steps
In this blog post, we looked at why Affinio chose Azure DocumentDB for their market intelligence platform, and some effective patterns for storing large volumes of social data in DocumentDB.
- Read the Affinio case study to learn more about how Affinio harnesses DocumentDB to process terabytes of social network data, and why they chose DocumentDB over Amazon DynamoDB and Elasticsearch.
- Learn more about Affinio from their website.
- If you’re looking for a NoSQL database to handle the demands of modern marketing, ad-technology and real-time analytics applications, try out DocumentDB using your free trial, or schedule a 1:1 chat with the DocumentDB engineering team.
- Stay up-to-date on the latest DocumentDB news and features by following us on Twitter @DocumentDB.