This post was co-authored by Tomer Teller, Senior Security Program Manager, Azure Security.
The arms race between data security professionals and cybercriminals continues at a rapid pace. More than ever, attackers exploit compute resources for malicious purposes by deploying malware, known as “bots”, in virtual machines running in the cloud. Even a conservative estimate reveals that, at least, 1 in every 10,000 machines are part of some known Botnet.
To better protect VMs in the cloud, Azure Security Center (ASC) applies a novel supervised Machine Learning model for high-precision Botnet detection based on analysis of DNS query logs. This model achieves 95% precision and 43% recall and can detect Botnets before they are reported by antimalware companies.
Communication patterns between Botnets and their CnC server
Bots are controlled by the attacker (Botmaster) using a Command and Control (CnC) server or servers. The Bots which are part of this network are called Botnets (or Zombies). A typical Bot network (Botnet) structure is illustrated in the following figure.
Historically, a CnC server was assigned a static IP address making it very easy to take down or blacklist. To avoid detection, Botmasters responded by creating more complex bot/CnC communication patterns.
For example, attackers developed methods, such as Fast-Flux (Mehta, 2014), that use domain names to locate the CnC server which frequently changes its IP address. In addition, domain generation algorithms (DGA) are also used by various families of malware, with Conficker being perhaps the most notorious example. The DGA pattern works by periodically generating a large number of domain names that can be used by bots as connection points to the CnC servers.
More recently, social networks and other user-generated content sites are being exploited by Botmasters to pass information without ever establishing a direct link with a CnC server. Security professionals can therefore no longer rely on simple rule-based approaches to detect these complex communication patterns.
Opportunity to detect Botnets in the cloud
One of the more common applications of machine learning in the cybersecurity domain is anomaly detection. The idea is that a compromised machine exhibits anomalous behavior. While this assumption is usually correct, the opposite seldom holds. Therefore, such techniques achieve low precision and thus produce many false alarms.
Cloud providers such as Microsoft possess a unique opportunity to detect Botnet activity with much greater accuracy by applying large scale machine learning over multiple data sources as a well as a combined view of all the VM logs. Unlike most other systems which analyze data from each machine in isolation, our approach can effectively uncover patterns that are typical to Botnets.
Gathering the data
We collect DNS query and response data from Azure VMs. The logs contain around 50TB of data per day and includes information such as the query name, queried domain name server, the DNS response, and other DNS logging information.
In addition to DNS query and response data, we also use a Microsoft automated machine-readable feed of threat intelligence (TI). The feed includes information about IP addresses of devices which are likely to be part of a Botnet as well as the IP addresses and domains of known CnC servers.
To achieve optimal results, we model Botnet detection as a 2-class supervised learning problem. That is, to classify that a VM (on a specific date) is part of a Botnet based on that VM’s DNS query log. VM instances are labeled as possibly participating in a Botnet based on the following criteria:
- The IP address of the VM appears in the TI Botnet feed of that same day.
- The VM issued a DNS query with a domain known to belong to a CnC.
- The VM received a DNS response to an issued query and the resulting mapped IP is a known IP address of a CnC.
The VM instances represent a VM on a specific day and are labeled as possible participants of a botnet based on the TI feed of that day. In our problem, feature extraction is difficult; the number of domains accessed by a given VM can be very large and the total number of possible domains is massive. Hence, the domain space is huge and relatively dense.
Moreover, since the model is used continuously, it needs to identify Botnets even when they query for domains is unseen during training. Based on communication patterns with CnC servers our features should capture the insights laid out in the following table.
|Rare domains||Domain names of CnC servers are rare since they are seldom requested by legitimate users|
|Young domains||When a domain generation algorithm (DGA) is used the CnC server frequently acquires new domain names hence they tend to be recently registered. We use a massive daily updated data feed to map domain names to their registration date|
|Domains Idiosyncratic to Botnets||Botnets controlled by the same CnC server issue DNS queries which contain similarities to each other yet are different from others|
|Non-existent domain responses||
When DGA is used, Botnets query many non-existing domains before they find the actual domain of their CnC server for that time
To efficiently generate the features for each instance we apply two passes over the dataset. In the first pass, we generate a Reputation Table (RT) which maps domains at a given day to:
- Rareness scores
- Youngness scores
- Botnet idiosyncratic scores
In the second pass we calculate the features for each instance based on the reputation scores of domains it queried for. The RT is calculated as follows:
To generate features for a VM at a given day, we create a set Dnx and Dx which are sets of all the non-existent and existent domains queried by the VM, respectively. We produce the feature vectors by summing up the corresponding values in the reputation table for each domain in Dnx and Dx separately.
Similarly, we do the same for the name-servers being queried (i.e., the DNS server that eventually resolved the DNS query). The latter features help identify legitimate scenarios in which rare subdomains of a non-malicious zone are accessed, e.g., rare subdomains of Yahoo.com. For classification, we used Apache Spark’s Gradient-Boosted Trees with default parameters.
We use the Microsoft TI feed to generate the labels for our daily Azure VM instances. However, these labels are not perfect; Botnets can still remain undetected for quite some time. Hence, the labels we produced based on the feed are not comprehensive. This makes our evaluation setting different from that of a standard classification problem, since our goal is not to perfectly match labels that can be extracted based on the feed. This would simply duplicate information which is already available in the daily feed. Instead, our goal is to find compromised VMs before they appear in the TI feed.
With this in mind, we trained our model on a week of data from early June 2016. We let the model classify instances from late June and produce our “ground truth” for evaluation based on labels generated from the TI feed looking forward in time one week (into July). We report the accuracy of our model in the following confusion matrix.
From the matrix, we learn that the model classified 432 (411+21) instances from the test set as being Botnets. Out of these, 95% (411) eventually appear in the Interflow feed (within a week), hence the model achieves 95% precision and 43% recall. Note that the 5% apparent FPs may still potentially be Botnets that have not yet appeared in the feed, hence they require further investigation.
We present a novel supervised ML model for Botnet detection based on DNS logs. We generate the labels for the supervised model based on a threat intelligence feed provided by anti-malware vendors. We show that the model is able to identify with high accuracy the VMs that are part of a botnet well before they become part of the TI feed. This new Botnet detection feature will reduce the risk of Azure VMs becoming infected with malware.