Since SQL Server 2016, a new function called R Services has been introduced. Microsoft recently announced a preview for the next version of SQL Server, which extends the advanced analytical ability to Python. This new capability of running R or Python in-database at scale enables us to keep the analytics services close to the data and eliminates the burden of data movements. It also simplifies the development and deployment of intelligent applications. To get the most out of SQL server, knowing how to fine tune the intelligence model itself is far from sufficient and sometimes still fail to meet the performance requirement. There are quite a few optimization tips and tricks that could help us boost the performance significantly. In this post, we apply a few optimization techniques to a resume-matching scenario, which mimics the workflow of large volume prediction aiming to showcase how those techniques could make data analytics more efficient and powerful. The three main optimization techniques introduced in our blog are as follows:
- Full durable memory-optimized tables
- CPU affinity and memory allocation
- Resource governance and concurrent execution
This blog post is a short summary of how the above optimization tips and tricks work with R Services on Azure SQL Server. Those optimization techniques not only work for R Services, but for any Machine Learning Services integrated with SQL Server. Please refer to the full tutorial for sample code and step-by-step walkthroughs.
Description of the Sample Use Case
The sample use case for both this blog and its associated tutorial is a resume-matching example. Finding the best candidate for a job position has long been an art that is labor intensive and requires a lot of manual efforts from search agents. How to find candidates with certain technical or specialized qualities from massive amount of information collected from diverse sources has become a new big challenge. We developed a model to search good matches among millions of resumes for a giving position. Being formulated as a binary classification problem, the machine learning model takes both the resume and job description as the inputs and produces the probability of being a good match for each resume-job pair. A user defined probability threshold is then used to further filter out all good matches.
A key challenge in this use case is that for each new job, we will need to match it with millions of resumes within a reasonable time frame. The feature engineering step, which produces thousands of features (2600 in this case), is a significant performance bottleneck during scoring. Hence, achieving a low matching (scoring) latency is the main objective in this use case.
There are many different types of optimization techniques, and we are going to discuss a few of them using the resume-matching scenario. In this blog, we will explain why and how those optimization techniques work from high level. For more detailed explanations and background knowledge, please refer to the included reference links. In the tutorial, the results are expected to be reproducible using similar hardware configuration and the SQL scripts.
Nowadays, memory is no longer a problem for a modern machine in terms of size and speed. People can get ‘value of RAM’ with the advancement of hardware. In the meantime, data has been produced far more quickly than ever before and some tasks need to process those data with low latency. Memory-optimized tables can leverage the advancement of hardware to tackle this problem. Memory-optimized tables mainly reside in memory so that data is read from and written to memory . However, for durability purposes a second copy of the table is maintained on disk and data is only read from disk during database recovery. The performance could be optimized with high scalability and low latency using memory especially when we need to read from and write to tables very frequently . You can find a detailed introduction of memory-optimized tables on this blog . You can also watch this video  to learn more about the performance benefits of using In-Memory OLTP.
In the resume-matching scenario, we will need to read all the resume features from the database and match all of them with a new job opening. By using memory-optimized tables, resume features are stored in main memory and disk IO could be significantly reduced. In addition, since we need to write all the predictions back to the database concurrently from different batches, extra performance gain could be achieved by using memory-optimized table. With the support of memory-optimized table on SQL Server, we achieved low latency on reading from/writing to tables and a seamless experience during development. Full durable memory-optimized tables were created along with creating the database. The rest of the development is exactly the same as before without knowing where the data is stored.
CPU affinity and memory allocation
With SQL Server 2014 SP2 and later version, soft-NUMA is automatically enabled at the database-instance level when starting the SQL Server service [4, 5, 6]. If the database engine server detects more than 8 physical cores per NUMA node or socket, it will automatically create soft-NUMA nodes that ideally contain 8 cores. But it can go down to 5 or up to 9 logical cores per node. You can find the log information when SQL Server detects more than 8 physical cores in each socket.
Figure 1: SQL log of auto Soft-NUMA, 4 soft NUMA nodes were created
As shown in Figure 1, our test consisted of 20 physical cores among which 4 soft-NUMA nodes were created automatically such that each node contained 5 cores. Soft-NUMA enables the ability to partition service threads per node and that generally increases scalability and performance by reducing IO and lazy writer bottlenecks. We then further created 4 SQL resource pools and 4 external resource pools  to specify the CPU affinity of using the same set of CPUs in each node. By doing this, both SQL Server and the R processes can eliminate foreign memory access since the processes will be within the same NUMA node. Hence, memory access latency could be reduced. Subsequently, those resource pools are then assigned to different workload groups to enhance hardware resource consumption.
Soft-NUMA and CPU affinity cannot divide physical memory in each physical NUMA node. All the soft NUMA nodes in the same physical NUMA node receive memory from the same OS memory block and there is no memory-to-processor affinity. However, we should pay attention to the memory allocation between SQL Server and the R processes. By default, only 20% of memory is allocated to R services and that is not enough for most of the data analytical tasks. Please see How To: Create a Resource Pool for R  for more information. We need to fine tune memory allocation between those two and of course the best configuration varies case by case. In the resume-matching use case, we increased the external memory resource allocation to 70% which was the best configuration.
Resource governance and concurrent scoring
To scale up the scoring problem, a good practice is to adopt the map-reduce approach in which we split millions of resumes into multiple batches, and then execute multiple scoring concurrently. The parallel processing framework is illustrated in Figure 2.
Figure 2: Illustration of parallel processing in multiple batches
Those batches will be processed on different CPU sets, and the results will be collected and written back to the database. Resource governance in SQL Server is designed to implement this idea. We can create resource governance for R services on SQL Server  by routing those scoring batches into different workload groups (Figure. 3). More information about resource governor could be found on this blog .
Figure 3: Resource governor (from: https://docs.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor)
Resource governor can help divide the available resources (CPU and memory) on a SQL Server to minimize the workload competition using a classifier function [10, 11]. It provides multitenancy and resource isolation on SQL Server for different tasks to potentially improve the execution and provide predictable performance.
One pain point with R is that when we conduct feature engineering it is usually processed on a single CPU. This is a major performance bottleneck for most of the data analysis tasks. In our resume-matching use case, we need to produce 2,500 cross-product features that will be then combined with the original 100 features (Figure 4). This whole process would take significant amount of time if everything was done on a single CPU.
Figure 4: Feature engineering of our resume-matching use case
One trick here is to create a R function for feature engineering and to pass it as rxTransform function during training. The machine learning algorithm is implemented with parallel processing. As part of the training, the feature engineering is also processed on multiple CPUs. In comparison with regular approach in which feature engineering is conducted before training and scoring, we observed a 16% performance improvement in terms of scoring time.
Another trick that can potentially improves the performance is to use SQL compute context within R . Since we have isolated resources for different batch executions, we need to isolate the SQL query for each batch as well. By using SQL compute context, we can parallelize the SQL query to extract data from tables and constrain the data on the same workload group.
Results and Conclusion
To fully illustrate those tips and tricks, we have published a very detailed step-by-step tutorial. A few benchmark tests for scoring 1.1 million rows of data were also conducted. We used both the RevoScaleR and MicrosoftML packages to train a prediction model separately. We then compared the scoring time if using those optimizations versus without optimizations. Figure 5 and 6 summarize the best performance results using RevoScaleR and MicrosoftML packages. The tests were conducted on the same Azure SQL Server VM using the same SQL query and R codes. Eight batches for one matching job were used in all tests.
Figure 5: RevoScaleR scoring results
Figure 6: MicrosoftML scoring results
The results suggested that the number of features had a significant impact on the scoring time. Also, using those optimization tips and tricks could significantly improve the performance in terms of scoring time. The improvement was even more prominent if more features were used in the prediction model.
Lastly, we would like to express our thanks to Umachandar Jayachandran, Amit Banerjee, Ramkumar Chandrasekaran, Wee Hyong Tok, Xinwei Xue, James Ren, Lixin Gong, Ivan Popivanov, Costin Eseanu, Mario Bourgoin, Katherine Lin and Yiyu Chen for the great discussions, proofreading and test-driving the tutorial accompanying this blog post.