• 4 min read

R workloads on Azure Batch

In this post, we will talk about how Batch will allow you to run simple R scripts in the Azure across numerous VMs.

R is an open source software environment and language that provides a wide variety of statistical and graphical techniques, its highly extensible and runs on all platforms. R today is used by a wide variety of organizations in their daily business: some of these include banks, automakers, airline manufacturers and tech companies. They use R for some form of prediction and statistical analysis that runs at massive hyper-scale. This results in big computation and large scale processing which means there is a need for large scale resources.

This is where Azure Batch comes in; a service in Public Preview since October 2014, which provides resourcing, scheduling and task execution as a managed service on Microsoft Azure. It’s a perfect fit for running large scale R scripts and workloads on a cluster. You provide the model and analysis and Azure Batch takes care of resourcing and scheduling VMs, then executing R to run the scripts to get direct results and pushing the outputs to their correct destination.

We will talk about how Batch will allow you to run simple R scripts in the Azure across numerous VMs. The service executes the “RScript.exe” (R can be downloaded here and instructions on installation for Windows can be found here) on each VM for every task that is produced by the splitting the job. The R executable is packaged and uploaded to the Batch account using the Batch Apps portal that can be accessed through your batch account via Azure Portal. The code that executes the R Script is also packaged and uploaded using the same resource.

There are two main parts to the solution –

1) Cloud – This is the server side code that executes the split of the job into numerous tasks and then the task execution code.

public static readonly CloudApplication Application = new ParallelCloudApplication
        {
            ApplicationName = "RWorkload",
            JobType = "RWorkload",
            JobSplitterType = typeof(RWorkloadJobSplitter),
            TaskProcessorType = typeof(RWorkloadTaskProcessor)
        };

This code describes my workload. The application name corresponds with the Application Image zip that needs to be uploaded in the portal. This contains the R script executables. The Job Type is what is referenced by the client to submit the workload.

protected override IEnumerable Split(IJob job, JobSplitSettings settings)
        {
            var taskList = new List();
            var task = new TaskSpecifier
            {
                RequiredFiles = job.Files,
                Parameters = job.Parameters
            };

            taskList.Add(task);

            return taskList;
        }

This code creates a single task but can be extended to create multiple tasks depending on a host of parameters like the number of files or even parsing a script and figuring out advanced cases to split the workload. Notice I am also passing through all the files and parameters that are passed through the client. This includes all the scripts including the main R script which may or may not call into multiple other scripts. The parameters contain the main R script file name which will help me identify what R script is the primary script.

protected override TaskProcessResult RunExternalTaskProcess(ITask task, TaskExecutionSettings settings)
        {
            //R input script
            var inputFile = task.Parameters["inputFile"];

            const string outputFile = "output.txt";

            //std output directed to this file
            string standardOutput = string.Format("std_task{0}.out", task.TaskId);

            var process = new ExternalProcess
            {
                CommandPath = ExecutablePath(@"R-3.1.2binRscript.exe"),
                Arguments = string.Format("{0} > {1}", inputFile, standardOutput),
                WorkingDirectory = LocalStoragePath
            };

            try
            {
                var processOutput = process.Run();
                return TaskProcessResult.FromExternalProcessResult(processOutput, standardOutput, outputFile);
            }
            catch (Exception ex)
            {
                Log.Error("Error in task processor: {0}", ex.ToString());
            }

            return new TaskProcessResult { Success = TaskProcessSuccess.RetryableFailure };
        }

The task processor just executes the R script executable with the input script and directs the output to a standard output file and captures the output in a text file which is produced independently by the executable. These outputs are then passed on as the results of the task which means they will be uploaded to blob storage and available via the task/job download in the portal.

2) Client – This is the client side code that submits the job along with the required files and then monitors the job status and finally downloads the output that is produced.

var parameters = new Dictionary
            {
                { "Rscript", ".r" },
                { "inputFile", ".r" }
            };

            return new JobSubmission
            {
                Name = "R",                             
                Type = "RWorkload",                     
                RequiredFiles = userInputFilePaths,     
                Parameters = parameters,                
                InstanceCount = 1,         
            };

The client builds the job specification which contains the job submission object. In the above sample I’ve given the job a name included the job type which corresponds with the job type in the Cloud code, a list of the scripts that need to be passed as input files; these files can be referenced using local storage location and will be uploaded to Azure Blob Storage behind the scenes (the upload happens before the submission of the job but is done in the same call) and the parameters that contain the name of the scripts and finally the number of instances that I want to run the job on.

var job = await client.Jobs.SubmitAsync(jobSpec);

Finally the Job Submission object is passed through the Submit call in the SDK and the workload is submitted to the service that has been created using the portal. Now the job can be monitored via the portal or via the client by getting the details of the job using the Job Id that was returned at submission.

So it’s easy to see how simple it is to run R Scripts in the cloud without much effort. It’s trivial to scale this out to multiple virtual machines depending on the workload requirements. We encourage people to play around with these ideas and get in touch with feedback, questions and comments.

 

Additional Resources

Technical overview

Get started with the Azure Batch library for .NET

Get Batch development libraries and tools