Data Simulator For Machine Learning

已于 三月 20, 2017 发布

Data Scientist

Virtually any data science experiment that uses a new machine learning algorithm requires testing across different scenarios. Simulated data allows one to do this in a controlled and systematic way that is usually not possible with real data.

A convenient way to implement and re-use data simulation in Azure Machine Learning (AML) Studio is through a custom R module. Custom R modules combine the convenience of having an R script packaged inside a drag and drop module, with the flexibility of custom code where the user has the freedom of adding and removing functionality parameters, seen as module inputs in the AML Studio GUI, as needed. A custom R module has identical behavior to native AML Studio modules. Its input and output can be connected to other modules or be set manually, and they can process data of arbitrary schema, if the underlying R code allows it, inside AML experiments. An added benefit is that they provide a convenient way of deploying code without revealing the source, which may be convenient for IP sensitive scenarios. By publishing it in Cortana Intelligence Gallery one can easily expose to the world any algorithm functionality without worrying about classical software deployment process.

Data simulator

We present here an AML Studio custom R module implementation of a data simulator for binary classification. Current version is simple enough to have the complete code inside Cortana Intelligence Gallery item page. It allows one to generate custom feature dimensionality datasets with both label relevant and irrelevant columns. Relevant features are univariately correlated with the label column. Correlation directionality (i.e. positive or negative correlation coefficient) is controlled by correlationDirectionality parameter(s). All features are generated using separate runif calls. In the future, the module functionality can be further extended to allow the user to choose other distributions by adding and exposing ellipsis/three dots argument feature in R. Last module parameter (seedValue) can be used to control results reproducibility. Figure 1 shows all module parameters exposed in AML Studio.


Figure 1. Data Simulator Custom R module in an AML Experiment. 1000000 samples are simulated, with 1000 irrelevant and 10 label relevant columns. Data is highly imbalanced since only 20 samples are of “FALSE” class. 2 values (.03 and 5) long array value for the “noiseAmplitude” property is reused for all relevant columns. Similarly, the sign of the 4 values (1, -1, 0, 3.5) “label-features correlation” property is reused for all 10 relevant columns to control the correlation directionality (i.e. positive or negative) with the label column.

By visualizing, as shown below in Figure 2, the module output (right click and then “Visualize”), we can check basic properties of the data. This includes data matrix size and univariate statistics like range and missing values.


Figure 2. Visualization of simulated data. Data has 1,000,000 rows and 1011 columns (10 relevant and 1000 irrelevant feature columns, plus label). Histogram of the label column (right graph) indicate large class imbalance chosen for this simulation.

Univariate Feature Importance Analysis of simulated data

Note: Depending on the size chosen for simulated data, it may take some time to generate them: e.g. 1 hour for a 1e6 rows x 2000 feature columns (2001 total columns) dataset. However, new modules can be added to the experiment even after data were generated, and the cached data can be processed as described below without having to simulate them again.

Univariate Feature Importance Analysis (FIA) measures similarity between each feature column and label values using metrics like Pearsonian Correlation and Mutual Information (MI). MI is more generic than Pearsonian Correlation since it has the nice property that it does not depend of directionality of data dependence: a feature that has labels of one class (say “TRUE”) for all middle values, and the other class (“FALSE”) for all small and large values will still have a large MI value although its Pearsonian Correlation may be close to zero.

Although feature-wise univariate FIA does not capture multivariate dependencies, it provides a simple to understand picture of the relationship between features and classification target (labels). An easy way to perform univariate FIA in AML Studio is by employing existing AML module for Filter Based Feature Selection for similarity computation and Execute R Script module(s) for results concatenation. To do this, we extend the default experiment deployed though CIS gallery page by adding several AML Studio modules as described below.

We first add a second Filter Based Feature Selection module, and we choose Mutual Information value for its “Feature scoring method” property. The original Filter Based Feature Selection module, with “Feature scoring method” property set to Pearson Correlation should be left unchanged. For both Filter Based Feature Selection modules, the setting for “Number of desired features” property is irrelevant. since we will use the similarity metrics computed for all data columns, available by connecting to the second (right) output of each Filter Based Feature Selection module. The “Target column” property for both modules needs to point to the label column name in the data. Figure 3 shows the settings chosen for the second Filter Based Feature Selection module.


Figure 3. Property settings for the Filter Based Feature Selection AML Studio module added for Mutual Information computation. By connecting to the right side output of the module we get the MI values for all data columns (features and label).

The next two Execute R Script module(s) added to the experiment are used for results concatenation. Their scripts are listed below.

First module (rbind with different column order):

  dataset1 <- maml.mapInputPort(1) # class: data.frame
  dataset2 <- maml.mapInputPort(2) # class: data.frame

  dataset2 <- dataset2[,colnames(dataset1)]
  data.set = rbind(dataset1, dataset2)


Second module (add row names):

  dataset <- maml.mapInputPort(1) # class: data.frame

  myRowNames <- c("PearsCorrel", "MI")
  data.set <- cbind(myRowNames, dataset)
  names(data.set)[1] <- c("Algorithms")


The last module, Convert to CSV, added to experiment allows one to download the results in a convenient format (csv) if needed. The results file is in plain text and can be opened in any text editor or Excel (Figure 4):


Figure 4. Downloaded results file visualized in Excel.

Simulated data properties

FIA results for relevant columns are shown in Figure 5. Although MI and Pearsonian correlation are on different scales, both similarity metrics are well correlated. They are also in sync with the “noiseAmplitude” property of the custom R module described in Figure 1. The 2 noiseAmplitude values (.03 and 5) are reused for all 10 relevant columns, such that relevant features 1, 3, 5, 7, and 9 are much better correlated with the labels dues to their lower noise amplitude.


Figure 5. FIA results for the 10 relevant features simulated before. Although MI (left axis) and Pearsonian correlation (right axis) are on different scales, both similarity metrics are well correlated.

As expected, for each of the 1000 irrelevant features columns, min, max and average statistics for both MI and Pearsonian Correlation are below 1e-2 (see Table 1).
















Table 1. Statistics of similarity metrics for the 1000 irrelevant columns simulated above.

This result is heavily dependent on sample size (i.e. number of simulated rows). For significantly smaller row sizes than 1e3 used here, the max and average MI and Pearsonian Correlation values for irrelevant columns may be larger due to the probabilistic nature of simulated data.


Data simulation is an important tool for understanding ML algorithms. The Custom R module presented here is available in Cortana Intelligence Gallery and its results can be analyzed using AML module for Filter Based Feature Selection. Future extension of the algorithm should include regression data and multivariate dependencies.