Optimizing rolling feature engineering for time series data

28 Şubat, 2017 tarihinde gönderildi

Data Scientist

In this blog post, I want to talk about how data scientists can efficiently perform certain types of feature engineering at scale. Before we dive into sample code, I will briefly set the context of how telemetry data gets generated and why businesses are interested in using such data.

To get started, we know that these days machines are instrumented with multiple in-built sensors to record various measurements while it is in operation. Thus, these machines end up generating a lot of telemetry data that can be used once this data is transferred off these machines and stored in a centralized repository. Businesses these days hope to use their amassed data to help answer questions like, “When is a machine likely to fail?” or, “When does a spare part for a machine need to be re-ordered?” Eventually this could help them reduce time and costs incurred in adhoc maintenance activities.

After having built many models, I have noticed that typical telemetry data that gets generated from the various sensors in their raw format add very little value. Sensors by design can generate data at a regular time interval, thus the data consists of multiple time series which can be sorted by time for each machine to build meaningful additional features. So, data scientists, like me, end up enhancing the dataset by performing additional feature engineering on this raw sensor data.

The most common features I begin with are to build out rolling aggregates using my preferred statistical programming language on a sample dataset. Here are some code snippets on how I would generate rolling aggregates for a specific window size using R/Python for machines which records voltage, rotation, pressure, and vibration measurements by date. These code snippets can be run on any other local R/Python IDE, within a Jupyter notebook or within an Azure ML Studio environment.



telemetrymean <- telemetry %>%
    arrange(machineID, datetime) %>%
    group_by(machineID) %>%

    mutate(voltmean = rollapply(volt, width = 3, FUN = mean, align = “right”, fill = NA, by = 3),
                  rotatemean = rollapply(rotate, width = 3, FUN = mean, align = “right”, fill = NA, by = 3),
                  pressuremean = rollapply(pressure, width = 3, FUN = mean, align = “right”, fill = NA, by = 3),
                  vibrationmean = rollapply(vibration, width = 3, FUN = mean, align = “right”, fill = NA, by = 3)) %>%
    select(datetime, machineID, voltmean, rotatemean, pressuremean, vibrationmean) %>%
    filter(!is.na(voltmean)) %>%

temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
                               values=col).resample('3H', closed='left', label='right', how='mean').unstack())
telemetry_mean_3h = pd.concat(temp, axis=1)
telemetry_mean_3h.columns = [i + 'mean_3h' for i in fields]

For more details on a description of the end to end use case please review the R code and Python code.

Once my R/Python code is tested in the local environment with a small dataset and deemed fit, I would then need to move it into a production environment. I would now need to also consider the various options on how to scale the same computation for a much larger dataset while ensuring efficiency. I have noticed that it is often more efficient to work with data that is indexed for such large-scale computations using some form of SQL query. Here is how I translated the code originally written in R/Python into SQL query language. 

Sample SQL code

select rt.datetime, rt.machineID, rt.voltmean, rt.rotatemean, rt.pressuremean, rt.vibrationmean
(select avg(volt) over(partition by machineID order by machineID, datetime rows 2 preceding) as voltmean,
        avg(rotate) over(partition by machineID order by machineID, datetime rows 2 preceding) as rotatemean,
        avg(pressure) over(partition by machineID order by machineID, datetime rows 2 preceding) as pressuremean,
        avg(vibration) over(partition by machineID order by machineID, datetime rows 2 preceding) as vibrationmean,
        row_number() over (partition by machineID order by machineID, datetime) as rn,
        machineID, datetime
from telemetry) rt
where rt.rn % 3 = 0 and rt.voltmean is not null
order by rt.machineID, rt.datetime

For more details please review the SQL code.

Based on my experience with predictive maintenance use cases, I have noticed that SQL rolling feature engineering was best suited for time series ordered data split by machine. For on-prem scenarios, now with SQL Server R Services, it also enables R enthusiasts to run their R code to do other data wrangling, model building and even scoring code from right within SQL Server. Overall, this ends up being more efficient as there is no data movement, and the computation ends up being scalable.

However, there are many other ways of operationalizing this type of feature engineering at scale. For example, R Server on HDInsight combines the functionality of R with the power of Hadoop and Spark, and Azure Data Lake Analytics now supports running R on petabytes of data. The power of cloud computing can be put towards transforming raw sensor data into meaningful data that can be leveraged for machine learning applications to provide value back to the business.