What’s next for industrial energy efficiency?
Patterns are everywhere. Some can be easy to recognize. Others take time for us to decipher. However, many patterns in data are too complex for a human brain to identify. For this situation, we can rely on computer algorithms to help us out. Most notably, machine learning algorithms (MLA).
Patterns are everywhere. Some can be easy to recognize. Others take time for us to decipher. However, many patterns in data are too complex for a human brain to identify. For this situation, we can rely on computer algorithms to help us out. Most notably, machine learning algorithms (MLA).
For those who do not have familiarity with, or use, MLAs regularly, this may sound like a daunting option. Before I gained experience with MLAs I thought they sounded like something involving robot-training! However, MLAs can be as simple or complex as the user wants to make them. The main upside is that in many situations they generally outperform other types of regression and/or statistical models. In this article, I will discuss a specific MLA called ‘Random Forest’, and how it can be used in the energy industry for forecasting project life cycles, along with savings and incentives.
Anticipating an energy efficiency program’s annual savings and incentives magnitude for completed projects, with respect to the annual goal, can be a challenge for utilities. Each year, even when the types of projects and their respective sizes are similar, there can be nuances that differentiate the time it takes one project to be completed when compared to another. Because of these subtle differences, accurate forecasts on when a project will be completed, and the size of the savings and incentives that are attached to them, are difficult to generate. However, when these subtle differences are added up from each project, they can greatly impact the program’s progress-to-goal during the course of the year. New methods are needed to enable utilities to more accurately forecast program results. Namely, machine learning.
Random Forests are supervised MLAs, meaning they are trained on data that allows for a single model, called a decision tree, to be built from input and output data while identifying ‘rules’ within the dataset. Therefore, the patterns of each input variable combination are ‘validated’ by an output variable in which the model is built. After a model is built for a random set of variables, it is stored, and another model is built. This bootstrapping methodology allows for a more robust set of models that encompasses the entire dataset: many decision trees make up a forest! Finally, the models are averaged to create one aggregate model that is representative of all the individual models builds. Because many models are averaged, individual bias, overfitting, and variance can be damped.
Figure 1: Example of a decision tree used to build Random Forest model. In this case, savings values and number of measures for a project are being used to estimate project lifecycle characteristics.
Furthermore, Random Forests can build models with both classification (binned categories: S, M, L) and regression (continuous: temperature) data. In some MLAs, like Artificial Neural Networks, these data need to be separated, or converted to bins. However, Random Forests can handle both data types as is, without defining variable classes. For these reasons, and the overall accuracy, Random Forest are widely used algorithms.
This brings us back to forecasting completed projects, and their respective savings and incentives. At DNV, we are utilizing sophisticated machine learning algorithms, like Random Forest, to help in the generation of forecasting completion dates, and here is how:
Figure 2: Time series plot of predicted project duration values (red) compared with observed values (black). This step allows for model validation and improvement based on statistical error metrics.
We have a robust database with historical program data from all of the programs we run, some almost a decade old. With this slew of data, forecasting models can be generated and utilized to predict a project’s life cycle, and to track the respective savings and incentives at completion based on past projects and experience. Utilizing variables from a pre-application, our models can create a forecast based on the patterns recognized in previous similar projects. The more input variables and decision trees, the more accurate the forecast can be. These models can drastically help utilities’ pursuit of an accurate annual savings and incentive forecast, with a quick delivery of results.
Further, during model build and tuning, the dataset is broken into testing and training subsets of the data. This method is used to validate the accuracy of the model with the testing dataset, and allows the model builder to fine-tune the model to decrease error.
Figure 3: Model residuals analysis to be used for model validation and improvements. The histograms on the top show a relatively normal distribution of model vs. observed residuals, with a mean value near 0. The bottom plots are quantile-quantile plots, and indicate there are some predicted outliers in the data that need improvement.
DNV uses this methodology for forecasting. These forecasts are displayed in a suite of dashboards with monthly and annual bins allowing the user to visualize the savings and incentives aggregations for completed projects in each forecasted month. Further, a data table is displayed on this dashboard that will allow users to see which projects are ‘Past Due’. This ‘Past Due’ flag is generated when a project is taking longer than other similar projects, per the forecast model.
We’ve learned some lessons from using Random Forest models for forecasting. Primarily, data quality assurance is of the essence for model building. The saying “garbage in, garbage out” rings loudly in model building. Erroneous dates, too many category bins in a certain variable and outlier data were the most problematic data issues encountered during the model build process. Further, optimizing the number of decision trees being created with respect to accuracy and computational expense was another item in which deep dives were needed.
Utilizing predictive models to assist with forecasting project completion, and their respective savings and incentives values, can be very advantageous for analysts and utilities alike. With volumes of data being stored in our database, building models that find patterns and create forecasts are needed to stay ahead of the curve in the industry. Further, knowing where your program is with respect to your goals, can help a utility and analysts be aware of their progress throughout the year, and drive projects to completion when needed.
About the author: Jeff Craft has been with DNV for almost three years during his career, and holds a Master of Science degree in Atmospheric Science from North Carolina State University, along with Bachelor’s degrees in Meteorology and Business Communication from Arizona State University. From predictive data analytics, to wind power forecasting, he has utilized his unique background to enhance his capabilities within the energy industry over the past seven years.