Acessibilidade / Reportar erro

Chaos theory applied to input space representation of autonomous neural network-based short-term load forecasting models

Teoria do caos aplicada à definição do conjunto de entradas de modelos neurais autônomos para previsão de carga em curto prazo

Abstracts

After 1991, the literature on load forecasting has been dominated by neural network based proposals. However, one major risk in using neural models is the possibility of excessive training, i.e., data overfitting. The extent of nonlinearity provided by neural network based load forecasters, which depends on the input space representation, has been adjusted using heuristic procedures. The empirical nature of these procedures makes their application cumbersome and time consuming. Autonomous modeling including automatic input selection and model complexity control has been proposed recently for short-term load forecasting. However, these techniques require the specification of an initial input set that will be processed by the model in order to select the most relevant variables. This paper explores chaos theory as a tool from non-linear time series analysis to automatic select the lags of the load series data that will be used by the neural models. In this paper, Bayesian inference applied to multi-layered perceptrons and relevance vector machines are used in the development of autonomous neural models.

Load Forecasting; Artificial Neural Networks; Input Selection; Chaos Theory; Chaotic Synchronization; Bayesian Inference; Multi-layered Perceptron; Relevance Vector Machines


Após 1991, a literatura sobre previsão de carga passou a ser dominada por propostas baseadas em modelos neurais. Entretanto, um empecilho na aplicação destes modelos reside na possibilidade do ajuste excessivo dos dados, i.e, overfitting. O excesso de não-linearidade disponibilizado pelos modelos neurais de previsão de carga, que depende da representação do espaço de entrada, vem sendo ajustado de maneira heurística. Modelos autônomos incluindo técnicas automáticas e acopladas para seleção de entradas e controle de complexidade dos modelos foram propostos recentemente para previsão de carga em curto prazo. Entretanto, estas técnicas necessitam da especificação do conjunto inicial de entradas que será processado pelo modelo visando determinar aquelas mais relevantes. Este trabalho explora a teoria do caos como ferramenta de análise não-linear de séries temporais na definição automática do conjunto de atrasos de uma dada série de carga a serem utilizados como entradas de modelos neurais autônomos. Neste trabalho, inferência Bayesiana aplicada a perceptrons de múltiplas camadas e máquinas de vetores relevantes são utilizadas no desenvolvimento de modelos neurais autônomos.

Previsão de carga; Redes Neurais Artificiais; Seleção de Entrada; Teoria do Caos; Sincronização caótica; Inferência Bayesiana; Perceptron de Multi-camadas; Máquinas de Vetores Relevantes


SISTEMAS ELÉTRICOS DE POTÊNCIA

Chaos theory applied to input space representation of autonomous neural network-based short-term load forecasting models

Teoria do caos aplicada à definição do conjunto de entradas de modelos neurais autônomos para previsão de carga em curto prazo

Vitor Hugo FerreiraI; Alexandre Pinto Alves da SilvaII

IElectrical Engineering Department, Fluminense Federal University (UFF) Rua Passo da Pátria, 156, Sala 509, Bloco D CEP 24210-240 - Niterói RJ vitor@vm.uff.br

IIElectrical Engineering Program, PEE-COPPE, Federal University of Rio de Janeiro (UFRJ) P.O. 68504 CEP 21945-972 - Rio de Janeiro RJ alex@coep.ufrj.br

ABSTRACT

After 1991, the literature on load forecasting has been dominated by neural network based proposals. However, one major risk in using neural models is the possibility of excessive training, i.e., data overfitting. The extent of non linearity provided by neural network based load forecasters, which depends on the input space representation, has been adjusted using heuristic procedures. The empirical nature of these procedures makes their application cumbersome and time consuming. Autonomous modeling including automatic input selection and model complexity control has been proposed recently for short-term load forecasting. However, these techniques require the specification of an initial input set that will be processed by the model in order to select the most relevant variables. This paper explores chaos theory as a tool from non-linear time series analysis to automatic select the lags of the load series data that will be used by the neural models. In this paper, Bayesian inference applied to multi-layered perceptrons and relevance vector machines are used in the development of autonomous neural models.

Keywords: Load Forecasting, Artificial Neural Networks, Input Selection, Chaos Theory, Chaotic Synchronization, Bayesian Inference, Multi-layered Perceptron, Relevance Vector Machines.

RESUMO

Após 1991, a literatura sobre previsão de carga passou a ser dominada por propostas baseadas em modelos neurais. Entretanto, um empecilho na aplicação destes modelos reside na possibilidade do ajuste excessivo dos dados, i.e, overfitting. O excesso de não-linearidade disponibilizado pelos modelos neurais de previsão de carga, que depende da representação do espaço de entrada, vem sendo ajustado de maneira heurística. Modelos autônomos incluindo técnicas automáticas e acopladas para seleção de entradas e controle de complexidade dos modelos foram propostos recentemente para previsão de carga em curto prazo. Entretanto, estas técnicas necessitam da especificação do conjunto inicial de entradas que será processado pelo modelo visando determinar aquelas mais relevantes. Este trabalho explora a teoria do caos como ferramenta de análise não-linear de séries temporais na definição automática do conjunto de atrasos de uma dada série de carga a serem utilizados como entradas de modelos neurais autônomos. Neste trabalho, inferência Bayesiana aplicada a perceptrons de múltiplas camadas e máquinas de vetores relevantes são utilizadas no desenvolvimento de modelos neurais autônomos.

Palavras-chave: Previsão de carga, Redes Neurais Artificiais, Seleção de Entrada, Teoria do Caos, Sincronização caótica, Inferência Bayesiana, Perceptron de Multi-camadas , Máquinas de Vetores Relevantes.

1 INTRODUCTION

The decision making process in power systems, including economic dispatch, hydrothermal coordination, automatic generation control, energy trading and so on, requires the knowledge of the future behavior of the load dynamics. Along the last two decades, many load forecasting models have been proposed, with the neural network based models receiving great attention. This is because they have been showing superior prediction performance, specially for short-term applications (Hippert, et. al., 2001). In fact, the neural network based models have been presenting outstanding results for multivariate problems envolving databases with huge cardinality, as the short-term load forecasting problem (Ferreira and Alves da Silva, 2007), (Ferreira and Alves da Silva, 2009) and (Ferreira and Alves da Silva, 2010). Even been more robust than traditional models, critical questions like input space representation and complexity control of neural network have not received the necessary attention.

The input selection stage is the one of the most important tasks in the development of load forecasting models. Feature extraction via non-linear techniques like wavelets uses only information about the time-series to be predicted without direct concern with the forecasting accuracy. In this sense, an input selection methodology directly related with the neural network model is required. The methods that use the model itself in the input selection step are called wrapper methods, and the ones that consider only the dynamics and statistics of the time-series are called filter methods (Guyon and Elisseeff, 2003). For forecasting purposes, the wrapper methods are more indicated since these techniques aim to select the inputs that are most suitable to the model in terms of forecasting performance.

The complexity control of neural models has the objective of adjusting the non-linear extent of the neural network to the regularity exhibited by the data. This step is necessary to avoid the harmful modeling of the noisy component of the data, named overfitting. This can compromise the generalization capacity of the neural model, i.e., good predictions for unseen data.

Autonomous neural forecasting models, including automatic input selection, complexity control and structure selection, are necessary to reduce the necessity of intervention from experts. These automatic procedures allow the extension of the forecasting to the bus load level. Autonomous Neural Network Load Forecasting models have been proposed in the literature (Ferreira and Alves da Silva, 2007) using Bayesian Inference Applied to Multi-Layered Perceptrons (BIAMLPs) and Support Vector Machines (SVMs) training and specification. These procedures include automatic and coupled procedures for input selection, complexity control and model specification. However, these procedures still require the definition of an initial set of inputs.

In order to improve the autonomous capability of the models proposed in (Ferreira and Alves da Silva, 2007), techniques for automatic definition of the initial set of inputs from the available time-series are necessary. This paper investigates the application of Chaos Theory as a tool for automatic definition of the initial set of inputs to be used with the autonomous neural models proposed in (Ferreira and Alves da Silva, 2007). The BIAMLPs are used in this paper and they are compared with Relevance Vector Machines (RVMs). Being a sparse kernel model, a RVM can be seen as a SVM derived from the application of Bayesian Inference. The forecasting performance of the models are compared using three public load and temperature databases. The main contributions of the paper can be summarized as follows:

a) proposal of an automatic method for selecting inputs of neural network load forecasting models, based on time-series and calendar information, only; and

b) evaluation of the applicability of RVMs to the load forecasting problem.

This paper is organized as follows. In Section 2, Chaos Theory is presented in the context of Input Space Reconstruction. BIAMLPs are described in Section 3. Section 4 is devoted to the description of the RVMs. The database description and results are shown in Section 5. The discussion, main conclusions and future work are presented in Section 6.

2 CHAOS THEORY

The Chaos Theory development is motivated by the study of dynamical systems sensitive to initial conditions. After the transient effects, a dynamical system F (X): DD evolving in a state space XD can be defined by the following expression:

From the actual state X (k), all of the adjacent states of the deterministic system described by equation (1) can be obtained. The sensitivity to the initial conditions makes the trajectory of the system dependent on the knowledge of the function F (X) and the value of the initial state. The set of initial conditions that drives asymptotically the system to a given region of the space is called basin of attraction, and the region where the system is driven is named attractor (Kantz and Schreiber, 1997).

2.1 TAKENS THEOREM

The above definitions are valid in the multidimensional space where the system F (X) is confined. However, in practice, only scalar measures x (k), k = 1, 2, ..., N, are avaliable through a measurement function s (X): D, i.e.,

where η (k) represents the measurement noise.

The measurement function s (X) comprises the multivariate information contained in X (k) in a scalar measure x (k), projecting non-observable variables of the system in a real scale. Since s (X) is unknown, in the presence of measurement noise η (k) , the perfect reconstruction of X (k) from a set of measures x (k) is impossible. However, the perfect estimation of the original space is unnecessary, being sufficient the definition of a new representation space with a equivalent attractor (Takens, 1981). Called embedded space, this space can be obtained from the equation:

where τ and d are parameters named delay and embedding dimension, respectively.

Takens’ Theorem (Takens, 1981) defines the conditions for which the attractor in the embedded space xd, given by equation (3), is equivalent to the attractor in the original space XD . In case of unlimited data, noise free and assuming the existence of a mapping Z(x): dD and the corresponding inverse mapping Z-1 (X): Dd , both smoth, continuous, bi-unique and continuously differentiable, Xd will be a immersion of XD if d > 2D for τ arbitrarily chosen. While the Takens’ Theorem devotes attention only to the embedding dimension d, in practical applications the choice of the embedding delay τ is also vital for the definition of the embedded space (Abarbanel et .al., 1993).

There are many criteria proposed in the literature for the definition of τ, including techniques based on geometrical and statistical foundations, with the statistical ones been more used and suitable for time-series applications (Kantz and Schreiber, 1997). Among the statistical criteria, the analysis of the autocorrelation function of x (k), rxx (k), is the simplest technique. In order to pursue a trade-off between attractor compression and reconstruction based on almost uncorrelated directions, the first minimum of the absolute value of rxx (k), |rxx (k)|, can be used as an estimate for τ . Although simple, generally the definition of τ based on the analysis of rxx (k) does not avoid the attractor collapse, since non-linear interdependences can fold the attractor along trajectories of this nature.

Information Theory provides indices for the evaluation of general relationships (linear or non-linear) among random variables. The mutual information, Ix (r), measures the degree of information that x (k - r) gives about x (k), i.e., the reduction of uncertainty about x (k) due to the knowledge of x (k - r). Using variable discretization to estimate the required probabilities, Ix (r) is given by:

where p represents the number of intervals in the discretization; P [x (k - r) ∈ vi] is the marginal probability of x (k - r) in the vi interval; P [x (k) ∈ vi, x (k - r) ∈ vj] the joint probability of the discretized x (k) and x (k - r). Similarly to the analysis of rxx (k), the first minimum of Ix (r) can be used as an estimate of τ (Fraser and Swinney, 1986).

The literature about the estimation of the embedding dimension d shows several techniques based on the calculation of invariant features of the attractor (Kantz and Schreiber, 1997) and (Abarbanel et .al., 1993). Besides being computationaly intensive, these techniques are very subjective, requiring constant intervention of experts during the modeling stage. One of the most popular techniques for estimation of embedding dimension d is based on the identification of spurious trajectories, being known as false nearest neighbors method (Kennel et. al., 1992). This denomination is based on the way the spurious intersections of the attractor are identified; through the observation of changes in the neighborhood of a given point due to the increase of the dimension. Neighboring points due to the system dynamics remain in this condition (neighbors) when d increases. Points that leave the neighborhood due to the dimension increase are called false nearest neighbors. These points were located in the neighborhood of the testing point because of the incomplete reconstruction of the attractor.

In order to increase the automation level of the false nearest neighbors method, Cao (Cao, 1997) proposes a practical method for estimation of d. Let Δ(i, j, d) be the distance between points x (i) and x (j), both reconstructed in the dimension d, given by:

In equation (6), xl (i) represents the l-th element of the vector xl (i) at instant i and Δ(i, j, d) the infinite norm of the difference between x (i) and x (j). The nearest neighbor of x (i) is the point for which Δ(i, j, d) is minimum, i.e.,

where n (i, d) is the index of the vector x [n (i, d)] closest to x (i) in the space of dimension d, according to the Δ(i, j, d) metric.

Additionally, let a (i, d) be the relation between nearest neighbors in consecutive dimensions d and (d + 1) given by:

In equation (8), if Δ[i, n (i, d) ,d] is zero, n (i, d) is replaced by the index of the next (adjacent) nearest neighbor. The mean value of a (i, d) is used to define the J (d) statistic:

The relative variation δ (d) of this statistic due to the increase on the embedding dimension d is given by:

According (Cao, 1997), for time-series originated from an attractor, the variation δ (d) stabilizes when the embedding dimension d is greater than a value d0. In other words, in dimensions above d0 the number and location of false nearest neighbors do not change, so that J (d) stops changing. Thus, the embedding dimension is given by d = d0 +1.

For automatic detection of the stabilization dimension d0, let dmax be the maximum embedding dimension for which the statistic δ (d) is calculated, supposing that the stabilization of δ (d) occurs for d0 < dmax. Given the pairs [d, δ (d)], d =1, 2, ..., dmax, a linear regression model of the evolution of δ (d) along d is estimated, i.e.,

A hypothesis test about the linear model given by is performed, at α significance level and null hypothesis defined as v equal to zero, i.e., the angular coeficient of the model being null (Griffiths et. al., 1993). If the null hypothesis is rejected, the first pair [d, δ (d)] is removed and a new linear regression model like is estimated considering only the points d = 2, 3, ..., dmax. This procedure is repeated until the null hypothesis can not be rejected, i.e., the hypothesis of constant δ (d) can not be discarded. Then, the stabilization point of δ (d) statistic is found, with the embedding dimension given by the first dimension used in the estimation of the linear regression model for which the null hypothesis is not rejected.

The heuristic defined above depends on the definition of two parameters, dmax and α. The choice of significance level α, although heuristic, is more intituive than the choice of the parameters that must be specified in other embedding dimension estimation approaches. The definition of dmax is directly related to computational effort. In this work, dmax = 30 and α = 0.01.

2.2 CHAOTIC SYNCHRONIZATION

Let’s assume two discrete chaotic systems, an autonomous driving system XD and the response system YR with dynamics given by the equations:

In , F (X): DD and U (Y, X): R × DR represent the dynamics of the driving and response systems, respectively. These systems will be in generalized synchronism if their trajectories along their state spaces are related, i.e., a function φ (X): DR can be defined such that:

Since the equations that define the functions F (X), U (Y, X) and φ (X are unknown, methods for detection of synchronism based on data collected from these systems are required.

Rulkov and co-workers (Rulkov, et. al., 1995) propose a method based on the idea of false nearest neighbors for synchronism detection. Called mutual false nearest neighbors, the method assumes that the function φ (X) exists and it is smooth and differentiable. In this case, neighbor points in X space will be associated with neighbor points in the response system Y.

Let X [n (i, D)] be the nearest neighbor of X (i). Assuming that φ (X) exists and that the distance between nearest neighbors in each state space is small, the aproximated relation between neighbors can be derived (Rulkov, et. al., 1995) as follows:

In equation (14), D (X): DR ×D is the Jacobian matrix of φ (X). Similarly, observing the nearest neighbor of Y (i) in the state space of the response system denoted by Y [n (i, R)],

The ratio between the Euclidean norms of equations (14) and (15) is given by:

If the mapping φ (X) exists, then the index M [X (i) ,Y (i)] will be close to one for all i.

Since the original state spaces X and Y are unknown, let y (k) ∈ r be the reconstructed space of the response system Y and x (k) ∈ D the reconstruction of the driving system X, both obtained from Takens’ Theorem given by equation (3). Let x' (k) ∈ r be an auxiliar reconstruction of the driving system X with embedding dimension equal to the one obtained for the response system Y. The nearest neighbors in the sense of the infinite norm given by equation (6) for each point in each embedded space are calculated, with y [n (k, r)] being the nearest neighbor of y (k), x [n (k, d)] the nearest neighbor of x (k), and x' [n (k, d')] the nearest neighbor of x' (k). Then, the index m [x (t) ,y (t)] known as mutual false nearest neighbors can be defined by the following equation (Rulkov, et. al., 1995):

Similar to the index M [X (k) ,Y (k)] the value of m [x (k) ,y (k) , d, r] is expected to be close to 1 for all k. However, since the embedded spaces y (k), x (k) and x' (k) are constructed from noisy data, the mean value of m [x (k) ,y (k) , d, r] calculated from all avaliable data, is used for synchronism detection. In this case, if the mapping φ (X) exists, the mean value of m [x (k) ,y (k) , d, r] is expected to be close to 1. Otherwise, the mean value will be greater than 1.

2.3 CHAOS INPUT SELECTION ALGORITHM

The application of Takens’ Theorem and chaotic synchronization for input selection for neural network forecasting models can be summarized as follows:

1. Given a time-series database, define the one to be predicted, y (k) ∈ , k = 1, 2, ..., N, and the exogenous time-series, xi (k) ∈ , k = 1, 2, ..., N, i = 1, 2, ..., S, where N is the number of points and S the number of avaliable exogenous time-series;

2. Define the maximum dimension parameter dmax and the confidence level a. In this work, dmax = 30 and α = 0.01;

3. Estimate the embedding parameters τy and dy using the methods described in section , obtaining the reconstructed space y (k) ∈ via Takens’ Theorem, given by equation (3) , with k = (dy - 1) τy + 1, (dy - 1) τy + 2, ..., N;

4. For each exogenous time-series xi (k) ∈ , do:

(a) Estimate the embedding parameters and for the reconstructed space xi (k) ∈ given by equation (3) with k = (- 1) + 1, (- 1) + 2, ..., N;

(b) Detect the existence of synchronism between y (k) ∈ and xi (k) ∈ by calculating the mean of m [x (k) ,y (k) , d, r], i.e., [x (k) ,y (k)], required by the mutual false nearest neigbhors method (section )

(c) If the synchronism does not exist, i.e., 1, discard the reconstruction xi (k) ∈ . Otherwise, include xi (k) ∈ in the input set.

5. If another information is avaliable, i.e., qualitative information, binary variables, etc., insert them in the input representation.

Once defined the initial input space representation, a neural network can be applied to model the equation (12), i.e., the function that maps Y (k) and X (k) on Y (k + 1), the first element of vector Y (k + 1).

3 BAYESIAN INFERENCE APPLIED TO MLPS TRAINING AND SPECIFICATION

Let xn be the vector containing the input signals and wM the vector with all weights and biases of the MLP with one hidden layer and only one output, being M = mn + 2m + 1 with m equal to the number of neurons in the hidden layer. The biases of the sigmoidal functions in the hidden layer are represented by bk, with b being the bias of the single linear neuron of the ouput layer. The output of this MLP is given by:

Given a dataset U = {, Y} with N input-output pairs, N ×n , YN , Y = [d1, d2, ..., dN]t, dj being the desired output, the objective of training a MLP from the Bayesian perspective is the estimation of the vector w that maximizes the posterior probability given by:

From the definition of joint probability,

and

since the input patterns are independent of the value of w. Putting these results in equation (19):

In equation (22), p (Y|, w) is the likelihood of Y , p (w) the prior probability of w and p (Y|) = p (Y |, w) p (w) dw a normalization factor.

The prior probability p (w) represents the prior knowledge of the behavior of w. Prior insights about specific values for w for general problems are unknown, but models with small weights can reproduce smooth mappings (Bishop, 1995). The likelihood p (Y|, w) represents the knowledge about the distribution of the noise in the desired output. Assuming that w follows a Gaussian distribution with null mean vector and diagonal covariance matrix equal to α-1, whereis the identity matrix of dimension M × M, and that the desired output is corrupted by aditive gaussian white noise with variance (β-1, i.e., dj = f(wj ,w) + ζj , the application of equation (22) results:

where

Therefore, maximize the posterior probability p (w| , Y) is equivalent to minimize S(w).

For multivariate problems, the use of a single prior for all weights and biases is not recommended (Ferreira and Alves da Silva, 2007). It is not expected that weights that connect different kinds of inputs have the same distribution in weight space. In (Ferreira and Alves da Silva, 2007), the weights that connect each input to the neurons in hidden layer are grouped, with each group having its own prior distribution p (wi). All priors are Gaussian with null mean vector and respective diagonal covariance matrix αi-1

i, with i being the identity matrix of dimension Mi × Mi where Mi represents the number of weights or bias included in the i-th group. The same idea is applied to the groups of weights associated with the biases (one αi for the connections with the hidden neurons and another for the output neuron connection). One last αi is associated with all connection weights between the hidden and output layers. Therefore, for n dimensional input vectors x, the total number of αis is n+3. In this case, S(w) is given by:

Details about the iterative algorithm for minimization of S(w) and estimation of parameters and hyperparameters αi’s and β can be find in (Mackay, 1992) and (Bishop, 1995).

The magnitude of αi’s related to the inputs connections can be used for ranking the relevance of each input signal in the calculation of the output. This characteristic makes this specification of the priors be known as Automatic Relevance Determination (ARD) (Mackay, 1992). Besides ranking capacity, irrelevance levels must be specified to determine the irrelevant inputs that should be discarded by the model. Since this irrelevant threshold is problem dependent, (Ferreira and Alves da Silva, 2007) proposed an empiric method for automatic determination of the referred threshold. Artificial probe signals, unrelated with the desired output and generated from uniform distributions, are included in the original input space. After training the MLP with this augmented input space, the αi related to the probe signals is used as irrelevance threshold. The relevant inputs are selected and the final model is trained. For continuous variables, the probe signal are generated from a uniform distribution defined in the same scale as the original normalized inputs. For dummy variables, a discrete uniform distribution is used. Since this technique uses the model along the input selection step, it can be included in the group of wrapper methods (Guyon and Elisseeff, 2003). More details can be find in (Ferreira and Alves da Silva, 2007).

The Bayesian inference can also be applied to the selection of the most probable MLP structure to represent a given mapping among a set of hypothesis H = {H1, H2, ..., HK}. The set of relevant inputs for each hypothesis was previously defined by ARD with probe signals, and the difference between hypothesis is the number of neurons in the hidden layer. Assuming that all hypothesis are equiprobable and using a Gaussian aproximation around the parameters and hyperparameters previously estimated, the logarithm of the evidence for the models, ln p (Y| Hh), can be obtained by (Bishop, 1995):

4 RELEVANCE VECTOR MACHINES

Relevance Vector Machines (RVMs) (Tipping, 2001) are kernel-based sparse probabilistic models. Sparse in the sense that only some vectors of the training set contribute for the estimation of the regression surface. These points are called relevant vectors.

Given a dataset U = {, Y} including the input-output pairs, let’s assume the traditional probabilistic formulation considering an additive noise ζk ∈ present in the desired output, i.e., dk = F (xk) + ζk. In order to model F (x) : n, let f (x, w): n be a function formed by the linear combination of functions ϕ (x, z) : n ×n centered at each point of the dataset D:

In equation (27), wN , b, WN+1, W = [ wt b ]t , with ϕ (x) : nN+1, a matrix including the functions ϕ(x, xi) = ϕi (x) and a constant term equal to one representing the bias.

Using Bayes’ rule, the posterior probability p (W | , Y) is given by:

As in equation (19), p (Y |) = p (Y | ,w) p (w) dw is a normalization factor, p (W) the prior probability of W and p (Y |, W) is the likelihood function related to the distribuition of the additive noise ζk presented in the desired output.

Assuming that the samples of ζk are generated independently from the same Gaussian distribution with zero mean and variance σ2, the likelihood function p (Y |, W) is given by:

where N × N+1 is the modeling matrix including all the functions ϕi (x) evaluated at each point of the training set, i.e., the ij-th element is ij = ϕj (xi) and i (N+1)= 1.

The prior probability p (W) can be defined as a product of Gaussian distributions given by:

In equation (30) distinct Gaussian distributions are considered, all of them with zero mean but different variances. These hyperparameters are responsible for magnitude control of each parameter Wi. As in ARD, weights with large αi will tend to be highly centered around the null vector. The estimation of αi and identification of weights with sufficient large αi can be used to select the functions ϕi (x) that will be included in the final model. This feature enables RVMs to present a sparse representation such as in other kernel methods like Support Vector Machines (SVMs).

The definition of hyperparameters σ2 and α requires the specification of prior probabilities for them. Non-informative gamma distributions are used, reflecting the prior absence of knowledge about hyperparameters’ distributions (Tipping, 2001).

Using the prior and the likelihood distribuitions defined by equations (29) and (30), respectively, in equation (28), and making a convolution of Gaussians to calculate the normalization factor p (Y |) = p (Y | , w) p (x) dw, the posterior probability p (W |, Y , α, σ2) can be written as:

where N+1 × N+1 and µN+1 are given by

with N+1 being a diagonal matrix with the ii-th element aii = ai. The expected value of the desired output N+1 and the estimate of the corresponding variance 2 associated with a testing point xN+1are obtained through the expressions:

In equation (33), µMP and MP are calculated by equations (32) using estimated αMP and σMP . An iterative method for calculating hyperparameters αMP and σMP, based on evidence maximization, analogous to Mackays’s evidence maximization for MLPs, can be found in (Tipping, 2001).

Unlike other sparse kernel-based models whose basis functions must agree with Mercer’s Theorem conditions (Vapnik, 1998), (Schölkopf and Smola, 2002), the function ϕ (x, z) used in RVMs does not need to meet the Mercers’ conditions. In this work a Gaussian function is used:

In equation (34), ηk+ denotes another set of hyper-parameters that are iteratively estimated by evidence maximization (Tipping, 2001). This choice for ϕ (x, z) allows the creation of a input selection method analogous to ARD (presented in section ). After the estimation of ηk’s, inputs with smallest ηk contribute less for output calculation. In other words, the magnitude of ηk can be used for ranking the input variables. Similarly to the input selection method used for MLPs and presented in section , artificial probe signals are included in the original input space to define irrelevant thresholds for the inputs. After training with augmented input space including the probe signals, the relevant inputs are selected for re-training the model and making predictions.

5 AUTONOMOUS MODELING

Chaos Theory, BIAMLP and RVM are combined in this paper in order to develop an analytic, coupled and unified framework for autonomous forecasting using neural models. This framework includes input space representation selection, structure definition and complexity control of the forecasting model, all of them disregarding the necessity of a validation set. The use of validation sets brings some practical and theoretical problems as described in (Amari et. al., 1996) and (Cataltepe et. al., 1999). A pratical disadvantage of cross-validation, specially in time series applications, is related to the definition of the validation set, since serial correlations or recent information can be neglected in the training phase. Using all the available data for model development, the framework proposed here can be summarized as follows:

1. Apply the Chaos Input Selection algorithm (section 2.3) to define the initial input representation space;

2. Use BIAMLP or RVM to model the mapping between input-output pairs;

3. Discard the irrelevant inputs using the wrapper methods described in sections 3 (BIAMLP) and 4 (RVMs);

4. Make predictions by recursion for all the forecasting horizon.

The autonomous modeling proposed can be summarized by the flowchart in Figure 1.


6 RESULTS

The models presented in previous section are evaluated through the application of them to three public databases. The first database presents hourly load L(k), temperature T(k) and temperature squared T2(k) for the period of January 1, 1985 to March 31, 1991. This database was used in a forecasting competition (Ramanathan et. al., 1997) and can be found in the web at the adress http://www.ee.washington.edu/class/555/elsharkawi/datafiles/forecasting.zip. For this database, hourly load must be predicted from 16 to 49 steps ahead for weekdays and from 16 to 80 steps ahead for weekend. The forecasts are made daily by 9 a.m., with the testing period starting in November 1, 1990 and finishing in March 31, 1991. For definition of the lags by Chaos Theory, the data from January 1, 1985 to October 31, 1990 (database avaliable at the begining of the forecasting period) are used. After the definition of the lags that will be used as inputs to the model, the input-output pairs used for training the models corresponds to the data from the current month, the two previous months and the corresponding pairs from the same period in the last year. This subset of the training data is used for training in order to reduce the computational effort for training. Some statistics for this database are shown in Table 1 to Table 3, where mean, standard deviation and the relation between then are presented, respectively. This statistics are calculated after de-trending of the load time-series using a time linear regression model.

The second database includes daily load L(k) and daily maximum temperature T(k) from the period of January 1, 1997 to January 31, 1999. As the first database, this one was also used in a forecasting competition (Chen et. al., 2004), where the objective was the daily prediction of the load from January 1, 1999 to January 31, 1999, i.e., forecasts for 1 to 31 steps ahead. The data from January 1, 1997 to December 31, 1998 was used for input space definition and training of the models. This database can be found at the website http://neuron.tuke.sk/competition. As for case 1, some statistics for this database are presented in Table 4. These statistics are estimated after detrending the load time-series using a time linear regression model.

The third database shows half-hourly load L(k) and temperature T(k) for the period of December 4, 2001 to December 31, 2003. The hourly databases are obtained by the mean value between two registers in the hour (Mandal et. al., 2005). The objective is to forecast hourly load from one to six hours ahead along the period from September 1, 2003 to September 7, 2003. The data from December 4, 2001 to August 31, 2003 are used for initial input space definition. The same subset selected for training the models in Case 1, i.e., data from the current month, the two previous months and respective pairs from the same period in the last year, are used for development of the models in this case. This database is related to Victoria State and can be found in web at the address at http://www.aemo.com.au/data/aggPD_2000to2005.html. As for case 1, Table 5 to Table 7 presented some statistics for this load time-series, all of them calculated after detrending the load time-series using a time linear regression model.

The hourly and daily load data used in this paper present seasonal patterns widely known in load forecasting area namely: hour, week and yearly seasonal pattern (Ferreira and Alves da Silva, 2007), (Hippert et. al., 2001). The yearly pattern is related to the seasons and is modelled by the temperature information. The other patterns are modelled as qualitative information, being represented as binary variables indicating the hour of the day (24 dummies) and day of the week (7 dummies) to be forecasted. The daily database (Case 2) uses only the dummies for day of the week.

Table 8 shows the estimated embedding parameters τ and d via the first minimum of the mutual information Ix (r) and false nearest neighbors method, respectively, and the mean value of the mutual false nearest neighbors statistic [x (k) ,y (k)]. As expected, the value of [x (k) ,y (k)] confirms that there is synchronism between the reconstructed load and temperature for all cases. The results presented in Table 8 for case 1 confirms that the calculated embedded parameters are invariant characteristics of the attractor, since the values calculated for T(k) and T2(k) are very close.

Table 9 shows the mean absolute percentage error (MAPE) for the three databases studied in this paper. For case 3, the results are discretized by step ahead to compare with the benchmark of the literature (Mandal, et. al., 2005). The last line of this Table presents the best results found in the literature for these databases. The analysis of Table 9 shows that the autonomous models proposed in this paper, specially the BIAMLP, are competitive against the best results in the literature. It is noteworthy that the benchmark results presented in Table 2 was obtained by highly specialized and dedicated group of modeling experts, while the autonomous models proposed here are mainly automatic, requiring little manual intervention. Further, among the usual parameters that must be specified in neural network training, the analyst must specifies only the maximum dimension parameter dmax, the confidence level α (Chaos Input Selection Algorithm) and the maximum number of neurons in the hidden layer to be tested in BIAMLP. Compared with the effort to define heuristically the input space representation, together with the necessity of criteria to neural network structure definition, the autonomous modeling proposed here shows a considerable level of automation.

Despite the distinct statistical features of the databases under consideration, as presented in Tables 1 to 7, which show significant differences both in level (mean) and variability (standard deviation), the performance of the proposed autonomous models, particularly from BIAMLP, was always robust. Even for Case 2, for which BIAMLP presents the worst result when compared with the benchmark, BIAMLP would be ranked in the top five competitors (Chen et. al., 2004). For Case 1 data, BIAMLP’s results are statistically equivalent to the ones obtained by the winner of the competition. Statistical equivalence between the results from BIAMLP and the corresponding benchmark is also confirmed for Case 3. These findings highlight the robustness of the BIAMLP’s performance with respect to different time-series.

The importance of qualitative information is demonstrated in Table 10 with the results obtained by BIAMLP when the dummy variables are discarded from the initial input set. In this case, the inputs are all selected from the time-series data, discarding the prior knowledge about the dynamics of the data. The consistent increase on MAPE for the forecasting period confirms the importance of qualitative information, and shows that Chaos Theory itself does not deal with seasonality modeling. Since the calendar information are available for time-series data, the use of this qualitative information does not reduce the level of automation of the proposed models. In fact, qualitative information (general electric load seasonal characteristics) is included based on the premise that calendar information (time, day of the week, and corresponding month) is available. Therefore, for electric load time series, such qualitative information is automatically inserted as dummy variables (“1 of n” binary coding). Besides, as the developed neural networks can disregard irrelevant information, such inputs are automatically excluded from the forecasting model when a general seasonal characteristic is not present in a particular dataset.

7 CONCLUSION

This work investigates the application of Chaos Theory as a input space representation tool in the development of autonomous neural network load forecasting models. Autonomy should be understood here as a set of automatic and coupled procedures for input space definition and selection, structure specification and complexity control (regularization). In this work, two neural network-based models are used: the Bayesian Inference Applied to MLPs training and specification (BIAMLPs) and Relevance Vector Machines (RVMs). The obtained results, comparable with the benchmarks available in the literature, specially for the BIAMLPs, show the potential of the proposal. The automation level of the techniques proposed in (Ferreira and Alves da Silva, 2007) has been increased, enabling the application of the new models to problems envolving multiple time series, as for example, bus load forecasting. Bus load forecasting is needed for feeding important power system control center functions, such as state estimation, generation scheduling, and security assessment.

In terms of computational effort, BIAMLP requires about 10 minutes to estimate the model and to provide the one to eighty hours ahead load forecasts for Case 1 (with Matlab ® in a PC Intel ® CoreTM 2 Duo 2,66 GHz, 3323 MB RAM Memory, running Windows Vista 32 Bits). Therefore, BIAMLP’s computational effort is suitable for short-term load forecasting.

The result for RVMs, although competitive, can be improved by selecting more appropriate basis functions ϕi (x). One interesting theoretical feature of RVMs is the possibility of using different basis functions, such as periodical functions, in order to model seasonal patterns without the use of dummy variables. The development of BIAMLPs considering non-Gaussian noise in the output is another interesting research area, by means of Monte Carlo methods for BIAMLPs definition (Neal, 1996). Beyond those issues, the local modeling of the attractor against the global one used here can still improve the results. In order to automate the identification of the regions to be independently modeled, automatic clustering methods are required.

ACKNOWLEDGEMENTS

Special thanks to the support of National Council for Scientific and Technological Development (CNPq) that funds the author Alexandre Pinto Alves da Silva and to FAPERJ - Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro - that funds author Vitor Hugo Ferreira under Grant INST E-26/110.158/2010.

REFERENCES

Abarbanel, H.D.I., Brown, R., Sidorowich, J.J., Tsimring, L.S. (1993) The Analysis of Observed Chaotic Data in Physical Systems, Reviews of Modern Physics, v.65, n.4, pp. 1331-1392.

Amari, S., Murata, N., Müller, K.R., Finke, M., Yang, H. (1996). “Statistical Theory of Overtraining - Is Cross-Validation Asymptotically Effective?”, Advances in Neural Information Processing Systems 8, MIT Press, pp. 176-182.

Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford University Press.

Cao, L. (1997). Practical Method for Determining the Minimum Embedding Dimension of a Scalar Time Series, Physica D, v.110, n.1-2, pp. 43-50.

Cataltepe, Z., Abu-Mostafa, Y.S., Magdon-Ismail, M. (1999). No Free Lunch for Early Stopping”, Neural Computation, v.11, n.4, pp. 995-1009, May 1999.

Chen, B.-J., Chang, M.-W., Lin, C.-J. (2004). Load Forecasting Using Support Vector Machines: A Study on EUNITE Competition 2001, IEEE Trans. on Power Systems, 19(4), pp. 1821-1830.

Ferreira, V.H., Alves da Silva, A.P. (2007). Toward Estimating Autonomous Neural Network-based Electric Load Forecasters, IEEE Transactions on Power Systems, 22 (4), n.4, pp. 1554-1562.

Ferreira, V.H., Alves da Silva, A.P. (2009). Automatic Kernel Based Models for Short Term Load Forecasting, Proceedings of the 15thInternational Conference on Intelligent System Application to Power Systems, Curitiba, Paraná, Brazil.

Ferreira, V.H., Alves da Silva, A.P. (2010). Teoria do Caos Aplicada à Definição do Conjunto de Entradas de Modelos Neurais Autônomos para Previsão de Carga em Curto Prazo. Anais do XVIII Congresso Brasileiro de Automática (XVIII CBA), Bonito-MS, pp.4439-4444.

Fraser, A.M., Swinney, H.L. (1986). Independent Coordinates for Strange Attractors from Mutual Information, Physical Review A, v.33, n.2, pp. 1134-1140.

Griffiths, W.E., Hill, R.C., Judge, G.G. (1993). Learning and Practicing Econometrics, John Wiley & Sons.

Guyon, I., Elisseeff, A. (2003). An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, n.3, pp. 1157-1182.

Hippert, H.S., Souza, R.C., and Pedreira, C.E. (2001). Neural Networks for Load Forecasting: A Review and Evaluation, IEEE Transactions on Power Systems, v.16, n.1, pp. 44-55.

Kantz, H., Schreiber, T. (1997). Nonlinear Time Series Analysis, Cambridge Nonlinear Science Series, n.7, Cambridge University Press.

Kennel, M.B., Brown, R., Abarbanel, H.D.I. (1992). Determining Embedding Dimension for Phase-space Reconstruction Using a Geometrical Construction, Physical Review A, v.45, n.6, pp. 3403-3411.

Mackay, D.J.C. (1992). Bayesian Methods for Adaptive Models, Ph.D. dissertation, California Institute of Technology, Pasadena, California, USA.

Mandal, P., Senjyu, T., Uezato, K., Funabashi, T. (2005). Several-Hours-Ahead Electricity Price and Load Forecasting Using Neural Networks, IEEE PES General Meeting, San Francisco, USA.

Neal, R.M. (1996). Bayesian Learning for Neural Networks, Lecture Notes in Statistics, n.118, Springer-Verlag, New York.

Ramanathan, R., Engle, R., Granger, C.W.J., Vahid-Araghi, F., Brace, C. (1997). Short-Run Forecasts of Electricity Loads and Peaks, International Journal of Forecasting, v.13, n.2, pp. 161-174.

Rulkov, N.F., Sushchik, M.M., Tsimring, L.S.; Abarbanel, H.D.I. (1995). Generalized Synchronization of Chaos in Directionally Coupled Chaotic Systems, Physical Review E, v.51, n.2, pp. 980-994.

Schölkopf, B., Smola, A.J., (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Cambridge, Massachusetts.

Takens, F. (1981). Detecting Strange Attractors in Turbulence, In.: D.A. Rand, L.-S. Young (eds.), Dynamical Systems and Turbulence, Lecture Notes in Mathematics, v.898, pp. 366-381, Springer-Verlag.

Tipping, (2001). Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research, v.1, pp. 211-244.

Vapnik (1998). Statistical Learning Theory, New York, John Wiley & Sons.

Artigo submetido em 03/03/2011 (Id.: 01285)

Revisado em 06/05/2011, 14/08/2011

Aceito sob recomendação do Editor Associado Prof. Carlos Roberto Minussi

  • Abarbanel, H.D.I., Brown, R., Sidorowich, J.J., Tsimring, L.S. (1993) The Analysis of Observed Chaotic Data in Physical Systems, Reviews of Modern Physics, v.65, n.4, pp. 1331-1392.
  • Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford University Press.
  • Cao, L. (1997). Practical Method for Determining the Minimum Embedding Dimension of a Scalar Time Series, Physica D, v.110, n.1-2, pp. 43-50.
  • Chen, B.-J., Chang, M.-W., Lin, C.-J. (2004). Load Forecasting Using Support Vector Machines: A Study on EUNITE Competition 2001, IEEE Trans. on Power Systems, 19(4), pp. 1821-1830.
  • Ferreira, V.H., Alves da Silva, A.P. (2007). Toward Estimating Autonomous Neural Network-based Electric Load Forecasters, IEEE Transactions on Power Systems, 22 (4), n.4, pp. 1554-1562.
  • Ferreira, V.H., Alves da Silva, A.P. (2009). Automatic Kernel Based Models for Short Term Load Forecasting, Proceedings of the 15th International Conference on Intelligent System Application to Power Systems, Curitiba, Paraná, Brazil.
  • Ferreira, V.H., Alves da Silva, A.P. (2010). Teoria do Caos Aplicada à Definição do Conjunto de Entradas de Modelos Neurais Autônomos para Previsão de Carga em Curto Prazo. Anais do XVIII Congresso Brasileiro de Automática (XVIII CBA), Bonito-MS, pp.4439-4444.
  • Fraser, A.M., Swinney, H.L. (1986). Independent Coordinates for Strange Attractors from Mutual Information, Physical Review A, v.33, n.2, pp. 1134-1140.
  • Griffiths, W.E., Hill, R.C., Judge, G.G. (1993). Learning and Practicing Econometrics, John Wiley & Sons.
  • Guyon, I., Elisseeff, A. (2003). An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, n.3, pp. 1157-1182.
  • Hippert, H.S., Souza, R.C., and Pedreira, C.E. (2001). Neural Networks for Load Forecasting: A Review and Evaluation, IEEE Transactions on Power Systems, v.16, n.1, pp. 44-55.
  • Kantz, H., Schreiber, T. (1997). Nonlinear Time Series Analysis, Cambridge Nonlinear Science Series, n.7, Cambridge University Press.
  • Kennel, M.B., Brown, R., Abarbanel, H.D.I. (1992). Determining Embedding Dimension for Phase-space Reconstruction Using a Geometrical Construction, Physical Review A, v.45, n.6, pp. 3403-3411.
  • Mackay, D.J.C. (1992). Bayesian Methods for Adaptive Models, Ph.D. dissertation, California Institute of Technology, Pasadena, California, USA.
  • Mandal, P., Senjyu, T., Uezato, K., Funabashi, T. (2005). Several-Hours-Ahead Electricity Price and Load Forecasting Using Neural Networks, IEEE PES General Meeting, San Francisco, USA
  • Neal, R.M. (1996). Bayesian Learning for Neural Networks, Lecture Notes in Statistics, n.118, Springer-Verlag, New York.
  • Ramanathan, R., Engle, R., Granger, C.W.J., Vahid-Araghi, F., Brace, C. (1997). Short-Run Forecasts of Electricity Loads and Peaks, International Journal of Forecasting, v.13, n.2, pp. 161-174.
  • Rulkov, N.F., Sushchik, M.M., Tsimring, L.S.; Abarbanel, H.D.I. (1995). Generalized Synchronization of Chaos in Directionally Coupled Chaotic Systems, Physical Review E, v.51, n.2, pp. 980-994.
  • Schölkopf, B., Smola, A.J., (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Cambridge, Massachusetts.
  • Takens, F. (1981). Detecting Strange Attractors in Turbulence, In.: D.A. Rand, L.-S. Young (eds.), Dynamical Systems and Turbulence, Lecture Notes in Mathematics, v.898, pp. 366-381, Springer-Verlag.
  • Tipping, (2001). Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research, v.1, pp. 211-244.
  • Vapnik (1998). Statistical Learning Theory, New York, John Wiley & Sons.

Publication Dates

  • Publication in this collection
    13 Jan 2012
  • Date of issue
    Dec 2011

History

  • Reviewed
    14 Aug 2011
  • Received
    03 Mar 2011
Sociedade Brasileira de Automática Secretaria da SBA, FEEC - Unicamp, BLOCO B - LE51, Av. Albert Einstein, 400, Cidade Universitária Zeferino Vaz, Distrito de Barão Geraldo, 13083-852 - Campinas - SP - Brasil, Tel.: (55 19) 3521 3824, Fax: (55 19) 3521 3866 - Campinas - SP - Brazil
E-mail: revista_sba@fee.unicamp.br