Previous chapter → Chapter 2 - Estimators

When performing time series analysis,

**models**are mathematical or statistical representations that describe the relationship between variables over time. They can serve as an abstraction of real-world processes, enabling scientists and data analysts to better understand, interpret and predict data.*A*

**Def. 2.1 - Model****model**in time series analysis is a formal mathematical framework that is used to describe how a variable (or set of variables) evolve over time.

### 2.1 Introduction To Models ❌

The following chapters in this syllabus will cover various types of models, which will be explained in detail. Different types of models allow to capture different underlying patterns, trends and relationships within the time series data. These models can be broadly categorized on their structure, underlying assumptions and the nature of time series they are suited for. Below is a short overview of common model types, some of which will be introduced in the current and later chapters.

**Naïve & Simple Models**

The easiest models to understand are those that do not require complex mathematical representations or computations. They often serve as a useful baseline or benchmark in comparison to more traditional or complex models.

- The
*Naïve*model assumes that the next value will be the exact same as the last observed value in the series. Mathematically, an estimator (see 2.2) can be represented as:

- The naïve model can be expanded to the
*Seasonal Naïve*model, which is very similar but has one big advantage; it takes into account seasonality. This is achieved by forecasting the next value as the same as the last observed value in the corresponding seasons. In case of monthly data, the following estimator can be used:

- Finally, the
*Simple Average*(SA) model forecasts the next value in the time series based on the average of all past observations, which leads to the following estimator:

**Smoothing Models**

Some time series data may contain noise from external sources. In other cases, it can be beneficial to capture certain trends within a time series. This is where smoothing models are useful, since they smooth out the data, removing noise in the process. Due to their accumulative nature, they also posses the capability of capturing trends over time.

- The
*Moving Average*(MA) model calculates the average of the past observations and uses that as a forecast for the next value. This bears similarity to the previously discussed simple average model, the only difference being that only the past observations are taken into account, instead of all observations. The moving average estimator can be written mathematically as:

- The
*Exponential Smoothing*(ES) model uses a weighted average of past observations, where more recent observations are given a higher weight (and thus contribute more to the estimate for the next value). There exist several types of exponential smoothing models such as the*Simple Exponential Smoothing*(SES) model, which is often used in case no clear trend or seasonal pattern is visible.*Holt’s linear trend*model extends the SES model by adding the capability of capturing linear trends. Finally, the*Holt-Winter*(sometimes referred to as*triple exponential smoothing*) model further extends this by accounting for seasonality.

**Regression Models**

Regression models aim to explain the relationship between the dependent variable (response, in time series analysis often the data dimension) and one or more independent variables (predictors, in time series analysis often the temporal dimension).

- The
*Linear Regression*model assumes a linear relationship between the dependent variable and one or more independent variables. An estimator can be constructed as follows:

*Autoregressive Distributed Lag*(ARDL) models combine lagged values of the dependent variable (as in autoregressive models) and independent variables (as in regressive models) to model the time series.

**Autoregressive Models**

As briefly mentioned in the first chapter, autoregressive models use past values of the time series to predict or forecast future values. This means that such models assume that the current values are dependent on previous values in the time series.

*Autoregressive*(AR) models forecast the value of the response as a linear combination of the response at past points in time. This is represented by the following estimator:

*Autoregressive Moving Average*(ARMA) models take regular AR models one step further by combining them with the MA model. This allows to capture both the autoregressive nature of the time series and remove noise in the process. An ARMA estimator can be expressed as:

- ARMA models can be further extended to
*(seasonal) autoregressive integrated moving average*((S)ARIMA) models. ARIMA models include a differencing step to remove trends. SARIMA models are a further extension that specifically aim at capturing seasonality.

**State-Space & Structural Models**

State-space and structural models provide a flexible framework for modeling complex time series data by representing the series with latent (unobserved) state variables.

- The
*Kalman Filter*is a state-space model that uses a recursive algorithm to estimate the hidden state variables of a time series. It is often used for filtering and forecasting in dynamic systems.

- The
*Unobserved Components*(UC) model decomposes a series into several components that constitute to the series, such that they provide relevant information. In the case of time series, this often includes decomposition into trend, seasonal and irregular components.

**Multivariate Models**

In some cases, time series data consists of more than one data dimension. In these cases, it is necessary to deal with more than one time series at once, allowing for interactions between different time series.

*Vector Autoregressive*(VAR) models extend AR models to multiple time series. Each variable in the system is modeled as a linear function of past values of all variables in the system. An estimator for the VAR model can be expressed as follows:

*Vector Error Correction*(VECM) models are a variation of VAR models used when time series are*cointegrated*(i.e. meaning they share a long-term equilibrium relation). This is often reserved for econometric analysis.

**Machine-Learning Models**

In recent years, machine learning methods have become a popular choice in time series forecasting, often outperforming traditional models in some applications.

*Random forest & Decision tree*models are tree-based methods that can be used for time series forecasting. This is particularly the case when there are complex non-linear interactions between variables.

*Gradient Boosting*methods such as*“XGBoost”*or*“LightGBM”*have been successfully applied to time series forecasting tasks.

- Neural networks are also quite successful when it comes to time series forecasting. In particular,
*Recurrent Neural Networks*(RNNs) such as LSTM and GRU models are specifically designed to handle (originally textual) sequential data. This allows to capture longer-term dependencies. Other types of neural networks such as*Convolutional Neural Networks*(CNNs) can be applied to time series data to capture both spatial and temporal dependencies.

**Non-linear Models**

Non-linear models are used when a time series exhibits a non-linearity. These models can be used to capture these specific dynamics.

*Threshold Autoregressive*(TAR) models are models where the AR process switches between different regimes, depending on whether the past values are above or below a certain threshold. A smooth version, called a*smooth transition autoregressive*(STAR) model is a generalization of a TAR model where the transition between regimes is smooth rather than abrupt.

*General Autoregressive Conditional Heteroskedasticity*(GARCH) models

**Other Model Types**

*Long-Memory*and

*Fractional*models can be used when a time series exhibits long-term dependencies that are not captured well by traditional short-term models such as ARIMA models. A general ARIMA model can be extended to ARFIMA (where F stands for

*“fractionally”*). These allow for fractional differencing, making it suitable for time series with long-range dependencies. Furthermore,

*hybrid*models combine different types of models to leverage the strengths of each approach. For example, combining ARIMA with machine learning models (such as ARIMA-LSTM) allows for capturing both linear and non-linear patterns in the data.

**Example 2.1 - Basic Forecasting Models**This example shows the closing price of the Microsoft stock price over the period of one month. Different models were then used to make predictions for the closing price of this stock. Noteworthy is the moving average, which seems to be constant. This is due to the fact that no new moving average can be computed with

*“predicted”*data, since this would dampen the overall prediction over time to a stable value. It is also interesting to see that, due to the volatility of stock market data, none of these models do a good job of accurately predicting what the stock price will do.### 2.2 The Linear Regression Model

Linear regression involves modeling a dependent variable as a linear combination of one or more

**predictors**(independent variables). In time series data, the goal is often to understand the trend, seasonality and potential cyclical patterns of the time-dependent variable or to forecast future values.In a linear regression model for time series, the dependent variable at time is expressed as:

Here,

- is the value of the dependent variable at time .

- is the predictor variable at time .

- is the
**intercept**of the regression line, representing the baseline level of when all the predictors evaluate to zero.

- is the
**(slope) coefficient**for the predictor variable, which measures the impact of on .

- is the
**error term**at time , which is often assumed to be a random variable. This term can be omitted if unnecessary, but is often assumed to have zero mean and constant variance: - = 0

**Types of Linear Regression Models in TSA**

In time series analysis often two types of linear regression models are considered; the

**trend-based**regression model and**explanatory variable**models:**Trend-based Regression Models**are used to identify long-term directional movements in the time series data. Common trends include linear, quadratic or exponential trends. For example, to model a linear trend, the following model can be used with being the time index:

**Explanatory Variable Models**(also known as**Multiple Regression**) are models where the dependent variable is explained by other time-dependent variables. For example, in predicting sales, economic indicators such as consumer confidence or seasonal indicators (e.g. month or quarterly) are used as independent variables.

**Assumptions for Linear Regression in TSA**

When using a linear regression model in time series analysis, it is important to review several assumptions to ensure the validity and reliability of the model and its estimated parameters.

- The relationship between the predictors and the dependent variable is assumed to be linear. This is the
**linearity**assumption.

- The error terms should be uncorrelated over time. This can be quite problematic, especially in time series, since consecutive values are often autocorrelated. This is the
**independence**assumption.

- The variance of the errors should also be constant over time, an assumption known as
**homoscedasticity**.

- Finally, the error terms should be normally distributed. This is of particular importance for inference of the regression coefficients. This assumption is the
**normality of errors**assumption.

Violating one of these assumptions can lean to biased estimates, incorrect predictions and unreliable inference, impacting the predictive power of the model.

#### 2.2.1 Estimation of Model Parameters

To obtain a linear regression model that can be used to make predictions of forecast future values, it is necessary to estimate the model parameters. In this particular case, the model parameters are

- The
**intercept**

- The
**slope coefficients**

To obtain the intercept and slope coefficients, data samples can be used. The data samples are pairs of both the sample and the predictors that were used to obtain that sample :

Here, we assume that the following linear relation holds, such that the linearity constraint is satisfied:

The error term is assumed to be distributed as white noise. This means the error terms are

*i.i.d.*with zero mean and are uncorrelated over time. This satisfies both the independence and normality of errors contraint. If the errors are drawn from any distribution where the variance is constant over time, the homoscedasticity constraint is further satisfied, validating all assumptions necessary for a linear regression model.Typically, the intercept and slope coefficients are estimated using the (ordinary)

**least squares**(OLS) estimator described in the previous chapter. Remember that this estimator uses the following relation to estimate a parameter:In this particular case we let . This allows the OLS estimator to choose the values of the intercept and slope coefficients that minimize the value of the expression. This results in fitted values , such that

It is important to note that in most cases, the resulting values for the intercept and slope coefficients will yield a linear relation where, if plotted in a -dimensional graph, the resulting line will not intercept

*any*of the data samples. The OLS estimator obtains a line that minimizes the distance to each of these points. This results in so called residuals**for each data sample. A****residual**is the difference between the actual data sample for the sample and the predicted value.

**Example 2.2 - Linear Regression**In this example simple linear regression between one predictor and one dependent variable is shown. The first image contains the fake data, together with two random regression lines .

It is clear that the two random regression lines do not capture the trend observed in the data very well. The trend observable in the upper regression line is too strong, while the trend in the lower regression line seems to follow the actual trend of the data. However, the intercept of the lower regression line is too low compared to the actual data.

After applying the OLS estimator, the result is the actual regression line with intercept and slope coefficient . This results in the following regression line that fits the actual data much better!

Performing a residual analysis on the residuals can give an insight in why the regression line obtained with the OLS estimator is better. To compare, the

**sum of squared residuals**(SSR) is used:This results in the following SSR scores:

Regression Line | SSR |

582.7126 | |

179.2635 | |

42.7764 |

Note that this is the lowest obtainable score, since the OLS estimator results in the parameters that minimize this sum. This means that a better score cannot be obtained using the OLS estimator and that the resulting regression line is the best one obtainable using a linear regression model.

**Matrix Notation**

The regression model can also written in matrix notation. In this notation a matrix containing all the values for the predictor variables and a value of 1 for all intercepts is constructed.

The vector contains all the values of the dependent variable, obtained from the sample data.

We can now rewrite the linear regression model in matrix notation.

where

- is an vector of observed responses (the sample data).

- is an matrix of predictors, with each row representing an observation and each column a predictor variable.

- is the vector of unknown parameter that have to be estimated (the intercept and slope coefficients).

- is an vector of random errors, which are (typically) assumed to be
*i.i.d.*with mean 0 and variance , i.e. to satisfy the assumptions for a linear regression model.

It is then possible to show that

**Covariance Matrix of**

The covariance matrix of the estimated parameters is denoted as and quantifies the uncertainty or variability of the estimated coefficients. It is given by the following expression:

It can further be shown that

where

- is the variance of the error term . If the errors have more variability, then the estimates of are less precise, leading to a larger covariance matrix (in terms of the magnitude of the values)

- is the inverse of the matrix formed by taking the product of with itself. This component enables to capture information about the predictor variables.

It might be unclear why this covariance matrix matters in the first place. The covariance matrix serves several purposes, especially in the context of statistical inference.

- The diagonal elements represent the variances of each estimated parameter . These values can indicate how precise each parameter estimate is. Smaller variances imply more precise estimates, while larger variances indicate a greater uncertainty.

- The off-diagonal elements represent the covariances between different parameter estimates. They provide an indication on how different parameter estimates are linearly related. If the predictors are highly correlated, the covariances will be high, leading to less reliable estimates (a phenomenon known as
**multicollinearity**).