Previous chapter → Chapter 1 - Introduction To Time Series Analysis
In time series analysis, estimators play a crucial role in understanding the underlying patterns and behaviors of data that evolve over time. A key goal in time series analysis is to perform inference of unknown parameters (e.g. means, variances, autocorrelations and trends) based on the observed data. These parameters can provide an insight into the structure of the time series, enabling the capability to model it, understand its dynamics and make forecasts, which is the goal of this course. Since the true values of these parameters is often unknown, estimators are used to approximate these quantities, which provide approximations based on the available sample data.
Without estimators, it would be impossible to quantify or predict time-dependent phenomena. For instance, in financial markets, estimating volatility is essential for risk management, while in climate studies, estimating trends and seasonal patterns helps to understand long-term changes. The accuracy and reliability of these estimates directly affect the quality of any conclusions that can be drawn, albeit forecasted future values or tests about various hypotheses regarding the behavior of a time series. Therefore, selecting the right estimator and understanding its properties is essential for time series analysis.
2.1 Basic Definitions
An estimator is a rule or mathematical formula that provides an approximation of an unknown parameter based on observed data. In the context of time series analysis, an estimator helps to infer characteristics of the underlying process generating the data, such as the mean, variance, or correlation, which are typically unknown. These parameters allow to make educated guesses about the time series using the information contained in a finite sample of observations. This is helpful, since generally the entire population or process cannot be observed over infinite time.
Estimator Vs. Estimate
Def. 2.2- Estimator
An estimator (often denoted ) is the method or function applied to data, whereas an estimate is the value obtained by applying this method or function to a specific dataset.
For instance, to estimate the mean of a time series, the simple average estimator can be used. In this case, the estimator is the specific formula that is used to obtain the sample mean, while the estimate would be the numerical result obtained by applying this formula to a particular sample of data points.
Parameters and Population
Def. 2.3 - Parameter
A parameter is a numerical characteristic that describes a feature of the process generating the time series.
Parameters can represent long-term properties like the average value (mean), volatility (variance), or the relationship between time points (autocorrelation). Since the entire “population” is rarely fully accessible or the full realization of the time series (which would span an infinite or unobservable time period) is unknown, an estimator can be used to approximate these parameters based on a sample of finite length .
2.2 Properties of Estimators
Understanding the properties of estimators is essential for evaluating their effectiveness in approximating unknown parameters. An estimator’s performance is not solely based on the value it produces but also on its theoretical qualities, which determine how well it can be expected to behave in different scenarios. Key properties such as unbiasedness, consistency and efficiency help to judge whether an estimator is reliable, whether it converges to the true parameter as more data becomes available and how much uncertainty is associated with its estimates. These properties form the foundation for selecting the most appropriate estimator for time series models.
Def 2.4 - Properties of Estimators
- (Un)Biased - An estimator is
- unbiased if , meaning that the expected value of the estimator is equal to the true parameter.
- biased if .
- Consistency - An estimator is consistent if the estimator converges to the true value with increasing sample size.
- Efficiency - An efficient estimator has the smallest possible variance among all unbiased estimators.
One important remark to make is that an estimator can still be biased and consistent. This might not seem completely logical at first. If the estimator is biased, then its value will, on average, not be equal to the true value. This is not the same as saying that it will not converge to its true value if the sample size increases!
Additionally, it must be noted that when the sample size increases, the estimates produced by an estimator are (often) distributed normally with the mean and variance of the estimator. This is the result of both the law of large numbers and the central limit theorem (CLT).
If this is the case, the estimator is asymptotically normal. In almost all cases, an estimator does not produce truly accurate estimates (i.e. often ). It is therefor interesting and also necessary to quantify the error of an estimate produced by the estimator. This can be done through the use of the standard error.
Def 2.5 - Standard Error of
The standard error (SE) of an estimator is a measure of the precision of the estimator. It is mathematically defined as
A confidence interval (CI) is a range of values, derived from data using the estimator . The interval spans a range where it is likely to contain the true parameter within a certain level of confidence. For example, in a 95% CI, you can be 95% sure that the true value of the parameter will be inside the interval. It is intuitive to notice that the higher the confidence, the wider the interval range will be. A CI with a confidence of 95% should contain a larger range of values than a CI of only 50%. In general and throughout this course, a CI of 95% is assumed to be standard.
Confidence intervals can be useful because they provide more information than a single point estimate. They allows to quantify uncertainty about a parameter that is being estimated and can help make informed decisions, showing how precise a certain estimate is and how confident you can be in it.
Def 2.6 - Confidence Interval of
If is a consistent, asymptotically normal estimator, a confidence interval for a confidence percentage is given by
where is the critical value for the confidence interval and is the standard error of the estimator. For example, the 95% CI (with ) for is given by
The result for is obtained by first computing the critical value (z-score) of the normal distribution for the significance level (in the case of a 95% confidence interval, ). Remember, this value is the result from computing the amount of standard deviations away from the mean it is necessary to go to capture the confidence percentage of the data starting in the center from the distribution, going towards the tails in a symmetric fashion. In other words, for a distribution
The standard error can also be useful for statistical testing. For example, in case of testing a hypothesis , the hypothesis can be rejected (, for “Reject ”) at a significance level if is not in the confidence interval.
For the remainder of the syllabus, will be rejected (at if
Review - P-value
The P-value is the probability that the test statistic takes values more extreme than the computed one (under .
The default choice for the significance level is . This level gives the type I error, i.e. the probability of rejecting when it holds.
- If , then the deviation from is said to be significant.
- The P-value should be considered on a continuous scale.
- The smaller the P-value, the more evidence in the data against the null-hypothesis.
2.3 Estimator Types ❌
In time series analysis, several types of estimators can be used to infer unknown parameters from data, as discussed previously. These estimators can be categorized based on their approach and the type of information they provide. This section briefly explores some of the most common estimators and their applications. Understanding estimators and their properties is often key in selecting the most appropriate one for a given model.
Point Estimator
A point estimator is a rule or formula that provides a single value as an estimate for an unknown population parameter. It gives the best guess or approximation based on the available sample data. Common examples of point estimators can include:
- The sample mean estimator is a point estimator for the sample mean.
- The sample variance estimator is a point estimator for the sample variance.
The advantage of point estimators is that they are straightforward and easy to compute. However, they provide no information about the uncertainty associated with the estimate. For instance, while the sample mean offers a single value as an estimate of the true mean, it does not convey how confident we can be about that estimate.
Interval Estimators
In the case where we want to have a quantifiable uncertainty in the estimate of a parameter an interval estimator can be used. Unlike point estimators, interval estimators provide a sense of precision of the estimate and account for sampling variability. This makes them essential to quantify the uncertainty surrounding a parameter. The result of an interval estimator is a range of values within which the true parameter is likely to fall. The most common interval estimators are confidence intervals, which have been described in the previous section. The general expression to find a confidence interval with a significance level was given by:
where in this particular example, . In most cases, the interval is centered around the sample mean, such that .
Maximum Likelihood Estimator (MLE)
The maximum likelihood estimator (abbr. MLE) is one of the most widely used estimation methods, especially in time series models such as ARIMA and GARCH. The MLE seeks to find the parameter values that maximize the likelihood function, which represents the probability of the observed data given a set of parameter values. Mathematically speaking, for a given model with parameters and observed data , the expression for the MLE is given by:
Here, is the likelihood function. The MLE has many desirable properties, such as asymptotic efficiency (i.e. as the sample size grows larger, the estimator achieves the lowest possible variance) and consistency. Often MLE is used in tie series analysis to estimate the parameters of stochastic models such as AR, MA and ARMA processes. In practice, often different methods are used to estimate the parameters using MLE. The following toggle offers more information on how MLE estimates parameters.
MLE Optimization
The likelihood function is the probability (or probability density) of the observed data as a function of the parameters . In other words, it describes how likely the data is to be observed under the given parameters. In practice, often the log-likelihood function is used because it simplifies the mathematical expressions (e.g. by turning products into sums).
The goal of MLE is to maximize the (log-)likelihood function with respect to the parameters . In general, this involves using numerical optimization methods. These methods are often gradient-based or gradient-free methods.
- Gradient-based methods such as Newton-Raphson or gradient descent calculate the derivative (or gradient) of the (log-)likelihood with respect to each parameter. The gradient then indicates the direction in which the parameters should be adjusted to increase the likelihood. This process iteratively updates the parameter values in a direction that maximizes the likelihood until in converges to a local (or global, but less likely) maximum.
- Grid search can be used for smaller problem or when gradients are not available. With grid search, MLE can be implemented through an exhaustive search over a grid of parameter values. This approach, however, is computationally expensive and impractical for large or continuous parameter spaces. Especially for continuous parameter spaces, the result might also be relatively inaccurate.
- Monte-Carlo methods can be used to estimate the parameters of more complex models such as Bayesian MLE or when the likelihood is intractable. Often methods based on Markov-Chain Monte Carlo methods such as the Metropolis-Hastings algorithm can be used to sample from the parameter space.
Method Of Moments (MoM)
The method of moments estimator is an alternative to the MLE. It involves equating stochastic sample moments (i.e. sample mean, variance, etc) to theoretical moments of the population distribution, to estimate the parameters. For example, if is the population mean and is the sample mean, the method of moments would estimate by setting . More general, for a distribution with parameters , the MoM estimator solves:
Here, are the moment equations derived from the distribution’s properties. While MoM is simpler and more intuitive in some cases, one downside is that it tends to be less efficient than the MLE.
Least Squares Estimator
The least squares estimator is primarily used in regression context, including time series models such as AR, ARMA and ARIMA models. These model the relationship between a dependent variable and its lagged values (commonly known as the predictors). The least squares estimator minimizes the sum of squared differences between observed and predicted values:
What each of the variables in this equation means will become more clear in the next chapter, which discusses regression models in more detail.
Bayesian Estimation
It is impossible to explain estimators without at least mentioning Bayesian estimation. Bayesian estimation incorporates prior knowledge about the parameter being estimated, combining it with the observed data to form a posterior distribution. In Bayesian estimation, parameters are treated as random variables with prior distributions (i.e. the parameters are taken from a distribution). The estimation process involves updating the prior distribution based on the likelihood of the observed sample data, resulting in a posterior distribution, which reflects the updates beliefs about the parameter being estimated. This requires the introduction of Bayes’ Theorem:
Which reads as follows: the probability of observing the parameters , given the sample data is given by the product of the probability of observing the sample data given the parameter times the probability of the parameters being chosen (which is obtained from a prior distribution), divided by the probability of observing the sample data.
One estimator derived with Bayesian estimation is the maximum a posteriori (MAP) estimator. The MAP estimator is a point estimator in which the value of maximizes the posterior distribution. It can be seen as the Bayesian analog of MLE. The MAP estimator can be mathematically expressed as:
2.4 Common Estimators in TSA ❌
In the previous sections and chapters some common estimators for time series analysis in particular have already been explained. The sample mean and variance estimators can help in estimating mean and variance, while estimators such as the autocorrelation function (abbrev. ACF) can be used to estimate the autocorrelation at a lag order . In this brief section we explore some other common estimators that are used in time series analysis such as the partial ACF, periodogram estimator, the Hurst exponent and (co)variance matrices.
Partial Autocorrelation Function (PACF)
In time series analysis, the partial autocorrelation function (PACF) gives the partial autocorrelation of a stationary time series with its own lagged values, regressed with values of the time series at all shorter lags. This is different from the regular autocorrelation function, which does not control for other lags. The PACF plays an important role in identifying the extent of the lag in an autoregressive (AR) model. It can thus be used to determine the appropriate lags in an model (or in extent, even an model. The PACF is further discussed in Chapter 4 - ARMA Models.
Periodogram Estimator
The periodogram estimator is used in spectral analysis to represent the frequency domain characteristics of a time series. It shows how the variance (or power) of the series is distributed over different frequency components. This is particularly useful for identifying cycles or periodic behavior in the data. The periodogram is given by the following expression.
Here, is the frequency and are the time series observations. The periodogram provides an estimate of the power spectral density (PSD) of the time series at different frequencies. Peaks in the periodogram indicate prominent cycles or periodicities in the data. It is an estimator that is widely used in fields such as signal processing, economics and climatology. In these fields, it is crucial for understanding the underlying processes.
Hurst Exponent
In general, the Hurst exponent is a statistical method for estimating parameters of a time series without making assumptions about stationarity. It measures the long-term memory or persistence of a time series. This allows to determine whether a series exhibits a trend-reinforcing (persistent) or mean-reverting (anti-persistent) behavior, or whether it is simply a random walk.
The Hurst exponent is estimated based on the relationship between the rescaled range of a time series and the time interval over which the range is measured. It can be estimated using the following expression:
where is the range of the first cumulative deviations from the mean. is the series (sum) of the first standard deviations. The number of observations that are taken into account is and is a constant. Typically, the value of the Hurst exponent falls between 0 and 1:
- suggests anti-persistent behavior. This means that movements in the values of the time series data are switching directions. Values closer to 0 indicate a time series that has a tendency to switch directions. If the time series has been trending upwards, it is likely to go downward soon and vice versa. This behavior is typical of a mean-reverting process (see example 1.3).
- suggests a random walk, commonly known as Brownian motion. This means there is no correlation between the movements in time series. Each step or value in the series is independent of the previous ones, with no predictable trend or mean-reverting tendencies.
- suggests persistent behavior. This means that the movements in the values of the time series are constant. Values closer to 1 indicates a strong memory or trend-following nature, meaning that if the series is in an upward trend, it is likely to continue upward (and vice versa). This persistence is often observed in trending data.
2.5 Advanced Topics ❌
Residuals
Residuals represent the differences between the observed data and the values predicted by the estimated model parameters. They can often provide useful insights into how well the model fits the data, indicating areas where the model may under- or overestimate the observations. Residuals are thus defined as the difference between each observed data point and the corresponding predicted value (or expected value- given the estimated parameters. In a model where are the estimated parameter value(s) obtained by MLE, the residual for an observation can be expressed as
where is the model’s prediction for using the estimated parameters . In regression models, the residual is typically the difference between the observed outcome and the predicted mean where represents the model function (see Chapter 3 - Regression Models). In contrast, in probabilistic models, the residuals might be a function of the difference between and the expected value (e.g. for binary models, the predicted probability).
In MLE, residuals can be calculated after estimating the parameters. In general, the following steps can be followed:
- Fit the model using an estimator as described in section 2.3 to find the parameter estimates for the observed data.
- Calculate the predicted values for each observation using the estimated parameters . The predicted values are the expected values (or mean predictions) under the model with these estimated parameters.
- Compute residuals by subtracting these predicted values from the observed values.
For example, in a regular linear regression model with parameters , after fitting the model and finding and using MLE, the residuals can be expressed as:
where is the predicted mean for based on the estimated regression coefficients.
Residual patterns can reveal model misfit. If the model is correctly specified, residuals should ideally be small, random and now show any clear pattern when plotted. Non-random patterns in residuals (e.g. symmetric trends or non-constant variance) can indicate that the model does not capture certain aspects of the data. If this is the case, one might need to reconsider the model used to describe the observed data. Furthermore, after performing MLE, the residuals are expected to follow the assumed error distribution of the model (e.g. normal distribution for a linear model, binomial for a logistic model, etc). Deviations from this distribution in the residuals might further suggest a need for model improvement.
Bias-Variance Tradeoff
The bias-variance tradeoff is a fundamental concept in statistical modeling and machine learning, but also plays a crucial role in time series analysis. Specifically it is an important concept to consider when estimating the parameters of models such as ARIMA or GARCH models. Balancing bias and variance is especially important in developing models that should generalize well to unseen data, while still capturing the underlying structure of the time series.
The concept of bias was introduced previously, but not explained in detail. Bias refers to the error that is introduced when approximating a complex real-world process with a more simple model. High-bias models, which are often the result of oversimplification or under-parameterization, fail to capture the true underlying patterns of the data. This is called underfitting. Variance on the other hand is the error that is introduced by the model’s sensitivity to fluctuations in the training data. High-variance models are often overly complex, responding to noise or idiosyncrasies in the dataset. This is called overfitting.
Example 2.1 - Bias-Variance Tradeoff
In this example polynomial regression will be used to obtain a model that can predict new data points. Since polynomials can be of any degree, it is easy to either underfit or overfit a model. In this specific example 100 data points are used as training data.
- The underfit polynomial is of degree 1, which means the model reverts to linear regression. This model introduces a lot of bias.
- The overfit polynomial is of degree 30, which introduces too much variability in that certain maxima or minima adapt too much to the provided training data. This model introduces a lot of variance.
- To obtain a model that trades optimally for both bias and variance, a polynomial fit of degree 4 is used. While being less biased, it still introduces some variability, albeit less than the overfit polynomial.
The total error of a model is often expressed as the sum of bias , variance and irreducible error :
In the context of time series analysis, the bias-variance tradeoff arises when estimating parameters for forecasting or inference. Several factors contribute to this tradeoff:
- Model Complexity - Simple models (e.g. or , as will be introduced later) often exhibit high bias because they cannot capture complex temporal dependencies. This leads to systematic errors in forecasts. More complex models (e.g. ) can have low bias, but risk overfitting to specific noise patterns in the time series. This leads to high variance in parameter estimates.
- Sample Size & Data Properties - Limited or noisy data exacerbates the bias-variance tradeoff. For example, a highly flexible model may overfit when the sample size is small, while a rigid model may ignore significant patterns in larger datasets. It is important to tailor to both.
- Regularization Techniques - As will be seen in the next subsection, shrinkage methods, commonly known as regularization techniques (e.g. ridge regression or Lasso) are used in time series to constrain parameter estimates. This reduces variance at the cost of introducing some bias. Regularization techniques are particularly useful when dealing with multivariate time series or high-dimensional settings.
- Non-Stationarity - Time series often exhibit non-stationary behavior. This makes the estimation of parameters prone to bias. Variance can also increase if the model tends to overreact to transient changes in the data.
Regularization & Shrinkage Estimators
Bootstrap Methods
Goodness-of-Fit Tests
Model Misspecification
Error Metrics
2.6 Special Cases in Time Series ❌
Estimation in Non-Stationary Time Series
Multivariate Time Series Estimators
Non-parametric Estimation
Continue reading → Chapter 3 - Regression Models
TODO
- Add examples of PACF for AR, MA and ARMA models.