I. Introduction to forecasting
If you missed the first part, I suggest you read it before going through this article. It gives a good introduction as well as an overview of traditional risk management and big data simulation. This article is instead more focused on big data forecasting.
There are nowadays several new techniques or methods borrowed from other disciplines which are used by financial practitioners with the aim of predicting future market outcomes.
Eklund and Kapetanios (2008) provided a good review of all the new predictive methods and I am borrowing here their classification of forecasting methods, which divides techniques into four groups: single equation models that use the whole datasets; models that use only a subset of the whole database, even though a complete set is provided; models that use partial datasets to estimate multiple forecasts averaged later on; and finally, multivariate models that use the whole datasets.
II. Single equation models
The first group is quite wide and includes common techniques used differently, such as ordinary least square (OLS) regression or Bayesian regression, as well as new advancements in the field, as in the case of factor models.
In the OLS model, when the time series dimension exceeds the number of observations, the generalized inverse has to be used in order to estimate the parameters.
Bayesian regression (De Mol, Giannone, and Reichlin, 2007) starts instead from a prior probability and updates this likelihood through an incremental amount of observations to finally obtain a posterior probability. Of particular interest in this type of regression is the ability to use the whole dataset reducing though the magnitude of parameters, and in this way providing a lower variance with respect to standard estimators.
Finally, a part of this first group is the factor models (Stock and Watson, 2002). These models are able to select ex-ante which data contains the greatest predictive power and use then only a few variables to form the forecasting equation. Famous models of this kind are the principal component analysis (PCA), its dynamic version, and the subspace methods.
The PCA technique estimates the matrix of the linear combinations of all factors through their eigenvectors and then considers only the first k ones with the greatest weight (loading vector). If in place of the covariance matrix the spectral density one with different frequencies is used, the technique is called dynamic PCA.
The subspace method starts instead from a parametric state-space model: a simple multivariate OLS model estimates the coefficient, while the factors are then obtained through a reduced rank approximation (Kapetanios and Marcellino, 2003).
III. Subset models
In the second group, we expect the most of the data to be noisy and not really meaningful, thus we use features or variables selection models to identify ex-ante the more significant predictors. This class of methods provides then a stratagem to avoid (if wanted) the problem to deal with large datasets.
The dimensionality reduction allows indeed the use of the simple linear forecasting model on the most appropriate subset of variables, which are skimmed down through information criteria. Boosting, LASSO, Least Angle Regression (LAR), are some of the most common techniques belonging to this class.
Boosting entails running univariate regressions for each predictor, and selecting the model that minimizes a chosen loss function. The residuals so obtained are then explained through running the second round of regressions using the remaining predictors, and selecting again the one that minimizes the same loss function as before. The process is repeated until a certain information criterion is met.
On the other hand, the LASSO and LAR applied penalty functions to the regressions. LASSO expects the norm of estimated vector to be less than a specified shrinkage threshold; the LAR works similarly to the boosting, even though at each step is not included a new variable, but the relative coefficient is incremented by the amount that makes it not to minimize the loss function anymore.
Many other procedures exist, such as stepwise regression, ridge regression, or still the more exotic genetic algorithm and simulated annealing (Kapetanios, 2007).
The stepwise regression estimates the model starting from no variable (forward regression) or from an all-variable model (backward), adding and subtracting at each step the variable that improves the model the most and keeping repeating the procedure until the process stops.
The ridge regression instead includes a quadratic term, which penalizes the size of the regression coefficients.
The genetic algorithm (Dorsey and Mayer, 1995) is an optimization method that iterates towards a solution selecting, in the same fashion of natural selection, only the best fitting features.
Finally, simulated annealing creates a non-homogeneous Markov chain using the objective function that converges to a maximum/minimum of the function itself.
IV. Partial datasets for multiple forecasts
The third group is the averaging model one, which embeds mainly the Bayesian model averaging (BMA) and frequentist model averaging.
While the frequentist approach entails the creation of model confidence sets from which model likelihood can be derived, the BMA methodology provides a Bayesian framework for the combination models.
Different combinations of relationships between predictors and dependent variables are estimated and weighted altogether in order to obtain a better forecasting model, with weights corresponding to the posterior probabilities.
V. Multivariate models
Finally, the last group uses the whole datasets to estimate a set of variables (Carriero, Kapetanios, and Marcellino, 2007). Reduced rank regression, Bayesian VAR, and Multivariate Boosting, as already proposed in Eklund and Kapetanios (2008), are only a few of the models belonging to this class.
Reduced rank regression works similarly to a set of classic Vector Autoregressive models, but as soon as the underlying dataset is large, those models become quite noisy and rich of insignificant coefficients. Therefore, a rank reduction can be imposed, constraining the matrix of the coefficients in the VAR model to be much smaller than the number of predictors.
If the previous model compressed the greatest informative value into few predictors, the Bayesian counterpart of the VAR focuses instead on constraining the data — imposing the restrictions as priors – although it maintains a dependency between data and coefficient determination.
Multivariate boosting is quite similar to its simple version explained beforehand. The main difference lies in measuring at each step a multivariate model (instead of a single equation as in simple boosting), starting from a zero coefficient matrix and recursively setting the single coefficients that explained better the dependent variables to be non-zero.
VI. Other models
Many models have been provided in the previous sections, but the list is not exhaustive. Indeed, according to Varian (2014), classification and regression trees (CART), bagging, bootstrapping, random forests and spike and slab regression are other models commonly used in machine learning applications for finance.
Decision trees are used when it is necessary to describe a sequence of decisions/outcomes in a discrete way, and they work quite well in nonlinear cases and with missing data issue.
Bootstrapping estimates the sampling distribution of some statistics through repeated sampling with replacement, while bagging averages different models obtained with manifold bootstrapping of different size.
The random forests instead is a machine learning technique that grows a tree starting from a bootstrap sample and selects at each node a random sample of predictors. This process is repeated several times, and a majority vote is applied for classification purposes.
Finally, spike and slab regression is another Bayesian hierarchical model that specifies a (Bernoulli) prior distribution of the probability of a set of variables to be included in the model — this is called spike, i.e., the probability of a coefficient to be non-zero. Afterwards, a second prior is selected, i.e. a prior on the regression coefficient deriving from the previous choice of certain variables (the slab). Combining the priors and repeating the process multiple times, it is possible to obtain two posterior distributions for both the priors, as well as the prediction of the overall model.
VII. Final considerations
Varian (2014) suggests a few additional points that should be considered as the dataset becomes larger: causal inference (Angrist and Pischke, 2009) and model uncertainty.
Causation and correlation are two far distinct concepts, and the issue with large noisy datasets is that is much easier to spot out spurious correlations that do not have real meaning or practical validation. Random-control groups may be necessary in those cases, but sometimes a good predictive model can even work better (Varian, 2014).
Causation and correlation are two distinct concepts
The second feature pointed out is that, when it comes to large datasets, averaging models may be more effective than choosing a single one. Usually a simple unique representation is analyzed because of data scarcity, but nothing prevents the decision maker from using multiple models as soon as more data allow it.
- Angrist, J. D., Pischke, J. S. (2009). “Mostly Harmless Econometrics”. Princeton University Press.
- Carriero, A., Kapetanios, G., Marcellino, M. (2011). “Forecasting Large Datasets with Bayesian Reduced Rank Multivariate Models”. Journal of Applied Econometrics 26 (5): 735–761.
- De Mol, C., D. Giannone, and L. Reichlin (2008): “Forecasting using a large number of predictors: is Bayesian regression a valid alternative to principal components?”. Journal of Econometrics 146: 318–328.
- Dorsey, R. E., Mayer, W. J. (1995). “Genetic Algorithms for Estimation Problems with Multiple Optima, Nondifferentiability and Other Irregular Features”. Journal of Business and Economic Statistics, 13 (1): 53–66.
- Eklund, J., Kapetanios, G. (2008). “A Review of Forecasting Techniques for Large Data Sets”. Queen Mary working paper no. 625: 1–18.
- Kapetanios, G. (2007). “Variable Selection in Regression Models using Non- Standard Optimisation of Information Criteria”. Computational Statistics & Data Analysis 52: 4–15.
- Kapetanios, G., and M. Marcellino. (2003). “A Comparison of Estimation Methods for Dynamic Factor Models of Large Dimensions.” Queen Mary, University of London Working Paper 489.
- Stock, J. H., Watson, M. W. (2002). “Macroeconomic Forecasting Using Diffusion Indices.” Journal of Business and Economic Statistics, 20, 147–162.
- Varian, H. R. (2014). “Big Data: New Tricks for Econometrics.” Journal of Economic Perspectives, 28 (2): 3–28.
Note: I have written an extended technical survey on big data and risk management for the Montreal Institute of Structured Finance and Derivatives (2016). If you want to read the complete work, here it comes the pdf.