Machine learning and causal analyses for modeling financial and economic data

Xu, Lei

doi:10.1186/s40535-018-0058-5

Review
Open access
Published: 29 December 2018

Machine learning and causal analyses for modeling financial and economic data

Lei Xu ORCID: orcid.org/0000-0002-2752-1573^1,2

Applied Informatics volume 5, Article number: 11 (2018) Cite this article

7049 Accesses
5 Citations
Metrics details

Abstract

Instead of aiming at a systematic survey, we consider further developments on several typical linear models and their mixture extensions for prediction modeling, portfolio management and market analyses. The focus is put on outlining the studies by the author’s research group, featured by (a) extensions of AR, ARCH and GARCH models into finite mixture or mixture-of-experts; (b) improvements of Sharpe ratio by maximizing the expected return and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification; (c) developments of arbitrage pricing theory (APT) into temporal factor analysis (TFA)-based temporal APT, macroeconomics-modulated temporal APT and a general formulation for market modeling, together with applications to temporal prediction and dynamic portfolio management; (d) Bayesian Ying–Yang (BYY) harmony learning is adopted to implement these developments, featured with automatic model selection. After a brief introduction on BYY harmony learning, gradient-based algorithms and EM-like algorithms are provided for learning alternative mixture-of-experts-based AR, ARCH and GARCH models; and (e) path analysis for linear causal analyses is briefly reviewed, a recent development on ρ-diagram is refined for cofounder discovery, and a causal potential theory is proposed. Also, further discussions are made on structural equation modeling and its relations to modulated TFA-APT and nGCH-driven M-TFA-O.

Introduction

Financial and economic data are naturally recorded as temporal sequences or time series, and thus one of major tasks on those data is making time series analysis. Typically, a mathematical model is obtained to describe the regression relation of the current observation from its past observations, such that the future observation is predicted. Such a prediction task has been extensively studied in both the literature of time series analysis and the literature of machine learning and neural networks.

One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–moving-average (ARMA) model, which describes a linear dependence of the current observation on past values and noise disturbances. Extended from describing stationary processes to data with some identifiable trend of a polynomial growth (Box and Jenkins 1970), an initial differencing step can be applied to remove such a non-stationarity. See Box 1 in Fig. 1; the autoregressive integrated moving average (ARIMA) model is used to refer a “cascade” of this initialization and ARMA. For simplicity, we still prefer to use AMRA to refer ARIMA by regarding such an initialization as a pre-processing stage.

In the literatures of statistics and econometrics, as outlined in Fig. 1 by Box 2, generalizations of ARMA have also been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering conditional heteroskedasticity of variables (Engle 1982; Bollerslev 1986), to nonlinear ARMA for modeling nonlinear dependence (Leontaritis and Billings 1985), and Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987).

The field of NN-ML in economics and finance involves each of the three streams of studies. In the early stage, most efforts were put on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as outlined in Fig. 1 by Box 3. There have been already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies.

Since 1994, the author’s group has made many efforts on extending AR, ARMA, ARCH and GARCH models into finite mixture or mixture-of-experts (Xu 1994, 1995a, b; Cheung et al. 1996, 1997; Leung 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2002a, 2003; Tang et al. 2003). Outlined in Fig. 1 by Box 4, studies actually proceed along an alternative road for modeling temporal dependence featured with nonlinearity, heteroskedasticity and non-stationarity. “Financial prediction: time series models and three finite mixture extensions” section is dedicated to the studies summarized in Fig. 1, together with introductions on learning implementations by the maximum likelihood (ML) learning, the rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993), and approaches of learning with model selection.

“Dynamic trading and portfolio management” section is dedicated to the studies summarized in Fig. 2, toward portfolio management directly, instead of making nonlinear modeling for analyses and predictions. Around the second half of the 1990s, efforts in the literature of neural networks and machine learning in economics and finance started to shift to adaptive trading; see Box 1. Subsequently, these efforts converge to the road pioneered by the Markowitz portfolio theory (Markowitz 1952) that maximizes the portfolio expected return for a given amount of portfolio risk by carefully choosing the proportions of assets; see Box 2. Based on Markowitz’s mean–variance paradigm, Sharpe (1966, 1994) further suggests evaluating the goodness of an asset by a ratio of the excess asset return; see Box 3. Later, it is further realized that the return variance is not an appropriate measure of portfolio risk because it counts the positive fluctuation above the expected returns (called upside volatility) also as the part of risk. The downside risk thus becomes a topic to study, as illustrated in Fig. 2 by Box 4; e.g., Markowitz (1959) counts the volatility below the expected returns only.

After a brief introduction on the above-mentioned boxes in Fig. 2, “Dynamic trading and portfolio management” section further reexamines the Markowitz paradigm and Sharpe ratio with extensions that maximizes the expected returns and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification (Hung et al. 2000, 2003), see Box 5 in Fig. 2. Moreover, several extensions have been proposed along this direction in Sect III(C) of Xu (2001), including that nonparametric estimates of the expected return and volatilities are improved by ARCH or GARCH models; see Box 6 in Fig. 2.

Next, “Market modeling: APT theory and temporal factor analysis” section is dedicated to the efforts summarized in Fig. 3. The Markowitz scheme also leads to the Capital Asset Pricing Model (CAPM) (Sharpe 1964). However, the CAPM is criticized to be not enough to describe a market behavior merely via one endogenous factor. Then, a general linear model of multiple factors has been proposed under the name of Arbitrage Pricing Theory (APT) (Ross 1976). Unfortunately, the APT has not been widely accepted in popularity similar to the CAPM. The reason lies largely with its significant drawback: namely, its implementation is difficult due to the lack of specificity regarding the number and nature of the factors that systematically affect asset return (Dhrymes et al. 1984; Abeysekera and Mahajan 1987).

In “Market modeling: APT theory and temporal factor analysis” section, we start from introducing three approaches that are usually applied for the implementation of APT and address their drawbacks as outlined in “Introduction” section of (Xu 2001), which leads to an observation that the lack of specificity regarding the endogenous factors is not just regarding the number and nature of the factors, but even more seriously arising from the so-called rotation indeterminacy implemented by factor analysis. Thus, further efforts should explore how to add certain structure to remove or remedy this indeterminacy. As outlined in Fig. 3 by Box 1 and Box 2, temporal factor analysis (TFA) (Xu 1997, 2000) is suggested as a generalization of the original APT theory (Xu 2001) to tackle such an incompleteness, featured with a first-order autoregressive dependence added to each factor such that the incompleteness caused by a notorious rotation indeterminacy is removed. Such a generalization is thus called temporal APT in a sense that temporal relation is taken into consideration.

This section further considers the influences of macroeconomic indexes such as GDP, inflation, investor confidence and yield curve, via their roles in controlling or modulating the temporal factors, which leads to a macroeconomics-modulated temporal APT shown in Fig. 3 by Box 3. Alternatively, TFA may also be replaced by non-Gaussian factor analyses (NFA) such that the incompleteness caused by rotation indeterminacy can also be removed; see Box 6 and Box 7 in Fig. 3. Actually, both the temporal factors and non-Gaussian factors are two aspects of one market model: one observes a dynamic market process, while the other describes the market with all the time points projected to one reference spot. Even generally, conditional heteroskedasticity may also be added to the factors, which finally leads to Box 8 in Fig. 3, namely, a general formulation for financial market modeling that systematically integrates all the ingredients. As illustrated in Fig. 3 by Box 4, various prediction tasks and investment managements can also be conducted with the help of the temporal APT and the macroeconomics-modulated temporal APT.

Further developments of these linear models introduced are suggested to be implemented by the Bayesian Ying–Yang (BYY) harmony learning. In “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, the fundamentals of BYY harmony learning are briefly introduced. For learning alternative mixture-of-experts-based AR, ARCH and GARCH models, both gradient-based algorithms and EM-like algorithms are provided for implementations, featured with automatic model selection and in reference of the well-known EM algorithm.

Except for the first column in Fig. 1, where only one time series is considered, mostly we consider dependences across more than one channel of time series. Prediction and decision making in portfolio management are based on such dependences that may not necessarily reflect causal structure underlying data, while it will be better to make prediction and decision based on casual structure. In “Linear causal analyses” section, path analyses (Wright 1934) for linear causal analyses is briefly reviewed, a recent development on ρ-diagram (Xu 2018) is refined for cofounder discovery and a causal potential theory is proposed. Further discussions are made on structural equation modeling (SEM) (Ullman 2006; Pearl 2010a; Westland 2015; Kline 2015) and its relations to modulated TFA-APT and nGCH-driven M-TFA-O.

Financial prediction: time series models and three finite mixture extensions

Time series models and neural networks

One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–moving-average (ARMA) model as follows:

$$x_{t} = a_{0} + \varepsilon_{t} + \mathop \sum \limits_{j = 1}^{q} a_{j} x_{t - j} + \mathop \sum \limits_{i = 1}^{p} b_{i} \varepsilon_{t - i} , \varepsilon_{t} \sim^{{{\text{i.i.d.}}}} G(\varepsilon |0, \sigma^{2} ) ,$$

(1)

where $\varepsilon_{t} \sim^{{{\text{i.i.d}} .}} G(\varepsilon |0, \sigma^{2} )$ denotes that $\varepsilon_{1} , \ldots ,\varepsilon_{t} , \ldots$ are i.i.d. samples from $G(\varepsilon |0, \sigma^{2} )$, while $G(u|\mu , \sigma^{2} )$ denotes a Gaussian distribution of u with the mean μ and the variance σ². Particularly, the ARMA model degenerates to the AR model when q = 0.

The ARMA model is appropriate to describe a wide sense stationary sequence. Extension has been made to describe data ξ_t that have some clearly identifiable trend of a polynomial growth (Box and Jenkins 1970); see Box 1 in Fig. 1. It is made simply by an initial differencing to remove the non-stationarity. That is, we get

$$x_{t} = \Delta^{d} \xi_{t} ,\quad {\text{where }}d > 0 ,\;\Delta u_{t} = u_{t} - u_{t - 1} {\text{and }}u_{t} = \Delta^{d} \xi_{t} .$$

(2)

A cascade of this initialization and ARMA is called the autoregressive integrated moving average (ARIMA) model. For simplicity, we prefer to still use AMRA to indicate ARIMA by regarding such an initialization as a pre-processing stage.

In the literature of statistics, econometrics, control and signal processing, generalizations of ARMA have been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering variables conditional to heteroskedasticity (Engle 1982; Bollerslev 1986); see Box 8 in Fig. 1. Namely, we consider

$$x_{t} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} x_{t - j} + \varepsilon_{t} , \varepsilon_{t} = \sigma_{t} z_{t} , z_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(z|0, 1) ,$$

where σ_t is not a constant, but given by the following regression:

$$\begin{aligned} \sigma_{t}^{2} \left( \vartheta \right) & = \sigma_{0}^{2} + \mathop \sum \limits_{i = 1}^{q} \beta_{i} \varepsilon_{t - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t - j}^{2} , \\ \vartheta & = \{ \sigma_{0}^{2} > 0, \beta_{i} \ge 0,\, for \ \ i > 0, \omega_{j} \ge 0,\, for \ \ j \ge 0\} , \\ \end{aligned}$$

(3)

which is usually denoted by GARCH(p,q) and degenerates to the ARCH model when p = 0.

Extensions of the ARMA model have also been made under the name of nonlinear ARMA (NARMA) for modeling nonlinear dependence (Leontaritis and Billings 1985) and to Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987). In the literature, many efforts have been made on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as illustrated by Box 3 in Fig. There are already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies. Instead, the subsequent two subsections will focus on Box 4 in Fig. 1, namely, learning mixture of multiple models.

Learning mixture of AR, ARMA, ARCH and GRACH models

Studies on finite mixture extensions of AR, ARMA, ARCH and GARCH models can be summarized into the following general expression:

$$\begin{aligned} & P(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta ) = \mathop \sum \limits_{\ell = 1}^{k} \alpha_{\ell } G(x_{t} - \mu_{\ell ,t} |0, \sigma_{\ell ,t}^{2} ), \\ & \mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right), {\mathbf{x}}_{t - 1}^{m} = \left[ {x_{\text{t - 1}} , \ldots ,x_{\text{t - m}} ]^{\rm T} , \varvec{a}_{i} = } \right[a_{0,i} ,a_{1,i} , \ldots ,a_{{q_{i} ,i}} ]^{\rm T} , \\ & \varepsilon_{t} = x_{t} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} ,\theta } \right), q = \hbox{max} \{ q_{1,} , \ldots ,q_{m} \} , \\ \end{aligned}$$

(4)

where we consider k regression models $x_{t} = \mu_{i,t} + \varepsilon_{i,t} , i = 1,, \ldots ,k$ with each $\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)$ being either of AR, ARMA, ARCH and GARCH models, and with the corresponding residual ɛ_i,t from $G(\varepsilon_{i,t} |0, \sigma_{i,t}^{2} )$. Typically, the studies of the AR, ARCH and GARCH models share the following detailed expression (Xu 1995a, b; Cheung et al. 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2003, 2004a; Tang et al. 2003):

$$\begin{aligned} & \mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right) = \varvec{a}_{i}^{\rm T} \left[ {\begin{array}{*{20}c} { 1} \\ {\varvec{x}_{t - 1}^{{q_{i} }} } \\ \end{array} } \right],\sigma_{i,t}^{2} = \left\{ {\begin{array}{*{20}l} {\sigma _{{i,0}}^{2} {\text{ > 0, }} \qquad \qquad \left( {\text{a}} \right) {\text{AR}}} \\ {\sigma _{{i,0}}^{2} + \varvec{b}_{i}^{{\text{T}}} E_{{i,t - 1}}^{{q_{i} }} ,\qquad \left( {\text{b}} \right){\text{ARCH}}} \\ {\sigma _{{i,0}}^{2} + \varvec{b}_{i}^{{\text{T}}} E_{{i,t - 1}}^{{q_{i} }} + \varvec{w}_{i}^{{\text{T}}} \sum _{{i,t - 1}}^{{p_{i} }} , \quad \left( {\text{c}} \right){\text{GARCH}}} \\ \end{array} } \right. \\ & E_{i,t - 1}^{n} = \left[ {\varepsilon_{i,t - 1}^{2} , \ldots ,\varepsilon_{i,t - n}^{2} ]^{\rm T} , {\mathbf{w}}_{i} = } \right[w_{1,i} , \ldots ,w_{{p_{i} ,i}} ]^{\rm T} , \,\omega_{j,i} \, \ge \,0, j\, = \,1, \ldots ,p_{i} \\ & \varSigma_{i,t - 1}^{n} = \left[ {\sigma_{i,t - 1}^{2} , \ldots ,\sigma_{i,t - n}^{2} ]^{\rm T} , {\mathbf{b}}_{i} = } \right[\beta_{1,i} , \ldots ,\beta_{{q_{i} ,i}} ]^{\rm T} , \, \beta_{j,i} \, \ge \,0, j\, = \,1, \ldots ,q_{i} . \\ \end{aligned}$$

(5)

For ARMA (Kwok et al. 1998; Tang et al. 2003), the detailed expression of $\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)$ is given by Eq. (1). Moreover, $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)$ can be also a specific nonlinear function, e.g., given by three-layer neural networks (Cheung et al. 1996, 1997) or the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, Xu 2009).

According to Eq. (4), a sequence x₁, …, x_t, … may come from the ith one of the k models with the probability α_i, and jointly the k models describe the sequence x₁, …, x_t, … with a residual ɛ_t that comes from a Gaussian mixture $P(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )$. In such a way, a nonlinear dependence of the current observation on past values and noise disturbances is modeled by probabilistically combining a mixture of linear models, which keeps the model structure simple and easy to learn. Moreover, non-stationarity beyond ones handled by ARIMA and GARCH models is able to be modeled via switching among individual linear models.

Also, a sequence x₁, …, x_t, … may be segmented into pieces with different statistical properties, simply by Bayesian posterior as follows (Xu 1994, 1995a, b):

$$P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta ) = \frac{{\alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}{{\mathop \sum \nolimits_{{j_{t} = 1}}^{k} \alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}},$$

(6)

that is, x_t is identified as coming from the j^*th model by

$$j^{*} = {\text{argmax}}_{j} P(j|x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta ) \quad {\text{or }}j^{*} = {\text{argmax}}_{j} [\alpha_{j} G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )].$$

To reduce the number of small fragments, some post-processing or smoothing regularization may be added. Moreover, we may extend a finite mixture into a hidden Markov model (HMM) (Rabiner 1989), in which each hidden state is associated with one $G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )$ and the transition between state is described by

$$\begin{aligned}\varvec{\alpha}_{\varvec{t}} & = \varvec{Q\alpha }_{{\varvec{t} - 1}} , {\varvec{\upalpha}}_{\varvec{t}} = [\varvec{\alpha}_{{1,\varvec{t}}} , \ldots ,\varvec{\alpha}_{{\varvec{k},\varvec{t}}} ]^{\bf{T}} , 0\le\varvec{\alpha}_{{\varvec{j},\varvec{t}}} \le 1 , \mathop \sum \limits_{\varvec{j}}\varvec{\alpha}_{{\varvec{j},\varvec{t}}} = 1, \\ \varvec{Q} & = \left[ {\varvec{q}_{{\varvec{j}|\varvec{i}}} } \right], 0\le \varvec{q}_{{\varvec{j}|\varvec{i}}} \le 1,\varvec{ }\mathop \sum \limits_{{\varvec{i} = 1}}^{\varvec{k}} \varvec{q}_{{\varvec{j}|\varvec{i}}} = 1, \\ \end{aligned}$$

(7)

with α_j,t estimated as time proceeds and then used in Eq. (5) and Eq. (6). Moreover, we can also further modify Eq. (5) and Eq. (6) into

$$\begin{aligned} P(j_{t} |x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\theta ) & = \frac{{q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}{{\mathop \sum \nolimits_{{j_{t} = 1}}^{k} q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}, \\ j_{t}^{*} & = {\text{argmax}}_{j} P(j|x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\theta ) {\text{ or }}j_{t}^{*} = {\text{argmax}}_{j} [q_{{j|j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )]. \\ \end{aligned}$$

(8)

Next, we proceed to estimate x_t from the finite mixture by Eq. (4). It follows that

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} , \theta } \right) = \mathop \int \nolimits x_{t} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )\,{\text{d}}x_{t} = \mathop \sum \limits_{i = 1}^{k} \alpha_{i} \mu_{i,t} ,$$

(9)

that is, we improve the prediction of x_t via each individual model by a line combination weighted by each α_i. However, this improvement is limited because α_i is a constant that does not change as the samples vary with time.

Each α_i in Eq. (4) cannot directly be replaced by its corresponding Bayes posterior by Eq. (5). First, $P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta )$ cannot be moved out of the integral $\mathop \smallint \nolimits x_{t} P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta )G(x_{t} |\mu_{j,t} , \sigma_{j,t}^{2} )dx_{t}$, though the integral can be made approximately. Second, the calculation needs to know x_t. Getting $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}$ from knowing x_t is applicable to a filtering problem that gets a smoothed or filtered version from x_t, but it is not applicable to a prediction problem that targets at getting $\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}$ from its past observations.

Instead, we use a predictive $P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )$ based on the immediate past observations $\varvec{x}_{t - 1}^{q}$ to combine the prediction of individual prediction model adaptively; that is, we have

$$\begin{aligned} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta ) & = \mathop \sum \limits_{{j_{t} = 1}}^{k} P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} ), \\ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} , \theta } \right) & = \mathop \int \nolimits x_{t} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )\,{\text{d}}x_{t} = \mathop \sum \limits_{{j_{t} = 1}}^{k} P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )\mu_{{j_{t} ,t}} , \\ \end{aligned}$$

(10)

which summarizes extensions of the AR, ARMA, ARCH and GARCH models with the help of the mixture-of-experts (ME). In the implementation of the original ME (Jacobs et al. 1991; Jordan and Xu 1995), $P(j|\varvec{x}_{t - 1}^{q} ,\varphi )$ is called the gating net and given as follows:

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{g_{j} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)}} /\mathop \sum \limits_{j = 1}^{k} e^{{g_{j} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)}} ,$$

with $g_{1} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right), \ldots , g_{k} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)$ being the output of multilayer networks.

In an implementation of an alternative ME model (Xu et al. 1994, 1995), we consider a predictive Bayesian posteriori

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = \frac{{\alpha_{j} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}{{q(\varvec{x}_{t - 1}^{q} |\psi )}}, q({\mathbf{x}}_{t - 1}^{q} |\psi ) = \sum\nolimits_{j = 1}^{k} \alpha_{j} q(\varvec{x}_{t - 1}^{q} |\psi_{j} ).$$

(11)

For the AR, ARCH and GARCH models, we further have

$$q(\varvec{x}_{t - 1}^{q} |\psi_{j} ) = q(x_{t - 1} |x_{t - 2} , \ldots ,x_{t - q} )q(x_{t - 2} |x_{t - 3} , \ldots ,x_{t - q} ) \cdots q\left( {x_{t - q} } \right).$$

To simplify the computation, we may consider the following approximation:

$$q(\varvec{x}_{t - 1}^{q} |\psi_{j} ) \approx q(x_{t - 1} |x_{t - 2} , \ldots ,x_{t - q - 1} ) = G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} ).$$

(12)

A further insight into Eq. (11) can be obtained at a setting that σ ²_{j,
t−1} = σ ²_j and x_t−1 = μ_j,t−1.; in this special case, we have a further simplification:

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = \frac{{\alpha_{j} /\sigma_{j} }}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} /\sigma_{j} }},$$

(13)

which shares a similar concept to the mixture-using variance (MUV) and actually degenerates to this MUV (Perrone and Cooper 1993, Perrone 1994) when $\alpha_{j} \propto \sigma_{j,t}^{ - 1}$. Another special case is that α_i/σ_i,t is constant, and it follows from Eqs. (11) to (12) that we have

$${{P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} } \mathord{\left/ {\vphantom {{P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} } {\sum \nolimits_{j = 1}^{k} e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} ,}}} \right. \kern-0pt} {\sum \nolimits_{j = 1}^{k} e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} ,}}$$

(14)

by which we get the counterparts of NRBF and ENRBF (Xu 1998, Xu 2009).

The other choices of $P(j|\varvec{x}_{t - 1}^{q} ,\varphi )$ may also be obtained or modified from Table 3 in Xu and Amari (2008). Moreover, similar to Eq. (8), it still follows from $q(\varvec{x}_{t - 1}^{q} |\psi_{j} )$ given by Eqs. (11) and (12) that we may further incorporate the HMM model from Eq. (7) into Eq. (11) and get

$$P(j_{t} |x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\varphi ) = {{q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )} \mathord{\left/ {\vphantom {{q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )} {\mathop \sum \nolimits_{j = 1}^{k} q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}} \right. \kern-0pt} {\mathop \sum \nolimits_{j = 1}^{k} q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}.$$

(15)

Maximum likelihood, RPCL learning and learning with model selection

Typically, unknown parameters in the models in Eqs. (4), (8), (10) and (11) are estimated by the maximum likelihood (ML) learning, that is, the following maximization:

$$\begin{aligned} & \quad \varTheta^{*} = \arg \max_{\varTheta } L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ), \hfill \\ L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ) &= \left\{ \begin{array}{ll} \sum\limits_{t} {\ln \sum\limits_{{j_{t} = 1}}^{k} {\alpha_{{j_{t} }} } G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} ,&\quad {\text{ (a) for finite mixture by Eq}} .\, ( 4 ) ,\\ \sum\limits_{t} {\ln \left\{ {\sum\limits_{{j_{t} = 1}}^{k} {P(j_{t} |{\mathbf{x}}_{t - 1}^{q} ,\phi )} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} \right\}} ,&\quad ( {\text{b) for ME by Eq}} .\, ( 10 ) ,\\ \sum\limits_{t} {\ln \left\{ {\sum\limits_{{j_{t} = 1}}^{k} {\alpha_{{j_{t} }} q({\mathbf{x}}_{t - 1}^{q} |\psi_{{j_{t} }} )} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} \right\}} ,&\quad ( {\text{c) for AME by Eq}} .\, ( 1 1 ) ,\\ \ln \left\{ {\sum\limits_{{j_{1} , \ldots ,j_{N} }} {\prod\limits_{t} {q_{{j_{t} |j_{t - 1} }} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} } } \right\},&\quad ( {\text{d) for HMM mixture }} .\end{array} \right. \end{aligned}$$

(16)

This maximization is implemented by the EM algorithm (Redner and Walker 1984), e.g., see the EM algorithms for finite mixture of AR models in Xu (1994, 1995a, b), finite mixture of GARCH models in Wong et al. (1998), finite mixture of ARMA–GARCH models in Tang et al. (2003) and the original ME in Jordan and Xu (1995), as well as the alternative ME model, NRBF and ENRBF in Xu et al. (1994, 1995) and Xu (1998, 2009).

For an HMM mixture, we may also have the following approximate likelihood:

$$L(x_{t(t = 1)}^{N} |\varTheta ) = \left\{\begin{array}{ll} \sum\nolimits_{t} {\ln \sum\nolimits_{{j_{t} = 1}}^{k} {q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} ),\quad ({\text{i}}),} } \\ {{\sum\nolimits_{t} {\ln \left\{ {\sum\nolimits_{{j_{t} = 1}}^{k} {q_{{j_{t} j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )} } \right\}} ,\quad ({\text{ii}})}}.\end{array} \right.$$

(17)

One critical problem for the ML learning is that a good performance on a training set is not necessarily good on a testing set, especially when the training set consists of a small size of samples. The reason is that there may be too many free parameters. As introduced in the third section of Xu (2009), efforts on this problem are mainly featured by learning with model selection. Model selection refers to select a model with an appropriate complexity $\varvec{k}$. For the models considered in the previous subsection, $\varvec{k }$ consists of the number of individual models, the autoregression order and the moving average order for each individual model. Typically, the ML learning is not good for model selection. However, whether the EM algorithm works well depends on whether an appropriate $\varvec{k}$ is selected.

Classically, model selection is made in a two-stage implementation. First, enumerate a candidate set $\varvec{\rm K}$ of $\varvec{k}$ and estimate a solution $\varTheta_{\varvec{k}}^{*}$ for the unknown set Θ_k of parameters by the ML learning at each $\varvec{k} \in \varvec{\rm K}$. Second, use a model selection criterion $J\left( {\varTheta_{\varvec{k}}^{*} } \right)$ to select a best $\varvec{k}^{*}$. Several classical criteria are available for the purpose, such as AIC, CAIC and BIC/MDL, and readers are referred to Xu (2009, 2010) for a recent outline. Unfortunately, any one of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Even with a criterion $J\left( {\varTheta_{\varvec{k}} } \right)$ available, this two-stage approach usually incurs a huge computing cost. Still, the parameter learning performance deteriorates rapidly as $\varvec{k}$ increases, which makes the value of $J\left( {\varTheta_{\varvec{k}} } \right)$ to be evaluated unreliably.

One direction that tackles this challenge is called automatic model selection, which is associated with a learning algorithm or a learning principle with the following two features:

When there is an indicator $\rho \left( {\theta_{\varvec{r}} } \right)$ on a subset $\theta_{\varvec{r}} \in \varTheta_{\varvec{k}}$, we have $\rho \left( {\theta_{\varvec{r}} } \right) = 0$ if $\theta_{\varvec{r}}$ consists of parameters of a redundant structural part.
In implementation of this algorithm or principle, there is a mechanism that automatically drives $\rho \left( {\theta_{\varvec{r}} } \right) \to 0$ as $\theta_{\varvec{r}}$ toward a specific value. Thus, the corresponding redundant structural part is effectively discarded.

An early effort along this direction is rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993) for adaptively learning a model that consists of $k$ substructures as follows:

$$\theta_{j}^{\text{new}} = \theta_{j}^{\text{old}} + p_{j,t} \eta \frac{{\partial \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right)}}{{\partial \theta_{j} }},\quad p_{j,t} = \left\{ {\begin{array}{ll} 1, &\quad j = c = {\text{argmax}}_{j} \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right), \\ \gamma ,&\quad j = {\text{argmax}}_{j \ne c} \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right), \\ 0,&\quad {\text{otherwise}}. \\ \end{array} } \right.$$

(18)

where η > 0 is a learning step size and γ is a small positive number, e.g., γ = 0.005–0.01. With $k$ initially at a value large enough, a current input sample x_t is allocated to one of the $k$ substructures via competition. The winner adapts to this sample by a little bit, while the rival is de-learned a little bit to reduce a duplicated allocation. This rival penalized mechanism will discard extra substructures, making model selection automatically during learning. Readers are referred to Xu (2007) for a recent overview and extensions.

Corresponding to Eq. (16), π_j,t(θ ^old_j ) in Eq. (18) is given as follows:

$$\pi_{{j_{t} ,t}} \left( {\theta_{{j_{t} }} } \right) = \left\{ \begin{array}{ll} \ln [\alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{a}} \right) \, {\text{for finite mixture by Eq}} .\left( 4 \right), \\ \ln [P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{b}} \right) \, {\text{for ME by Eq}} .\left( { 1 0} \right), \\ \ln [\alpha_{{j_{t} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{c}} \right) \, {\text{for AME by Eq}} .\left( { 1 1} \right)\\ \end{array} \right..$$

(19)

For an HMM mixture, we may also approximately have

$$\pi_{{j_{t} ,t}} \left( {\theta_{{j_{t} }} } \right) = \left\{ {\begin{array}{*{20}l} {\ln [q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )],} & { \left( {\text{i}} \right)\, {\text{for HMM mixture by Eq}} .\left( 8 \right),} \\ {\ln [q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )],} & {\left( {\text{ii}} \right)\, {\text{for HMM AME by Eq}} .\left( { 1 5} \right).} \\ \end{array} } \right.$$

(20)

Another stream of automatic model selection is featured by those appropriate prior-based efforts. By a Laplace prior in a regression task, sparse learning or Lasso shrinkage prunes away extra weights (Williams 1995; Tibshirani 1996). For pruning away Gaussian components on Gaussian mixture, a Jeffreys priori is used in the implementation of the minimum message length (MML) that minimizes a two-part message for a statement of model and a statement of data encoded by that model (Figueiredo and Jain 2002), and also Dirichlet–Normal–Wishart priories is added on Gaussian components in the implementation of the variational Bayes (VB) that computes a lower bound of the marginal likelihood (McGrory and Titterington 2007).

However, these efforts highly depend on choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, VB and MML all degenerate to the maximum likelihood learning, while the RPCL learning is still capable of automatic model selection. Firstly proposed in Xu (1995a, b) and systematically developed over a decade and half (Xu 2001, 2007, 2010, 2012), the third stream of efforts has been made under the name of Bayesian Ying–Yang (BYY) harmony learning. The BYY harmony learning shares a mechanism similar to the RPCL learning. Also, the performances of BYY harmony learning can be further improved by incorporating appropriate priors. Further details about the BYY harmony learning are referred to “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, where a tutorial is also provided on one BYY harmony learning algorithm for alternative mixture-of-experts-based GARCH models.

Dynamic trading and portfolio management

Dynamic trading by supervised learning and reinforcement learning

Instead of building a mathematical model for understanding and forecasting time series, studies of neural networks and machine learning in economics and finance started to shift from nonlinear forecasting modeling to adaptive trading and dynamic portfolio management (Neuneier 1996; Choey and Weigend 1997; Xu and Cheung 1997; Moody et al. 1998; Hung et al. 2000; Moody and Saffell 2001; Hung et al. 2003; Chiu and Xu 2004b; Jangmin 2006). Efforts on portfolio management will be addressed in the next subsection. In the sequel, we introduce efforts on learning dynamic trading based on one single time series, with the help of supervised learning, reinforcement learning and Sharpe ratio maximization.

Given a sequence x₁, …, x_t, e.g., the sequence of one asset, Gold, FOREX index,…, etc., at any time point t ≤ τ we may infer a sequence $I_{1}^{p} , \ldots I_{t}^{p}$ each $I_{\tau }^{p}$ being the following desired trading signal:

$$I_{\tau }^{p} = \left\{ \begin{array}{ll} + 1, &\quad {\text{to}}\,{\text{buy}}, \hfill \\ - 1, &\quad {\text{to}}\,{\text{sell}}, \hfill \\ 0, &\quad {\text{no}}\,{\text{action}}, \hfill \\ \end{array} \right.$$

(21)

based on a trading strategy (e.g., maximum return) or an external expertise.

The task of learning decision, as illustrated by Box 1 in Fig. 2, can be formulated as a nonlinear regression model:

$$\tilde{I}_{t}^{p} = \frac{{1 - e^{{ - f\left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)}} }}{{1 + e^{{ - f\left(XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta\right) }} }}$$

(22)

where $f\left( {XF_{t}^{q} , \left\{ {I_{t - \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right)$ is implemented by an ENRBF network in Xu & Cheung (1997). Also, it can be implemented by three-layer neural networks. Supervised learning is used to determine the unknown parametric Θ by minimizing

$$E_{2} \left( \varTheta \right) = \mathop \sum \limits_{t}[ I_{t}^{p} - f(XF_{t}^{q} ,\{ I_{t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta )]^{2} ,$$

(23)

where XF ^q_t may be directly a number of past observations {x_t−τ} ^q_t=1 or certain features {F ^{(
i)}_t } extracted from{x_t−τ} ^q_t = 1 , e.g., F ^{(
i)}_t may be MACD, RSI, %K, %D, as well as features from candlestick charts and configurations from waves, etc. Also, we may put both together to consider $XF_{t}^{q} = \left\{ {\left\{ {x_{t - \tau } } \right\}_{t = 1}^{q} , \left\{ {F_{t}^{(i)} } \right\}} \right\}.$

One key problem is how to keep a good generalization ability by training with a small length of sequence x₁, …, x_t. One way is adding some regularization term E₂(Θ) + λΓ(Θ). Without a priori knowledge, however, it is not an easy task to get an appropriate term Γ(Θ) and its strength λ. The other way is to describe the model as follows:

$$\begin{aligned} q\left( {\tilde{I}_{t}^{p}\ | \ {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta }} \right) &= \frac{{exp\left[ {z_{t}^{{(1)}} f^{{(1)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right) + z_{t}^{{(2)}} f^{{(2)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right]}}{{1 + exp\left[ {f^{{(1)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right] + \exp \left[ {f^{{(2)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right]}} \hfill \\ f(XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta ) &= [f^{{\left( 1 \right)}} (XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta ),f^{{\left( 2 \right)}} (XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta )]^{{\text{T}}} , \end{aligned}$$

(24)

with $I_{t}^{p} = [z_{t}^{\left( 1 \right)} ,z_{t}^{\left( 2 \right)} ,z_{t}^{\left( 3 \right)} ]^{\rm T} , z_{t}^{\left( 2 \right)} = 0\, {\text{or}} \,1$ and z ⁽¹⁾_t + z ⁽²⁾_t + z ⁽³⁾_t = 1. Correspondingly, min _ΘE₂(Θ) is replaced by maximizing the likelihood $L\left( \varTheta \right) = \mathop \sum \limits_{t} { \ln }q(I_{t}^{p} |f(XF_{t}^{q} ,\{ I_{t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta ))$. In the formulation, learning regularization may be implemented via Bayesian learning with help of a priori distribution q(Θ), i.e., max _Θ[L(Θ) + lnq(Θ)]. For a better generalization ability, we may also put q(I ^p_t |f(XF ^q_t , {I ^p_{t−
τ} } ^q_t=1 , Θ)) into a Bayesian Ying–Yang system and making BYY harmony learning with automatic model selection; see Sect. 4.4 in Xu (2010).

The other key problem is how to make a pre-processing stage for getting a desired sequence $I_{1}^{p} , \ldots ,I_{t}^{p}$, which can be obtained automatically by a trading strategy, e.g., getting a profit and cutting a loss beyond a pre-specified threshold as follows:

$$I_{t}^{p} = \left\{ \begin{array}{ll} + 1, &\quad {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} \mathord{\left/ {\vphantom {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} {\sigma_{t} \ge g_{0}^{ + } > 0}}} \right. \kern-0pt} {\sigma_{t} \ge g_{0}^{ + } > 0}}, \hfill \\ - 1, &\quad {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} \mathord{\left/ {\vphantom {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} {\sigma_{t} \le g_{0}^{ - } \le 0}}} \right. \kern-0pt} {\sigma_{t} \le g_{0}^{ - } \le 0}}, \hfill \\ 0, &\quad {\text{no}}\,{\text{action}}, \hfill \\ \end{array} \right.$$

where σ_t is an estimation of the volatility about this asset. Also, $I_{1}^{p} , \ldots ,I_{t}^{p}$ may come from an outcome of market technical analysis, which is difficult to get $I_{1}^{p} , \ldots ,I_{t}^{p}$ adaptively in a dynamic trading.

From the studies (Moody et al. 1998; Moody and Saffell 2001; Jangmin 2006), $I_{1}^{p} , \ldots ,I_{t}^{p}$ is a sequence of actions that are dynamically learned by reinforcement learning. Typically, a reinforcement learning model consists of a set S of environment states (e.g., differences in the current price of asset and the volumes in holding) and a set A (e.g., buy, sell, no action) of actions. There is also a policy π that chooses an action a_t ∊ A at an environment state s_t. The action a_t makes the environment move to a new state s_t+1. Associated with the transition (s_t, a_t, s_t+1), there is a scalar immediate reward r_t+1(s_t, a_t, s_t+1) that is estimated according to a utility function, e.g., a maximum profit. The goal is to collect as much reward as possible by determining a sequence of actions a₁, …, a_t.

In the literature of reinforcement learning, one popular approach is called Q-learning, by which a_t is chosen according to a table Q(s_t, a_t) that is learned from r_t+1(s_t, a_t, s_t+1). For a dynamic trading, the S of environment states are featured by differences in the current price of asset and the volumes in holding. Quantizing the differences into the states is not an easy task. Also, there will be a large number states to be considered. As a result, we need to learn a large Q(s_t, a_t) table, which not only increases computing cost rapidly, but also makes the problem of a small sample size become more serious because Q(s_t, a_t) consists of too many free parameters to be determined. Instead of Q-learning, the action a_t in r_t+1(s_t, a_t, s_t+1) can be approximately replaced by the value of I ^p_t given by Eq. (22) such that r_t+1(s_t, a_t, s_t+1) is replaced by an expression r_t+1(s_t, s_t+1, {x_t−τ} ^q_t=1 , {I ^p_{t−
τ} } ^q_t=1 , Θ). As a result, the maximization of ∑ ^∞_t=1 γ^tr_t+1(s_t, a_t, s_t+1) with respect to a sequence of discrete actions a₁, …, a_t is replaced by the maximization of ∑ ^∞_t=1 γ^tr_t+1(s_t, s_t+1, {x_t−τ} ^q_t=1 , {I ^p_{t−
τ} } ^q_t=1 , Θ) with respect to Θ. Similar to learning regularization, the problem of a small sample size may also be handled by adding a a priori term, e.g., $\sum\nolimits_{t = 1}^{\infty } {\gamma^{t} r_{t + 1} \left( {s_{t} , s_{t + 1} , \left\{ {x_{t - \tau } } \right\}_{t = 1}^{q} , \left\{ {I_{t - \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right) + \lambda { \ln } q\left( \varTheta \right)} .$

The last but not the least, the specific expression of r_t+1(s_t, a_t, s_t+1) is an important practical issue, related to the current price of asset, the volume in holding, the transaction cost and the tax, as well as personal preference. There could be a number of choices. See Fig. 2 by Box 3; a widely used one is the Sharpe ratio, which is originally suggested for evaluating the goodness of an asset in market by a ratio of the excess asset return (i.e., after minus the benchmark return) over the standard deviation of the excess asset return (Sharpe 1966, 1994). For dynamic trading, it is not the Sharpe ratio of the asset in market that has to be calculated, but the Sharpe ratio of the dynamic trading system, which depends on a sequence of actions a₁, …, a_t.

Dynamic portfolio management by maximizing Sharpe ratio and extensions

Instead of only considering one single asset, a common and more reliable practice is considering a portfolio of assets, and thus portfolio management is one important topic in the finance literature. For the supervised learning by Eq. (22), its extension can be made simply by considering $I_{j,t}^{p} (XF_{t}^{q} ,\{ I_{j,t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta_{j} ), j = 1, \ldots ,k$ with each in the format of Eq. (22), and learning is made by minimizing the total sum ∑ _jE₂(Θ_j). Simply, we get the training signals $I_{j,1}^{p} , \ldots ,I_{j,t}^{p}$ per asset individually. Still, further studies are needed on how to get the training signals bases on the whole portfolio of assets. Conceptually, extension of reinforcement learning to multiple assets is rather straightforward too. However, both the set S of environment states and the set $A$ of possible actions increase rapidly, which makes learning a large table Q(s_t, a_t) seriously suffer the problem of a small sample size. Thus, it becomes more critical to get a₁, …, a_t to be approximately replaced by {I ^p_{j,
t} (XF ^q_t , {I ^p_{j,
t −
τ} } ^q_t=1 , Θ_j)} ^k_j=1 in evaluating the reward r_t+1 (Moody et al. 1998; Moody and Saffell 2001). Similar to supervised learning, one direction for tackling the problem of a small sample size is incorporating with learning regularization.

Alternatively, another direction to pursuit portfolio management is exploring the road pioneered by the Markowitz portfolio theory (Markowitz 1952), see Box 2 in Fig. 2. By this theory, the return of an investment portfolio is the proportion-weighted combination of the constituent assets’ returns, while the portfolio volatility is a function of the correlations between the component assets. The portfolio expected return is maximized subject to a given amount of portfolio risk, or equivalently risk is minimized for a given level of expected return. Moreover, the Markowitz mean–variance scheme also leads to the suggestion of Sharpe ratio (Sharpe 1966, 1994), which is typically used to evaluate the performance of a portfolio.

In both the standard Markowitz mean–variance scheme and Sharpe ratio approach, a risk is defined as the return variance, which has been subsequently realized that the variance is not an appropriate measure because it counts the positive fluctuation above the expected returns (also called upside volatility) as a part of the risk. See Box 4 in Fig. 2; the downside risk thus becomes a topic to study. Markowitz (1959) counts the volatility below the expected returns only. Fishburn (1977) makes a mean-risk analysis with risk associated with below-target returns and proposes a more sophisticated measure of risk associated with below-target return, which has been further refined by Sortino and Meer (1991). Basically, this downside risk is the volatility of return below the minimal acceptable return (also called target return G).

$${\text{down}}V_{\gamma } (G) = \int\nolimits_{{ - \infty }}^{G} {(G - r)^{\gamma } {\text{d}}F{\text{(r)}}}$$

(25)

Moreover, the downside risk of a single asset has been extended into the following covariance (Hung et al. 2000, 2003):

$$\varvec{D} = \left[ {d_{i,j} } \right],d_{i,j} = \mathop \int \limits_{ - \infty }^{G} \mathop \int \limits_{ - \infty }^{G} (G - r_{i} )^{{\frac{\gamma }{2}}} (G - r_{j} )^{{\frac{\gamma }{2}}} p(r_{i} ,r_{j} )\,{\rm d}r_{i} {\rm d}r_{j} ,$$

(26)

for the returns $r_{j} ,\, j = 1, \ldots ,k$ of multiple assets. Also, we have the following matrix for the upside volatility:

$$\varvec{U} = \left[ {u_{i,j} } \right],\,\,u_{i,j} = \mathop \int \limits_{G}^{ + \infty } \mathop \int \limits_{G}^{ + \infty } (r_{i} - G)^{{\frac{\gamma }{2}}} (r_{j} - G)^{{\frac{\gamma }{2}}} p(r_{i} ,r_{j} )\,{\rm d}r_{i} {\rm d}r_{j} .$$

(27)

The sprit of the Markowitz theory and the Shape ratio, i.e., maximizing the expected returns while minimizing the risk, is reasonably modified into one extended Sharpe ratio featured by maximizing both the expected returns and the upside volatility while minimizing the downside risk; see Box 5 in Fig. 2. In Hung et al. (2000, 2003), this generalization is implemented by the following maximizaon:

$$\begin{aligned} \mathop {\text{Max}}\limits_{\varvec{w}} \left[ {\frac{{\varvec{w}^{\rm T} E\varvec{r} + H\varvec{w}^{\rm T} \varvec{Uw}}}{{\varvec{w}^{\rm T} \varvec{Dw}}} + B\varvec{w}^{\rm T} \left( {1 - \varvec{w}} \right)} \right],1 = \left[ {1, \ldots ,1} \right]^{\text{T}} , \hfill \\ \varvec{r} = \left[ {r_{1} , \ldots ,r_{k} ]^{\text{T}} , {\mathbf{w}} = } \right[w_{1} , \ldots ,w_{k} ]^{\rm T} ,\quad \mathop \sum \limits_{i = 1}^{k} w_{i} = 1,\quad w_{i} \ge 0. \hfill \\ \end{aligned}$$

(28)

As shown in Fig. 4, we use the parameters H, B to adapt the investor’s preference. The parameter H represents a strength of maximizing upside volatility and B represents a strength of diversification or regularization. The term $\varvec{w}^{\text{T}} \left( {1 - \varvec{w}} \right)$ is a diversification term that reaches its minimum when one w_i is 1 and others are 0, and its maximum when all the elements $\varvec{w}$ are equal.

It has been experimentally shown that this generalization of Sharpe ratio can effectively reduce the risk while obtaining great returns, in comparison with the standard Markowitz mean–variance scheme and Sharpe ratio. Moreover, investors expect a constant return with a minimum downward risk, for which we can simply set $\varvec{w}^{\text{T}} E\varvec{r} = r_{\text{spec}}$, while the others expect a maximum return under a constant downward risk, for which we can simply set $\varvec{w}^{\text{T}} \varvec{Dw} = v_{\text{spec}}$.

In Sect III(C) of Xu (2001), several developments have been proposed along this direction. First, a more practical scenario is considered, featured with a portfolio of risk securities with returns $r_{j,t} , \,j = 1, \ldots ,k$, a risk-free bond with return r^f and transaction cost with a rate r_c. That is, $r_{t} = \varvec{w}^{\text{T}} \varvec{r}$ is replaced by

$$\begin{aligned} r_{t} & = \left( {1 - \alpha_{0} } \right)r^{f} + \alpha_{0} \mathop \sum \limits_{j = 1}^{k} \left[ {w_{j,t} r_{j,t} - r_{c} \mathop \sum \limits_{j = 1}^{k} \left| {w_{j,t} - w_{j,t - 1} } \right|\left( {1 + r_{j,t} } \right)} \right] \\ & = \left( {1 - \alpha_{0} } \right)r^{f} + \alpha_{0} \left[ {\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{c} \delta \varvec{w}_{t}^{\rm T} \left( {1 + \varvec{r}_{t} } \right)} \right], \alpha_{0} > 0, {\mathbf{w}}_{t}^{\text{T}} 1 = 1, \\ \delta \varvec{w}_{t} & = \left[ {\left| {w_{1,t} - w_{1,t - 1} \left| {, \ldots ,} \right|w_{k,t} - w_{k,t - 1} } \right|} \right]^{\text{T}} , \\ \end{aligned}$$

(29)

where each w_j,t may be nonnegative as in Eq. (28). In this case, short of a risk security is not permitted but borrowing from the risk-free bond is allowed, i.e., we can have 1 − α₀ < 0. Also, we may allow a negative w_j,t, i.e., short of a risk security is permitted.

Second, instead of considering $E\varvec{w}^{\text{T}} \varvec{r} = \varvec{w}^{\text{T}} E\varvec{r}$ and $E\left[ {\varvec{w}^{\text{T}} \varvec{r} - E\varvec{w}^{\text{T}} \varvec{r}} \right]\left[ {\varvec{w}^{\text{T}} \varvec{r} - E\varvec{w}^{\text{T}} \varvec{r}} \right]^{\text{T}}$ for the expected return and its volatility, we compute their estimations directly from samples R_T = {r_t, t = 1, …, T} within a time window. Accordingly, it follows from Eq. (25) that we get the counterpart of Eq. (28) as follows:

$$\begin{aligned} Sp = \frac{{M\left( {R_{T} } \right)}}{{\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}} + \beta_{V} \frac{{\sqrt[\gamma ]{{V_{G}^{U} \left( {R_{T} } \right)}}}}{{\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}} + \beta_{\varvec{w}} D\left( \varvec{w} \right),\,M\left( {R_{T} } \right) = \frac{1}{T}\mathop \sum \limits_{t = 1}^{\rm T} r_{t} , \hfill \\ V_{G}^{D} \left( {R_{T} } \right) = \frac{1}{{\# \left( {r_{t} \le G} \right)}}\mathop \sum \limits_{{ r_{t} \le G}} (G - r_{t} )^{\gamma } , V_{G}^{U} \left( {R_{T} } \right) = \frac{1}{{\# (r_{t} > G)}}\mathop \sum \limits_{{ r_{t} > G}} (r_{t} - G)^{\gamma } , \hfill \\ \end{aligned}$$

(30)

where #S denotes the cardinality of the set S, and the parameter $\beta_{V} ,\beta_{\varvec{w}}$ are the counterparts of H, B in Eq. (28). Moreover, $D\left( \varvec{w} \right)$ is a diversification term that reaches its minimum when one w_i is 1 and the others are 0, and reaches its maximum when all the elements $\varvec{w}$ are equal. There could be several choices for $D\left( \varvec{w} \right)$. One example is $\varvec{w}^{\text{T}} \left( {1 - \varvec{w}} \right)$ in Eq. (28) or equivalently $- \varvec{w}^{\text{T}} \varvec{w}$. One other example is

$$D\left( \varvec{w} \right) = - \,\mathop \sum \limits_{j = 1}^{k} w_{j,t} \,\ln \,w_{j,t} , \mathop \sum \limits_{j = 1}^{k} w_{j,t} = 1, \quad w_{j,t} \ge 0.$$

(31)

Moreover, ${{M\left( {R_{T} } \right)} \mathord{\left/ {\vphantom {{M\left( {R_{T} } \right)} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}} \right. \kern-0pt} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}$ is a ratio which is also an improvement over $\varvec{w}^{\text{T}} E\varvec{r}/\varvec{w}^{\text{T}} \varvec{Dw}$ in Eq. (28), and actually $\varvec{w}^{\rm T} E\varvec{r}/\varvec{w}^{\rm T} \varvec{Dw}$ is not really a ratio. Third, instead of directly searching the parameters $\alpha_{0} ,\varvec{w}_{t}$, we may let

$$\begin{aligned} \alpha_{0} = \,& e^{{ - g\left( {\varvec{r}_{t} ,\psi } \right)}} , w_{j,t} = \frac{{e^{{f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right)}} }}{{\mathop \sum \nolimits_{i = 1}^{k} e^{{f^{\left( i \right)} \left( {\varvec{r}_{t} ,\varphi } \right)}} }}, \\ f\left( {\varvec{r}_{t} ,\varphi } \right) = & \,\left[ {f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right), \ldots ,f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right)} \right]^{\text{T}} , \\ \end{aligned}$$

(32)

with $g\left( {\varvec{r}_{t} ,\psi } \right), f\left( {\varvec{r}_{t} ,\varphi } \right)$ implemented by neural networks, e.g., an ENRBF network. In the next section, we will show that a portfolio of security returns $\varvec{r}_{t}$ may also be modeled by a temporal extension of arbitrage pricing theory such that $\varvec{r}_{t}$ is mapped into inner factor $\varvec{y}_{t}$ with a much lowered dimension. Instead of depending on the security returns $\varvec{r}_{t}$, we use $\varvec{y}_{t}$ to replace $\varvec{r}_{t}$ in Eq. (28) for a further improvement.

Following the extension proposed in Xu (2001), most of the above addressed extensions have been investigated together with detailed algorithm, experiments on real market data and comparative studies (Chiu and Xu 2002b, 2003, 2004b). Still, at the end of Sect III(C) in Xu (2001), there was one briefly introduced idea that has not been further investigated yet. Here, some further details are addressed.

In Eq. (30) and also in Eq. (28), as well as in the existing studies on the Markowitz portfolio optimization and the Sharpe ratio, the expected return and volatilities are nonparametric estimates directly from samples $R_{T} = \left\{ {\varvec{r}_{t} ,t = 1, \ldots ,T} \right\}.$ To capture a temporal dependence better, one idea is using an ARCH or GARCH model to describe a sequence {r_t, t = 1, …, T} of the portfolio return $r_{t} = \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ;$ see Box in Fig. 2. It follows from Eq. (3) that we have

$$r_{t + 1} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} r_{t + 1 - j} + \sigma_{t} \varepsilon_{t} , \varepsilon_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\varepsilon |0, 1) , {\text{and }}\sigma_{t} = \sigma_{t}^{2} \left( \vartheta \right)\, {\text{by Eq}} .\,\left( 3 \right).$$

(33)

Taking the expectation and separating the first term from the rest, as well as approximately considering $E\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} \approx a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ,$ we further get

$$\begin{aligned} Er_{t + 1} & = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} Er_{t + 1 - j} = a_{1} E\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} \approx a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} , \\ \sigma_{t + 1}^{2} & = \sigma_{0}^{2} + \mathop \sum \limits_{i = 1}^{q} \beta_{i} \varepsilon_{t + 1 - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t + 1 - j}^{2} = \beta (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2} + \hat{\sigma }_{t}^{2} , r_{t}^{AR} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} r_{t - j} , \\ E\hat{r}_{t - 1} & = a_{0} + \mathop \sum \limits_{j = 2}^{q} a_{j} Er_{t + 1 - j} , \hat{\sigma }_{t}^{2} = \sigma_{0}^{2} + \mathop \sum \limits_{i = 2}^{q} \beta_{i} \varepsilon_{t + 1 - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t + 1 - j}^{2} ,\\ \end{aligned}$$

(34)

from which we get the following GARCH-based Shape ratio

$$J\left( {\varvec{w}_{t} } \right) = \frac{{Er_{t + 1} }}{{\sigma_{t + 1} }} = \frac{{a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} }}{{\beta_{1} (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2} + \hat{\sigma }_{t}^{2} }}.$$

(35)

Given the GARCH model and the past $Er_{t - j} , r_{t - j} , \quad j = 1, \ldots ,k,$ we have $E\hat{r}_{t - 1} ,$ $\hat{\sigma }_{t}^{2} ,$ r ^AR_t , a₁, β₁ available. As $\varvec{r}_{t}$ is obtained, we compute the gradient of $J\left( {\varvec{w}_{t} } \right)$ and update

$$\varvec{w}_{t} = \varvec{w}_{t - 1} + \eta \nabla_{{\varvec{w}_{t} }} J\left( {\varvec{w}_{t} } \right), {\text{for }}\,{\text{a}}\, {\text{learning}}\, {\text{step }}\,{\text{size }}\,\eta > 0.$$

(36)

Then, we get $\varepsilon_{t}^{2} = (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2}$ and update $a_{i}^{\text{new}} = e^{{c_{1}^{\text{new}} }} , c_{i}^{\text{new}} = c_{i}^{\text{old}} - \eta \frac{{{\text{d}}\varepsilon_{t}^{2} }}{{{\text{d}}c_{i}^{\text{old}} }},\quad {\text{for }}i = 0,1,$ $a_{j}^{new} = a_{j}^{old} - \eta \frac{{d\varepsilon_{t}^{2} }}{{da_{j}^{old} }}, \quad {\text{for }}j = 2, \ldots ,q.$

Also, we update the parameters ϑ in the same way as one standard GARCH solving approach. Next, we use Eq. (36) for updating $\varvec{w}_{t + 1}$ again.

Market modeling: APT theory and temporal factor analysis

Arbitrage pricing theory and factor analysis’s incapability

Beyond only optimizing the outcome by investing a portfolio of multiple assets, the Markowitz mean–variance scheme also leads to the linear modeling of the market. The most famous one is the well-known capital asset pricing model (CAPM) (Sharpe 1964). However, the CAPM is criticized as being not sufficient to describe market behavior merely via one endogenous factor.

Under the name of arbitrage pricing theory (APT), Ross (1976) proposed the following linear model of multiple hidden or endogenous factors:

$$\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , A = \left[ {a_{ij} } \right], {\mathbf{r}}_{t} = [r_{1,t} , \ldots ,r_{k,t} ]^{\text{T}} , \,\varvec{f}_{t} = \left[ {f_{1,t} , \ldots ,f_{n,t} ]^{\text{T}} , \varvec{e}_{t} = } \right[e_{1,t} , \ldots ,e_{k,t} ]^{\text{T}} .$$

(37)

As illustrated in Fig. 5a, $\varvec{r}_{t}$ consists of the returns of k assets in this market, $\varvec{f}_{t}$ consists of m risky hidden factors that will affect the rate of returns on all assets by different degrees of sensitivity and a_ij is the sensitivity of the ith asset to factor j, also called factor loading, Moreover, each element of $\varvec{e}_{t}$ is the risky asset’s idiosyncratic random shock with mean zero, and each element of $\varvec{a}$ is a constant part of the corresponding risky asset.

Since its inception, the APT has attracted a considerable interest as a tool for interpreting investment results and controlling portfolio risk. However, the APT has been accepted by the investment community, but is not as popular as the CAPM. The reason largely relates to APT’s serious drawback, namely, its implementation is difficult due to the lack of specificity regarding the nature of the factors that systematically affect asset returns. As outlined in Sect. I of (Xu 2001), typically three types of approaches have been applied for the APT implementation.

Most of the studies are featured with $\varvec{f}_{t}$ given by the so-called fundamental factors, i.e., historic time series of a set of macroeconomic or fundamental indexes. With the hidden factors chosen, the problem becomes a typical multivariate linear regression problem: $\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}$. However, choosing these fundamental factors is not an easy task. Chen et al. (1986) chose five macroeconomic factors, including surprises in GDP, inflation, investor confidence, and yield curve. Also, others consider index or spot or future market price, e.g., short-term interest rate, a diversified stock index, oil price, gold or precious metal prices, and currency exchange rate in place of macroeconomic factors. With efforts over decades, little progress has been achieved on identifying the number and nature of these fundamental factors. Many researchers believe that this issue is essentially empirical in nature, because the factors change over time and between economies.

There have been also efforts under the name of the cross-sectional approaches that observes the correlations of all the assets of $\varvec{r}_{t}$ to each of the hidden factor in $\varvec{f}_{t}$ by a certain period, resulting in estimates of elements of A that reflect the assets’ sensitivities to these hidden factors. Then, the task is to estimate $\varvec{f}_{t}$ upon $\varvec{r}_{t}$ and A, which is typically handled as a linear cross-sectional regression and solved by the least square error method in the literature of economics and finance. In Sect. I of Xu (2001), it is formulated as an inverse mapping problem, a topic that has been widely studied in the neural network and machine learning literature.

Observation of an implementation of the least square error method actually shows that the residuals $\varvec{e}_{t}$ are uncorrelated among the elements and also with the factors $\varvec{f}_{t}$ and that each element of $\varvec{e}_{t}$ reflects a collective effect of many random noise, that is, we have $E\varvec{f}_{t} \varvec{e}_{t}^{\rm T} = 0$ and also $q(\varvec{r}_{t} |\varvec{f}_{t} )$ as shown by the top-down pathway on the right part of Fig. 5b. An inverse of the top-down path is a bottom-up path on the left part of Fig. 5b, for which the optimal solution is the following Bayesian inverse:

$$p(\varvec{f}_{t} |\varvec{r}_{t} ) = \frac{{G(\varvec{e}_{t} |\varvec{a} + A\varvec{f}_{t} ,\varSigma )q\left( {\varvec{f}_{t} } \right)}}{{\mathop \smallint \nolimits G(\varvec{e}_{t} |\varvec{a} + A\varvec{f}_{t} ,\varSigma )q\left( {\varvec{f}_{t} } \right)d\varvec{f}_{t} }}.$$

(38)

Here, we encounter a probabilistic structure $q\left( {\varvec{f}_{t} } \right)$ of hidden factors. Approximately, if only considering its statistics up to the second order, $q\left( {\varvec{f}_{t} } \right)$ is approximated by a Gaussian $G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)$ as shown in Fig. 5b. In such a case, we have the following analytical solution:

$$\hat{\varvec{f}}_{t} = \mathop \smallint \nolimits \varvec{f}_{t} p(\varvec{f}_{t} |\varvec{r}_{t} )\,{\text{d}}\varvec{f}_{t} = \left( {A^{\text{T}} \varSigma^{ - 1} A + \varLambda^{ - 1} } \right)\left[ {A^{\text{T}} \varSigma^{ - 1} \left( {\varvec{r}_{t} - \varvec{a}} \right) + \varLambda^{ - 1} \nu } \right],$$

(39)

which returns to a least square error solution when there is no information about $q\left( {\varvec{f}_{t} } \right)$ for which we may simply set Λ = 0, ν = 0.

Similar to the first approach, the second approach is also essentially empirical in nature, which needs not only a manual help to identify the number and nature of hidden factors, but also at least an enough long period of historic data about factors for estimating of elements of A. Moreover, getting elements of A by the correlations between $\varvec{f}_{t}$ upon $\varvec{r}_{t}$ actually imposes additional constraints on the values that A may take. The second approach is supplementary to the first approach, but it still cannot get rid of the nature that the factors are chosen heuristically and even rather arbitrarily. We may regard that the second approach actually consists of two steps. First, estimation of elements of A bases on a period historic data of macroeconomic or fundamental indexes takes the same role of the first approach or even just an implementation of the first approach. Second, we estimate $\varvec{f}_{t}$ upon $\varvec{r}_{t}$ and A, e.g., typically by Eq. (39).

The third type of efforts are called factor-analytic approach, attempting to use a statistical approach called factor analysis (FA) to get both the unknown and the unknown factors estimated from the observed return series $\left\{ {\varvec{r}_{t} } \right\}$. There is no need of external heuristics, and thus it seems more appealing. As shown in Fig. 5b, an FA model comes from modifying Fig. 5a with an additional structure that $\varvec{f}_{t}$ comes from a Gaussian $G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)$ with a diagonal Λ or even $\varLambda = I$. Unfortunately, empirical tests showed that factor analysis does not explain economic variables well. As addressed in Sect. I of Xu (2001), some incapability of factor analysis mainly comes from two kinds of intrinsic indeterminacy. One is the rotation indeterminacy, i.e.,

$${\text{if}}\,A,\varvec{f}_{t} \;{\text{is a solution}},\,A\varphi^{\text{T}} ,\varphi \varvec{f}_{t} \,{\text{is also a solution for any rotation matrix}}\,\varphi ,$$

(40)

while such a rotation may lead to a solution far from the correct one. The other comes from an intrinsic indeterminacy of an appropriate number of factors, while the selection of a correct number of factors is essential to the performance of using the APT model. Usually, it is set by a rule of thumb. Actually, factor analysis also suffers other types of indeterminacy. One is any rescaling $D\varvec{f}_{t}$ of a solution $\varvec{f}_{t}$ is still a solution for a diagonal matrix D, which is not critical because it reserves the waveform of each element in $\varvec{f}_{t}$. The other is additive indeterminacy, i.e., A, Λ, Σ and A^*, Λ^*, Σ^*are both the solutions as long as AΛA^T + Σ = A^*Λ^*A^*T + Σ^*. However, the effect of this indeterminacy can be reduced significantly when Σ = σ²I. Therefore, our attention should be mainly on the first two key challenges, namely, removing the rotation indeterminacy by Eq. (40) and determining an appropriate number of factors.

The first challenge has been seldom considered by the APT studies in the fields of economics and finance, while there are some efforts on the second challenge, i.e., determining an appropriate number of factors with the help of statistical testing. The simplest one is making maximum likelihood factor analysis (MLFA) followed by the likelihood ratio (LR) test, shortly MLFA-LR. Empirical evidences show that the minimum number of factors accepted by the LR test tends to increase with the number of securities. Alternatively, Chamberlain and Rothschild (1983) suggest analyzing eigenvalues of the population covariance matrix, shortly eigenvalue approach. Still, Brown (1989) empirically found that this approach biases toward too few factors and the result consistent with one factor may be equally consistent with multiple equally weighted factors.

On one hand, being essentially empirical in nature, both the fundamental factor-based approaches and the cross-sectional approaches rely on pre-knowledge or external beliefs to choose the factors heuristically, in lack of consensus and consistency over what should be the real factors in APT. On the other hand, the implementation of factor analysis suffers the rotation indeterminacy by Eq. (40) and the difficulty of determining an appropriate number of factors. These problems incur for criticisms on the APT theory, e.g., see Dhrymes et al. (1984); Abeysekera and Mahajan (1987).

Instead of doubting the incorrectness of the APT theory, our understanding is that the APT theory is correct but incomplete. The APT suggests to model a market at no arbitrage equilibrium by a linear model, which is justifiable. However, this theory is incomplete because this linear model cannot be uniquely or even reasonably specified merely from the observed return series $\left\{ {\varvec{r}_{t} } \right\}$. To complete the theory, further specification should be imposed on the components of this model. The fundamental factor-based approaches fix the hidden factors by heuristically and empirically picking a set of macroeconomic or fundamental indexes, which removes the indeterminacy but leaves the difficult questions on how to choose these factors and whether the factors should come directly from macroeconomic or fundamental indexes. The cross-sectional approaches aim at estimating $A$, which leaves the difficult question on how A can be estimated correctly. To get A by the assets’ sensitivities to these hidden factors, we still need to heuristically and empirically pick a set of macroeconomic or fundamental indexes, Finally, the FA model is also unable to remove the incompleteness of the APT, because imposing an additional Gaussian $G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)$ is still not enough to remove the critical indeterminacy by Eq. (40). In a summary, the original APT (Ross 1976) is reasonable but incomplete, and further efforts should explore how to add certain structure to remove or remedy the incompleteness.

Temporal factor analysis and temporal APT

The famous CAPM model is featured by one factor that is not a manually chosen exogenous macroeconomic or fundamental index but an invisible and intrinsic market indicator. The APT was motivated by following the basic sprit of CAPM to answer the criticism that merely one factor is not enough to describe the market behavior. However, implementing APT by manually picking macroeconomic or fundamental indices actually deviates from the original motivation. Encouragingly, the direction of FA implementation is still consistent with the original motivation that seeks intrinsic factors, and thus we further proceed along this direction. Keeping Eq. (37), we extend the Gaussian structure $G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)$ into a better structure such that the indeterminacy by Eq. (40) or the incompleteness of the FA model can be removed or at least remedied.

Temporal factor analysis (TFA) is such a further development of FA; see Box 1 in Fig. 3. The early study was started in 1997, firstly introduced briefly by Xu (1997) and further addressed in Xu (2000) (this manuscript actually reached the editorial office also in 1997). See Box 2 in Fig. 3: the key idea is modifying Eq. (37) as follows:

$$\begin{aligned} \varvec{r}_{t} & = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , A = \left[ {a_{ij} } \right], {\mathbf{r}}_{t} = [r_{1,t} , \ldots ,r_{k,t} ]^{\text{T}} , \\ \varvec{f}_{t} & = \left[ {f_{1,t} , \ldots ,f_{n,t} ]^{\text{T}} , \varvec{e}_{t} = } \right[e_{1,t} , \ldots ,e_{k,t} ]^{\text{T}} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\text{T}} = 0, \\ \varvec{f}_{t} & = B\varvec{f}_{t - 1} + \varepsilon_{t} , B = {\text{diag}}\left[ {b_{1} , \ldots ,b_{m} } \right] \ne bI\,{\text{with }}\,b \ne 0,\\ & \varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,\varLambda } \right.} \right) \;{\text{with}}\;{\text{a}}\;{\text{diagonal }}\varLambda, {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\text{T}} = 0. \\ \end{aligned}$$

(41)

That is, the first-order autoregressive dependence is added to each factor in $\varvec{f}_{t}$ via B, and Eq. (41) returns to FA by Eq. (37) when B = 0.

It is this temporal dependence that removes the rotation indeterminacy by Eq. (40); see Sect IV (A) in Xu (2000) and Sect. II in Xu (2002). Roughly, the following points may be understood:

For any diagonal matrix D, we have $A\varvec{f} = \tilde{A}\tilde{\varvec{f}},\tilde{A} = AD,\tilde{\varvec{f}} = D^{ - 1} \varvec{f},$ which keeps the format $\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}$ unchanged and also the elements of $\tilde{\varvec{f}}$ remain mutually independent. i.e., Equation (37) has an indeterminacy of unknown scaling on factors of $\tilde{\varvec{f}}$. Thus, we may simply consider $\varvec{f}_{t} \sim G\left( {\varvec{f}_{t} \left| {0,I} \right.} \right)$. For any rotation matrix φ with $\varphi^{\text{T}} \varphi = I$, we have $A\varvec{f} = \tilde{A}\tilde{\varvec{f}},$ and $\tilde{A} = A\varphi^{\text{T}} ,\tilde{\varvec{f}} = \varphi \varvec{f}$ with $\tilde{f}_{t} \sim G\left( {\tilde{f}_{t} \left| {0,I} \right.} \right)$. That is, Eq. (37) has also an indeterminacy of unknown rotation on factors $\tilde{\varvec{f}}$.
For any diagonal matrix D, we also have $D^{ - 1} \varvec{f}_{t} = D^{ - 1} BDD^{ - 1} \varvec{f}_{t - 1} + D^{ - 1} \varepsilon_{t}$ and $\tilde{\varvec{f}}_{t} = B\tilde{\varvec{f}}_{t - 1} + \tilde{\varepsilon }_{t} ,$, where $\tilde{\varepsilon }_{t} = D^{ - 1} \varepsilon_{t}$ comes from $G\left( {\tilde{\varepsilon }_{t} \left| {0,D^{ - 1} \varLambda D^{ - 1} } \right.} \right)$ and $D^{ - 1} \varLambda D^{ - 1}$ is still diagonal. That is, Eq. (41) still has an indeterminacy of unknown scaling on factors $\tilde{\varvec{f}}$. Again, we may consider $\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,I } \right.} \right).$ For any rotation matrix φ with φ^Tφ = I, we have $\tilde{\varvec{f}}_{t} = \tilde{B}\tilde{\varvec{f}}_{t - 1} + \tilde{\varepsilon }_{t}$ with $\tilde{\varepsilon }_{t} \sim G\left( {\tilde{\varepsilon }_{t} \left| {0,I } \right.} \right)$, while $\tilde{B} = \varphi B\varphi^{\text{T}}$ is no longer diagonal and even B is diagonal. If $\tilde{B} = \varphi B\varphi^{\text{T}}$ is required to be diagonal, the only rotation matrix is φ = I and thus the rotation indeterminacy is removed.

Still there is an indeterminacy of unknown scaling on factors of $\tilde{\varvec{f}}$, but it will not change the waveform of f_1,t, …, f_n,t. Also, we may normalize each factor to remove such indeterminacy.

In Xu (2001), the TFA by Eq. (41) is thus suggested as a refinement of the original APT theory, by which the original part of APT is kept without modification, while a temporal structure $\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}$ is added such that the incompleteness caused by the rotation indeterminacy has been removed. Such a refinement may be called temporal APT in a sense that temporal relation is taken into consideration of market modeling. That is, a static equation by Eq. (37) is not enough to describe a market equilibrium, but a temporal structure should be an important ingredient of a market equilibrium.

Why is an AR model of merely order one $\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}$ considered as this temporal structure? First, we consider that hidden factors $\varvec{f}_{t}$ are driven by Gaussian noise $\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,\varLambda } \right.} \right),$ following a general consensus that the noisy component in most econometric and statistical models is Gaussian distributed. The rationale comes from the central limit theorem which implies that the compounding of a large number of unknown distributions will be approximately normal. Second, the first-order AR model can be attributed to the weak form of efficient market hypothesis (EMH), that is, stock price today is conditionally independent of all previous prices given the price of yesterday. Third, though observable economic indices are seldom independent, it cannot rule out that hidden factors that denominate a market equilibrium are mutually independent. Instead, independent factors may help to make market equilibrium simpler.

As addressed in the previous subsection, past efforts on determining an appropriate number of factors have not provided much support on the APT. For one example, the MLFA-LR test shows that the number of factors tends to increase with the number of securities. For another example, the identification via eigenvalue approach (Chamberlain and Rothschild 1983) biases toward a smaller factor number. In one IJCNN 02 paper (Chiu and Xu 2002a), empirical tests on Hong Kong stock market data show not only that these two unfavorable biases are again observed, but also that the TFA-based APT can provide a reasonable answer to the number of factors in the Hong Kong stock market. As shown in Fig. 6, the number of factors identified by MLFA-LR test varies as the numbers of securities, while the number of factors identified by the eigenvalue approach is always 1. In contrast, BYY harmony learning based TFA stably identifies four or five factors regardless of the numbers of securities, which is quite consistent with the number identified via heuristic empirical analysis, e.g., in Chen et al. (1986).

The above introduced nature of TFA and preliminary studies suggest that there may need a renewed interest in the literature of finance and economics to further investigate APT and its further developments. To consider which topics to pursue, it is helpful to observe the differences of TFA from related methods.

First, $\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}$ in Eq. (41) is actually a special type of the first-order vector AR (VAR). Being different from the conventional VAR that are used for capturing linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987), the TFA captures the interdependencies among multiple time series by $\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}$ and temporal dependences by $\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}$. As addressed in Sect. 3.2.1 in Xu (2012), it is more efficient to separately treat these two types of dependences.

Second, if we do not constrain B,Λ to be diagonal, Eq. (41) becomes a general state–space model (SSM) or a linear dynamical system (LDS), which has been widely studied in the literature of control theory and signal processing. As outlined in Sect. 5.2.1 of Xu (2012), in a period that is more or less the same as the studies on TFA (Xu 1997; 2000), there was a renewed interest on a general LDS, featured by using the EM algorithm for parameter estimation under the ML learning (Ghahramani and Hinton 2000). Accordingly, this EM algorithm was originally derived in the early 1980s and re-introduced in the early 1990s (Shumway and Stoffer 1991). Neither these studies suggest using the LDS as a further development of APT, nor the notorious rotation indeterminacy in Eq. (40) has been taken into consideration. On the contrary, more problems of indeterminacy than the FA are actually incurred in this general LDS model due to many extra free parameters, which makes identifiability even worse. For an example, applied to radar automatic target recognition based on high-resolution range profile, it has been shown in Wang et al. (2011) that the recognition performance of the general LDS is actually even inferior to that of the FA, while TFA obtains better performances than the FA.

Third, many efforts have been made on determining the factor number of FA in the literature of statistics and machine learning, typically in a two-stage implementation. The first stage uses the EM algorithm to make the ML learning for unknown parameters in the FA while the second stage selects an appropriate number of factors with help of a model selection criterion. In Tu and Xu (2011), a systematic comparative investigation has been made on a number of typical model selection criteria, including not only Akaike’s AIC, Schwarz’s BIC, Bozdogan’s CAIC, Hannan–Quinn criterion, but also recent Minka’s PCA criterion, Kritchman and Nadler’s tests, and Perry and Wolfe’s rank, as well as the criterion obtained from the BYY harmony learning theory (Xu 2001).

As discussed above, there is not really a need to further consider the relations to VAR and LDS. Instead, further explorations may start from continuing the study in the IJCNN02 paper (Chiu and Xu 2002b) and proceed to clarify the following issues:

Does using one of the above model selection criteria in a two-stage implementation improve the number of FA factors identified by the MLFA-LR test and the eigenvalue approach? If yes, does this improvement help the FA-based implementation of APT, even still suffering the rotation indeterminacy by Eq. (40).
Still using one of the above model selection criteria in a two-stage implementation, how much improvement TFA can be obtained after removing the rotation indeterminacy by $\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}$?

Additionally, studies may be made on data from other major international markets, with those past empirical analyses (e.g., Chen et al. 1986; Azeez and Yonezawa 2006) as references. In addition to a two-stage implementation, one promising feature of implementing the TFA by the BYY harmony learning (Xu 2001) is that the number of temporal factors is determined automatically during learning, which saves computational costs greatly and also improves the learning performance of TFA, for which details are referred to Sect. 5 of Xu (2010) and Sect. 5.2 of Xu (2012).

Macroeconomics-modulated TFA-APT and nGCH-driven M-TFA-O

In those empirical APT studies, the practice that uses macroeconomic indexes as $\varvec{f}_{t}$ leads to an understanding that $\varvec{f}_{t}$ typically consists of a set of macroeconomic or fundamental indexes. In an FA implementation or a TFA implementation by Eq. (41), such an understanding may not be correct. Actually, $\varvec{f}_{t}$ may vary much slower than the return $\varvec{r}_{t}$ and thus be regarded as a macroeconomic type of indices. However, $\varvec{f}_{t}$ may also vary in a timescale similar to the changes of $\varvec{r}_{t}$. Moreover, $\varvec{f}_{t}$ in Eq. (41) is intrinsically determined from real data $\varvec{r}_{t}$ and usually will not coincide with exogenous macroeconomic indexes, such as GDP, inflation, investor confidence, and yield curve. Therefore, we need to further investigate how the market is influenced by these exogenous variables or macroeconomic indexes.

Being quite different from many existing studies that explicitly model the relation between market return $\varvec{r}_{t}$ and macroeconomic indices, the influences of these indices to $\varvec{r}_{t}$ are considered via their roles in modulating the temporal factors in $\varvec{f}_{t} ,$ as shown in Fig. 3 by Box 3. This idea is realized via extending Eq. (41) into the following macroeconomics-modulated TFA–APT:

$$\begin{aligned} \varvec{r}_{t} & = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\rm T} = 0, \\ \varvec{f}_{t} & = B\varvec{f}_{t - 1} + H\varvec{m}_{t} + \varepsilon_{t} , {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\rm T} = 0, {\text{E }}{\mathbf{m}}_{t - 1} \varepsilon_{t}^{\rm T} = 0, \\ \varvec{m}_{t} & = C\varvec{v}_{t} + \eta_{t} , {\text{E }}\varvec{v}_{t}\, \eta_{t}^{\rm T} = 0, \\ \end{aligned}$$

(42)

where $\varvec{e}_{t} ,$ ɛ_t, and η_t are Gaussian white noises and independent of each other. Typically, $\varvec{m}_{t}$ consists of several macroeconomic indices, and $\varvec{\nu}_{t}$ consists of several known non-market factors that affect the macroeconomy. Specifically, $H\varvec{m}_{t}$ describes the effect of the macroeconomic indices to the security market via the hidden factors $\varvec{f}_{t}$. Actually, Eq. (42) comes from a simplification of one proposed in Sect. III(C) of (Xu 2001) and its Eq. (101), in particular, under the name of macroeconomics-modulated independent state–space model.

In one CIFEr2003 conference paper (Chiu and Xu 2003), empirical investigation is made on the model by Eq. (42). First, white noise tests are made on $\varvec{e}_{t} ,$ ɛ_t, and η_t to ensure model specification adequacy. Second, the performances in return prediction and index forecasting are compared with that of the TFA model. Empirical results reveal that the model is not only well specified, but also superior to the TFA model in stock price and index forecasting.

See Box 4 in Fig. 3, there are two ways to perform prediction based on Eq. (41) and Eq. (42). The first way is intrinsically to get $\varvec{r}_{t - 1} \to \varvec{f}_{t - 1}$ and predict $\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t - 1}$ for Eq. (41) and $\hat{\varvec{r}}_{t} = \varvec{a} + A\left( {B\varvec{f}_{t - 1} + H\varvec{m}_{t} } \right)$ for Eq. (42), while the second way is considering a given prediction $\varvec{r}_{t - 1} \to \varvec{y}_{t}$ via $\varvec{r}_{t - 1} \to \varvec{f}_{t - 1}$, $B\varvec{f}_{t - 1} \to \varvec{f}_{t}$ and then $\varvec{f}_{t} \to \varvec{y}_{t}$ by learning either linear or nonlinear regression, where y_t could be either $\varvec{r}_{t}$ or any type of market indices. In one paper (Chiu and Xu 2002), $\varvec{f}_{t} \to \varvec{y}_{t}$ is implemented by the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, 2009) and predicts the stock price or return $\varvec{r}_{t}$. Empirical studies on Hong Kong market data have shown the superiority of this prediction over not only a conventional prediction $\varvec{f}_{t} \to \varvec{y}_{t}$, but also the prediction $\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t - 1}$.

Based on Eqs. (41) and (42), in addition to making a prediction featured with learning a regression $\varvec{f}_{t} \to \varvec{y}_{t}$, we may also use $\varvec{f}_{t}$ to replace $\varvec{r}_{t}$ in the previous Eq. (29) for adaptive portfolio management; see Box 5 in Fig. 3. This APT based portfolio management was firstly suggested in Sect. III(c) and especially by Eqs. (96) and (97) in Xu (2001). Extensive simulation results reveal that this $\varvec{f}_{t}$-based portfolio management generally excels the return $\varvec{r}_{t}$ based portfolio management by Eq. (29) (Chiu and Xu 2004b).

In general, a parametric $\varvec{y}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right)$ can be added to Eq. (41) to provide the outputs of this model for application purposes for such prediction and portfolio management. Moreover, beyond the consideration of Gaussian white noises as the driven noise ɛ_t, we may consider a non-Gaussian driven noise ɛ_t or a driven noise ɛ_t with a conditional heteroskedasticity. In summary, we further generalize Eq. (42) into the following model

(a)
$${\mathbf{r}}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\text{T}} = 0,$$
${\mathbf{e}}_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\varvec{e}_{t} |0, \varSigma_{e} )$ with a diagonal covariance $\varSigma _{e}$
(b)
${\mathbf{y}}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right);$
(c)
$$\begin{aligned} {\mathbf{f}}_{t} &= B\varvec{f}_{t - 1} + H\varvec{m}_{t} + {\text{diag}}\left[ {\sigma_{t}^{\left( 1 \right)} , \ldots ,\sigma_{t}^{\left( m \right)} } \right]\varepsilon_{t}, \ q\left( {\varepsilon_{t} } \right) = \mathop \prod \limits_{j} q\left( {\varepsilon_{t}^{\left( j \right)} } \right), \\ \varepsilon_{t} &= [\varepsilon_{t}^{\left( 1 \right)} , \ldots ,\varepsilon_{t}^{\left( m \right)} ]^{\text{T}}_{,}\ \ {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}{\mathbf{m}}_{t} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}\varepsilon_{t}^{\left( j \right)} = 0, {\text{E}}\varepsilon_{t}^{\left( j \right) 2} = 1, \\ q\left( {\varepsilon_{t}^{\left( j \right)} } \right) &= \left\{ {\begin{array}{ll} G(\varepsilon_{t}^{\left( j \right)} |0, 1), \qquad \qquad \qquad \qquad \quad \left( {\text{i}} \right)\, {\text{one}}\;{\text{Gaussian,}} \\ \mathop \sum \limits_{i} \alpha_{i}^{\left( j \right)} G(\varepsilon_{t}^{\left( j \right)} |\mu_{i}^{\left( j \right)} , \lambda_{i}^{\left( j \right)} ),\qquad \quad \qquad \;\left( {\text{ii}} \right) \,{\text{Gaussian }}\,{\text{mixture}}; \\ \end{array} } \right. \\ \sigma_{t}^{\left( j \right)} &= \left\{ {\begin{array}{ll} {\rm a} \ {\text{constant}}\, \sigma_{{}}^{\left( j \right)} , &\quad \left( {\text{a}} \right) \ {\text{nonheteroskedasticity}}, \\ \sigma_{t}^{\left( j \right)} \left( {\vartheta_{{}}^{\left( j \right)} } \right) {\text{given }}\;{\text{by }}\;{\text{Eq}}.\,\left( 3 \right), &\quad \left( {\text{b}} \right) \ {\text{heteroskedasticity}}; \\ \end{array} } \right. \end{aligned}$$
(d)
$$\begin{aligned}{\mathbf{m}}_{t} &= C\varvec{\nu}_{t} + \eta_{t} , {\text{E }}{\varvec{\upnu}}_{t} \eta_{t}^{\text{T}} = 0, \\ \eta_{t} &\sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\eta_{t} |0, \varSigma_{\eta } ) {\text{with }}\;{\text{a}}\; {\text{digognal}}\; {\text{covariance }} \varSigma_{\eta } . \\ \end{aligned}$$
(43)

Its basic part consists of ingredients (a)(b)(c). In the special case H = 0, its function is TFA with two extensions. One is outputting y_t, thus shortly denoted by TFA-O. The other is that ingredient (c) drives f_t by its last term that is either or both of non-Gaussian (nG) and conditional heteroscedasticity (CH), for which we use nGCH-driven TFA-O to refer this formulation. When H ≠ 0, f_t is also modulated by the macroeconomic market force $\varvec{m}_{t}$, it leads to the general formulation shortly named nGCH-driven M-TFA-O.

The central role is taken by the statistical nature of ingredient (c), with several scenarios as follows:

For the case that $B = 0, H = 0$ and q(ɛ ^{(
j)}_t ) in Choice (i) as well as σ ^{(
j)}_t in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the FA-based implementation of the original APT by Eq. (37).
For the case that B = 0, $\varepsilon_{t} = 0$, it follows from $\tilde{A} = AH$ that ingredient (a) and ingredient (c) jointly degenerate back to the fundamental factors based implementation of the original APT by Eq. (37).
For the case that B = 0, q(ɛ ^{(
j)}_t ) in Choice (i), and σ ^{(
j)}_t in Choice (a), ingredient (a) and ingredient (c) jointly act as a combination of the above two implementations.
For the case that H = 0, q(ɛ ^{(
j)}_t ) in Choice (i), and σ ^{(
j)}_t in Choice (a), as well as B = diag[b₁, …, b_m]^T, ingredient (a) and ingredient (c) jointly become the TFA-based implementation by Eq. (41). It further becomes Eq. (42) when H ≠ 0. Moreover, conditional heteroskedasticity is further considered in $\varepsilon_{t}$ via Choice (i) of σ ^{(
j)}_t to be replaced by Choice (b). As shown by empirical investigation in the CIEF’2003 conference paper (Chiu and Xu 2003), we consider that the conditional heteroskedasticity in the TFA-based implementation is considerably better than the TFA-based implementation without such a consideration.

Another alternative is that Choice (i) of a Gaussian q(ɛ ^{(
j)}_t ) is replaced by Choice (ii) of a non-Gaussian q(ɛ ^{(
j)}_t ). In the simplest case, B = 0, H = 0, and σ ^{(
j)}_t in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the non-Gaussian FA (NFA) as outlined in Fig. 3 by Box 6, for which details are referred to Sect. III(A) in Xu (2001), Sect. IV in Xu (2004), and Sect. 3.2 in Xu (2010). Accordingly, we get a Non-Gaussian APT as shown in Fig. 3 by Box 7. Interestingly, NFA can also remove the FA’s rotation indeterminacy by Eq. (40), though there is no temporal structure $\varvec{f}_{t}$ in consideration because B = 0, H = 0. Similar to Fig. 6, shown in Fig. 7 are the results of empirical investigation made on determining the appropriate factor number of APT by NFA (Chiu and Xu 2004a), still in comparison with the results of the MLFA-LR test and the eigenvalue approach as listed in Fig. 7a. Again, the BYY harmony learning-based NFA stably identified four or five factors regardless of the numbers of securities.

This alternative provides a different perspective on how to remove the indeterminacy by Eq. (40) or the incompleteness of APT. Without the additional equation about $\varvec{f}_{t}$, the formulation of NFA implementation seems closer than the TFA implementation to the original APT formulation by Eq. (37). Naturally, there rises a question on which one is right, TFA or NFA? Actually, they are two aspects of one market model. TFA observes a dynamic market process while NFA describes the market with all the time points projected to one observation spot such that a Gaussian process is projected to be observed as a mixture of Gaussian distributions. Generally, we may have two natures to be considered in the same market, that is, considering both B = diag[b₁, …, b_m]^T and the choice (ii) of a non-Gaussian q(ɛ ^{(
j)}_t ). Even generally, the conditional heteroskedasticity may also be added in via letting $\sigma_{t}^{\left( j \right)}$ in the choice (b). Systematically integrating all the parts and all the ingredients together, Eq. (43) may serve as a general formulation for financial market modeling.

Bayesian Ying–Yang harmony learning and two exemplar learning algorithms

Bayesian Ying–Yang (BYY) harmony learning

The Bayesian Ying–Yang (BYY) harmony learning was proposed in Xu (1995a, b) and subsequently developed systematically (Xu 2001, 2007, 2010, 2012), which provides not only a framework that accommodates typical learning approaches from a unified perspective, but also a new road that leads to improved model selection criteria, Ying–Yang alternative learning with automatic model selection, as well as coordinated implementation of Ying-based model selection and Yang-based learning regularization.

From a modern science perspective that regards the famous ancient Yin–Yang philosophy as a meta theory of system sciences and intelligent systems, a system that survives and interacts with its world can be regarded as a Ying–Yang system that functionally composes of two complement parts. One is called Ying, from its inside into its external world, by which a set $\varvec{X}_{N} = \{ x_{t} \}_{t = 1}^{N}$ of samples are regarded as generated from its representation $\varvec{R}$, while the other is called Yang, from an external world into its inside. A two directional view is considered via the joint distribution of $\varvec{X},\varvec{R}$ in two types of Bayesian decomposition. The decomposition of $p\left( {\varvec{X},\varvec{R}} \right)$ coincides the Yang concept with a visible domain $p\left( \varvec{X} \right)$ for a Yang space and a $\varvec{X} \to \varvec{R}$ pathway by $p(\varvec{R}|\varvec{X})$ as a Yang pathway. Thus, $p\left( {\varvec{X},\varvec{R}} \right)$ is called Yang machine. Also, $q\left( {\varvec{X},\varvec{R}} \right)$ is called Ying machine with an invisible domain $q\left( \varvec{R} \right)$ for a Ying space and a $\varvec{R} \to \varvec{X}$ pathway by $q(\varvec{X}|\varvec{R})$ as a Ying pathway. Such a Ying–Yang pair is called Bayesian Ying–Yang (BYY) system. Ying–Yang pair interact with each other under the principle of best harmony, which is mathematically implemented by maximizing

$$H(p||q) = \mathop \int \limits^{{}} p(\varvec{R}|\varvec{X})p\left( \varvec{X} \right){ \ln }\left[ {q\left( {\varvec{X} |\varvec{R}} \right)q\left( \varvec{R} \right)} \right]\,{\text{d}}\varvec{X}{\text{d}}\varvec{R}\varvec{.}$$

(44)

For a machine learning or modeling purpose, we first need to consider a mathematical representation for $\varvec{R}$. The first column of Table lists several typical examples. Usually, $\varvec{R}$ consists of two parts. One is a long-term memory θ that consists of all unknown parameters in the system for collectively representing the underlying structure of $\varvec{X}_{N}$, while the other is a short-term memory YL with each element being either or both of a categorical label ℓ ∊ L and a vector y ∊ Y as the corresponding inner representation of one element x ∊ X. For examples, we have a vector y for describing $\varvec{f}_{t}$ in the APT model by Eq. (37), while we simply have a label ℓ in the time series model by Eq. (4).

The probabilistic structure q(Y, L) is considered jointly with $q(\varvec{X}|\varvec{R}) = q(\varvec{X}|Y,L,\theta )$, depending on both the tasks in consideration and a trade-off between the complexity of q(Y, L) and the complexity of $q(\varvec{X}|Y,L,\theta )$. For the task of TFA modeling by Eq. (41), we have $q(\varvec{X}|Y,L,\theta )$ by $q(\varvec{r}_{t} |\varvec{f}_{t} )$ and q(Y, L) by $q\left( {\varvec{f}_{t} \left| {\varvec{f}_{t - 1} } \right.} \right)$ as follows:

$$\begin{aligned} q(\varvec{r}_{t} |\varvec{f}_{t} ) & = G(\varvec{r}_{t} |\varvec{a} + A\varvec{f}_{t} , \varSigma ) \quad {\text{with}}\,{\text{a }}\,{\text{diagonal }}\,\varSigma , \\ q\left( {\varvec{f}_{t} \left| {\varvec{f}_{t - 1} } \right.} \right) & = G\left( {\varvec{f}_{t} \left| {B\varvec{f}_{t - 1} ,\varLambda } \right.} \right) \quad {\text{with }}\,{\text{a }}\,{\text{diagonal}}\,\varLambda . \\ \end{aligned}$$

(45)

Moreover, the remaining part in q(R) = q(Y, L|θ)q(θ) is usually called a priori q(θ) that is chosen depending on the types of parameters and their positions in the Ying machine. In general, a Ying machine q(X, R) = q(X|R)q(R) is designed according to a least complexity principle, featured with designing q(R) = q(Y, L|θ)q(θ) in a least redundancy principle and designing $q(\varvec{X}|\varvec{R}) = q(\varvec{X}|Y,L,\theta )$ in a divide–conquer principle.

For the Yang machine p(X, R) = p(R|X)p(X), p(X) directly comes from samples $\varvec{X}_{N}$, while p(R|X) is designed based on the Ying machine q(X, R) = q(X|R)q(R) according to the variety preservation principle, that is

$$\begin{aligned} p\left( {R |X} \right) & = q\left( {R |X} \right)\ \ {\text{in}}\, {\text{a}}\, {\text{strong }}\,{\text{sense }} \\ \quad \quad \quad \quad \quad {\text{or}} \\ {\text{Cov}}_{R|X} \ of \ p(R|X) & = {\text{Cov}}_{R|X} \ of \ q(R|X) \ \ {\text{in }}\,{\text{a }}\,{\text{week}}\, {\text{sense}} .\\ q\left( {R|X} \right) & = {{q\left( {X|R} \right)q\left( R \right)} \mathord{\left/ {\vphantom {{q\left( {X|R} \right)q\left( R \right)} {\int {q\left( {X|R} \right)q\left( R \right)\,{\rm d}R} }}} \right. \kern-0pt} {\int {q\left( {X|R} \right)q\left( R \right)\, {\rm d}X {\rm d}R} }}, \\ \end{aligned}$$

(46)

where Cov_R|X indicates a covariance matrix of R conditioning on X. Readers are referred to Xu (2010, 2012) for recent systematic outlines on major issues for designing Ying–Yang machines. To be specific, reading is suggested to start with Sect. 3.2 in Xu (2012) and refer to Sect. 4.2 in Xu (2010) for supplementary materials. Also, readers are referred to Xu (2011) for another perspective that a co-dimensional matrix pair forms a building unit and a hierarchy of such building units sets up the BYY system.

With a BYY system designed, all the remaining unknowns in the system are determined via maximizing the harmony functional by Eq. (44). Typically, there are two types of unknowns. Given the structure of a BYY system or a parametric model in general, it actually means a family of infinite many candidate structures with everyone in a same configuration but in different scales. That is, each candidate is featured by a scale parameter $\varvec{k}$ in terms of one integer or a set of integers. For examples, $\varvec{k }$ consists of the model number k and the orders {q_i} for the model in Eq. (3), while merely of the dimension k in the APT model by Eq. (37).

The second type of unknown is featured by a set $\theta_{\varvec{k}}$ of unknown parameters within the candidate structure featured by a specific $\varvec{k}$. Accordingly, maximizing the harmony functional H(p||q) by Eq. (44) makes both parameter learning on determining $\theta_{\varvec{k}}$ and model selection on determining $\varvec{k}$. This BYY best harmony learning provides a favorable mechanism for model selection. Readers are referred to Xu (2010, 2012) for recent systematic overviews on the fundamentals, the novelties and favorable natures of the BYY best harmony learning. To be specific, reading is suggested to start with Sect. 4.1 in Xu (2012) on two different aspects of measuring bi-entity proximity and Sect. 4.2 on the BYY harmony learning from the perspectives of Ying–Yang best matching versus Ying–Yang best harmony, and then proceed to Sect. 7 for a systematic outline on the thirteen topics about the BYY best harmony learning. Also, readers are referred to Xu ( 2010) for supplementary materials in Sect. 4.1 and the roadmap shown in Fig. A2 for the relations to other typical learning approaches.

The implementation of maximizing H(p||q) consists of different specific cases for different learning problems and application tasks. Inputting the samples $\varvec{X}_{N}$ by $p\left( \varvec{X} \right) = \delta \left( {\varvec{X} - \varvec{X}_{N} } \right)$, H(p||q) in Eq. (44) is simplified into the one on the top of Table 1. As $\varvec{R}$ takes different specific forms given in the first column of Table 1, we have four types of H(p||q) as listed in the second column of the table, plus their corresponding special cases of i.i.d. samples $\left\{ {x_{t} } \right\}_{t = 1}^{N}$.

Table 1 $H(p||q)$ in four specific types of implementations

Full size table

Moreover, the collective operations $\int {[ \bullet ]} \,{\text{d}}Y_{N}$ and $\sum_{L} \left[ { \bullet } \right]$ may be simplified by removing the integral or the summation to merely consider their optimal values, from which those of H(p||q) in the second column of Table 1 result in the corresponding counterparts of $H(\varTheta_{\varvec{k}} |X_{N} )$ in the third column of the table. Each type in the second column may have more than one counterparts by removing either or both of the two collective operations. Such a removal makes learning implementation of $H(\varXi_{\varvec{k}} |X_{N} )$ easier but the learned system become more prone to an overfitting of a small size of samples.

As addressed at the end of “Learning mixture of AR, ARMA, ARCH and GRACH models” section, the BYY harmony learning has an automatic model selection mechanism similar to the RPCL learning. Additionally, $H(\varTheta_{\varvec{k}} |X_{N} )$ in the third column of Table 1 provides another angle to view such a mechanism. For example, observing the choice (a) in the last-bottom box of the table, maximizing $H(\varTheta_{\varvec{k}} |X_{N} )$ consists of maximizing not only $p\left( {\theta |X_{N} , \varXi } \right)$ that is same as the Bayesian learning, but also $\mathop \sum \nolimits_{t = 1}^{N} p(y_{t} ,\ell_{t} | x_{t} ,\theta )\pi (x_{t} ,y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )$ that includes maximizing a term $\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}$ with $\omega_{{\ell_{t} }} = q(x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )$. Noticing that $\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}$. monotonically increasing for $\omega_{{\ell_{t} }} > e^{ - 1}$ but decreasing for $\omega_{{\ell_{t} }} < e^{ - 1}$, a value $\omega_{{\ell_{t} }} = q(x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} ) > e^{ - 1}$ indicates the current fit to x_t is bigger than this threshold and increasing $\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}$ enhances learning by $q\left( {x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} } \right)q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )$ to fit x_t; while a value $\omega_{{\ell_{t} }} < e^{ - 1}$ indicates that this fit is below a threshold and increasing $\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}$ actually reduces this fit, i.e., a de-learning occurs. This is similar to the RPCL learning.

For the existing Bayes approaches, it is crucial to choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, Bayes approaches degenerate to the maximum likelihood learning, while the BYY harny learning is still capable of automatic model selection. Also in Table 1, if a priori distribution q(θ|Ξ_q) is also considered, the performances of BYY harmony learning will be further improved. A simple choice of q(θ|Ξ_q) is a Jeffreys prior, for which there is no parameter Ξ_q. Alternatively, we may also consider a parametric distribution. Typically, a priori q(θ|Ξ_q) and a posteriori $p(\theta | X_{N} ,\varXi_{p} )$ are either jointly a conjugate parametric pair or approximately two parametric distributions with each having a set of hyper-parameters, namely, Ξ_p,Ξ_q. Actually, a hyper-priori q(Ξ) is further considered for $\varXi = \left\{ {\varXi_{p} , \varXi_{q} } \right\}$, for which q(Ξ) is a distribution usually with no more prior, e.g., by a Jeffreys prior.

The implementation of maximizing H(p||q) is featured by jointly determining $\varTheta_{{\varvec{k} }}$ and $\varvec{k}$, namely

$$\max_{{\varvec{k},\varTheta_{\varvec{k}} }} H\left( {\varTheta_{{\varvec{k} }} |X_{N} } \right).$$

(47)

Moreover, determining $\varTheta_{{\varvec{k} }}$ further consists of determining $\theta_{\varvec{k}}$ and $\varXi_{{\varvec{k} }}$ (if any), as well as updating y_t, ℓ_t per sample x_t. Generally, the implementation of Eq. (47) is an alternative iterative process that consists of Step yℓ for updating y_t, ℓ_t, Step θ for parameter learning, Step Ξ for learning hyper-parameters (if any), and Step $\varvec{k}$ for model selection. This process is featured by apex approximation, manifold shrinking, and balanced operation. Readers are referred to Sect. 4.3 in Xu (2012) for a recent systematic overview on major issues about the BYY harmony learning implementation and to Sect. 4.3 in Xu (2010) for further supplementary materials. Considering two typical learning tasks, readers are referred to Sect. 2 in Xu (2012) and Sect. 3 in Xu (2010) for the BYY harmony learning algorithms on Gaussian mixture and factor analysis as well as their extensions.

Learning implementation: gradient algorithms versus EM-like algorithms

The maximization by Eq. (47) can be implemented by different types of learning algorithms. The simplest and widely applicable type is featured by the following gradient based updating:

$$\varTheta_{\varvec{k}}^{\rm new} \leftarrow \varTheta_{\varvec{k}}^{\rm old} + \Delta \varTheta_{\varvec{k}} \in D_{{\varTheta_{\varvec{k}} }} , {{\Delta \varTheta }}_{\varvec{k}} \propto \nabla_{{\varTheta_{\varvec{k}} \in D_{{\varTheta_{\varvec{k}} }} }} H\left( {\varTheta_{{\varvec{k} }} |_{ } X_{N} } \right),$$

(48)

where ${{\Delta }}u \propto g_{u}$ means ${{\Delta }}u = {{\gamma }}g_{u}$ with a small γ > 0, $\nabla_{{u \in D_{u} }} f\left( u \right)$ is the gradient of f(u) with respect to u within the domain D_u of u, and $u + {{\Delta }}u \in D_{u}$ means updating within the domain D_u of u. In the sequel, the use of ${{\Delta }}u \propto g_{u}$ includes the updating $u_{{}}^{\text{new}} = u_{{}}^{\text{old}} + {{\Delta u }} \in D_{u}$ even without writing it explicitly. For those choices of $H\left( {\varTheta_{{\varvec{k} }} |_{ } X_{N} } \right)$ in Table 1, if integrals are involved, we need to first handle the integrals and then take gradient on a mathematical expression without integrals, for which we approximately use a Taylor expansion around a maximal point up to the second order. Readers are referred to Sect. 4.3 in Xu (2012) for further details.

To show how a BYY harmony learning algorithm is obtained via the gradient based updating by Eq. (48). Further details are provided on learning the following alternative mixture-of-experts:

$$\begin{aligned} p(x_{t} |\varvec{x}_{t-1}^{q} ,\theta ) & = \mathop \sum \limits_{j = 1}^{k} P(j|x_{\text{t-1}} - \mu_{j,t-1} ,\theta )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} ), \\ P(j|x_{\text{t-1}} - \mu_{j,t-1} ,\theta ) & = \frac{{\alpha_{j} G(x_{\text{t-1}} - \mu_{j,t-1} |0, \sigma_{j,t-1}^{2} )}}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} G(x_{\text{t-1}} - \mu_{j,t-1} |0, \sigma_{j,t-1}^{2} )}}, \\ \end{aligned}$$

(49)

which comes from Eqs. (10), (11) and (12), while μ_j,t comes from the GARCH model given by Eq. (5). To develop algorithms for the ML learning by Eq. (16)(c) and the RPCL learning by Eq. (18), we consider the following likelihood:

$$\begin{aligned} L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ) & = \mathop \sum \nolimits_{t} \ln \left\{ {\mathop \sum \limits_{j = 1}^{k} \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )} \right\}, \\ {\text{and }}\pi_{j,t} \left( {\theta_{j} } \right) & = \ln \{ \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )\} . \\ \end{aligned}$$

(50)

Instead of maximizing the likelihood, learning algorithm is derived for maximizing

$$\begin{aligned} H(p||q) & = \mathop \smallint \nolimits p(\theta |X_{N} ,\varXi_{p} )H(\varTheta_{\varvec{k}} |X_{N} )\,{\text{d}}\theta \\ H(\varTheta_{\varvec{k}} |X_{N} ) & = \ln [q(\theta |\varXi_{q} )q\left( \varXi \right)] + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} p_{t,t - 1} (j|\theta )\pi_{j,t} \left( {\theta_{j} } \right), \\ p_{t,t - 1} (j|\theta ) & = \frac{{\alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )}}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )}}, \\ \end{aligned}$$

(51)

where q(θ|Ξ_q) is a priori distribution typically in a least redundant factorization as follows:

$$\begin{aligned} &q\left( {\theta |\varXi_{q} } \right) = q\left( {\left\{ {\alpha_{j} } \right\}_{j = 1}^{k} } \right) \prod _{j,i} q\left( {a_{j,i} } \right) \prod _{j,i} q\left( {\beta_{j,i} } \right) \prod _{j,i} q\left( {\omega_{j,i} } \right), \hfill \\& {\text{Usually, we }}\,{\text{have}} \hfill \\ &q(\{ \alpha_{j} \}_{j = 1}^{k} ):{\text{Dirichlet}}, \hfill \\ &q\left( {\beta_{j,i} } \right) , q\left( {\omega_{j,i} } \right) {:}\, {\text{nonnegative densities, e }} . {\text{g}} . , {\text{exponential}}\, {\text{or}}\, {\text{gamma,}} \hfill \\ &q\left( {a_{j,i} } \right) {:} {\text{ Gaussian }}\,{\text{or}}\, {\text{Laplacian, e}} . {\text{g}} . , {\text{a }}\,{\text{Gaussian }}\,G(a_{j,i} |0,\rho_{j,i}^{2} ) \hfill \\ &{\text{with }}q\left( {\rho_{j,i}^{2} } \right) {\text{being }}\,{\text{a}}\,{\text{Jeffreys }}\,{\text{prior}}\, {\text{or }}\,{\text{an }}\,{\text{inverse}}\, {\text{gamma}}. \hfill \\ \end{aligned}$$

(52)

Alternatively, each factor may be simply a Jeffreys prior. The posterior p(θ|X_N, Ξ_p) also have choices. First, p(θ|X_N, Ξ_p) and q(θ|Ξ_q) are a conjugate pair such that the integral over θ can be handled analytically; see Sect. 4.3 of Xu (2012). Second, we may simply consider that p(θ|X_N, Ξ_p) is free of structure and maximizing H(p||q) with respect to p(θ|X_N, Ξ_p) is simplified into the maximization of $H(\varTheta_{\varvec{k}} |X_{N} )$ with respect to $\varTheta_{\varvec{k}} .$ It follows from Eq. (48) that we consider the following gradient updating

$$\begin{aligned} \Delta \phi & \propto \nabla_{\phi } H\left( {\varTheta_{\varvec{k}}^{old} |X_{N} } \right), \phi \subset {{\varTheta }}_{\varvec{k}} = \left\{ {\theta ,\varXi_{\varvec{k}} } \right\}, \\ \nabla_{\phi } H\left( {\varTheta_{\varvec{k}} |X_{N} } \right) & = g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \rho_{j,t} \left( \theta \right)\nabla_{\phi } \pi_{j,t} \left( \theta \right), \\ g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) & = \nabla_{\phi } \ln [q(\theta |\varXi_{q} )q\left( \varXi \right)], \\ \rho_{j,t} \left( \theta \right) & = p_{t,t - 1} \left( {j |\theta } \right)\left[ {1 + \Delta \pi_{j,t} \left( \theta \right)} \right], \\ {{\Delta }}\pi_{j,t} \left( \theta \right) & = \pi_{j,t} \left( {\theta_{j} } \right) - \mathop \sum \limits_{j = 1}^{k} p_{t,t - 1} (j|\theta )\pi_{j,t} \left( {\theta_{j} } \right), \\ \end{aligned}$$

(53)

where ϕ is a subset of $\varTheta_{\varvec{k}} = \left\{ {\theta ,\varXi_{\varvec{k}} } \right\}$, e.g., either of $\left\{ {\varvec{a}_{j} } \right\},\left\{ {\mu_{j} } \right\},\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}, \ldots {\text{etc}}.$ One particular example of ϕ is $\varvec{\alpha}= [\alpha_{1} , \ldots ,\alpha_{k} ]^{\text{T}}$ subject to each α_j ≥ 0 and $\varvec{\alpha}^{\text{T}} 1 = 1$ with 1 = [1, …, 1]^T, for which we get $\varvec{\alpha}$ via updating $\varvec{c} = [c_{1} , \ldots ,c_{k} ]^{\text{T}}$ as follows:

$$\begin{aligned} &\alpha_{j} = e^{{c_{j} }} /\sum\nolimits_{\ell } {e^{{c_{\ell } }} } , {{\Delta }}{\mathbf{c}} \propto \nabla_{\varvec{c}} H(\theta ,\varXi_{\varvec{k}} |X_{N} ) = \left( {I -\varvec{\alpha}^{\text{old}} 1^{\text{T}} } \right){\text{diag}}\left[ {\mathop \sum \limits_{t} p_{1,t} , \ldots ,\mathop \sum \limits_{t} p_{k,t} } \right], \hfill \\ & {\mathbf{If}}\, {\mathbf{a}}\ \alpha_{j} \to 0, \,{\mathbf{discard }}\,{\mathbf{the}}\,{\mathbf{ corresponding }}\,{\mathbf{structure}}\,{\mathbf{ and}}\,{\mathbf{ its}} \,\theta_{j} .\hfill \\ \end{aligned}$$

(54)

As addressed in Eq. (5) in Xu (2010) and in Sect. 4.3.2 of Xu (2012), the maximization of Eq. (47) has a mechanism that pushes α_j → 0 if the corresponding expert is extra, i.e., automatic model selection occurs. Each of nonnegative parameters in $\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}$ may also be updated in a similar way, e.g., considering ξ = v² or ξ = exp (v) such that ξ is updated via $\Delta v \propto \nabla_{v} H(\varTheta_{\varvec{k}}^{\text{old}} |X_{N} ).$ With the help of the priories $q\left( {\beta_{j,i} } \right)$ and q(ω_j,i) in Eq. (52), the maximization of Eq. (47) also pushes β_j,i → 0 and ω_j,i → 0 if some order of the GARCH part in Eq. (4) and Eq. (5) is extra. Moreover, with help of the priori q(a_j,i) in Eq. (52), the maximization of Eq. (47) also pushes $\rho_{j,i}^{ 2} \to 0$ if some order of the AR part in Eq. (4) and Eq. (5) is extra.

The learning implementation by Eq. (53) covers not only the gradient based ML learning by simply setting Δπ_j,t(θ ^old_j ) = 0 in the Yang step, but also the RPCL learning algorithm simply with p_j,t given by Eq. (18). Moreover, setting $\varvec{w}_{i} = 0$ leads to learning a mixture of ARCH models, while setting $\varvec{w}_{i} = 0$ and $\varvec{b}_{i} = 0$ degenerates to learning a mixture of AR models.

For implementing the ML learning, it also been widely regarded that the EM algorithm is preferred over the gradient-based algorithm (Redner and Walker 1984; Xu and Jordan 1996). In addition to the gradient-based implementation by Eq. (53), the BYY harmony learning may also be implemented by the following EM-like procedure:

$$\begin{aligned}& {\text{Yang }}\,{\text{Step: }}p_{j,t} = \rho_{j,t} \left( {\theta^{\text{old}} } \right), {\text{see Eq}}.\,\left( { 5 3} \right), \hfill \\ &{\text{Ying }}\, {\text{Step: Let}}\ \tilde{\varTheta }_{\varvec{k}} = \varTheta_{k} - \phi \ \ {\text{and}}\ \ \tilde{\theta } = \theta - \phi , \hfill \\ &{\text{Solve}}\, {\text{the }}\,{\text{root }}\phi^{*} {\text{of }}\chi \left( \phi \right) = 0\, {\text{or }}\,{\text{approxaimtely }}\left( {{\text{if }}\,{\text{difficult}}} \right), \hfill \\ &\chi \left( \phi \right) = g_{\phi } \left( {{\tilde{{\varTheta }}}_{k}^{{ {\text{old}}}} \mathop \cup \nolimits \phi } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} p_{j,t} \nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }_{{}}^{\text{old}} \mathop \cup \nolimits \phi } \right), \hfill \\ &{\text{Then, update }}\phi^{\text{new}} = \phi^{*} , \hfill \\ \end{aligned}$$

(55)

where A–B denotes the complement of A with respect to B, i.e., $\varvec{A} {-} \varvec{B} = \left\{ {x \in \varvec{A}\left| {x \notin \varvec{B}} \right.} \right\}$. When the root ϕ^* of χ(ϕ) = 0 is solved analytically, setting Δπ_j,t(θ) = 0 makes Eq. (53) degenerate to the EM algorithm for the ML learning if $g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0$ or the Bayes learning if $g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) \ne 0$. Generally, the algorithm by Eq. (55) is different from the EM algorithm by the factor 1 + Δπ_j,t(θ), which takes an important role in making model selection. However, the EM algorithm is guaranteed to converge (Redner and Walker 1984), while the factor 1 + Δπ_j,t(θ) makes the Ying–Yang iteration lose such a guarantee.

Efforts are made on remedying this weakness. One simple way is replacing ϕ^new = ϕ^* in Eq. (55) by the following linear combination

$$\phi^{\text{new}} = \phi^{\text{old}} + \eta \left( {\phi^{*} - \phi^{\text{old}} } \right), \quad 0\le \eta \le 1.$$

(56)

E.g., see Box 3 and Remark (c) in Fig. 7 and Box 7 in Fig. 8 of Xu (2010). However, how to choose an appropriate 0 ≤ η ≤ 1 remains a problem, which can be handled in one of the following two ways:

Initialize η ≤ 1, get ϕ^new by Eq. (56) and check whether $H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \phi^{new} |X_{N} ) > H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \nolimits \phi^{old} |X_{N} )$

If yes, we move to the next Ying step in Eq. (55), otherwise reduce η in some way to get ϕ^new and make such a check again.
Seek an optimal η^* that maximizes $H\left( \eta \right) = H(\tilde{\varTheta }_{k}^{\text{old}} \mathop \cup \left[ {\phi^{\text{old}} + \eta \left( {\phi^{*} - \phi^{\text{old}} } \right)} \right]|X_{N} )$, which can be handled by one of many techniques for one variable optimization. One example is solving the root of dH(η)/dη = 0.

Alternatively, another way to get ϕ^new from ϕ^* and ϕ^old is a reconsideration of $\nabla_{\phi } H(\varTheta_{\varvec{k}} |X_{N} )$ in Eq. (53). Making a first order Taylor expansion of ρ_j,t(θ) around θ^old and of ∇_ϕπ_j,t(θ) around ϕ^*, we consider

$$\begin{aligned} & \rho_{j,t} \left( \theta \right)\nabla_{\phi } \pi_{j,t} \left( \theta \right) \approx \\ & \quad \left[ {\rho_{j,t} \left( {\theta^{\text{old}} } \right) + \nabla_{\phi } \rho_{j,t} \left( {\theta^{\text{old}} )^{\rm T} \left( {\phi - \varphi^{old} } \right)} \right]} \right[\nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right) + \nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\left( {\phi - \varphi^{*} } \right)] \\ & \approx \rho_{j,t} \left( {\theta^{\text{old}} } \right)\nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right) + U_{j,t} \left( {\phi - \varphi^{\text{old}} } \right) + V_{j,t} \left( {\phi - \varphi^{*} } \right) \\ & U_{j,t} = \nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\nabla_{\phi } \rho_{j,t} (\theta^{\text{old}} )^{\rm T},\ \ V_{j,t} = \rho_{j,t} \left( {\theta^{\text{old}} } \right)\nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right), \\ \end{aligned}$$

where the second ≈ comes from dropping the second order term $\left( {\phi - \varphi^{\text{old}} } \right)^{\rm T} \nabla_{\phi } \rho_{j,t} \left( {\theta^{\text{old}} } \right)\;\nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\left( {\phi - \varphi^{*} } \right)$. Taking the sum over j, t, the counterpart of the first term becomes $\chi \left( {\phi^{*} } \right) = 0$ and thus disappears, from which we are led to

$$\psi \left( \phi \right) = \nabla_{\phi } H(\varTheta_{\varvec{k}} |X_{N} ) \approx g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left[ {U_{j,t} \left( {\phi - \varphi^{\text{old}} } \right) + V_{j,t} \left( {\phi - \varphi^{*} } \right)} \right].$$

(57)

Then, we solve ψ(ϕ^new) = 0 to get $\phi^{\text{new}}$ from ϕ^* and ϕ^old. Particularly, when $g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0$ we simply have

$$\phi^{\text{new}} = \left[ {\mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left( {U_{j,t} + V_{j,t} } \right)} \right]^{ - 1} \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left( {U_{j,t} \varphi^{\text{old}} + V_{j,t} \varphi^{*} } \right).$$

(58)

It is still a linear function of ϕ^* and ϕ^old, but becomes much advanced than the one by Eq. (56).

Linear causal analyses

Path analyses and a recent development on ρ-diagram

Path analyses is one earliest causal analysis approach, proposed around 1918 by Sewall Wright who made its developments more extensively in the 1920s (Wright 1921, 1934). It has been not only further investigated in the formulation of structural equation modeling (SEM) (Ullman 2006; Hooper et al. 2008; Pearl 2010a; Kline 2015) with wide applications, but also found its uses in many complex modeling areas, including biology, psychology, sociology, and econometrics. Details are left to a vast volume of publications in literature. Here, we introduce a recent development on a modified formulation named ρ-diagram (Xu 2018).

The formulation considers a directed acyclic graph (DAG) or Bayesian networks, with visible nodes x₁, x₂,…, x_n and hidden nodes w₁,…,w_m. Each x_i is normalized to be zero mean and unit variance and each w_j is assumed to be zero mean and unit variance too; while each edge is associated with the correlation coefficient between its two nodes. In other words, such a diagram is completely defined by pairwise correlation coefficients, and thus called ρ-diagram in that each correlation coefficient is denoted by ρ shortly. Being different from the classical procedure for path analyses, namely getting topology by prior, estimating unknown parameters and causal effects, and making model-fit assessment on alternative models, a TPC procedure is suggested for ρ-diagram (Xu 2018), which begins at Topology discovery from data based on ρ-diagram, and then makes Parameter estimation and Causality embedded model-fit assessment.

Topology discovery is based on equations that are obtained from path tracing in a way similar to Wright’s system of tracing rules. The difference is that unknowns in equations involve only the within-diagram ρ-variables, while knowns are pairwise correlation r-coefficients obtained from visible nodes x₁, x₂,…, x_n, subject to the constraints that all the ρ-variables vary between [− 1,+ 1]. We discover a topology underlying data by checking whether a set of constrained equations is deterministically solved, that is, having (1) no solution, (2) a unique solution (or few solutions), and (3) infinite many of solutions.

For details refer to Xu (2018). Here, an illustration is made on topologies of 3-node diagrams, as illustrated in Fig. 8. Given a diagram with nodes x, y, z, the simplest case is illustrated in Fig. 8a, featured by that every pairwise correlation is zero or there is only one pair that gets r_ij ≠ 0, which can be directly identified by observing r_ij, ∀i,j ∈{x,y,z}. Shown in Fig. 8b are topologies that have two edges. The first one gets two edges in a fork, which can be identified by observing r_ij = 0 for only one pair while r_ij ≠ 0 for other two pairs. The other topologies describes the causality from conditional independence analysis, which can be identified by observing r_ikr_kj = r_ij ≠ 0 ∀i,j ∈{x,y,z} on all the permutations of x, y, z.

Shown in Fig. 8c are two typical topologies of widely encountered causal structure called cofounder. Via path tracing, the following equations are obtained:

$$\rho_{ki} + \rho_{kj} \rho_{ji} = r_{ki} ,\ \ \rho_{ji} + \rho_{kj} \rho_{ki} = r_{ji} ,\ \ \rho_{kj} = r_{kj} ; \quad - 1 \le \rho_{ji} , \rho_{kj} ,\rho_{ki} \le 1$$

(59)

As shown in Fig. 8c, we may check whether two lines get cross within the dashed box. If yes, a cofounder is identified in either of two topologies on the bottom of Fig. 8c. However, the direction between j and k cannot be identified. Even so, the direct causal direction and effect

$$\rho_{ji} = {{\left( {r_{ji} - r_{kj} r_{ki} } \right)} \mathord{\left/ {\vphantom {{\left( {r_{ji} - r_{kj} r_{ki} } \right)} {\left( {1 - r_{kj}^{2} } \right)}}} \right. \kern-0pt} {\left( {1 - r_{kj}^{2} } \right)}}$$

(60)

is uniquely determined, i.e., the cofounder effect can be remedied.

If two lines do not intersect within the box, one may further check one other permutation of labels i, j, k. It is unlikely that two different permutations are both identified because it merely happens when not only ρ = r holds on two edges but also four linear equations have consistent solution for unknowns. If no permutation can be identified, it means that there is not such a cofounder causality underlying data. However, there may be still other causality. On one hand, we may check whether there is some causality in types of Fig. 8a, b. On the other hand, we may continue to diagrams with four nodes or more.

Causal potential theory

As already mentioned above, the direction between j and k in Fig. 8c cannot be identified. Also, edge directions in Fig. 8b cannot be identified too. There have been extensive studies on detecting causal direction and evaluating causal strength (Peters et al. 2009; Zhang and Hyvärinen 2009; Hoyer et al. 2009; Rubin and John 2011), via analyzing certain types of asymmetry between two variables X and Y. One most authoritative definition of causality is p(Y|do X = x) with ‘do X = x’ indicating the action that imposes X = x (Pearl 2010b). In these studies, causality is actually examined from a descriptive perspective.

As illustrated in Fig. 8d, possible movements that apple falls and balance loses are actually caused by physics mechanism, i.e., the law of universal gravitation and the lever principle, where causality is actually an issue of dynamics, about how movements are caused by forces that come from potential difference. It follows from the viewpoint of grand unification that we are thus motivated to believe that causality in terms of probability, information, and intelligence should be also governed by similar dynamics.

Consider the relationship described by density distribution $p\left( {x,y} \right), \varvec{ }$ as illustrated in Fig. 8d, the quantity ${\text{E}}\left( {x,y} \right) \propto - { \ln } p\left( {x,y} \right)$ actually describes a sort of potential energy density on an infinitesimal piece dxdy, and represents a difference of potential energy density in reference of a uniform distribution on the space x, y, while we can get

$$\left[ {I_{x} ,I_{y} } \right] = \left[ { - \frac{{\partial {\text{E}}\left( {x,y} \right)}}{\partial x}, - \frac{{\partial {\text{E}}\left( {x,y} \right)}}{\partial y}} \right]$$

(61)

to represent a force field that drives information flow toward the area with the lowest energy, or equivalently driving that information flows from rare occurring locations toward high occurring locations.

Changes of x, y and the rates of changes are described by I_x, I_y, respectively, and both are actually driven by the difference of potential energy density of E(x, y). The problems about whether one of X, Y causes the other or whether two are mutually caused each other may be examined through I_x, I_y. Typically, we may encounter the following cases:

$$\begin{aligned} {\text{Case}} \,O: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( x \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( y \right); \\ {\text{Case }}\,A: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( x \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( {x,y} \right); \\ {\text{Case }}\,B: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( {x,y} \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( y \right); \\ {\text{Case }}\,C: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( {x,y} \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( {x,y} \right). \\ \end{aligned}$$

(62)

For Case O, changes of x merely relates to itself, while changes of y merely relates to itself, that is, changing x is independent of change of y. For Case A, changes of x merely relates to itself, while changes of y relate to both of $x,y,$ where we may regard that changing x causes change of y. For Case B, changes of y merely relates to itself, while changes of x relate to both of $x,y,$ where we may regard that changing y causes change of x. For Case C, changes of x, y are mutually related.

From a set of samples of x, y, we may develop certain statistics to identify which case is actually encountered. Due to noise and a finite sample size, the first three cases are rarely found. What are often encountered is Case C. In such cases, we may further check whether one of x, y takes a dominant role, while the other maybe ignored, that is, whether we have either or both of

$$f\left( {x,y} \right) \approx f\left( x \right), g\left( {x,y} \right) \approx g\left( y \right).$$

(63)

Further insights on causality may be obtained from this perspective, not only a pair X, Y may be identified in one of the four cases on the entire domain that x, y vary, but also a pair may be identified in one case on some subdomain but in a different case on some different subdomain. That is, causal direction may reverse, disappear, and emerge as x, y vary on different subdomains.

To be more specific, we observe two typical examples. The first considers binary x, y from

$$p\left( {x,y} \right) = p\left( {y |x} \right)p\left( x \right), \quad {\text{for}}\ x,y = 0, 1$$

(64)

$$p\left( {y |x} \right) = s^{y} \left( {bx + c} \right)\left[ {1 - s\left( {bx + c} \right)} \right]^{1 - y} ,\,q\left( x \right) = q^{x} \left( { 1 - q} \right)^{ 1- x} ,$$

where s(r) is a sigmoid function and p(y|x) describes a logistic regression, for which we get

$$\begin{aligned} I_{x} & = { \ln }\frac{q}{1 - q} + bs^{{\prime \left( {bx + c} \right)}} \left[ {\frac{y}{{s\left( {bx + c} \right)}} - \frac{1 - y}{{1 - s\left( {bx + c} \right)}}} \right] = { \ln }\frac{q}{1 - q} + b\delta , \\ \delta & = s\left( {bx + c} \right) - y,\ \ I_{y} = { \ln }\frac{{s\left( {bx + c} \right)}}{{1 - s\left( {bx + c} \right)}}. \\ \end{aligned}$$

(65)

We usually have $\delta \approx 0$ if the logistic regression fits well, thus it leads to Case A above, i.e., the causal direction is x → y, which is consistent to our existing understanding on this model.

The second example considers p(x,y) from a joint density of Gaussian variables x, y with zero mean and unit variance as well as their correlation coefficient ρ. It follows that

$$-I_{x} = x + \rho y,\ \ -I_{y} = y + \rho x,$$

(66)

which leads to Case 0 when ρ = 0, Case A when ρy ≈ 0, Case B when ρx ≈ 0, and Case C in general. That is, we are unable to identify causal direction on the entire domain, which is also consistent to our existing understanding. Interestingly, we get new insight that it is possible to detect causal direction in some particular subdomains.It also may deserve to extend these studies to consider a density $p\left( {\varvec{x},\varvec{y}} \right) \varvec{ }$ with $\varvec{x},\varvec{y}$ being vectors such that we examine causality between two groups of variables.

SEM and its relations to modulated TFA-APT and nGCH-driven M-TFA-O

In its early stages of developments, modeling by equations in path analyses and structural equation modeling (SEM) were used without a particular clarification. In recent decades, SEM is gradually developed into the following formulation (Ullman 2006; Kline 2016):

$$\varvec{x} = {\varvec{\Lambda}}_{\varvec{x}}\varvec{\xi}+\varvec{\delta},\ \varvec{ y} = {\varvec{\Lambda}}_{\varvec{y}}\varvec{\eta}+\varvec{\varepsilon},\ \varvec{ \eta } = \varvec{B\eta } + {\varvec{\Gamma}}\varvec{\xi}+ \varvec{\varsigma }$$

(67)

To compare modulated TFA-APT and nGCH-driven M-TFA-O, we observe the following equations from Eq. (42) and in Eq. (43):

$$\varvec{r}_{\varvec{t}} = \varvec{a} + \varvec{Af}_{\varvec{t}} + \varvec{e}_{\varvec{t}},\ \varvec{ f}_{\varvec{t}} = \varvec{Bf}_{{\varvec{t} - 1}} + \varvec{Hm}_{\varvec{t}} +\varvec{\varepsilon}_{\varvec{t}},\ \varvec{ m}_{\varvec{t}} = \varvec{Cv}_{\varvec{t}} +\varvec{\eta}_{\varvec{t}} ,$$

Putting the last one into the second one, we may rewrite

$$\begin{aligned} \varvec{ f}_{\varvec{t}} & = \varvec{Bf}_{{\varvec{t} - 1}} + \varvec{HCv}_{\varvec{t}} + \varvec{H\eta }_{\varvec{t}} +\varvec{\varepsilon}_{\varvec{t}} ,\varvec{ } \\ \varvec{r}_{\varvec{t}} & = \varvec{a} + \varvec{Af}_{\varvec{t}} + \varvec{e}_{\varvec{t}},\ \varvec{ m}_{\varvec{t}} = \varvec{Cv}_{\varvec{t}} +\varvec{\eta}_{\varvec{t}} . \\ \end{aligned}$$

(68)

Table 2 compares the notations in Eqs. (62) and (63).

Table 2 In comparison with modulated TFA-APT and GMCH-driven M-TFA

Full size table

The two are actually the same at the special case H = 0. Generally, we observe that modulated TFA-APT may be regarded as a variant or extension of SEM.

Coming from different perspectives, SEM and the modulated TFA–APT aim at causal analysis in a closely related way. Both consist of FA as basic ingredient that suffers the intrinsic rotation indeterminacy by Eq. (40). In path analysis and SEM study, the problem is avoided by making hidden factors f_t and/or the elements of A partly known with human-aide. While in the modulated TFA-APT, the problem is solved by considering both independence cross hidden factors and temporal dependence Bf_t−1 among each factor. We may combine the ideas to improve each other. On one hand, SEM motivates us to prune away extra edges that correspond to elements of A, which may be implemented by sparse learning. On the other hand, we may improve SEM by considering temporal dependence among endogenous factors.

Moreover, rotation indeterminacy may also be removed by changing the driving noise of hidden factors from Gaussian q(ɛ ^{(
j)}_t ) into non-Gaussian q(ɛ ^{(
j)}_t ) (Xu 2001, 2004). Furthermore, conditional heteroskedasticity (Chiu and Xu 2003) has also been included in the driving noise to encode non-stationarity. The two points are actually included in Item (c) in Eq. (43), which extends the modulated TFA-APT into nGCH-driven M-TFA-O, which may also be used to improve SEM. Furthermore, a non-diagonal matrix B may be considered to replace a diagnal matrix B in TFA, such that Granger causality like problem (Granger 1969) may be taken in consideration together with the previous cofounder problem further examined.

Abbreviations

AIC:: Akaike information criterion
APT:: arbitrage pricing theory
AR:: autoregressive
ARCH:: Autoregressive Conditional Heteroskedasticity
ARIMA:: autoregressive integrated moving average
ARMA:: autoregressive–moving average
BYY:: Bayesian Ying Yang
BIC:: Bayesian information criterion
CAIC:: consistent AIC
CAPM:: capital asset pricing model
EMH:: efficient market hypothesis
HMM:: hidden Markov model
GARCH:: generalized ARCH
LDS:: linear dynamical system
LR:: likelihood ratio
MDL:: minimum description length
ME:: mixture-of-experts
ML:: maximum likelihood
MLFA:: maximum likelihood factor analysis
MML:: minimum message length
MUV:: mixture using variance
NFA:: non-Gaussian factor analyses
NRBF:: normalized radial basis function
ρ-diagram:: a diagram defined by a set of pairwise correlation coefficients
RPCL:: rival penalized competitive learning
SEM:: structural equation modeling
SSM:: state space model
TFA:: temporal factor analysis
VAR:: vector autoregressive
VB:: variational Bayes

References

Abeysekera SP, Mahajan A (1987) A test of the APT in pricing UK stocks. J Account Finance 17(3):377–391
Google Scholar
Azeez AA, Yonezawa Y (2006) Macroeconomic factors and the empirical content of the Arbitrage Pricing Theory in the Japanese stock market. Jpn World Econ 18(4):568–591
Article Google Scholar
Azoff ME (1994) Neural network time series forecasting of financial markets. Wiley, New York
Google Scholar
Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econom 31:307–327
Article MathSciNet MATH Google Scholar
Box G, Jenkins G (1970) Time series analysis: forecasting and control. Holden-Day, San Francisco
MATH Google Scholar
Brown SJ (1989) The number of factors in security returns. J Finance 44(5):1247–1262
Article Google Scholar
Chamberlain G, Rothschild M (1983) Arbitrage, factor structure, and mean–variance analysis on large asset markets. Econometrica 51(5):1281–1304
Article MathSciNet MATH Google Scholar
Chen NF, Roll R, Ross S (1986) Economic forces and the stock market. J Bus 59(3):383–403
Article Google Scholar
Cheung YM, Leung WM, Xu L (1996) Combination of buffered back-propagation and RPCL-CLP by mixture-of-experts model for foreign exchange rate forecasting. In: Proceedings of 3rd international conference on neural networks in the capital markets, London, UK, Oct 11–13, 1996. World Scientific Pub, Singapore, pp 554–563
Cheung Y, Leung WM, Xu L (1997) Adaptive rival penalized competitive learning and combined linear predictor model for financial forecast and investment. Int J Neural Syst 8:517–534
Article Google Scholar
Chiu KC, Xu L (2002) Stock price and index forecasting by arbitrage pricing theory-based Gaussian TFA learning. In: Yin HJ (ed) Lecture notes in computer sciences (LNCS), vol 2412. Springer, Berlin, pp 366–371
Chiu KC, Xu L (2002) A comparative study of Gaussian TFA learning and statistical tests on the factor number in APT. In: Proceedings of international joint conference on neural networks 2002 (IJCNN ‘02), Honolulu, Hawaii, USA, May 12–17, 2002. pp 2243–2248
Chiu KC, Xu L (2003) Stock forecasting by ARCH driven Gaussian TFA and alternative mixture experts models. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sept 26–30. pp 1096–1099
Chiu KC, Xu L (2003) On generalized arbitrage pricing theory analysis: empirical investigation of the macroeconomics modulated independent state–space model. In: Proceedings of 2003 international conference on computational intelligence for financial engineering, Hong Kong, March 20–23. pp 139–144
Chiu KC, Xu L (2004a) Arbitrage pricing theory based Gaussian temporal factor analysis for adaptive portfolio management. J Decis Support Syst 37:485–500
Article Google Scholar
Chiu KC, Xu L (2004b) NFA for factor number determination in APT. Int J Theor Appl Finance 7:253–267
Article MATH Google Scholar
Choey M, Weigend AS (1997) Nonlinear trading models through Sharpe ratio optimization. Int J Neural Syst 8(3):417–431
Article Google Scholar
Dhrymes PJ, Friend I, Gultekin B (1984) A critical reexamination of the empirical evidence on the arbitrage pricing theory. J Finance 39(2):323–346
Article Google Scholar
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of variance of United Kingdom Inflation. Econometrica 50:987–1008
Article MathSciNet MATH Google Scholar
Engle RF, Granger CWJ (1987) Co-integration and error–correction: representation, estimation and testing. Econometrica 55(2):251–276
Article MathSciNet MATH Google Scholar
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Fishburn PC (1977) Mean-risk analysis with risk associated with below-target returns. Am Econ Rev 67(2):116–126
Google Scholar
Gately E (1995) Neural networks for financial forecasting. John Wiley & Sons, New York
Google Scholar
Ghahramani Z, Hinton GE (2000) Variational learning for switching state–space models. Neural Comput 12(4):831–864
Article Google Scholar
Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438
Article MATH Google Scholar
Hooper D, Coughlan J, Mullen MR (2008) Structural equation modelling: guidelines for determining model fit. Electron J Bus Res Methods 6(1):53–60
Google Scholar
Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696
Hung KK, Cheung CC, Xu L (2000) New Sharpe-ratio-related methods for portfolio selection. In: IEEE/IAFE/INFORMS 2000 conference on computational intelligence for financial engineering, New York City, USA, March 26–28, pp 34–37
Hung KK, Cheung Y, Xu L (2003) An extended ASLD trading system to enhance portfolio management. IEEE Trans Neural Networks 14:413–425
Article Google Scholar
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3:79–87
Article Google Scholar
Jangmin O, Jongwoo L, Lee JW, Zhang BT (2006) Adaptive stock trading with dynamic asset allocation using reinforcement learning Inform Sci 176(15):2121–2147
Google Scholar
Jordan MI, Xu L (1995) Convergence results for the EM approach to mixtures of experts architectures. Neural Netw 8:1409–1431
Article Google Scholar
Kline RB (2015) Principles and practice of structural equation modeling, 4th edn. Guilford Publications, New York
MATH Google Scholar
Kwok HY, Chen CM, Xu L (1998) Comparison between mixture of ARMA and mixture of AR model with application to time series forecasting. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, October 21–23, vol 2. pp 1049–1052
Leontaritis IJ, Billings SA (1985) Input-output parametric models for non-linear systems Part I: deterministic non-linear systems and Part II: stochastic non-linear systems. Int J Control 41:303–344
Article MATH Google Scholar
Leung WM, Cheung Y, Xu L (1997) Application of mixture of experts models to nonlinear financial forecasting. In: Caldwell RB (ed) Nonlinear financial forecasting: proceedings of the first INFFC, (Finance & Technology Publishing, 1997), pp 153–168
Markowitz HM (1952) Portfolio selection. J Finance 7(1):77–91
Google Scholar
Markowitz HM (1959) Portfolio selection: efficient diversification of investments. John Wiley & Sons, New York
Google Scholar
McGrory CA, Titterington DM (2007) Variational approximations in Bayesian model selection for finite mixture distributions. Comput Stat Data Anal 51(11):5352–5367
Article MathSciNet MATH Google Scholar
Moody J, Saffell M (2001) Q learning to trade via direct reinforcement. IEEE Trans Neural Networks 12(4):875–889
Article Google Scholar
Moody J, Lizhong W, Liao Y, Saffell M (1998) Performance functions and reinforcement learning for trading systems and portfolios. J Forecasting 17:441–470
Article Google Scholar
Neuneier R (1996) Optimal asset allocation using adaptive dynamic programming. In: Touretzky DS (ed) Advances in neural information processing systems, 8th edn. MIT Press, Cambridge, pp 952–958
Google Scholar
Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2):1–62
Article MathSciNet Google Scholar
Perrone MP (1994) Putting it all together: methods for combining neural networks. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann, San Francisco, pp 1188–1189
Google Scholar
Perrone MP, Cooper LN (1993) When networks disagree: ensemble methods for neural networks. In: Mammone RJ (ed) Neural networks for speech and image processing. Chapman & Hall, New York, pp 126–142
Google Scholar
Peters J, Janzing D, Gretton A, Schölkopf B (2009) Detecting the direction of causal time series. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, pp 801–808
Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Article Google Scholar
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood, and the EM algorithm. SIAM Rev 26:195–239
Article MathSciNet MATH Google Scholar
Ross S (1976) The arbitrage theory of capital asset pricing. J Econ Theory 13(3):341–360
Article MathSciNet Google Scholar
Rubin DB, John L (2011) Rubin causal model. International encyclopedia of statistical science. Springer, Berlin, pp 1263–1265
Chapter Google Scholar
Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J Finance XIX(3):425–442
Google Scholar
Sharpe FW (1966) Mutual fund performance. J Bus 39(S1):119–138
Article Google Scholar
Sharpe WF (1994) The Sharpe ratio-properly used, it can improve investment. J Portfolio Manag Fall 21:49–58
Article Google Scholar
Shumway RH, Stoffer DS (1991) Dynamic linear models with switching. J Am Stat Assoc 86(415):763–769
Article MathSciNet Google Scholar
Sims C (1980) Macroeconomics and reality. Econometrica 48(1):1–48
Article Google Scholar
Sortino FA, van der Meer R (1991) Downside risk: capturing what’s at stake in investment situations. J Portfolio Manag 17(4):27–31
Article Google Scholar
Tang H, Chiu K-C, Xu L (2003) Finite mixture of ARMA-GARCH model for stock price prediction. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sep 26–30, pp 1112–1119
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58(1):267–288
MathSciNet MATH Google Scholar
Tu S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electr Electron Eng China 6(2):245–255
Article Google Scholar
Ullman JB (2006) Structural equation modeling reviewing the basics and moving forward. J Pers Assess 87(1):35–50
Article Google Scholar
Wang P et al (2011) Radar HRRP statistical recognition with temporal factor analysis by automatic Bayesian Ying–Yang harmony learning. Front Electr Electron Eng China 6(2):300–317
Article MATH Google Scholar
Westland JC (2015) Structural equation modeling: from paths to networks. Springer, New York
Google Scholar
Williams PM (1995) Bayesian regularization and pruning using a Laplace prior. Neural Comput 7(1):117–143
Article Google Scholar
Wong WC, Yip F, Xu L (1998) Financial prediction by finite mixture GARCH model. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, Oct 21–23, 3(1998), pp 1351–1354
Wright S (1921) Correlation and causation. J Agric Res 20(7):557–585
MathSciNet Google Scholar
Wright S (1934) The method of path coefficients. Ann Math Stat 5(3):161–215
Article MATH Google Scholar
Xu L (1994) Signal segmentation by finite mixture model and EM algorithm. In: Proceedings of international symposium on artificial neural networks, Tainan, Dec 15–17, pp 453–458
Xu L (1995) Channel equalization by finite mixtures and the EM algorithm. In: Proceedings of IEEE neural networks and signal processing workshop. Cambridge, MA, Aug 31–Sep 2, vol 5, pp 603–612
Xu L (1995) Ying–Yang machines: a Bayesian–Kullback scheme for unified learning and new results on vector quantization. In: Proceedings of the international conference on neural information processing, Beijing, China, Oct 30–Nov 3, pp 977–988 (A further version Advances in NIPS8, Touretzky DS et al (ed), MIT Press, Cambridge MA, 1996: 444–450)
Xu L (1997) Bayesian Ying Yang system and theory as a unified statistical learning approach: (II) from unsupervised learning to supervised learning, and temporal modeling. In: Wong KM et al (eds) Proceedings of theoretical aspects of neural computation: a multidisciplinary perspective. Springer, Berlin, pp 29–42
Xu L (1998) RBF nets, mixture experts, and Bayesian Ying–Yang learning. Neurocomputing 19:223–257
Article MATH Google Scholar
Xu L (2000) Temporal BYY learning for state space approach, hidden Markov model, and blind source separation. IEEE Trans Signal Process 48(7):2132–2144
Article MathSciNet MATH Google Scholar
Xu L (2001) BYY harmony learning, independent state space and generalized APT financial analyses. IEEE Trans Neural Netw 12:822–849
Article Google Scholar
Xu L (2002) Temporal factor analysis: stable-identifiable family, orthogonal flow learning, and automated model selection. In: Proceedings of international joint conference on neural networks. Honolulu, HI, USA, 12–17 May, pp 472–476
Xu L (2004) Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor autodetermination. IEEE Trans Neural Netw 15(4):885–902
Article MathSciNet Google Scholar
Xu L (2007) A unified perspective and new results on RHT computing, mixture based learning, and multi-learner based problem solving. Pattern Recogn 40:2129–2153
Article MATH Google Scholar
Xu L (2009) Learning algorithms for RBF functions and subspace based functions. In: Olivas ES et al (eds) Handbook of research on machine learning applications and trends: algorithms, methods and techniques. IGI Global, Hershey, pp 60–94
Google Scholar
Xu L (2010) Bayesian Ying–Yang system, best harmony learning, and five action circling. J Front Electr Electron Eng China 5(3):281–328 (A special issue on Emerging Themes on Information Theory and Bayesian Approach)
Article MathSciNet Google Scholar
Xu L (2012) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. J Front Electr Electron Eng 7(1):147–196 (A special issue on Machine learning and intelligence science: IScIDE (C))
Google Scholar
Xu L (2018) Deep bidirectional intelligence: AlphaZero, deep IA-search, deep IA-infer, and TPC causal learning. Appl Inform 5(5):38
Google Scholar
Xu L, Amari S (2008) Combining classifiers and learning mixture of experts. In: Rabuñal Dopico JR (ed) Encyclopedia of artificial intelligence. IGI Global, Hershey, pp 318–326
Google Scholar
Xu L, Cheung Y (1997) Adaptive supervised learning decision networks for traders and portfolios. J Comput Intell Finance 5(6):11–16 (A short version also in Proceedings of IEEE-IAFE 1997 International Conference on Computational Intelligence for Financial Engineering (CIFEr), New York City, March 23-25, 1997, 206–212)
Google Scholar
Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8(1):129–151
Article Google Scholar
Xu L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival Penalized competitive learning. In: Proceedings of 11th international conference on pattern recognition. Hague, Netherlands, Aug 30–Sep 3, pp 672–675
Xu L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Trans Neural Netw 4:636–649
Article Google Scholar
Xu L, Jordan MI, Hinton GE (1994) A modified gating network for the mixtures of experts architecture. Proceedings of 1994 world congress on neural networks, vol 2. San Diego, CA, June 4–9, pp 405–410
Xu L, Jordan MI, Hinton GE (1995) An alternative model for mixtures of experts. In: Tesauro G et al (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 633–640
Google Scholar
Zhang PG (ed) (2003) Neural networks in business forecasting, forecasting and control. IRM Press, London
Google Scholar
Zhang K, Hyvärinen A (2009) On the identifiability of the post-nonlinear causal model. Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009). Montreal, Canada, 2009, pp 647–655

Download references

Authors’ contributions

All from the sole author LX. The author read and approved the final manuscript.

Acknowledgements

This work was supported by the Zhi-Yuan chair professorship start-up Grant (WF220103010) from Shanghai Jiao Tong University.

Competing interests

The author declares that there is no competing interests.

Availability of data and materials

Not applicable.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Funding

WF220103010, Shanghai Jiao Tong University.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

CMaCH Centre, School of Electronic Information and Electrical Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Minhang District, Shanghai, 200240, China
Lei Xu
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong, China
Lei Xu

Authors

Lei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Xu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Xu, L. Machine learning and causal analyses for modeling financial and economic data. Appl Inform 5, 11 (2018). https://doi.org/10.1186/s40535-018-0058-5

Download citation

Received: 05 December 2018
Accepted: 20 December 2018
Published: 29 December 2018
DOI: https://doi.org/10.1186/s40535-018-0058-5

Machine learning and causal analyses for modeling financial and economic data

Abstract

Introduction

Financial prediction: time series models and three finite mixture extensions

Time series models and neural networks

Learning mixture of AR, ARMA, ARCH and GRACH models

Maximum likelihood, RPCL learning and learning with model selection

Dynamic trading and portfolio management

Dynamic trading by supervised learning and reinforcement learning

Dynamic portfolio management by maximizing Sharpe ratio and extensions

Market modeling: APT theory and temporal factor analysis

Arbitrage pricing theory and factor analysis’s incapability

Temporal factor analysis and temporal APT

Macroeconomics-modulated TFA-APT and nGCH-driven M-TFA-O

Bayesian Ying–Yang harmony learning and two exemplar learning algorithms

Bayesian Ying–Yang (BYY) harmony learning

Learning implementation: gradient algorithms versus EM-like algorithms

Linear causal analyses

Path analyses and a recent development on ρ-diagram

Causal potential theory

SEM and its relations to modulated TFA-APT and nGCH-driven M-TFA-O

Abbreviations

References

Authors’ contributions

Acknowledgements

Competing interests

Availability of data and materials

Consent for publication

Ethics approval and consent to participate

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords