Skip to main content

Machine learning and causal analyses for modeling financial and economic data

Abstract

Instead of aiming at a systematic survey, we consider further developments on several typical linear models and their mixture extensions for prediction modeling, portfolio management and market analyses. The focus is put on outlining the studies by the author’s research group, featured by (a) extensions of AR, ARCH and GARCH models into finite mixture or mixture-of-experts; (b) improvements of Sharpe ratio by maximizing the expected return and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification; (c) developments of arbitrage pricing theory (APT) into temporal factor analysis (TFA)-based temporal APT, macroeconomics-modulated temporal APT and a general formulation for market modeling, together with applications to temporal prediction and dynamic portfolio management; (d) Bayesian Ying–Yang (BYY) harmony learning is adopted to implement these developments, featured with automatic model selection. After a brief introduction on BYY harmony learning, gradient-based algorithms and EM-like algorithms are provided for learning alternative mixture-of-experts-based AR, ARCH and GARCH models; and (e) path analysis for linear causal analyses is briefly reviewed, a recent development on ρ-diagram is refined for cofounder discovery, and a causal potential theory is proposed. Also, further discussions are made on structural equation modeling and its relations to modulated TFA-APT and nGCH-driven M-TFA-O.

Introduction

Financial and economic data are naturally recorded as temporal sequences or time series, and thus one of major tasks on those data is making time series analysis. Typically, a mathematical model is obtained to describe the regression relation of the current observation from its past observations, such that the future observation is predicted. Such a prediction task has been extensively studied in both the literature of time series analysis and the literature of machine learning and neural networks.

One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–moving-average (ARMA) model, which describes a linear dependence of the current observation on past values and noise disturbances. Extended from describing stationary processes to data with some identifiable trend of a polynomial growth (Box and Jenkins 1970), an initial differencing step can be applied to remove such a non-stationarity. See Box 1 in Fig. 1; the autoregressive integrated moving average (ARIMA) model is used to refer a “cascade” of this initialization and ARMA. For simplicity, we still prefer to use AMRA to refer ARIMA by regarding such an initialization as a pre-processing stage.

Fig. 1
figure 1

A road map on studies of time series prediction

In the literatures of statistics and econometrics, as outlined in Fig. 1 by Box 2, generalizations of ARMA have also been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering conditional heteroskedasticity of variables (Engle 1982; Bollerslev 1986), to nonlinear ARMA for modeling nonlinear dependence (Leontaritis and Billings 1985), and Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987).

The field of NN-ML in economics and finance involves each of the three streams of studies. In the early stage, most efforts were put on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as outlined in Fig. 1 by Box 3. There have been already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies.

Since 1994, the author’s group has made many efforts on extending AR, ARMA, ARCH and GARCH models into finite mixture or mixture-of-experts (Xu 1994, 1995a, b; Cheung et al. 1996, 1997; Leung 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2002a, 2003; Tang et al. 2003). Outlined in Fig. 1 by Box 4, studies actually proceed along an alternative road for modeling temporal dependence featured with nonlinearity, heteroskedasticity and non-stationarity. “Financial prediction: time series models and three finite mixture extensions” section is dedicated to the studies summarized in Fig. 1, together with introductions on learning implementations by the maximum likelihood (ML) learning, the rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993), and approaches of learning with model selection.

Dynamic trading and portfolio management” section is dedicated to the studies summarized in Fig. 2, toward portfolio management directly, instead of making nonlinear modeling for analyses and predictions. Around the second half of the 1990s, efforts in the literature of neural networks and machine learning in economics and finance started to shift to adaptive trading; see Box 1. Subsequently, these efforts converge to the road pioneered by the Markowitz portfolio theory (Markowitz 1952) that maximizes the portfolio expected return for a given amount of portfolio risk by carefully choosing the proportions of assets; see Box 2. Based on Markowitz’s mean–variance paradigm, Sharpe (1966, 1994) further suggests evaluating the goodness of an asset by a ratio of the excess asset return; see Box 3. Later, it is further realized that the return variance is not an appropriate measure of portfolio risk because it counts the positive fluctuation above the expected returns (called upside volatility) also as the part of risk. The downside risk thus becomes a topic to study, as illustrated in Fig. 2 by Box 4; e.g., Markowitz (1959) counts the volatility below the expected returns only.

Fig. 2
figure 2

A road map on studies of portfolio management

After a brief introduction on the above-mentioned boxes in Fig. 2, “Dynamic trading and portfolio management” section further reexamines the Markowitz paradigm and Sharpe ratio with extensions that maximizes the expected returns and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification (Hung et al. 2000, 2003), see Box 5 in Fig. 2. Moreover, several extensions have been proposed along this direction in Sect III(C) of Xu (2001), including that nonparametric estimates of the expected return and volatilities are improved by ARCH or GARCH models; see Box 6 in Fig. 2.

Next, “Market modeling: APT theory and temporal factor analysis” section is dedicated to the efforts summarized in Fig. 3. The Markowitz scheme also leads to the Capital Asset Pricing Model (CAPM) (Sharpe 1964). However, the CAPM is criticized to be not enough to describe a market behavior merely via one endogenous factor. Then, a general linear model of multiple factors has been proposed under the name of Arbitrage Pricing Theory (APT) (Ross 1976). Unfortunately, the APT has not been widely accepted in popularity similar to the CAPM. The reason lies largely with its significant drawback: namely, its implementation is difficult due to the lack of specificity regarding the number and nature of the factors that systematically affect asset return (Dhrymes et al. 1984; Abeysekera and Mahajan 1987).

Fig. 3
figure 3

A road map on studies of generalized apt theories and applications

In “Market modeling: APT theory and temporal factor analysis” section, we start from introducing three approaches that are usually applied for the implementation of APT and address their drawbacks as outlined in “Introduction” section of (Xu 2001), which leads to an observation that the lack of specificity regarding the endogenous factors is not just regarding the number and nature of the factors, but even more seriously arising from the so-called rotation indeterminacy implemented by factor analysis. Thus, further efforts should explore how to add certain structure to remove or remedy this indeterminacy. As outlined in Fig. 3 by Box 1 and Box 2, temporal factor analysis (TFA) (Xu 1997, 2000) is suggested as a generalization of the original APT theory (Xu 2001) to tackle such an incompleteness, featured with a first-order autoregressive dependence added to each factor such that the incompleteness caused by a notorious rotation indeterminacy is removed. Such a generalization is thus called temporal APT in a sense that temporal relation is taken into consideration.

This section further considers the influences of macroeconomic indexes such as GDP, inflation, investor confidence and yield curve, via their roles in controlling or modulating the temporal factors, which leads to a macroeconomics-modulated temporal APT shown in Fig. 3 by Box 3. Alternatively, TFA may also be replaced by non-Gaussian factor analyses (NFA) such that the incompleteness caused by rotation indeterminacy can also be removed; see Box 6 and Box 7 in Fig. 3. Actually, both the temporal factors and non-Gaussian factors are two aspects of one market model: one observes a dynamic market process, while the other describes the market with all the time points projected to one reference spot. Even generally, conditional heteroskedasticity may also be added to the factors, which finally leads to Box 8 in Fig. 3, namely, a general formulation for financial market modeling that systematically integrates all the ingredients. As illustrated in Fig. 3 by Box 4, various prediction tasks and investment managements can also be conducted with the help of the temporal APT and the macroeconomics-modulated temporal APT.

Further developments of these linear models introduced are suggested to be implemented by the Bayesian Ying–Yang (BYY) harmony learning. In “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, the fundamentals of BYY harmony learning are briefly introduced. For learning alternative mixture-of-experts-based AR, ARCH and GARCH models, both gradient-based algorithms and EM-like algorithms are provided for implementations, featured with automatic model selection and in reference of the well-known EM algorithm.

Except for the first column in Fig. 1, where only one time series is considered, mostly we consider dependences across more than one channel of time series. Prediction and decision making in portfolio management are based on such dependences that may not necessarily reflect causal structure underlying data, while it will be better to make prediction and decision based on casual structure. In “Linear causal analyses” section, path analyses (Wright 1934) for linear causal analyses is briefly reviewed, a recent development on ρ-diagram (Xu 2018) is refined for cofounder discovery and a causal potential theory is proposed. Further discussions are made on structural equation modeling (SEM) (Ullman 2006; Pearl 2010a; Westland 2015; Kline 2015) and its relations to modulated TFA-APT and nGCH-driven M-TFA-O.

Financial prediction: time series models and three finite mixture extensions

Time series models and neural networks

One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–moving-average (ARMA) model as follows:

$$x_{t} = a_{0} + \varepsilon_{t} + \mathop \sum \limits_{j = 1}^{q} a_{j} x_{t - j} + \mathop \sum \limits_{i = 1}^{p} b_{i} \varepsilon_{t - i} , \varepsilon_{t} \sim^{{{\text{i.i.d.}}}} G(\varepsilon |0, \sigma^{2} ) ,$$
(1)

where \(\varepsilon_{t} \sim^{{{\text{i.i.d}} .}} G(\varepsilon |0, \sigma^{2} )\) denotes that \(\varepsilon_{1} , \ldots ,\varepsilon_{t} , \ldots\) are i.i.d. samples from \(G(\varepsilon |0, \sigma^{2} )\), while \(G(u|\mu , \sigma^{2} )\) denotes a Gaussian distribution of u with the mean μ and the variance σ2. Particularly, the ARMA model degenerates to the AR model when q = 0.

The ARMA model is appropriate to describe a wide sense stationary sequence. Extension has been made to describe data ξt that have some clearly identifiable trend of a polynomial growth (Box and Jenkins 1970); see Box 1 in Fig. 1. It is made simply by an initial differencing to remove the non-stationarity. That is, we get

$$x_{t} = \Delta^{d} \xi_{t} ,\quad {\text{where }}d > 0 ,\;\Delta u_{t} = u_{t} - u_{t - 1} {\text{and }}u_{t} = \Delta^{d} \xi_{t} .$$
(2)

A cascade of this initialization and ARMA is called the autoregressive integrated moving average (ARIMA) model. For simplicity, we prefer to still use AMRA to indicate ARIMA by regarding such an initialization as a pre-processing stage.

In the literature of statistics, econometrics, control and signal processing, generalizations of ARMA have been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering variables conditional to heteroskedasticity (Engle 1982; Bollerslev 1986); see Box 8 in Fig. 1. Namely, we consider

$$x_{t} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} x_{t - j} + \varepsilon_{t} , \varepsilon_{t} = \sigma_{t} z_{t} , z_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(z|0, 1) ,$$

where σt is not a constant, but given by the following regression:

$$\begin{aligned} \sigma_{t}^{2} \left( \vartheta \right) & = \sigma_{0}^{2} + \mathop \sum \limits_{i = 1}^{q} \beta_{i} \varepsilon_{t - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t - j}^{2} , \\ \vartheta & = \{ \sigma_{0}^{2} > 0, \beta_{i} \ge 0,\, for \ \ i > 0, \omega_{j} \ge 0,\, for \ \ j \ge 0\} , \\ \end{aligned}$$
(3)

which is usually denoted by GARCH(p,q) and degenerates to the ARCH model when p = 0.

Extensions of the ARMA model have also been made under the name of nonlinear ARMA (NARMA) for modeling nonlinear dependence (Leontaritis and Billings 1985) and to Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987). In the literature, many efforts have been made on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as illustrated by Box 3 in Fig. There are already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies. Instead, the subsequent two subsections will focus on Box 4 in Fig. 1, namely, learning mixture of multiple models.

Learning mixture of AR, ARMA, ARCH and GRACH models

Studies on finite mixture extensions of AR, ARMA, ARCH and GARCH models can be summarized into the following general expression:

$$\begin{aligned} & P(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta ) = \mathop \sum \limits_{\ell = 1}^{k} \alpha_{\ell } G(x_{t} - \mu_{\ell ,t} |0, \sigma_{\ell ,t}^{2} ), \\ & \mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right), {\mathbf{x}}_{t - 1}^{m} = \left[ {x_{\text{t - 1}} , \ldots ,x_{\text{t - m}} ]^{\rm T} , \varvec{a}_{i} = } \right[a_{0,i} ,a_{1,i} , \ldots ,a_{{q_{i} ,i}} ]^{\rm T} , \\ & \varepsilon_{t} = x_{t} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} ,\theta } \right), q = \hbox{max} \{ q_{1,} , \ldots ,q_{m} \} , \\ \end{aligned}$$
(4)

where we consider k regression models \(x_{t} = \mu_{i,t} + \varepsilon_{i,t} , i = 1,, \ldots ,k\) with each \(\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) being either of AR, ARMA, ARCH and GARCH models, and with the corresponding residual ɛi,t from \(G(\varepsilon_{i,t} |0, \sigma_{i,t}^{2} )\). Typically, the studies of the AR, ARCH and GARCH models share the following detailed expression (Xu 1995a, b; Cheung et al. 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2003, 2004a; Tang et al. 2003):

$$\begin{aligned} & \mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right) = \varvec{a}_{i}^{\rm T} \left[ {\begin{array}{*{20}c} { 1} \\ {\varvec{x}_{t - 1}^{{q_{i} }} } \\ \end{array} } \right],\sigma_{i,t}^{2} = \left\{ {\begin{array}{*{20}l} {\sigma _{{i,0}}^{2} {\text{ > 0, }} \qquad \qquad \left( {\text{a}} \right) {\text{AR}}} \\ {\sigma _{{i,0}}^{2} + \varvec{b}_{i}^{{\text{T}}} E_{{i,t - 1}}^{{q_{i} }} ,\qquad \left( {\text{b}} \right){\text{ARCH}}} \\ {\sigma _{{i,0}}^{2} + \varvec{b}_{i}^{{\text{T}}} E_{{i,t - 1}}^{{q_{i} }} + \varvec{w}_{i}^{{\text{T}}} \sum _{{i,t - 1}}^{{p_{i} }} , \quad \left( {\text{c}} \right){\text{GARCH}}} \\ \end{array} } \right. \\ & E_{i,t - 1}^{n} = \left[ {\varepsilon_{i,t - 1}^{2} , \ldots ,\varepsilon_{i,t - n}^{2} ]^{\rm T} , {\mathbf{w}}_{i} = } \right[w_{1,i} , \ldots ,w_{{p_{i} ,i}} ]^{\rm T} , \,\omega_{j,i} \, \ge \,0, j\, = \,1, \ldots ,p_{i} \\ & \varSigma_{i,t - 1}^{n} = \left[ {\sigma_{i,t - 1}^{2} , \ldots ,\sigma_{i,t - n}^{2} ]^{\rm T} , {\mathbf{b}}_{i} = } \right[\beta_{1,i} , \ldots ,\beta_{{q_{i} ,i}} ]^{\rm T} , \, \beta_{j,i} \, \ge \,0, j\, = \,1, \ldots ,q_{i} . \\ \end{aligned}$$
(5)

For ARMA (Kwok et al. 1998; Tang et al. 2003), the detailed expression of \(\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) is given by Eq. (1). Moreover, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) can be also a specific nonlinear function, e.g., given by three-layer neural networks (Cheung et al. 1996, 1997) or the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, Xu 2009).

According to Eq. (4), a sequence x1, …, xt, … may come from the ith one of the k models with the probability αi, and jointly the k models describe the sequence x1, …, xt, … with a residual ɛt that comes from a Gaussian mixture \(P(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )\). In such a way, a nonlinear dependence of the current observation on past values and noise disturbances is modeled by probabilistically combining a mixture of linear models, which keeps the model structure simple and easy to learn. Moreover, non-stationarity beyond ones handled by ARIMA and GARCH models is able to be modeled via switching among individual linear models.

Also, a sequence x1, …, xt, … may be segmented into pieces with different statistical properties, simply by Bayesian posterior as follows (Xu 1994, 1995a, b):

$$P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta ) = \frac{{\alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}{{\mathop \sum \nolimits_{{j_{t} = 1}}^{k} \alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}},$$
(6)

that is, xt is identified as coming from the j*th model by

$$j^{*} = {\text{argmax}}_{j} P(j|x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta ) \quad {\text{or }}j^{*} = {\text{argmax}}_{j} [\alpha_{j} G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )].$$

To reduce the number of small fragments, some post-processing or smoothing regularization may be added. Moreover, we may extend a finite mixture into a hidden Markov model (HMM) (Rabiner 1989), in which each hidden state is associated with one \(G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )\) and the transition between state is described by

$$\begin{aligned}\varvec{\alpha}_{\varvec{t}} & = \varvec{Q\alpha }_{{\varvec{t} - 1}} , {\varvec{\upalpha}}_{\varvec{t}} = [\varvec{\alpha}_{{1,\varvec{t}}} , \ldots ,\varvec{\alpha}_{{\varvec{k},\varvec{t}}} ]^{\bf{T}} , 0\le\varvec{\alpha}_{{\varvec{j},\varvec{t}}} \le 1 , \mathop \sum \limits_{\varvec{j}}\varvec{\alpha}_{{\varvec{j},\varvec{t}}} = 1, \\ \varvec{Q} & = \left[ {\varvec{q}_{{\varvec{j}|\varvec{i}}} } \right], 0\le \varvec{q}_{{\varvec{j}|\varvec{i}}} \le 1,\varvec{ }\mathop \sum \limits_{{\varvec{i} = 1}}^{\varvec{k}} \varvec{q}_{{\varvec{j}|\varvec{i}}} = 1, \\ \end{aligned}$$
(7)

with αj,t estimated as time proceeds and then used in Eq. (5) and Eq. (6). Moreover, we can also further modify Eq. (5) and Eq. (6) into

$$\begin{aligned} P(j_{t} |x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\theta ) & = \frac{{q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}{{\mathop \sum \nolimits_{{j_{t} = 1}}^{k} q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )}}, \\ j_{t}^{*} & = {\text{argmax}}_{j} P(j|x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\theta ) {\text{ or }}j_{t}^{*} = {\text{argmax}}_{j} [q_{{j|j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )]. \\ \end{aligned}$$
(8)

Next, we proceed to estimate xt from the finite mixture by Eq. (4). It follows that

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} , \theta } \right) = \mathop \int \nolimits x_{t} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )\,{\text{d}}x_{t} = \mathop \sum \limits_{i = 1}^{k} \alpha_{i} \mu_{i,t} ,$$
(9)

that is, we improve the prediction of xt via each individual model by a line combination weighted by each αi. However, this improvement is limited because αi is a constant that does not change as the samples vary with time.

Each αi in Eq. (4) cannot directly be replaced by its corresponding Bayes posterior by Eq. (5). First, \(P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta )\) cannot be moved out of the integral \(\mathop \smallint \nolimits x_{t} P(j_{t} |x_{t} ,\varvec{x}_{t - 1}^{q} ,\theta )G(x_{t} |\mu_{j,t} , \sigma_{j,t}^{2} )dx_{t}\), though the integral can be made approximately. Second, the calculation needs to know xt. Getting \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}\) from knowing xt is applicable to a filtering problem that gets a smoothed or filtered version from xt, but it is not applicable to a prediction problem that targets at getting \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}\) from its past observations.

Instead, we use a predictive \(P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )\) based on the immediate past observations \(\varvec{x}_{t - 1}^{q}\) to combine the prediction of individual prediction model adaptively; that is, we have

$$\begin{aligned} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta ) & = \mathop \sum \limits_{{j_{t} = 1}}^{k} P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} ), \\ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t - 1}^{q} , \theta } \right) & = \mathop \int \nolimits x_{t} p(\varepsilon_{t} |\varvec{x}_{t - 1}^{q} ,\theta )\,{\text{d}}x_{t} = \mathop \sum \limits_{{j_{t} = 1}}^{k} P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )\mu_{{j_{t} ,t}} , \\ \end{aligned}$$
(10)

which summarizes extensions of the AR, ARMA, ARCH and GARCH models with the help of the mixture-of-experts (ME). In the implementation of the original ME (Jacobs et al. 1991; Jordan and Xu 1995), \(P(j|\varvec{x}_{t - 1}^{q} ,\varphi )\) is called the gating net and given as follows:

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{g_{j} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)}} /\mathop \sum \limits_{j = 1}^{k} e^{{g_{j} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)}} ,$$

with \(g_{1} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right), \ldots , g_{k} \left( {\varvec{x}_{t - 1}^{q} ,\varphi } \right)\) being the output of multilayer networks.

In an implementation of an alternative ME model (Xu et al. 1994, 1995), we consider a predictive Bayesian posteriori

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = \frac{{\alpha_{j} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}{{q(\varvec{x}_{t - 1}^{q} |\psi )}}, q({\mathbf{x}}_{t - 1}^{q} |\psi ) = \sum\nolimits_{j = 1}^{k} \alpha_{j} q(\varvec{x}_{t - 1}^{q} |\psi_{j} ).$$
(11)

For the AR, ARCH and GARCH models, we further have

$$q(\varvec{x}_{t - 1}^{q} |\psi_{j} ) = q(x_{t - 1} |x_{t - 2} , \ldots ,x_{t - q} )q(x_{t - 2} |x_{t - 3} , \ldots ,x_{t - q} ) \cdots q\left( {x_{t - q} } \right).$$

To simplify the computation, we may consider the following approximation:

$$q(\varvec{x}_{t - 1}^{q} |\psi_{j} ) \approx q(x_{t - 1} |x_{t - 2} , \ldots ,x_{t - q - 1} ) = G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} ).$$
(12)

A further insight into Eq. (11) can be obtained at a setting that σ 2 j, t−1  = σ 2 j and xt−1 = μj,t−1.; in this special case, we have a further simplification:

$$P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = \frac{{\alpha_{j} /\sigma_{j} }}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} /\sigma_{j} }},$$
(13)

which shares a similar concept to the mixture-using variance (MUV) and actually degenerates to this MUV (Perrone and Cooper 1993, Perrone 1994) when \(\alpha_{j} \propto \sigma_{j,t}^{ - 1}\). Another special case is that αi/σi,t is constant, and it follows from Eqs. (11) to (12) that we have

$${{P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} } \mathord{\left/ {\vphantom {{P(j|\varvec{x}_{t - 1}^{q} ,\varphi ) = e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} } {\sum \nolimits_{j = 1}^{k} e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} ,}}} \right. \kern-0pt} {\sum \nolimits_{j = 1}^{k} e^{{ - \frac{1}{{2\sigma_{j,t - 1}^{2} }}(x_{t - 1} - \mu_{j,t - 1} )^{2} }} ,}}$$
(14)

by which we get the counterparts of NRBF and ENRBF (Xu 1998, Xu 2009).

The other choices of \(P(j|\varvec{x}_{t - 1}^{q} ,\varphi )\) may also be obtained or modified from Table 3 in Xu and Amari (2008). Moreover, similar to Eq. (8), it still follows from \(q(\varvec{x}_{t - 1}^{q} |\psi_{j} )\) given by Eqs. (11) and (12) that we may further incorporate the HMM model from Eq. (7) into Eq. (11) and get

$$P(j_{t} |x_{t} ,j_{t - 1} ,\varvec{x}_{t - 1}^{q} ,\varphi ) = {{q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )} \mathord{\left/ {\vphantom {{q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )} {\mathop \sum \nolimits_{j = 1}^{k} q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}} \right. \kern-0pt} {\mathop \sum \nolimits_{j = 1}^{k} q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{j} )}}.$$
(15)

Maximum likelihood, RPCL learning and learning with model selection

Typically, unknown parameters in the models in Eqs. (4), (8), (10) and (11) are estimated by the maximum likelihood (ML) learning, that is, the following maximization:

$$\begin{aligned} & \quad \varTheta^{*} = \arg \max_{\varTheta } L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ), \hfill \\ L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ) &= \left\{ \begin{array}{ll} \sum\limits_{t} {\ln \sum\limits_{{j_{t} = 1}}^{k} {\alpha_{{j_{t} }} } G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} ,&\quad {\text{ (a) for finite mixture by Eq}} .\, ( 4 ) ,\\ \sum\limits_{t} {\ln \left\{ {\sum\limits_{{j_{t} = 1}}^{k} {P(j_{t} |{\mathbf{x}}_{t - 1}^{q} ,\phi )} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} \right\}} ,&\quad ( {\text{b) for ME by Eq}} .\, ( 10 ) ,\\ \sum\limits_{t} {\ln \left\{ {\sum\limits_{{j_{t} = 1}}^{k} {\alpha_{{j_{t} }} q({\mathbf{x}}_{t - 1}^{q} |\psi_{{j_{t} }} )} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} \right\}} ,&\quad ( {\text{c) for AME by Eq}} .\, ( 1 1 ) ,\\ \ln \left\{ {\sum\limits_{{j_{1} , \ldots ,j_{N} }} {\prod\limits_{t} {q_{{j_{t} |j_{t - 1} }} G(x_{\text{t}} - \mu_{{j_{t} ,t}} \, |0, \sigma_{{j_{t} ,t}}^{2} )} } } \right\},&\quad ( {\text{d) for HMM mixture }} .\end{array} \right. \end{aligned}$$
(16)

This maximization is implemented by the EM algorithm (Redner and Walker 1984), e.g., see the EM algorithms for finite mixture of AR models in Xu (1994, 1995a, b), finite mixture of GARCH models in Wong et al. (1998), finite mixture of ARMA–GARCH models in Tang et al. (2003) and the original ME in Jordan and Xu (1995), as well as the alternative ME model, NRBF and ENRBF in Xu et al. (1994, 1995) and Xu (1998, 2009).

For an HMM mixture, we may also have the following approximate likelihood:

$$L(x_{t(t = 1)}^{N} |\varTheta ) = \left\{\begin{array}{ll} \sum\nolimits_{t} {\ln \sum\nolimits_{{j_{t} = 1}}^{k} {q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} ),\quad ({\text{i}}),} } \\ {{\sum\nolimits_{t} {\ln \left\{ {\sum\nolimits_{{j_{t} = 1}}^{k} {q_{{j_{t} j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )} } \right\}} ,\quad ({\text{ii}})}}.\end{array} \right.$$
(17)

One critical problem for the ML learning is that a good performance on a training set is not necessarily good on a testing set, especially when the training set consists of a small size of samples. The reason is that there may be too many free parameters. As introduced in the third section of Xu (2009), efforts on this problem are mainly featured by learning with model selection. Model selection refers to select a model with an appropriate complexity \(\varvec{k}\). For the models considered in the previous subsection, \(\varvec{k }\) consists of the number of individual models, the autoregression order and the moving average order for each individual model. Typically, the ML learning is not good for model selection. However, whether the EM algorithm works well depends on whether an appropriate \(\varvec{k}\) is selected.

Classically, model selection is made in a two-stage implementation. First, enumerate a candidate set \(\varvec{\rm K}\) of \(\varvec{k}\) and estimate a solution \(\varTheta_{\varvec{k}}^{*}\) for the unknown set Θk of parameters by the ML learning at each \(\varvec{k} \in \varvec{\rm K}\). Second, use a model selection criterion \(J\left( {\varTheta_{\varvec{k}}^{*} } \right)\) to select a best \(\varvec{k}^{*}\). Several classical criteria are available for the purpose, such as AIC, CAIC and BIC/MDL, and readers are referred to Xu (2009, 2010) for a recent outline. Unfortunately, any one of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Even with a criterion \(J\left( {\varTheta_{\varvec{k}} } \right)\) available, this two-stage approach usually incurs a huge computing cost. Still, the parameter learning performance deteriorates rapidly as \(\varvec{k}\) increases, which makes the value of \(J\left( {\varTheta_{\varvec{k}} } \right)\) to be evaluated unreliably.

One direction that tackles this challenge is called automatic model selection, which is associated with a learning algorithm or a learning principle with the following two features:

  • When there is an indicator \(\rho \left( {\theta_{\varvec{r}} } \right)\) on a subset \(\theta_{\varvec{r}} \in \varTheta_{\varvec{k}}\), we have \(\rho \left( {\theta_{\varvec{r}} } \right) = 0\) if \(\theta_{\varvec{r}}\) consists of parameters of a redundant structural part.

  • In implementation of this algorithm or principle, there is a mechanism that automatically drives \(\rho \left( {\theta_{\varvec{r}} } \right) \to 0\) as \(\theta_{\varvec{r}}\) toward a specific value. Thus, the corresponding redundant structural part is effectively discarded.

An early effort along this direction is rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993) for adaptively learning a model that consists of \(k\) substructures as follows:

$$\theta_{j}^{\text{new}} = \theta_{j}^{\text{old}} + p_{j,t} \eta \frac{{\partial \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right)}}{{\partial \theta_{j} }},\quad p_{j,t} = \left\{ {\begin{array}{ll} 1, &\quad j = c = {\text{argmax}}_{j} \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right), \\ \gamma ,&\quad j = {\text{argmax}}_{j \ne c} \pi_{j,t} \left( {\theta_{j}^{\text{old}} } \right), \\ 0,&\quad {\text{otherwise}}. \\ \end{array} } \right.$$
(18)

where η > 0 is a learning step size and γ is a small positive number, e.g., γ = 0.005–0.01. With \(k\) initially at a value large enough, a current input sample xt is allocated to one of the \(k\) substructures via competition. The winner adapts to this sample by a little bit, while the rival is de-learned a little bit to reduce a duplicated allocation. This rival penalized mechanism will discard extra substructures, making model selection automatically during learning. Readers are referred to Xu (2007) for a recent overview and extensions.

Corresponding to Eq. (16), πj,t(θ old j ) in Eq. (18) is given as follows:

$$\pi_{{j_{t} ,t}} \left( {\theta_{{j_{t} }} } \right) = \left\{ \begin{array}{ll} \ln [\alpha_{{j_{t} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{a}} \right) \, {\text{for finite mixture by Eq}} .\left( 4 \right), \\ \ln [P(j_{t} |\varvec{x}_{t - 1}^{q} ,\varphi )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{b}} \right) \, {\text{for ME by Eq}} .\left( { 1 0} \right), \\ \ln [\alpha_{{j_{t} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )], & \quad \left( {\text{c}} \right) \, {\text{for AME by Eq}} .\left( { 1 1} \right)\\ \end{array} \right..$$
(19)

For an HMM mixture, we may also approximately have

$$\pi_{{j_{t} ,t}} \left( {\theta_{{j_{t} }} } \right) = \left\{ {\begin{array}{*{20}l} {\ln [q_{{j_{t} |j_{t - 1} }} G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )],} & { \left( {\text{i}} \right)\, {\text{for HMM mixture by Eq}} .\left( 8 \right),} \\ {\ln [q_{{j_{t} |j_{t - 1} }} q(\varvec{x}_{t - 1}^{q} |\psi_{{j_{t} }} )G(x_{t} - \mu_{{j_{t} ,t}} |0, \sigma_{{j_{t} ,t}}^{2} )],} & {\left( {\text{ii}} \right)\, {\text{for HMM AME by Eq}} .\left( { 1 5} \right).} \\ \end{array} } \right.$$
(20)

Another stream of automatic model selection is featured by those appropriate prior-based efforts. By a Laplace prior in a regression task, sparse learning or Lasso shrinkage prunes away extra weights (Williams 1995; Tibshirani 1996). For pruning away Gaussian components on Gaussian mixture, a Jeffreys priori is used in the implementation of the minimum message length (MML) that minimizes a two-part message for a statement of model and a statement of data encoded by that model (Figueiredo and Jain 2002), and also Dirichlet–Normal–Wishart priories is added on Gaussian components in the implementation of the variational Bayes (VB) that computes a lower bound of the marginal likelihood (McGrory and Titterington 2007).

However, these efforts highly depend on choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, VB and MML all degenerate to the maximum likelihood learning, while the RPCL learning is still capable of automatic model selection. Firstly proposed in Xu (1995a, b) and systematically developed over a decade and half (Xu 2001, 2007, 2010, 2012), the third stream of efforts has been made under the name of Bayesian Ying–Yang (BYY) harmony learning. The BYY harmony learning shares a mechanism similar to the RPCL learning. Also, the performances of BYY harmony learning can be further improved by incorporating appropriate priors. Further details about the BYY harmony learning are referred to “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, where a tutorial is also provided on one BYY harmony learning algorithm for alternative mixture-of-experts-based GARCH models.

Dynamic trading and portfolio management

Dynamic trading by supervised learning and reinforcement learning

Instead of building a mathematical model for understanding and forecasting time series, studies of neural networks and machine learning in economics and finance started to shift from nonlinear forecasting modeling to adaptive trading and dynamic portfolio management (Neuneier 1996; Choey and Weigend 1997; Xu and Cheung 1997; Moody et al. 1998; Hung et al. 2000; Moody and Saffell 2001; Hung et al. 2003; Chiu and Xu 2004b; Jangmin 2006). Efforts on portfolio management will be addressed in the next subsection. In the sequel, we introduce efforts on learning dynamic trading based on one single time series, with the help of supervised learning, reinforcement learning and Sharpe ratio maximization.

Given a sequence x1, …, xt, e.g., the sequence of one asset, Gold, FOREX index,…, etc., at any time point t ≤ τ we may infer a sequence \(I_{1}^{p} , \ldots I_{t}^{p}\) each \(I_{\tau }^{p}\) being the following desired trading signal:

$$I_{\tau }^{p} = \left\{ \begin{array}{ll} + 1, &\quad {\text{to}}\,{\text{buy}}, \hfill \\ - 1, &\quad {\text{to}}\,{\text{sell}}, \hfill \\ 0, &\quad {\text{no}}\,{\text{action}}, \hfill \\ \end{array} \right.$$
(21)

based on a trading strategy (e.g., maximum return) or an external expertise.

The task of learning decision, as illustrated by Box 1 in Fig. 2, can be formulated as a nonlinear regression model:

$$\tilde{I}_{t}^{p} = \frac{{1 - e^{{ - f\left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)}} }}{{1 + e^{{ - f\left(XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta\right) }} }}$$
(22)

where \(f\left( {XF_{t}^{q} , \left\{ {I_{t - \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right)\) is implemented by an ENRBF network in Xu & Cheung (1997). Also, it can be implemented by three-layer neural networks. Supervised learning is used to determine the unknown parametric Θ by minimizing

$$E_{2} \left( \varTheta \right) = \mathop \sum \limits_{t}[ I_{t}^{p} - f(XF_{t}^{q} ,\{ I_{t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta )]^{2} ,$$
(23)

where XF q t may be directly a number of past observations {xtτ} q t=1 or certain features {F ( i) t } extracted from{xtτ} q t = 1 , e.g., F ( i) t may be MACD, RSI, %K, %D, as well as features from candlestick charts and configurations from waves, etc. Also, we may put both together to consider \(XF_{t}^{q} = \left\{ {\left\{ {x_{t - \tau } } \right\}_{t = 1}^{q} , \left\{ {F_{t}^{(i)} } \right\}} \right\}.\)

One key problem is how to keep a good generalization ability by training with a small length of sequence x1, …, xt. One way is adding some regularization term E2(Θ) + λΓ(Θ). Without a priori knowledge, however, it is not an easy task to get an appropriate term Γ(Θ) and its strength λ. The other way is to describe the model as follows:

$$\begin{aligned} q\left( {\tilde{I}_{t}^{p}\ | \ {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta }} \right) &= \frac{{exp\left[ {z_{t}^{{(1)}} f^{{(1)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right) + z_{t}^{{(2)}} f^{{(2)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right]}}{{1 + exp\left[ {f^{{(1)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right] + \exp \left[ {f^{{(2)}} \left( {XF_{t}^{q} ,\left\{ {I_{{t - \tau }}^{p} } \right\}_{{t = 1}}^{q} ,\Theta } \right)} \right]}} \hfill \\ f(XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta ) &= [f^{{\left( 1 \right)}} (XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta ),f^{{\left( 2 \right)}} (XF_{t}^{q} ,\{ I_{{t - \tau }}^{p} \} _{{t = 1}}^{q} ,\Theta )]^{{\text{T}}} , \end{aligned}$$
(24)

with \(I_{t}^{p} = [z_{t}^{\left( 1 \right)} ,z_{t}^{\left( 2 \right)} ,z_{t}^{\left( 3 \right)} ]^{\rm T} , z_{t}^{\left( 2 \right)} = 0\, {\text{or}} \,1\) and z (1) t  + z (2) t  + z (3) t  = 1. Correspondingly, min ΘE2(Θ) is replaced by maximizing the likelihood \(L\left( \varTheta \right) = \mathop \sum \limits_{t} { \ln }q(I_{t}^{p} |f(XF_{t}^{q} ,\{ I_{t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta ))\). In the formulation, learning regularization may be implemented via Bayesian learning with help of a priori distribution q(Θ), i.e., max Θ[L(Θ) + lnq(Θ)]. For a better generalization ability, we may also put q(I p t |f(XF q t , {I p tτ } q t=1 Θ)) into a Bayesian Ying–Yang system and making BYY harmony learning with automatic model selection; see Sect. 4.4 in Xu (2010).

The other key problem is how to make a pre-processing stage for getting a desired sequence \(I_{1}^{p} , \ldots ,I_{t}^{p}\), which can be obtained automatically by a trading strategy, e.g., getting a profit and cutting a loss beyond a pre-specified threshold as follows:

$$I_{t}^{p} = \left\{ \begin{array}{ll} + 1, &\quad {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} \mathord{\left/ {\vphantom {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} {\sigma_{t} \ge g_{0}^{ + } > 0}}} \right. \kern-0pt} {\sigma_{t} \ge g_{0}^{ + } > 0}}, \hfill \\ - 1, &\quad {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} \mathord{\left/ {\vphantom {{{\text{if}}\, \left( {x_{t} - x_{t - 1} } \right)} {\sigma_{t} \le g_{0}^{ - } \le 0}}} \right. \kern-0pt} {\sigma_{t} \le g_{0}^{ - } \le 0}}, \hfill \\ 0, &\quad {\text{no}}\,{\text{action}}, \hfill \\ \end{array} \right.$$

where σt is an estimation of the volatility about this asset. Also, \(I_{1}^{p} , \ldots ,I_{t}^{p}\) may come from an outcome of market technical analysis, which is difficult to get \(I_{1}^{p} , \ldots ,I_{t}^{p}\) adaptively in a dynamic trading.

From the studies (Moody et al. 1998; Moody and Saffell 2001; Jangmin 2006), \(I_{1}^{p} , \ldots ,I_{t}^{p}\) is a sequence of actions that are dynamically learned by reinforcement learning. Typically, a reinforcement learning model consists of a set S of environment states (e.g., differences in the current price of asset and the volumes in holding) and a set A (e.g., buy, sell, no action) of actions. There is also a policy π that chooses an action at A at an environment state st. The action at makes the environment move to a new state st+1. Associated with the transition (statst+1), there is a scalar immediate reward rt+1(statst+1) that is estimated according to a utility function, e.g., a maximum profit. The goal is to collect as much reward as possible by determining a sequence of actions a1, …, at.

In the literature of reinforcement learning, one popular approach is called Q-learning, by which at is chosen according to a table Q(stat) that is learned from rt+1(statst+1). For a dynamic trading, the S of environment states are featured by differences in the current price of asset and the volumes in holding. Quantizing the differences into the states is not an easy task. Also, there will be a large number states to be considered. As a result, we need to learn a large Q(stat) table, which not only increases computing cost rapidly, but also makes the problem of a small sample size become more serious because Q(stat) consists of too many free parameters to be determined. Instead of Q-learning, the action at in rt+1(statst+1) can be approximately replaced by the value of I p t given by Eq. (22) such that rt+1(statst+1) is replaced by an expression rt+1(stst+1, {xtτ} q t=1 , {I p tτ } q t=1 Θ). As a result, the maximization of ∑  t=1 γtrt+1(statst+1) with respect to a sequence of discrete actions a1, …, at is replaced by the maximization of ∑  t=1 γtrt+1(stst+1, {xtτ} q t=1 , {I p tτ } q t=1 Θ) with respect to Θ. Similar to learning regularization, the problem of a small sample size may also be handled by adding a a priori term, e.g., \(\sum\nolimits_{t = 1}^{\infty } {\gamma^{t} r_{t + 1} \left( {s_{t} , s_{t + 1} , \left\{ {x_{t - \tau } } \right\}_{t = 1}^{q} , \left\{ {I_{t - \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right) + \lambda { \ln } q\left( \varTheta \right)} .\)

The last but not the least, the specific expression of rt+1(statst+1) is an important practical issue, related to the current price of asset, the volume in holding, the transaction cost and the tax, as well as personal preference. There could be a number of choices. See Fig. 2 by Box 3; a widely used one is the Sharpe ratio, which is originally suggested for evaluating the goodness of an asset in market by a ratio of the excess asset return (i.e., after minus the benchmark return) over the standard deviation of the excess asset return (Sharpe 1966, 1994). For dynamic trading, it is not the Sharpe ratio of the asset in market that has to be calculated, but the Sharpe ratio of the dynamic trading system, which depends on a sequence of actions a1, …, at.

Dynamic portfolio management by maximizing Sharpe ratio and extensions

Instead of only considering one single asset, a common and more reliable practice is considering a portfolio of assets, and thus portfolio management is one important topic in the finance literature. For the supervised learning by Eq. (22), its extension can be made simply by considering \(I_{j,t}^{p} (XF_{t}^{q} ,\{ I_{j,t - \tau }^{p} \}_{t = 1}^{q} ,\varTheta_{j} ), j = 1, \ldots ,k\) with each in the format of Eq. (22), and learning is made by minimizing the total sum ∑ jE2(Θj). Simply, we get the training signals \(I_{j,1}^{p} , \ldots ,I_{j,t}^{p}\) per asset individually. Still, further studies are needed on how to get the training signals bases on the whole portfolio of assets. Conceptually, extension of reinforcement learning to multiple assets is rather straightforward too. However, both the set S of environment states and the set \(A\) of possible actions increase rapidly, which makes learning a large table Q(stat) seriously suffer the problem of a small sample size. Thus, it becomes more critical to get a1, …, at to be approximately replaced by {I p j, t (XF q t , {I p j, t −  τ } q t=1 Θj)} k j=1 in evaluating the reward rt+1 (Moody et al. 1998; Moody and Saffell 2001). Similar to supervised learning, one direction for tackling the problem of a small sample size is incorporating with learning regularization.

Alternatively, another direction to pursuit portfolio management is exploring the road pioneered by the Markowitz portfolio theory (Markowitz 1952), see Box 2 in Fig. 2. By this theory, the return of an investment portfolio is the proportion-weighted combination of the constituent assets’ returns, while the portfolio volatility is a function of the correlations between the component assets. The portfolio expected return is maximized subject to a given amount of portfolio risk, or equivalently risk is minimized for a given level of expected return. Moreover, the Markowitz mean–variance scheme also leads to the suggestion of Sharpe ratio (Sharpe 1966, 1994), which is typically used to evaluate the performance of a portfolio.

In both the standard Markowitz mean–variance scheme and Sharpe ratio approach, a risk is defined as the return variance, which has been subsequently realized that the variance is not an appropriate measure because it counts the positive fluctuation above the expected returns (also called upside volatility) as a part of the risk. See Box 4 in Fig. 2; the downside risk thus becomes a topic to study. Markowitz (1959) counts the volatility below the expected returns only. Fishburn (1977) makes a mean-risk analysis with risk associated with below-target returns and proposes a more sophisticated measure of risk associated with below-target return, which has been further refined by Sortino and Meer (1991). Basically, this downside risk is the volatility of return below the minimal acceptable return (also called target return G).

$${\text{down}}V_{\gamma } (G) = \int\nolimits_{{ - \infty }}^{G} {(G - r)^{\gamma } {\text{d}}F{\text{(r)}}}$$
(25)

Moreover, the downside risk of a single asset has been extended into the following covariance (Hung et al. 2000, 2003):

$$\varvec{D} = \left[ {d_{i,j} } \right],d_{i,j} = \mathop \int \limits_{ - \infty }^{G} \mathop \int \limits_{ - \infty }^{G} (G - r_{i} )^{{\frac{\gamma }{2}}} (G - r_{j} )^{{\frac{\gamma }{2}}} p(r_{i} ,r_{j} )\,{\rm d}r_{i} {\rm d}r_{j} ,$$
(26)

for the returns \(r_{j} ,\, j = 1, \ldots ,k\) of multiple assets. Also, we have the following matrix for the upside volatility:

$$\varvec{U} = \left[ {u_{i,j} } \right],\,\,u_{i,j} = \mathop \int \limits_{G}^{ + \infty } \mathop \int \limits_{G}^{ + \infty } (r_{i} - G)^{{\frac{\gamma }{2}}} (r_{j} - G)^{{\frac{\gamma }{2}}} p(r_{i} ,r_{j} )\,{\rm d}r_{i} {\rm d}r_{j} .$$
(27)

The sprit of the Markowitz theory and the Shape ratio, i.e., maximizing the expected returns while minimizing the risk, is reasonably modified into one extended Sharpe ratio featured by maximizing both the expected returns and the upside volatility while minimizing the downside risk; see Box 5 in Fig. 2. In Hung et al. (2000, 2003), this generalization is implemented by the following maximizaon:

$$\begin{aligned} \mathop {\text{Max}}\limits_{\varvec{w}} \left[ {\frac{{\varvec{w}^{\rm T} E\varvec{r} + H\varvec{w}^{\rm T} \varvec{Uw}}}{{\varvec{w}^{\rm T} \varvec{Dw}}} + B\varvec{w}^{\rm T} \left( {1 - \varvec{w}} \right)} \right],1 = \left[ {1, \ldots ,1} \right]^{\text{T}} , \hfill \\ \varvec{r} = \left[ {r_{1} , \ldots ,r_{k} ]^{\text{T}} , {\mathbf{w}} = } \right[w_{1} , \ldots ,w_{k} ]^{\rm T} ,\quad \mathop \sum \limits_{i = 1}^{k} w_{i} = 1,\quad w_{i} \ge 0. \hfill \\ \end{aligned}$$
(28)

As shown in Fig. 4, we use the parameters H, B to adapt the investor’s preference. The parameter H represents a strength of maximizing upside volatility and B represents a strength of diversification or regularization. The term \(\varvec{w}^{\text{T}} \left( {1 - \varvec{w}} \right)\) is a diversification term that reaches its minimum when one wi is 1 and others are 0, and its maximum when all the elements \(\varvec{w}\) are equal.

Fig. 4
figure 4

Use of the parameters H, B

It has been experimentally shown that this generalization of Sharpe ratio can effectively reduce the risk while obtaining great returns, in comparison with the standard Markowitz mean–variance scheme and Sharpe ratio. Moreover, investors expect a constant return with a minimum downward risk, for which we can simply set \(\varvec{w}^{\text{T}} E\varvec{r} = r_{\text{spec}}\), while the others expect a maximum return under a constant downward risk, for which we can simply set \(\varvec{w}^{\text{T}} \varvec{Dw} = v_{\text{spec}}\).

In Sect III(C) of Xu (2001), several developments have been proposed along this direction. First, a more practical scenario is considered, featured with a portfolio of risk securities with returns \(r_{j,t} , \,j = 1, \ldots ,k\), a risk-free bond with return rf and transaction cost with a rate rc. That is, \(r_{t} = \varvec{w}^{\text{T}} \varvec{r}\) is replaced by

$$\begin{aligned} r_{t} & = \left( {1 - \alpha_{0} } \right)r^{f} + \alpha_{0} \mathop \sum \limits_{j = 1}^{k} \left[ {w_{j,t} r_{j,t} - r_{c} \mathop \sum \limits_{j = 1}^{k} \left| {w_{j,t} - w_{j,t - 1} } \right|\left( {1 + r_{j,t} } \right)} \right] \\ & = \left( {1 - \alpha_{0} } \right)r^{f} + \alpha_{0} \left[ {\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{c} \delta \varvec{w}_{t}^{\rm T} \left( {1 + \varvec{r}_{t} } \right)} \right], \alpha_{0} > 0, {\mathbf{w}}_{t}^{\text{T}} 1 = 1, \\ \delta \varvec{w}_{t} & = \left[ {\left| {w_{1,t} - w_{1,t - 1} \left| {, \ldots ,} \right|w_{k,t} - w_{k,t - 1} } \right|} \right]^{\text{T}} , \\ \end{aligned}$$
(29)

where each wj,t may be nonnegative as in Eq. (28). In this case, short of a risk security is not permitted but borrowing from the risk-free bond is allowed, i.e., we can have 1 − α0 < 0. Also, we may allow a negative wj,t, i.e., short of a risk security is permitted.

Second, instead of considering \(E\varvec{w}^{\text{T}} \varvec{r} = \varvec{w}^{\text{T}} E\varvec{r}\) and \(E\left[ {\varvec{w}^{\text{T}} \varvec{r} - E\varvec{w}^{\text{T}} \varvec{r}} \right]\left[ {\varvec{w}^{\text{T}} \varvec{r} - E\varvec{w}^{\text{T}} \varvec{r}} \right]^{\text{T}}\) for the expected return and its volatility, we compute their estimations directly from samples RT = {rtt = 1, …, T} within a time window. Accordingly, it follows from Eq. (25) that we get the counterpart of Eq. (28) as follows:

$$\begin{aligned} Sp = \frac{{M\left( {R_{T} } \right)}}{{\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}} + \beta_{V} \frac{{\sqrt[\gamma ]{{V_{G}^{U} \left( {R_{T} } \right)}}}}{{\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}} + \beta_{\varvec{w}} D\left( \varvec{w} \right),\,M\left( {R_{T} } \right) = \frac{1}{T}\mathop \sum \limits_{t = 1}^{\rm T} r_{t} , \hfill \\ V_{G}^{D} \left( {R_{T} } \right) = \frac{1}{{\# \left( {r_{t} \le G} \right)}}\mathop \sum \limits_{{ r_{t} \le G}} (G - r_{t} )^{\gamma } , V_{G}^{U} \left( {R_{T} } \right) = \frac{1}{{\# (r_{t} > G)}}\mathop \sum \limits_{{ r_{t} > G}} (r_{t} - G)^{\gamma } , \hfill \\ \end{aligned}$$
(30)

where #S denotes the cardinality of the set S, and the parameter \(\beta_{V} ,\beta_{\varvec{w}}\) are the counterparts of H, B in Eq. (28). Moreover, \(D\left( \varvec{w} \right)\) is a diversification term that reaches its minimum when one wi is 1 and the others are 0, and reaches its maximum when all the elements \(\varvec{w}\) are equal. There could be several choices for \(D\left( \varvec{w} \right)\). One example is \(\varvec{w}^{\text{T}} \left( {1 - \varvec{w}} \right)\) in Eq. (28) or equivalently \(- \varvec{w}^{\text{T}} \varvec{w}\). One other example is

$$D\left( \varvec{w} \right) = - \,\mathop \sum \limits_{j = 1}^{k} w_{j,t} \,\ln \,w_{j,t} , \mathop \sum \limits_{j = 1}^{k} w_{j,t} = 1, \quad w_{j,t} \ge 0.$$
(31)

Moreover, \({{M\left( {R_{T} } \right)} \mathord{\left/ {\vphantom {{M\left( {R_{T} } \right)} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}} \right. \kern-0pt} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}\) is a ratio which is also an improvement over \(\varvec{w}^{\text{T}} E\varvec{r}/\varvec{w}^{\text{T}} \varvec{Dw}\) in Eq. (28), and actually \(\varvec{w}^{\rm T} E\varvec{r}/\varvec{w}^{\rm T} \varvec{Dw}\) is not really a ratio. Third, instead of directly searching the parameters \(\alpha_{0} ,\varvec{w}_{t}\), we may let

$$\begin{aligned} \alpha_{0} = \,& e^{{ - g\left( {\varvec{r}_{t} ,\psi } \right)}} , w_{j,t} = \frac{{e^{{f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right)}} }}{{\mathop \sum \nolimits_{i = 1}^{k} e^{{f^{\left( i \right)} \left( {\varvec{r}_{t} ,\varphi } \right)}} }}, \\ f\left( {\varvec{r}_{t} ,\varphi } \right) = & \,\left[ {f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right), \ldots ,f^{\left( j \right)} \left( {\varvec{r}_{t} ,\varphi } \right)} \right]^{\text{T}} , \\ \end{aligned}$$
(32)

with \(g\left( {\varvec{r}_{t} ,\psi } \right), f\left( {\varvec{r}_{t} ,\varphi } \right)\) implemented by neural networks, e.g., an ENRBF network. In the next section, we will show that a portfolio of security returns \(\varvec{r}_{t}\) may also be modeled by a temporal extension of arbitrage pricing theory such that \(\varvec{r}_{t}\) is mapped into inner factor \(\varvec{y}_{t}\) with a much lowered dimension. Instead of depending on the security returns \(\varvec{r}_{t}\), we use \(\varvec{y}_{t}\) to replace \(\varvec{r}_{t}\) in Eq. (28) for a further improvement.

Following the extension proposed in Xu (2001), most of the above addressed extensions have been investigated together with detailed algorithm, experiments on real market data and comparative studies (Chiu and Xu 2002b, 2003, 2004b). Still, at the end of Sect III(C) in Xu (2001), there was one briefly introduced idea that has not been further investigated yet. Here, some further details are addressed.

In Eq. (30) and also in Eq. (28), as well as in the existing studies on the Markowitz portfolio optimization and the Sharpe ratio, the expected return and volatilities are nonparametric estimates directly from samples \(R_{T} = \left\{ {\varvec{r}_{t} ,t = 1, \ldots ,T} \right\}.\) To capture a temporal dependence better, one idea is using an ARCH or GARCH model to describe a sequence {rtt = 1, …, T} of the portfolio return \(r_{t} = \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ;\) see Box in Fig. 2. It follows from Eq. (3) that we have

$$r_{t + 1} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} r_{t + 1 - j} + \sigma_{t} \varepsilon_{t} , \varepsilon_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\varepsilon |0, 1) , {\text{and }}\sigma_{t} = \sigma_{t}^{2} \left( \vartheta \right)\, {\text{by Eq}} .\,\left( 3 \right).$$
(33)

Taking the expectation and separating the first term from the rest, as well as approximately considering \(E\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} \approx a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ,\) we further get

$$\begin{aligned} Er_{t + 1} & = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} Er_{t + 1 - j} = a_{1} E\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} \approx a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} , \\ \sigma_{t + 1}^{2} & = \sigma_{0}^{2} + \mathop \sum \limits_{i = 1}^{q} \beta_{i} \varepsilon_{t + 1 - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t + 1 - j}^{2} = \beta (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2} + \hat{\sigma }_{t}^{2} , r_{t}^{AR} = a_{0} + \mathop \sum \limits_{j = 1}^{q} a_{j} r_{t - j} , \\ E\hat{r}_{t - 1} & = a_{0} + \mathop \sum \limits_{j = 2}^{q} a_{j} Er_{t + 1 - j} , \hat{\sigma }_{t}^{2} = \sigma_{0}^{2} + \mathop \sum \limits_{i = 2}^{q} \beta_{i} \varepsilon_{t + 1 - i}^{2} + \mathop \sum \limits_{j = 1}^{p} \omega_{j} \sigma_{t + 1 - j}^{2} ,\\ \end{aligned}$$
(34)

from which we get the following GARCH-based Shape ratio

$$J\left( {\varvec{w}_{t} } \right) = \frac{{Er_{t + 1} }}{{\sigma_{t + 1} }} = \frac{{a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} + E\hat{r}_{t - 1} }}{{\beta_{1} (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2} + \hat{\sigma }_{t}^{2} }}.$$
(35)

Given the GARCH model and the past \(Er_{t - j} , r_{t - j} , \quad j = 1, \ldots ,k,\) we have \(E\hat{r}_{t - 1} ,\) \(\hat{\sigma }_{t}^{2} ,\) r AR t , a1, β1 available. As \(\varvec{r}_{t}\) is obtained, we compute the gradient of \(J\left( {\varvec{w}_{t} } \right)\) and update

$$\varvec{w}_{t} = \varvec{w}_{t - 1} + \eta \nabla_{{\varvec{w}_{t} }} J\left( {\varvec{w}_{t} } \right), {\text{for }}\,{\text{a}}\, {\text{learning}}\, {\text{step }}\,{\text{size }}\,\eta > 0.$$
(36)

Then, we get \(\varepsilon_{t}^{2} = (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} - r_{t}^{AR} )^{2}\) and update \(a_{i}^{\text{new}} = e^{{c_{1}^{\text{new}} }} , c_{i}^{\text{new}} = c_{i}^{\text{old}} - \eta \frac{{{\text{d}}\varepsilon_{t}^{2} }}{{{\text{d}}c_{i}^{\text{old}} }},\quad {\text{for }}i = 0,1,\) \(a_{j}^{new} = a_{j}^{old} - \eta \frac{{d\varepsilon_{t}^{2} }}{{da_{j}^{old} }}, \quad {\text{for }}j = 2, \ldots ,q.\)

Also, we update the parameters ϑ in the same way as one standard GARCH solving approach. Next, we use Eq. (36) for updating \(\varvec{w}_{t + 1}\) again.

Market modeling: APT theory and temporal factor analysis

Arbitrage pricing theory and factor analysis’s incapability

Beyond only optimizing the outcome by investing a portfolio of multiple assets, the Markowitz mean–variance scheme also leads to the linear modeling of the market. The most famous one is the well-known capital asset pricing model (CAPM) (Sharpe 1964). However, the CAPM is criticized as being not sufficient to describe market behavior merely via one endogenous factor.

Under the name of arbitrage pricing theory (APT), Ross (1976) proposed the following linear model of multiple hidden or endogenous factors:

$$\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , A = \left[ {a_{ij} } \right], {\mathbf{r}}_{t} = [r_{1,t} , \ldots ,r_{k,t} ]^{\text{T}} , \,\varvec{f}_{t} = \left[ {f_{1,t} , \ldots ,f_{n,t} ]^{\text{T}} , \varvec{e}_{t} = } \right[e_{1,t} , \ldots ,e_{k,t} ]^{\text{T}} .$$
(37)

As illustrated in Fig. 5a, \(\varvec{r}_{t}\) consists of the returns of k assets in this market, \(\varvec{f}_{t}\) consists of m risky hidden factors that will affect the rate of returns on all assets by different degrees of sensitivity and aij is the sensitivity of the ith asset to factor j, also called factor loading, Moreover, each element of \(\varvec{e}_{t}\) is the risky asset’s idiosyncratic random shock with mean zero, and each element of \(\varvec{a}\) is a constant part of the corresponding risky asset.

Fig. 5
figure 5

Arbitrage pricing theory and factor analysis

Since its inception, the APT has attracted a considerable interest as a tool for interpreting investment results and controlling portfolio risk. However, the APT has been accepted by the investment community, but is not as popular as the CAPM. The reason largely relates to APT’s serious drawback, namely, its implementation is difficult due to the lack of specificity regarding the nature of the factors that systematically affect asset returns. As outlined in Sect. I of (Xu 2001), typically three types of approaches have been applied for the APT implementation.

Most of the studies are featured with \(\varvec{f}_{t}\) given by the so-called fundamental factors, i.e., historic time series of a set of macroeconomic or fundamental indexes. With the hidden factors chosen, the problem becomes a typical multivariate linear regression problem: \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\). However, choosing these fundamental factors is not an easy task. Chen et al. (1986) chose five macroeconomic factors, including surprises in GDP, inflation, investor confidence, and yield curve. Also, others consider index or spot or future market price, e.g., short-term interest rate, a diversified stock index, oil price, gold or precious metal prices, and currency exchange rate in place of macroeconomic factors. With efforts over decades, little progress has been achieved on identifying the number and nature of these fundamental factors. Many researchers believe that this issue is essentially empirical in nature, because the factors change over time and between economies.

There have been also efforts under the name of the cross-sectional approaches that observes the correlations of all the assets of \(\varvec{r}_{t}\) to each of the hidden factor in \(\varvec{f}_{t}\) by a certain period, resulting in estimates of elements of A that reflect the assets’ sensitivities to these hidden factors. Then, the task is to estimate \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) and A, which is typically handled as a linear cross-sectional regression and solved by the least square error method in the literature of economics and finance. In Sect. I of Xu (2001), it is formulated as an inverse mapping problem, a topic that has been widely studied in the neural network and machine learning literature.

Observation of an implementation of the least square error method actually shows that the residuals \(\varvec{e}_{t}\) are uncorrelated among the elements and also with the factors \(\varvec{f}_{t}\) and that each element of \(\varvec{e}_{t}\) reflects a collective effect of many random noise, that is, we have \(E\varvec{f}_{t} \varvec{e}_{t}^{\rm T} = 0\) and also \(q(\varvec{r}_{t} |\varvec{f}_{t} )\) as shown by the top-down pathway on the right part of Fig. 5b. An inverse of the top-down path is a bottom-up path on the left part of Fig. 5b, for which the optimal solution is the following Bayesian inverse:

$$p(\varvec{f}_{t} |\varvec{r}_{t} ) = \frac{{G(\varvec{e}_{t} |\varvec{a} + A\varvec{f}_{t} ,\varSigma )q\left( {\varvec{f}_{t} } \right)}}{{\mathop \smallint \nolimits G(\varvec{e}_{t} |\varvec{a} + A\varvec{f}_{t} ,\varSigma )q\left( {\varvec{f}_{t} } \right)d\varvec{f}_{t} }}.$$
(38)

Here, we encounter a probabilistic structure \(q\left( {\varvec{f}_{t} } \right)\) of hidden factors. Approximately, if only considering its statistics up to the second order, \(q\left( {\varvec{f}_{t} } \right)\) is approximated by a Gaussian \(G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)\) as shown in Fig. 5b. In such a case, we have the following analytical solution:

$$\hat{\varvec{f}}_{t} = \mathop \smallint \nolimits \varvec{f}_{t} p(\varvec{f}_{t} |\varvec{r}_{t} )\,{\text{d}}\varvec{f}_{t} = \left( {A^{\text{T}} \varSigma^{ - 1} A + \varLambda^{ - 1} } \right)\left[ {A^{\text{T}} \varSigma^{ - 1} \left( {\varvec{r}_{t} - \varvec{a}} \right) + \varLambda^{ - 1} \nu } \right],$$
(39)

which returns to a least square error solution when there is no information about \(q\left( {\varvec{f}_{t} } \right)\) for which we may simply set Λ = 0, ν = 0.

Similar to the first approach, the second approach is also essentially empirical in nature, which needs not only a manual help to identify the number and nature of hidden factors, but also at least an enough long period of historic data about factors for estimating of elements of A. Moreover, getting elements of A by the correlations between \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) actually imposes additional constraints on the values that A may take. The second approach is supplementary to the first approach, but it still cannot get rid of the nature that the factors are chosen heuristically and even rather arbitrarily. We may regard that the second approach actually consists of two steps. First, estimation of elements of A bases on a period historic data of macroeconomic or fundamental indexes takes the same role of the first approach or even just an implementation of the first approach. Second, we estimate \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) and A, e.g., typically by Eq. (39).

The third type of efforts are called factor-analytic approach, attempting to use a statistical approach called factor analysis (FA) to get both the unknown and the unknown factors estimated from the observed return series \(\left\{ {\varvec{r}_{t} } \right\}\). There is no need of external heuristics, and thus it seems more appealing. As shown in Fig. 5b, an FA model comes from modifying Fig. 5a with an additional structure that \(\varvec{f}_{t}\) comes from a Gaussian \(G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)\) with a diagonal Λ or even \(\varLambda = I\). Unfortunately, empirical tests showed that factor analysis does not explain economic variables well. As addressed in Sect. I of Xu (2001), some incapability of factor analysis mainly comes from two kinds of intrinsic indeterminacy. One is the rotation indeterminacy, i.e.,

$${\text{if}}\,A,\varvec{f}_{t} \;{\text{is a solution}},\,A\varphi^{\text{T}} ,\varphi \varvec{f}_{t} \,{\text{is also a solution for any rotation matrix}}\,\varphi ,$$
(40)

while such a rotation may lead to a solution far from the correct one. The other comes from an intrinsic indeterminacy of an appropriate number of factors, while the selection of a correct number of factors is essential to the performance of using the APT model. Usually, it is set by a rule of thumb. Actually, factor analysis also suffers other types of indeterminacy. One is any rescaling \(D\varvec{f}_{t}\) of a solution \(\varvec{f}_{t}\) is still a solution for a diagonal matrix D, which is not critical because it reserves the waveform of each element in \(\varvec{f}_{t}\). The other is additive indeterminacy, i.e., AΛΣ and A*Λ*Σ*are both the solutions as long as AT + Σ = A*Λ*A*T + Σ*. However, the effect of this indeterminacy can be reduced significantly when Σ = σ2I. Therefore, our attention should be mainly on the first two key challenges, namely, removing the rotation indeterminacy by Eq. (40) and determining an appropriate number of factors.

The first challenge has been seldom considered by the APT studies in the fields of economics and finance, while there are some efforts on the second challenge, i.e., determining an appropriate number of factors with the help of statistical testing. The simplest one is making maximum likelihood factor analysis (MLFA) followed by the likelihood ratio (LR) test, shortly MLFA-LR. Empirical evidences show that the minimum number of factors accepted by the LR test tends to increase with the number of securities. Alternatively, Chamberlain and Rothschild (1983) suggest analyzing eigenvalues of the population covariance matrix, shortly eigenvalue approach. Still, Brown (1989) empirically found that this approach biases toward too few factors and the result consistent with one factor may be equally consistent with multiple equally weighted factors.

On one hand, being essentially empirical in nature, both the fundamental factor-based approaches and the cross-sectional approaches rely on pre-knowledge or external beliefs to choose the factors heuristically, in lack of consensus and consistency over what should be the real factors in APT. On the other hand, the implementation of factor analysis suffers the rotation indeterminacy by Eq. (40) and the difficulty of determining an appropriate number of factors. These problems incur for criticisms on the APT theory, e.g., see Dhrymes et al. (1984); Abeysekera and Mahajan (1987).

Instead of doubting the incorrectness of the APT theory, our understanding is that the APT theory is correct but incomplete. The APT suggests to model a market at no arbitrage equilibrium by a linear model, which is justifiable. However, this theory is incomplete because this linear model cannot be uniquely or even reasonably specified merely from the observed return series \(\left\{ {\varvec{r}_{t} } \right\}\). To complete the theory, further specification should be imposed on the components of this model. The fundamental factor-based approaches fix the hidden factors by heuristically and empirically picking a set of macroeconomic or fundamental indexes, which removes the indeterminacy but leaves the difficult questions on how to choose these factors and whether the factors should come directly from macroeconomic or fundamental indexes. The cross-sectional approaches aim at estimating \(A\), which leaves the difficult question on how A can be estimated correctly. To get A by the assets’ sensitivities to these hidden factors, we still need to heuristically and empirically pick a set of macroeconomic or fundamental indexes, Finally, the FA model is also unable to remove the incompleteness of the APT, because imposing an additional Gaussian \(G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)\) is still not enough to remove the critical indeterminacy by Eq. (40). In a summary, the original APT (Ross 1976) is reasonable but incomplete, and further efforts should explore how to add certain structure to remove or remedy the incompleteness.

Temporal factor analysis and temporal APT

The famous CAPM model is featured by one factor that is not a manually chosen exogenous macroeconomic or fundamental index but an invisible and intrinsic market indicator. The APT was motivated by following the basic sprit of CAPM to answer the criticism that merely one factor is not enough to describe the market behavior. However, implementing APT by manually picking macroeconomic or fundamental indices actually deviates from the original motivation. Encouragingly, the direction of FA implementation is still consistent with the original motivation that seeks intrinsic factors, and thus we further proceed along this direction. Keeping Eq. (37), we extend the Gaussian structure \(G\left( {\varvec{f}_{t} \left| {\nu ,\varLambda } \right.} \right)\) into a better structure such that the indeterminacy by Eq. (40) or the incompleteness of the FA model can be removed or at least remedied.

Temporal factor analysis (TFA) is such a further development of FA; see Box 1 in Fig. 3. The early study was started in 1997, firstly introduced briefly by Xu (1997) and further addressed in Xu (2000) (this manuscript actually reached the editorial office also in 1997). See Box 2 in Fig. 3: the key idea is modifying Eq. (37) as follows:

$$\begin{aligned} \varvec{r}_{t} & = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , A = \left[ {a_{ij} } \right], {\mathbf{r}}_{t} = [r_{1,t} , \ldots ,r_{k,t} ]^{\text{T}} , \\ \varvec{f}_{t} & = \left[ {f_{1,t} , \ldots ,f_{n,t} ]^{\text{T}} , \varvec{e}_{t} = } \right[e_{1,t} , \ldots ,e_{k,t} ]^{\text{T}} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\text{T}} = 0, \\ \varvec{f}_{t} & = B\varvec{f}_{t - 1} + \varepsilon_{t} , B = {\text{diag}}\left[ {b_{1} , \ldots ,b_{m} } \right] \ne bI\,{\text{with }}\,b \ne 0,\\ & \varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,\varLambda } \right.} \right) \;{\text{with}}\;{\text{a}}\;{\text{diagonal }}\varLambda, {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\text{T}} = 0. \\ \end{aligned}$$
(41)

That is, the first-order autoregressive dependence is added to each factor in \(\varvec{f}_{t}\) via B, and Eq. (41) returns to FA by Eq. (37) when B = 0.

It is this temporal dependence that removes the rotation indeterminacy by Eq. (40); see Sect IV (A) in Xu (2000) and Sect. II in Xu (2002). Roughly, the following points may be understood:

  • For any diagonal matrix D, we have \(A\varvec{f} = \tilde{A}\tilde{\varvec{f}},\tilde{A} = AD,\tilde{\varvec{f}} = D^{ - 1} \varvec{f},\) which keeps the format \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\) unchanged and also the elements of \(\tilde{\varvec{f}}\) remain mutually independent. i.e., Equation (37) has an indeterminacy of unknown scaling on factors of \(\tilde{\varvec{f}}\). Thus, we may simply consider \(\varvec{f}_{t} \sim G\left( {\varvec{f}_{t} \left| {0,I} \right.} \right)\). For any rotation matrix φ with \(\varphi^{\text{T}} \varphi = I\), we have \(A\varvec{f} = \tilde{A}\tilde{\varvec{f}},\) and \(\tilde{A} = A\varphi^{\text{T}} ,\tilde{\varvec{f}} = \varphi \varvec{f}\) with \(\tilde{f}_{t} \sim G\left( {\tilde{f}_{t} \left| {0,I} \right.} \right)\). That is, Eq. (37) has also an indeterminacy of unknown rotation on factors \(\tilde{\varvec{f}}\).

  • For any diagonal matrix D, we also have \(D^{ - 1} \varvec{f}_{t} = D^{ - 1} BDD^{ - 1} \varvec{f}_{t - 1} + D^{ - 1} \varepsilon_{t}\) and \(\tilde{\varvec{f}}_{t} = B\tilde{\varvec{f}}_{t - 1} + \tilde{\varepsilon }_{t} ,\), where \(\tilde{\varepsilon }_{t} = D^{ - 1} \varepsilon_{t}\) comes from \(G\left( {\tilde{\varepsilon }_{t} \left| {0,D^{ - 1} \varLambda D^{ - 1} } \right.} \right)\) and \(D^{ - 1} \varLambda D^{ - 1}\) is still diagonal. That is, Eq. (41) still has an indeterminacy of unknown scaling on factors \(\tilde{\varvec{f}}\). Again, we may consider \(\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,I } \right.} \right).\) For any rotation matrix φ with φTφ = I, we have \(\tilde{\varvec{f}}_{t} = \tilde{B}\tilde{\varvec{f}}_{t - 1} + \tilde{\varepsilon }_{t}\) with \(\tilde{\varepsilon }_{t} \sim G\left( {\tilde{\varepsilon }_{t} \left| {0,I } \right.} \right)\), while \(\tilde{B} = \varphi B\varphi^{\text{T}}\) is no longer diagonal and even B is diagonal. If \(\tilde{B} = \varphi B\varphi^{\text{T}}\) is required to be diagonal, the only rotation matrix is φ = I and thus the rotation indeterminacy is removed.

Still there is an indeterminacy of unknown scaling on factors of \(\tilde{\varvec{f}}\), but it will not change the waveform of f1,t, …, fn,t. Also, we may normalize each factor to remove such indeterminacy.

In Xu (2001), the TFA by Eq. (41) is thus suggested as a refinement of the original APT theory, by which the original part of APT is kept without modification, while a temporal structure \(\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}\) is added such that the incompleteness caused by the rotation indeterminacy has been removed. Such a refinement may be called temporal APT in a sense that temporal relation is taken into consideration of market modeling. That is, a static equation by Eq. (37) is not enough to describe a market equilibrium, but a temporal structure should be an important ingredient of a market equilibrium.

Why is an AR model of merely order one \(\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}\) considered as this temporal structure? First, we consider that hidden factors \(\varvec{f}_{t}\) are driven by Gaussian noise \(\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left| {0,\varLambda } \right.} \right),\) following a general consensus that the noisy component in most econometric and statistical models is Gaussian distributed. The rationale comes from the central limit theorem which implies that the compounding of a large number of unknown distributions will be approximately normal. Second, the first-order AR model can be attributed to the weak form of efficient market hypothesis (EMH), that is, stock price today is conditionally independent of all previous prices given the price of yesterday. Third, though observable economic indices are seldom independent, it cannot rule out that hidden factors that denominate a market equilibrium are mutually independent. Instead, independent factors may help to make market equilibrium simpler.

As addressed in the previous subsection, past efforts on determining an appropriate number of factors have not provided much support on the APT. For one example, the MLFA-LR test shows that the number of factors tends to increase with the number of securities. For another example, the identification via eigenvalue approach (Chamberlain and Rothschild 1983) biases toward a smaller factor number. In one IJCNN 02 paper (Chiu and Xu 2002a), empirical tests on Hong Kong stock market data show not only that these two unfavorable biases are again observed, but also that the TFA-based APT can provide a reasonable answer to the number of factors in the Hong Kong stock market. As shown in Fig. 6, the number of factors identified by MLFA-LR test varies as the numbers of securities, while the number of factors identified by the eigenvalue approach is always 1. In contrast, BYY harmony learning based TFA stably identifies four or five factors regardless of the numbers of securities, which is quite consistent with the number identified via heuristic empirical analysis, e.g., in Chen et al. (1986).

Fig. 6
figure 6

Comparison on finding the number of factors identified by MLFA-LR test, eigenvalue approach and BYY harmony learning-based TFA

The above introduced nature of TFA and preliminary studies suggest that there may need a renewed interest in the literature of finance and economics to further investigate APT and its further developments. To consider which topics to pursue, it is helpful to observe the differences of TFA from related methods.

First, \(\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}\) in Eq. (41) is actually a special type of the first-order vector AR (VAR). Being different from the conventional VAR that are used for capturing linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987), the TFA captures the interdependencies among multiple time series by \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\) and temporal dependences by \(\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}\). As addressed in Sect. 3.2.1 in Xu (2012), it is more efficient to separately treat these two types of dependences.

Second, if we do not constrain B,Λ to be diagonal, Eq. (41) becomes a general state–space model (SSM) or a linear dynamical system (LDS), which has been widely studied in the literature of control theory and signal processing. As outlined in Sect. 5.2.1 of Xu (2012), in a period that is more or less the same as the studies on TFA (Xu 1997; 2000), there was a renewed interest on a general LDS, featured by using the EM algorithm for parameter estimation under the ML learning (Ghahramani and Hinton 2000). Accordingly, this EM algorithm was originally derived in the early 1980s and re-introduced in the early 1990s (Shumway and Stoffer 1991). Neither these studies suggest using the LDS as a further development of APT, nor the notorious rotation indeterminacy in Eq. (40) has been taken into consideration. On the contrary, more problems of indeterminacy than the FA are actually incurred in this general LDS model due to many extra free parameters, which makes identifiability even worse. For an example, applied to radar automatic target recognition based on high-resolution range profile, it has been shown in Wang et al. (2011) that the recognition performance of the general LDS is actually even inferior to that of the FA, while TFA obtains better performances than the FA.

Third, many efforts have been made on determining the factor number of FA in the literature of statistics and machine learning, typically in a two-stage implementation. The first stage uses the EM algorithm to make the ML learning for unknown parameters in the FA while the second stage selects an appropriate number of factors with help of a model selection criterion. In Tu and Xu (2011), a systematic comparative investigation has been made on a number of typical model selection criteria, including not only Akaike’s AIC, Schwarz’s BIC, Bozdogan’s CAIC, Hannan–Quinn criterion, but also recent Minka’s PCA criterion, Kritchman and Nadler’s tests, and Perry and Wolfe’s rank, as well as the criterion obtained from the BYY harmony learning theory (Xu 2001).

As discussed above, there is not really a need to further consider the relations to VAR and LDS. Instead, further explorations may start from continuing the study in the IJCNN02 paper (Chiu and Xu 2002b) and proceed to clarify the following issues:

  • Does using one of the above model selection criteria in a two-stage implementation improve the number of FA factors identified by the MLFA-LR test and the eigenvalue approach? If yes, does this improvement help the FA-based implementation of APT, even still suffering the rotation indeterminacy by Eq. (40).

  • Still using one of the above model selection criteria in a two-stage implementation, how much improvement TFA can be obtained after removing the rotation indeterminacy by \(\varvec{f}_{t} = B\varvec{f}_{t - 1} + \varepsilon_{t}\)?

Additionally, studies may be made on data from other major international markets, with those past empirical analyses (e.g., Chen et al. 1986; Azeez and Yonezawa 2006) as references. In addition to a two-stage implementation, one promising feature of implementing the TFA by the BYY harmony learning (Xu 2001) is that the number of temporal factors is determined automatically during learning, which saves computational costs greatly and also improves the learning performance of TFA, for which details are referred to Sect. 5 of Xu (2010) and Sect. 5.2 of Xu (2012).

Macroeconomics-modulated TFA-APT and nGCH-driven M-TFA-O

In those empirical APT studies, the practice that uses macroeconomic indexes as \(\varvec{f}_{t}\) leads to an understanding that \(\varvec{f}_{t}\) typically consists of a set of macroeconomic or fundamental indexes. In an FA implementation or a TFA implementation by Eq. (41), such an understanding may not be correct. Actually, \(\varvec{f}_{t}\) may vary much slower than the return \(\varvec{r}_{t}\) and thus be regarded as a macroeconomic type of indices. However, \(\varvec{f}_{t}\) may also vary in a timescale similar to the changes of \(\varvec{r}_{t}\). Moreover, \(\varvec{f}_{t}\) in Eq. (41) is intrinsically determined from real data \(\varvec{r}_{t}\) and usually will not coincide with exogenous macroeconomic indexes, such as GDP, inflation, investor confidence, and yield curve. Therefore, we need to further investigate how the market is influenced by these exogenous variables or macroeconomic indexes.

Being quite different from many existing studies that explicitly model the relation between market return \(\varvec{r}_{t}\) and macroeconomic indices, the influences of these indices to \(\varvec{r}_{t}\) are considered via their roles in modulating the temporal factors in \(\varvec{f}_{t} ,\) as shown in Fig. 3 by Box 3. This idea is realized via extending Eq. (41) into the following macroeconomics-modulated TFA–APT:

$$\begin{aligned} \varvec{r}_{t} & = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\rm T} = 0, \\ \varvec{f}_{t} & = B\varvec{f}_{t - 1} + H\varvec{m}_{t} + \varepsilon_{t} , {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\rm T} = 0, {\text{E }}{\mathbf{m}}_{t - 1} \varepsilon_{t}^{\rm T} = 0, \\ \varvec{m}_{t} & = C\varvec{v}_{t} + \eta_{t} , {\text{E }}\varvec{v}_{t}\, \eta_{t}^{\rm T} = 0, \\ \end{aligned}$$
(42)

where \(\varvec{e}_{t} ,\) ɛt, and ηt are Gaussian white noises and independent of each other. Typically, \(\varvec{m}_{t}\) consists of several macroeconomic indices, and \(\varvec{\nu}_{t}\) consists of several known non-market factors that affect the macroeconomy. Specifically, \(H\varvec{m}_{t}\) describes the effect of the macroeconomic indices to the security market via the hidden factors \(\varvec{f}_{t}\). Actually, Eq. (42) comes from a simplification of one proposed in Sect. III(C) of (Xu 2001) and its Eq. (101), in particular, under the name of macroeconomics-modulated independent state–space model.

In one CIFEr2003 conference paper (Chiu and Xu 2003), empirical investigation is made on the model by Eq. (42). First, white noise tests are made on \(\varvec{e}_{t} ,\) ɛt, and ηt to ensure model specification adequacy. Second, the performances in return prediction and index forecasting are compared with that of the TFA model. Empirical results reveal that the model is not only well specified, but also superior to the TFA model in stock price and index forecasting.

See Box 4 in Fig. 3, there are two ways to perform prediction based on Eq. (41) and Eq. (42). The first way is intrinsically to get \(\varvec{r}_{t - 1} \to \varvec{f}_{t - 1}\) and predict \(\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t - 1}\) for Eq. (41) and \(\hat{\varvec{r}}_{t} = \varvec{a} + A\left( {B\varvec{f}_{t - 1} + H\varvec{m}_{t} } \right)\) for Eq. (42), while the second way is considering a given prediction \(\varvec{r}_{t - 1} \to \varvec{y}_{t}\) via \(\varvec{r}_{t - 1} \to \varvec{f}_{t - 1}\), \(B\varvec{f}_{t - 1} \to \varvec{f}_{t}\) and then \(\varvec{f}_{t} \to \varvec{y}_{t}\) by learning either linear or nonlinear regression, where yt could be either \(\varvec{r}_{t}\) or any type of market indices. In one paper (Chiu and Xu 2002), \(\varvec{f}_{t} \to \varvec{y}_{t}\) is implemented by the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, 2009) and predicts the stock price or return \(\varvec{r}_{t}\). Empirical studies on Hong Kong market data have shown the superiority of this prediction over not only a conventional prediction \(\varvec{f}_{t} \to \varvec{y}_{t}\), but also the prediction \(\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t - 1}\).

Based on Eqs. (41) and (42), in addition to making a prediction featured with learning a regression \(\varvec{f}_{t} \to \varvec{y}_{t}\), we may also use \(\varvec{f}_{t}\) to replace \(\varvec{r}_{t}\) in the previous Eq. (29) for adaptive portfolio management; see Box 5 in Fig. 3. This APT based portfolio management was firstly suggested in Sect. III(c) and especially by Eqs. (96) and (97) in Xu (2001). Extensive simulation results reveal that this \(\varvec{f}_{t}\)-based portfolio management generally excels the return \(\varvec{r}_{t}\) based portfolio management by Eq. (29) (Chiu and Xu 2004b).

In general, a parametric \(\varvec{y}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right)\) can be added to Eq. (41) to provide the outputs of this model for application purposes for such prediction and portfolio management. Moreover, beyond the consideration of Gaussian white noises as the driven noise ɛt, we may consider a non-Gaussian driven noise ɛt or a driven noise ɛt with a conditional heteroskedasticity. In summary, we further generalize Eq. (42) into the following model

  1. (a)
    $${\mathbf{r}}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\text{T}} = 0,$$

    \({\mathbf{e}}_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\varvec{e}_{t} |0, \varSigma_{e} )\) with a diagonal covariance \(\varSigma _{e}\)

  2. (b)

    \({\mathbf{y}}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right);\)

  3. (c)
    $$\begin{aligned} {\mathbf{f}}_{t} &= B\varvec{f}_{t - 1} + H\varvec{m}_{t} + {\text{diag}}\left[ {\sigma_{t}^{\left( 1 \right)} , \ldots ,\sigma_{t}^{\left( m \right)} } \right]\varepsilon_{t}, \ q\left( {\varepsilon_{t} } \right) = \mathop \prod \limits_{j} q\left( {\varepsilon_{t}^{\left( j \right)} } \right), \\ \varepsilon_{t} &= [\varepsilon_{t}^{\left( 1 \right)} , \ldots ,\varepsilon_{t}^{\left( m \right)} ]^{\text{T}}_{,}\ \ {\text{E }}{\mathbf{f}}_{t - 1} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}{\mathbf{m}}_{t} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}\varepsilon_{t}^{\left( j \right)} = 0, {\text{E}}\varepsilon_{t}^{\left( j \right) 2} = 1, \\ q\left( {\varepsilon_{t}^{\left( j \right)} } \right) &= \left\{ {\begin{array}{ll} G(\varepsilon_{t}^{\left( j \right)} |0, 1), \qquad \qquad \qquad \qquad \quad \left( {\text{i}} \right)\, {\text{one}}\;{\text{Gaussian,}} \\ \mathop \sum \limits_{i} \alpha_{i}^{\left( j \right)} G(\varepsilon_{t}^{\left( j \right)} |\mu_{i}^{\left( j \right)} , \lambda_{i}^{\left( j \right)} ),\qquad \quad \qquad \;\left( {\text{ii}} \right) \,{\text{Gaussian }}\,{\text{mixture}}; \\ \end{array} } \right. \\ \sigma_{t}^{\left( j \right)} &= \left\{ {\begin{array}{ll} {\rm a} \ {\text{constant}}\, \sigma_{{}}^{\left( j \right)} , &\quad \left( {\text{a}} \right) \ {\text{nonheteroskedasticity}}, \\ \sigma_{t}^{\left( j \right)} \left( {\vartheta_{{}}^{\left( j \right)} } \right) {\text{given }}\;{\text{by }}\;{\text{Eq}}.\,\left( 3 \right), &\quad \left( {\text{b}} \right) \ {\text{heteroskedasticity}}; \\ \end{array} } \right. \end{aligned}$$
  4. (d)
    $$\begin{aligned}{\mathbf{m}}_{t} &= C\varvec{\nu}_{t} + \eta_{t} , {\text{E }}{\varvec{\upnu}}_{t} \eta_{t}^{\text{T}} = 0, \\ \eta_{t} &\sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\eta_{t} |0, \varSigma_{\eta } ) {\text{with }}\;{\text{a}}\; {\text{digognal}}\; {\text{covariance }} \varSigma_{\eta } . \\ \end{aligned}$$
    (43)

Its basic part consists of ingredients (a)(b)(c). In the special case H=0, its function is TFA with two extensions. One is outputting yt, thus shortly denoted by TFA-O. The other is that ingredient (c) drives ft by its last term that is either or both of non-Gaussian (nG) and conditional heteroscedasticity (CH), for which we use nGCH-driven TFA-O to refer this formulation. When H ≠ 0, ft is also modulated by the macroeconomic market force \(\varvec{m}_{t}\), it leads to the general formulation shortly named nGCH-driven M-TFA-O.

The central role is taken by the statistical nature of ingredient (c), with several scenarios as follows:

  • For the case that \(B = 0, H = 0\) and q(ɛ ( j) t ) in Choice (i) as well as σ ( j) t in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the FA-based implementation of the original APT by Eq. (37).

  • For the case that B = 0, \(\varepsilon_{t} = 0\), it follows from \(\tilde{A} = AH\) that ingredient (a) and ingredient (c) jointly degenerate back to the fundamental factors based implementation of the original APT by Eq. (37).

  • For the case that B = 0, q(ɛ ( j) t ) in Choice (i), and σ ( j) t in Choice (a), ingredient (a) and ingredient (c) jointly act as a combination of the above two implementations.

  • For the case that H = 0, q(ɛ ( j) t ) in Choice (i), and σ ( j) t in Choice (a), as well as B = diag[b1, …, bm]T, ingredient (a) and ingredient (c) jointly become the TFA-based implementation by Eq. (41). It further becomes Eq. (42) when H ≠ 0. Moreover, conditional heteroskedasticity is further considered in \(\varepsilon_{t}\) via Choice (i) of σ ( j) t to be replaced by Choice (b). As shown by empirical investigation in the CIEF’2003 conference paper (Chiu and Xu 2003), we consider that the conditional heteroskedasticity in the TFA-based implementation is considerably better than the TFA-based implementation without such a consideration.

Another alternative is that Choice (i) of a Gaussian q(ɛ ( j) t ) is replaced by Choice (ii) of a non-Gaussian q(ɛ ( j) t ). In the simplest case, B = 0, H = 0, and σ ( j) t in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the non-Gaussian FA (NFA) as outlined in Fig. 3 by Box 6, for which details are referred to Sect. III(A) in Xu (2001), Sect. IV in Xu (2004), and Sect. 3.2 in Xu (2010). Accordingly, we get a Non-Gaussian APT as shown in Fig. 3 by Box 7. Interestingly, NFA can also remove the FA’s rotation indeterminacy by Eq. (40), though there is no temporal structure \(\varvec{f}_{t}\) in consideration because B = 0, H = 0. Similar to Fig. 6, shown in Fig. 7 are the results of empirical investigation made on determining the appropriate factor number of APT by NFA (Chiu and Xu 2004a), still in comparison with the results of the MLFA-LR test and the eigenvalue approach as listed in Fig. 7a. Again, the BYY harmony learning-based NFA stably identified four or five factors regardless of the numbers of securities.

Fig. 7
figure 7

Comparison on finding the number of factors identified by MLFA-LR test, eigenvalue approach and the BYY harmony learning-based NFA

This alternative provides a different perspective on how to remove the indeterminacy by Eq. (40) or the incompleteness of APT. Without the additional equation about \(\varvec{f}_{t}\), the formulation of NFA implementation seems closer than the TFA implementation to the original APT formulation by Eq. (37). Naturally, there rises a question on which one is right, TFA or NFA? Actually, they are two aspects of one market model. TFA observes a dynamic market process while NFA describes the market with all the time points projected to one observation spot such that a Gaussian process is projected to be observed as a mixture of Gaussian distributions. Generally, we may have two natures to be considered in the same market, that is, considering both B = diag[b1, …, bm]T and the choice (ii) of a non-Gaussian q(ɛ ( j) t ). Even generally, the conditional heteroskedasticity may also be added in via letting \(\sigma_{t}^{\left( j \right)}\) in the choice (b). Systematically integrating all the parts and all the ingredients together, Eq. (43) may serve as a general formulation for financial market modeling.

Bayesian Ying–Yang harmony learning and two exemplar learning algorithms

Bayesian Ying–Yang (BYY) harmony learning

The Bayesian Ying–Yang (BYY) harmony learning was proposed in Xu (1995a, b) and subsequently developed systematically (Xu 2001, 2007, 2010, 2012), which provides not only a framework that accommodates typical learning approaches from a unified perspective, but also a new road that leads to improved model selection criteria, Ying–Yang alternative learning with automatic model selection, as well as coordinated implementation of Ying-based model selection and Yang-based learning regularization.

From a modern science perspective that regards the famous ancient Yin–Yang philosophy as a meta theory of system sciences and intelligent systems, a system that survives and interacts with its world can be regarded as a Ying–Yang system that functionally composes of two complement parts. One is called Ying, from its inside into its external world, by which a set \(\varvec{X}_{N} = \{ x_{t} \}_{t = 1}^{N}\) of samples are regarded as generated from its representation \(\varvec{R}\), while the other is called Yang, from an external world into its inside. A two directional view is considered via the joint distribution of \(\varvec{X},\varvec{R}\) in two types of Bayesian decomposition. The decomposition of \(p\left( {\varvec{X},\varvec{R}} \right)\) coincides the Yang concept with a visible domain \(p\left( \varvec{X} \right)\) for a Yang space and a \(\varvec{X} \to \varvec{R}\) pathway by \(p(\varvec{R}|\varvec{X})\) as a Yang pathway. Thus, \(p\left( {\varvec{X},\varvec{R}} \right)\) is called Yang machine. Also, \(q\left( {\varvec{X},\varvec{R}} \right)\) is called Ying machine with an invisible domain \(q\left( \varvec{R} \right)\) for a Ying space and a \(\varvec{R} \to \varvec{X}\) pathway by \(q(\varvec{X}|\varvec{R})\) as a Ying pathway. Such a Ying–Yang pair is called Bayesian Ying–Yang (BYY) system. Ying–Yang pair interact with each other under the principle of best harmony, which is mathematically implemented by maximizing

$$H(p||q) = \mathop \int \limits^{{}} p(\varvec{R}|\varvec{X})p\left( \varvec{X} \right){ \ln }\left[ {q\left( {\varvec{X} |\varvec{R}} \right)q\left( \varvec{R} \right)} \right]\,{\text{d}}\varvec{X}{\text{d}}\varvec{R}\varvec{.}$$
(44)

For a machine learning or modeling purpose, we first need to consider a mathematical representation for \(\varvec{R}\). The first column of Table lists several typical examples. Usually, \(\varvec{R}\) consists of two parts. One is a long-term memory θ that consists of all unknown parameters in the system for collectively representing the underlying structure of \(\varvec{X}_{N}\), while the other is a short-term memory YL with each element being either or both of a categorical label ℓ L and a vector y Y as the corresponding inner representation of one element x X. For examples, we have a vector y for describing \(\varvec{f}_{t}\) in the APT model by Eq. (37), while we simply have a label ℓ in the time series model by Eq. (4).

The probabilistic structure q(YL) is considered jointly with \(q(\varvec{X}|\varvec{R}) = q(\varvec{X}|Y,L,\theta )\), depending on both the tasks in consideration and a trade-off between the complexity of q(YL) and the complexity of \(q(\varvec{X}|Y,L,\theta )\). For the task of TFA modeling by Eq. (41), we have \(q(\varvec{X}|Y,L,\theta )\) by \(q(\varvec{r}_{t} |\varvec{f}_{t} )\) and q(YL) by \(q\left( {\varvec{f}_{t} \left| {\varvec{f}_{t - 1} } \right.} \right)\) as follows:

$$\begin{aligned} q(\varvec{r}_{t} |\varvec{f}_{t} ) & = G(\varvec{r}_{t} |\varvec{a} + A\varvec{f}_{t} , \varSigma ) \quad {\text{with}}\,{\text{a }}\,{\text{diagonal }}\,\varSigma , \\ q\left( {\varvec{f}_{t} \left| {\varvec{f}_{t - 1} } \right.} \right) & = G\left( {\varvec{f}_{t} \left| {B\varvec{f}_{t - 1} ,\varLambda } \right.} \right) \quad {\text{with }}\,{\text{a }}\,{\text{diagonal}}\,\varLambda . \\ \end{aligned}$$
(45)

Moreover, the remaining part in q(R) = q(YL|θ)q(θ) is usually called a priori q(θ) that is chosen depending on the types of parameters and their positions in the Ying machine. In general, a Ying machine q(XR) = q(X|R)q(R) is designed according to a least complexity principle, featured with designing q(R) = q(YL|θ)q(θ) in a least redundancy principle and designing \(q(\varvec{X}|\varvec{R}) = q(\varvec{X}|Y,L,\theta )\) in a divide–conquer principle.

For the Yang machine p(XR) = p(R|X)p(X), p(X) directly comes from samples \(\varvec{X}_{N}\), while p(R|X) is designed based on the Ying machine q(XR) = q(X|R)q(R) according to the variety preservation principle, that is

$$\begin{aligned} p\left( {R |X} \right) & = q\left( {R |X} \right)\ \ {\text{in}}\, {\text{a}}\, {\text{strong }}\,{\text{sense }} \\ \quad \quad \quad \quad \quad {\text{or}} \\ {\text{Cov}}_{R|X} \ of \ p(R|X) & = {\text{Cov}}_{R|X} \ of \ q(R|X) \ \ {\text{in }}\,{\text{a }}\,{\text{week}}\, {\text{sense}} .\\ q\left( {R|X} \right) & = {{q\left( {X|R} \right)q\left( R \right)} \mathord{\left/ {\vphantom {{q\left( {X|R} \right)q\left( R \right)} {\int {q\left( {X|R} \right)q\left( R \right)\,{\rm d}R} }}} \right. \kern-0pt} {\int {q\left( {X|R} \right)q\left( R \right)\, {\rm d}X {\rm d}R} }}, \\ \end{aligned}$$
(46)

where CovR|X indicates a covariance matrix of R conditioning on X. Readers are referred to Xu (2010, 2012) for recent systematic outlines on major issues for designing Ying–Yang machines. To be specific, reading is suggested to start with Sect. 3.2 in Xu (2012) and refer to Sect. 4.2 in Xu (2010) for supplementary materials. Also, readers are referred to Xu (2011) for another perspective that a co-dimensional matrix pair forms a building unit and a hierarchy of such building units sets up the BYY system.

With a BYY system designed, all the remaining unknowns in the system are determined via maximizing the harmony functional by Eq. (44). Typically, there are two types of unknowns. Given the structure of a BYY system or a parametric model in general, it actually means a family of infinite many candidate structures with everyone in a same configuration but in different scales. That is, each candidate is featured by a scale parameter \(\varvec{k}\) in terms of one integer or a set of integers. For examples, \(\varvec{k }\) consists of the model number k and the orders {qi} for the model in Eq. (3), while merely of the dimension k in the APT model by Eq. (37).

The second type of unknown is featured by a set \(\theta_{\varvec{k}}\) of unknown parameters within the candidate structure featured by a specific \(\varvec{k}\). Accordingly, maximizing the harmony functional H(p||q) by Eq. (44) makes both parameter learning on determining \(\theta_{\varvec{k}}\) and model selection on determining \(\varvec{k}\). This BYY best harmony learning provides a favorable mechanism for model selection. Readers are referred to Xu (2010, 2012) for recent systematic overviews on the fundamentals, the novelties and favorable natures of the BYY best harmony learning. To be specific, reading is suggested to start with Sect. 4.1 in Xu (2012) on two different aspects of measuring bi-entity proximity and Sect. 4.2 on the BYY harmony learning from the perspectives of Ying–Yang best matching versus Ying–Yang best harmony, and then proceed to Sect. 7 for a systematic outline on the thirteen topics about the BYY best harmony learning. Also, readers are referred to Xu ( 2010) for supplementary materials in Sect. 4.1 and the roadmap shown in Fig. A2 for the relations to other typical learning approaches.

The implementation of maximizing H(p||q) consists of different specific cases for different learning problems and application tasks. Inputting the samples \(\varvec{X}_{N}\) by \(p\left( \varvec{X} \right) = \delta \left( {\varvec{X} - \varvec{X}_{N} } \right)\), H(p||q) in Eq. (44) is simplified into the one on the top of Table 1. As \(\varvec{R}\) takes different specific forms given in the first column of Table 1, we have four types of H(p||q) as listed in the second column of the table, plus their corresponding special cases of i.i.d. samples \(\left\{ {x_{t} } \right\}_{t = 1}^{N}\).

Table 1 \(H(p||q)\) in four specific types of implementations

Moreover, the collective operations \(\int {[ \bullet ]} \,{\text{d}}Y_{N}\) and \(\sum_{L} \left[ { \bullet } \right]\) may be simplified by removing the integral or the summation to merely consider their optimal values, from which those of H(p||q) in the second column of Table 1 result in the corresponding counterparts of \(H(\varTheta_{\varvec{k}} |X_{N} )\) in the third column of the table. Each type in the second column may have more than one counterparts by removing either or both of the two collective operations. Such a removal makes learning implementation of \(H(\varXi_{\varvec{k}} |X_{N} )\) easier but the learned system become more prone to an overfitting of a small size of samples.

As addressed at the end of “Learning mixture of AR, ARMA, ARCH and GRACH models” section, the BYY harmony learning has an automatic model selection mechanism similar to the RPCL learning. Additionally, \(H(\varTheta_{\varvec{k}} |X_{N} )\) in the third column of Table 1 provides another angle to view such a mechanism. For example, observing the choice (a) in the last-bottom box of the table, maximizing \(H(\varTheta_{\varvec{k}} |X_{N} )\) consists of maximizing not only \(p\left( {\theta |X_{N} , \varXi } \right)\) that is same as the Bayesian learning, but also \(\mathop \sum \nolimits_{t = 1}^{N} p(y_{t} ,\ell_{t} | x_{t} ,\theta )\pi (x_{t} ,y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )\) that includes maximizing a term \(\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}\) with \(\omega_{{\ell_{t} }} = q(x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )\). Noticing that \(\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}\). monotonically increasing for \(\omega_{{\ell_{t} }} > e^{ - 1}\) but decreasing for \(\omega_{{\ell_{t} }} < e^{ - 1}\), a value \(\omega_{{\ell_{t} }} = q(x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} ) > e^{ - 1}\) indicates the current fit to xt is bigger than this threshold and increasing \(\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}\) enhances learning by \(q\left( {x_{t} |y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} } \right)q(y_{t} ,\ell_{t} |\theta_{{\ell_{t} }} )\) to fit xt; while a value \(\omega_{{\ell_{t} }} < e^{ - 1}\) indicates that this fit is below a threshold and increasing \(\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}\) actually reduces this fit, i.e., a de-learning occurs. This is similar to the RPCL learning.

For the existing Bayes approaches, it is crucial to choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, Bayes approaches degenerate to the maximum likelihood learning, while the BYY harny learning is still capable of automatic model selection. Also in Table 1, if a priori distribution q(θ|Ξq) is also considered, the performances of BYY harmony learning will be further improved. A simple choice of q(θ|Ξq) is a Jeffreys prior, for which there is no parameter Ξq. Alternatively, we may also consider a parametric distribution. Typically, a priori q(θ|Ξq) and a posteriori \(p(\theta | X_{N} ,\varXi_{p} )\) are either jointly a conjugate parametric pair or approximately two parametric distributions with each having a set of hyper-parameters, namely, Ξp,Ξq. Actually, a hyper-priori q(Ξ) is further considered for \(\varXi = \left\{ {\varXi_{p} , \varXi_{q} } \right\}\), for which q(Ξ) is a distribution usually with no more prior, e.g., by a Jeffreys prior.

The implementation of maximizing H(p||q) is featured by jointly determining \(\varTheta_{{\varvec{k} }}\) and \(\varvec{k}\), namely

$$\max_{{\varvec{k},\varTheta_{\varvec{k}} }} H\left( {\varTheta_{{\varvec{k} }} |X_{N} } \right).$$
(47)

Moreover, determining \(\varTheta_{{\varvec{k} }}\) further consists of determining \(\theta_{\varvec{k}}\) and \(\varXi_{{\varvec{k} }}\) (if any), as well as updating yt, ℓt per sample xt. Generally, the implementation of Eq. (47) is an alternative iterative process that consists of Step yℓ for updating yt, ℓt, Step θ for parameter learning, Step Ξ for learning hyper-parameters (if any), and Step \(\varvec{k}\) for model selection. This process is featured by apex approximation, manifold shrinking, and balanced operation. Readers are referred to Sect. 4.3 in Xu (2012) for a recent systematic overview on major issues about the BYY harmony learning implementation and to Sect. 4.3 in Xu (2010) for further supplementary materials. Considering two typical learning tasks, readers are referred to Sect. 2 in Xu (2012) and Sect. 3 in Xu (2010) for the BYY harmony learning algorithms on Gaussian mixture and factor analysis as well as their extensions.

Learning implementation: gradient algorithms versus EM-like algorithms

The maximization by Eq. (47) can be implemented by different types of learning algorithms. The simplest and widely applicable type is featured by the following gradient based updating:

$$\varTheta_{\varvec{k}}^{\rm new} \leftarrow \varTheta_{\varvec{k}}^{\rm old} + \Delta \varTheta_{\varvec{k}} \in D_{{\varTheta_{\varvec{k}} }} , {{\Delta \varTheta }}_{\varvec{k}} \propto \nabla_{{\varTheta_{\varvec{k}} \in D_{{\varTheta_{\varvec{k}} }} }} H\left( {\varTheta_{{\varvec{k} }} |_{ } X_{N} } \right),$$
(48)

where \({{\Delta }}u \propto g_{u}\) means \({{\Delta }}u = {{\gamma }}g_{u}\) with a small γ > 0, \(\nabla_{{u \in D_{u} }} f\left( u \right)\) is the gradient of f(u) with respect to u within the domain Du of u, and \(u + {{\Delta }}u \in D_{u}\) means updating within the domain Du of u. In the sequel, the use of \({{\Delta }}u \propto g_{u}\) includes the updating \(u_{{}}^{\text{new}} = u_{{}}^{\text{old}} + {{\Delta u }} \in D_{u}\) even without writing it explicitly. For those choices of \(H\left( {\varTheta_{{\varvec{k} }} |_{ } X_{N} } \right)\) in Table 1, if integrals are involved, we need to first handle the integrals and then take gradient on a mathematical expression without integrals, for which we approximately use a Taylor expansion around a maximal point up to the second order. Readers are referred to Sect. 4.3 in Xu (2012) for further details.

To show how a BYY harmony learning algorithm is obtained via the gradient based updating by Eq. (48). Further details are provided on learning the following alternative mixture-of-experts:

$$\begin{aligned} p(x_{t} |\varvec{x}_{t-1}^{q} ,\theta ) & = \mathop \sum \limits_{j = 1}^{k} P(j|x_{\text{t-1}} - \mu_{j,t-1} ,\theta )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} ), \\ P(j|x_{\text{t-1}} - \mu_{j,t-1} ,\theta ) & = \frac{{\alpha_{j} G(x_{\text{t-1}} - \mu_{j,t-1} |0, \sigma_{j,t-1}^{2} )}}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} G(x_{\text{t-1}} - \mu_{j,t-1} |0, \sigma_{j,t-1}^{2} )}}, \\ \end{aligned}$$
(49)

which comes from Eqs. (10), (11) and (12), while μj,t comes from the GARCH model given by Eq. (5). To develop algorithms for the ML learning by Eq. (16)(c) and the RPCL learning by Eq. (18), we consider the following likelihood:

$$\begin{aligned} L(\{ x_{t} \}_{t = 1}^{N} |\varTheta ) & = \mathop \sum \nolimits_{t} \ln \left\{ {\mathop \sum \limits_{j = 1}^{k} \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )} \right\}, \\ {\text{and }}\pi_{j,t} \left( {\theta_{j} } \right) & = \ln \{ \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )\} . \\ \end{aligned}$$
(50)

Instead of maximizing the likelihood, learning algorithm is derived for maximizing

$$\begin{aligned} H(p||q) & = \mathop \smallint \nolimits p(\theta |X_{N} ,\varXi_{p} )H(\varTheta_{\varvec{k}} |X_{N} )\,{\text{d}}\theta \\ H(\varTheta_{\varvec{k}} |X_{N} ) & = \ln [q(\theta |\varXi_{q} )q\left( \varXi \right)] + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} p_{t,t - 1} (j|\theta )\pi_{j,t} \left( {\theta_{j} } \right), \\ p_{t,t - 1} (j|\theta ) & = \frac{{\alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )}}{{\mathop \sum \nolimits_{j = 1}^{k} \alpha_{j} G(x_{t - 1} - \mu_{j,t - 1} |0, \sigma_{j,t - 1}^{2} )G(x_{t} - \mu_{j,t} |0, \sigma_{j,t}^{2} )}}, \\ \end{aligned}$$
(51)

where q(θ|Ξq) is a priori distribution typically in a least redundant factorization as follows:

$$\begin{aligned} &q\left( {\theta |\varXi_{q} } \right) = q\left( {\left\{ {\alpha_{j} } \right\}_{j = 1}^{k} } \right) \prod _{j,i} q\left( {a_{j,i} } \right) \prod _{j,i} q\left( {\beta_{j,i} } \right) \prod _{j,i} q\left( {\omega_{j,i} } \right), \hfill \\& {\text{Usually, we }}\,{\text{have}} \hfill \\ &q(\{ \alpha_{j} \}_{j = 1}^{k} ):{\text{Dirichlet}}, \hfill \\ &q\left( {\beta_{j,i} } \right) , q\left( {\omega_{j,i} } \right) {:}\, {\text{nonnegative densities, e }} . {\text{g}} . , {\text{exponential}}\, {\text{or}}\, {\text{gamma,}} \hfill \\ &q\left( {a_{j,i} } \right) {:} {\text{ Gaussian }}\,{\text{or}}\, {\text{Laplacian, e}} . {\text{g}} . , {\text{a }}\,{\text{Gaussian }}\,G(a_{j,i} |0,\rho_{j,i}^{2} ) \hfill \\ &{\text{with }}q\left( {\rho_{j,i}^{2} } \right) {\text{being }}\,{\text{a}}\,{\text{Jeffreys }}\,{\text{prior}}\, {\text{or }}\,{\text{an }}\,{\text{inverse}}\, {\text{gamma}}. \hfill \\ \end{aligned}$$
(52)

Alternatively, each factor may be simply a Jeffreys prior. The posterior p(θ|XNΞp) also have choices. First, p(θ|XNΞp) and q(θ|Ξq) are a conjugate pair such that the integral over θ can be handled analytically; see Sect. 4.3 of Xu (2012). Second, we may simply consider that p(θ|XNΞp) is free of structure and maximizing H(p||q) with respect to p(θ|XNΞp) is simplified into the maximization of \(H(\varTheta_{\varvec{k}} |X_{N} )\) with respect to \(\varTheta_{\varvec{k}} .\) It follows from Eq. (48) that we consider the following gradient updating

$$\begin{aligned} \Delta \phi & \propto \nabla_{\phi } H\left( {\varTheta_{\varvec{k}}^{old} |X_{N} } \right), \phi \subset {{\varTheta }}_{\varvec{k}} = \left\{ {\theta ,\varXi_{\varvec{k}} } \right\}, \\ \nabla_{\phi } H\left( {\varTheta_{\varvec{k}} |X_{N} } \right) & = g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \rho_{j,t} \left( \theta \right)\nabla_{\phi } \pi_{j,t} \left( \theta \right), \\ g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) & = \nabla_{\phi } \ln [q(\theta |\varXi_{q} )q\left( \varXi \right)], \\ \rho_{j,t} \left( \theta \right) & = p_{t,t - 1} \left( {j |\theta } \right)\left[ {1 + \Delta \pi_{j,t} \left( \theta \right)} \right], \\ {{\Delta }}\pi_{j,t} \left( \theta \right) & = \pi_{j,t} \left( {\theta_{j} } \right) - \mathop \sum \limits_{j = 1}^{k} p_{t,t - 1} (j|\theta )\pi_{j,t} \left( {\theta_{j} } \right), \\ \end{aligned}$$
(53)

where ϕ is a subset of \(\varTheta_{\varvec{k}} = \left\{ {\theta ,\varXi_{\varvec{k}} } \right\}\), e.g., either of \(\left\{ {\varvec{a}_{j} } \right\},\left\{ {\mu_{j} } \right\},\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}, \ldots {\text{etc}}.\) One particular example of ϕ is \(\varvec{\alpha}= [\alpha_{1} , \ldots ,\alpha_{k} ]^{\text{T}}\) subject to each αj ≥ 0 and \(\varvec{\alpha}^{\text{T}} 1 = 1\) with 1 = [1, …, 1]T, for which we get \(\varvec{\alpha}\) via updating \(\varvec{c} = [c_{1} , \ldots ,c_{k} ]^{\text{T}}\) as follows:

$$\begin{aligned} &\alpha_{j} = e^{{c_{j} }} /\sum\nolimits_{\ell } {e^{{c_{\ell } }} } , {{\Delta }}{\mathbf{c}} \propto \nabla_{\varvec{c}} H(\theta ,\varXi_{\varvec{k}} |X_{N} ) = \left( {I -\varvec{\alpha}^{\text{old}} 1^{\text{T}} } \right){\text{diag}}\left[ {\mathop \sum \limits_{t} p_{1,t} , \ldots ,\mathop \sum \limits_{t} p_{k,t} } \right], \hfill \\ & {\mathbf{If}}\, {\mathbf{a}}\ \alpha_{j} \to 0, \,{\mathbf{discard }}\,{\mathbf{the}}\,{\mathbf{ corresponding }}\,{\mathbf{structure}}\,{\mathbf{ and}}\,{\mathbf{ its}} \,\theta_{j} .\hfill \\ \end{aligned}$$
(54)

As addressed in Eq. (5) in Xu (2010) and in Sect. 4.3.2 of Xu (2012), the maximization of Eq. (47) has a mechanism that pushes αj → 0 if the corresponding expert is extra, i.e., automatic model selection occurs. Each of nonnegative parameters in \(\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}\) may also be updated in a similar way, e.g., considering ξ = v2 or ξ = exp (v) such that ξ is updated via \(\Delta v \propto \nabla_{v} H(\varTheta_{\varvec{k}}^{\text{old}} |X_{N} ).\) With the help of the priories \(q\left( {\beta_{j,i} } \right)\) and q(ωj,i) in Eq. (52), the maximization of Eq. (47) also pushes βj,i → 0 and ωj,i → 0 if some order of the GARCH part in Eq. (4) and Eq. (5) is extra. Moreover, with help of the priori q(aj,i) in Eq. (52), the maximization of Eq. (47) also pushes \(\rho_{j,i}^{ 2} \to 0\) if some order of the AR part in Eq. (4) and Eq. (5) is extra.

The learning implementation by Eq. (53) covers not only the gradient based ML learning by simply setting Δπj,t(θ old j ) = 0 in the Yang step, but also the RPCL learning algorithm simply with pj,t given by Eq. (18). Moreover, setting \(\varvec{w}_{i} = 0\) leads to learning a mixture of ARCH models, while setting \(\varvec{w}_{i} = 0\) and \(\varvec{b}_{i} = 0\) degenerates to learning a mixture of AR models.

For implementing the ML learning, it also been widely regarded that the EM algorithm is preferred over the gradient-based algorithm (Redner and Walker 1984; Xu and Jordan 1996). In addition to the gradient-based implementation by Eq. (53), the BYY harmony learning may also be implemented by the following EM-like procedure:

$$\begin{aligned}& {\text{Yang }}\,{\text{Step: }}p_{j,t} = \rho_{j,t} \left( {\theta^{\text{old}} } \right), {\text{see Eq}}.\,\left( { 5 3} \right), \hfill \\ &{\text{Ying }}\, {\text{Step: Let}}\ \tilde{\varTheta }_{\varvec{k}} = \varTheta_{k} - \phi \ \ {\text{and}}\ \ \tilde{\theta } = \theta - \phi , \hfill \\ &{\text{Solve}}\, {\text{the }}\,{\text{root }}\phi^{*} {\text{of }}\chi \left( \phi \right) = 0\, {\text{or }}\,{\text{approxaimtely }}\left( {{\text{if }}\,{\text{difficult}}} \right), \hfill \\ &\chi \left( \phi \right) = g_{\phi } \left( {{\tilde{{\varTheta }}}_{k}^{{ {\text{old}}}} \mathop \cup \nolimits \phi } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} p_{j,t} \nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }_{{}}^{\text{old}} \mathop \cup \nolimits \phi } \right), \hfill \\ &{\text{Then, update }}\phi^{\text{new}} = \phi^{*} , \hfill \\ \end{aligned}$$
(55)

where AB denotes the complement of A with respect to B, i.e., \(\varvec{A} {-} \varvec{B} = \left\{ {x \in \varvec{A}\left| {x \notin \varvec{B}} \right.} \right\}\). When the root ϕ* of χ(ϕ) = 0 is solved analytically, setting Δπj,t(θ) = 0 makes Eq. (53) degenerate to the EM algorithm for the ML learning if \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0\) or the Bayes learning if \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) \ne 0\). Generally, the algorithm by Eq. (55) is different from the EM algorithm by the factor 1 + Δπj,t(θ), which takes an important role in making model selection. However, the EM algorithm is guaranteed to converge (Redner and Walker 1984), while the factor 1 + Δπj,t(θ) makes the Ying–Yang iteration lose such a guarantee.

Efforts are made on remedying this weakness. One simple way is replacing ϕnew = ϕ* in Eq. (55) by the following linear combination

$$\phi^{\text{new}} = \phi^{\text{old}} + \eta \left( {\phi^{*} - \phi^{\text{old}} } \right), \quad 0\le \eta \le 1.$$
(56)

E.g., see Box 3 and Remark (c) in Fig. 7 and Box 7 in Fig. 8 of Xu (2010). However, how to choose an appropriate 0 ≤ η ≤ 1 remains a problem, which can be handled in one of the following two ways:

  • Initialize η ≤ 1, get ϕnew by Eq. (56) and check whether \(H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \phi^{new} |X_{N} ) > H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \nolimits \phi^{old} |X_{N} )\)

    If yes, we move to the next Ying step in Eq. (55), otherwise reduce η in some way to get ϕnew and make such a check again.

  • Seek an optimal η* that maximizes \(H\left( \eta \right) = H(\tilde{\varTheta }_{k}^{\text{old}} \mathop \cup \left[ {\phi^{\text{old}} + \eta \left( {\phi^{*} - \phi^{\text{old}} } \right)} \right]|X_{N} )\), which can be handled by one of many techniques for one variable optimization. One example is solving the root of dH(η)/dη = 0.

Alternatively, another way to get ϕnew from ϕ* and ϕold is a reconsideration of \(\nabla_{\phi } H(\varTheta_{\varvec{k}} |X_{N} )\) in Eq. (53). Making a first order Taylor expansion of ρj,t(θ) around θold and of ϕπj,t(θ) around ϕ*, we consider

$$\begin{aligned} & \rho_{j,t} \left( \theta \right)\nabla_{\phi } \pi_{j,t} \left( \theta \right) \approx \\ & \quad \left[ {\rho_{j,t} \left( {\theta^{\text{old}} } \right) + \nabla_{\phi } \rho_{j,t} \left( {\theta^{\text{old}} )^{\rm T} \left( {\phi - \varphi^{old} } \right)} \right]} \right[\nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right) + \nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\left( {\phi - \varphi^{*} } \right)] \\ & \approx \rho_{j,t} \left( {\theta^{\text{old}} } \right)\nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right) + U_{j,t} \left( {\phi - \varphi^{\text{old}} } \right) + V_{j,t} \left( {\phi - \varphi^{*} } \right) \\ & U_{j,t} = \nabla_{\phi } \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\nabla_{\phi } \rho_{j,t} (\theta^{\text{old}} )^{\rm T},\ \ V_{j,t} = \rho_{j,t} \left( {\theta^{\text{old}} } \right)\nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right), \\ \end{aligned}$$

where the second ≈ comes from dropping the second order term \(\left( {\phi - \varphi^{\text{old}} } \right)^{\rm T} \nabla_{\phi } \rho_{j,t} \left( {\theta^{\text{old}} } \right)\;\nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\left( {\phi - \varphi^{*} } \right)\). Taking the sum over jt, the counterpart of the first term becomes \(\chi \left( {\phi^{*} } \right) = 0\) and thus disappears, from which we are led to

$$\psi \left( \phi \right) = \nabla_{\phi } H(\varTheta_{\varvec{k}} |X_{N} ) \approx g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) + \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left[ {U_{j,t} \left( {\phi - \varphi^{\text{old}} } \right) + V_{j,t} \left( {\phi - \varphi^{*} } \right)} \right].$$
(57)

Then, we solve ψ(ϕnew) = 0 to get \(\phi^{\text{new}}\) from ϕ* and ϕold. Particularly, when \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0\) we simply have

$$\phi^{\text{new}} = \left[ {\mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left( {U_{j,t} + V_{j,t} } \right)} \right]^{ - 1} \mathop \sum \limits_{t} \mathop \sum \limits_{j = 1}^{k} \left( {U_{j,t} \varphi^{\text{old}} + V_{j,t} \varphi^{*} } \right).$$
(58)

It is still a linear function of ϕ* and ϕold, but becomes much advanced than the one by Eq. (56).

Linear causal analyses

Path analyses and a recent development on ρ-diagram

Path analyses is one earliest causal analysis approach, proposed around 1918 by Sewall Wright who made its developments more extensively in the 1920s (Wright 1921, 1934). It has been not only further investigated in the formulation of structural equation modeling (SEM) (Ullman 2006; Hooper et al. 2008; Pearl 2010a; Kline 2015) with wide applications, but also found its uses in many complex modeling areas, including biology, psychology, sociology, and econometrics. Details are left to a vast volume of publications in literature. Here, we introduce a recent development on a modified formulation named ρ-diagram (Xu 2018).

The formulation considers a directed acyclic graph (DAG) or Bayesian networks, with visible nodes x1, x2,, xn and hidden nodes w1,,wm. Each xi is normalized to be zero mean and unit variance and each wj is assumed to be zero mean and unit variance too; while each edge is associated with the correlation coefficient between its two nodes. In other words, such a diagram is completely defined by pairwise correlation coefficients, and thus called ρ-diagram in that each correlation coefficient is denoted by ρ shortly. Being different from the classical procedure for path analyses, namely getting topology by prior, estimating unknown parameters and causal effects, and making model-fit assessment on alternative models, a TPC procedure is suggested for ρ-diagram (Xu 2018), which begins at Topology discovery from data based on ρ-diagram, and then makes Parameter estimation and Causality embedded model-fit assessment.

Topology discovery is based on equations that are obtained from path tracing in a way similar to Wright’s system of tracing rules. The difference is that unknowns in equations involve only the within-diagram ρ-variables, while knowns are pairwise correlation r-coefficients obtained from visible nodes x1, x2,…, xn, subject to the constraints that all the ρ-variables vary between [− 1,+ 1]. We discover a topology underlying data by checking whether a set of constrained equations is deterministically solved, that is, having (1) no solution, (2) a unique solution (or few solutions), and (3) infinite many of solutions.

For details refer to Xu (2018). Here, an illustration is made on topologies of 3-node diagrams, as illustrated in Fig. 8. Given a diagram with nodes x, y, z, the simplest case is illustrated in Fig. 8a, featured by that every pairwise correlation is zero or there is only one pair that gets rij ≠ 0, which can be directly identified by observing rij, i,j {x,y,z}. Shown in Fig. 8b are topologies that have two edges. The first one gets two edges in a fork, which can be identified by observing rij = 0 for only one pair while rij ≠ 0 for other two pairs. The other topologies describes the causality from conditional independence analysis, which can be identified by observing rikrkj =  rij ≠ 0 i,j {x,y,z} on all the permutations of x, y, z.

Fig. 8
figure 8

Causal analyses: path analysis on ρ-diagram and causal potential theory

Shown in Fig. 8c are two typical topologies of widely encountered causal structure called cofounder. Via path tracing, the following equations are obtained:

$$\rho_{ki} + \rho_{kj} \rho_{ji} = r_{ki} ,\ \ \rho_{ji} + \rho_{kj} \rho_{ki} = r_{ji} ,\ \ \rho_{kj} = r_{kj} ; \quad - 1 \le \rho_{ji} , \rho_{kj} ,\rho_{ki} \le 1$$
(59)

As shown in Fig. 8c, we may check whether two lines get cross within the dashed box. If yes, a cofounder is identified in either of two topologies on the bottom of Fig. 8c. However, the direction between j and k cannot be identified. Even so, the direct causal direction and effect

$$\rho_{ji} = {{\left( {r_{ji} - r_{kj} r_{ki} } \right)} \mathord{\left/ {\vphantom {{\left( {r_{ji} - r_{kj} r_{ki} } \right)} {\left( {1 - r_{kj}^{2} } \right)}}} \right. \kern-0pt} {\left( {1 - r_{kj}^{2} } \right)}}$$
(60)

is uniquely determined, i.e., the cofounder effect can be remedied.

If two lines do not intersect within the box, one may further check one other permutation of labels i, j, k. It is unlikely that two different permutations are both identified because it merely happens when not only ρ = r holds on two edges but also four linear equations have consistent solution for unknowns. If no permutation can be identified, it means that there is not such a cofounder causality underlying data. However, there may be still other causality. On one hand, we may check whether there is some causality in types of Fig. 8a, b. On the other hand, we may continue to diagrams with four nodes or more.

Causal potential theory

As already mentioned above, the direction between j and k in Fig. 8c cannot be identified. Also, edge directions in Fig. 8b cannot be identified too. There have been extensive studies on detecting causal direction and evaluating causal strength (Peters et al. 2009; Zhang and Hyvärinen 2009; Hoyer et al. 2009; Rubin and John 2011), via analyzing certain types of asymmetry between two variables X and Y. One most authoritative definition of causality is p(Y|do X = x) with ‘do X = x’ indicating the action that imposes X = x (Pearl 2010b). In these studies, causality is actually examined from a descriptive perspective.

As illustrated in Fig. 8d, possible movements that apple falls and balance loses are actually caused by physics mechanism, i.e., the law of universal gravitation and the lever principle, where causality is actually an issue of dynamics, about how movements are caused by forces that come from potential difference. It follows from the viewpoint of grand unification that we are thus motivated to believe that causality in terms of probability, information, and intelligence should be also governed by similar dynamics.

Consider the relationship described by density distribution \(p\left( {x,y} \right), \varvec{ }\) as illustrated in Fig. 8d, the quantity \({\text{E}}\left( {x,y} \right) \propto - { \ln } p\left( {x,y} \right)\) actually describes a sort of potential energy density on an infinitesimal piece dxdy, and represents a difference of potential energy density in reference of a uniform distribution on the space xy, while we can get

$$\left[ {I_{x} ,I_{y} } \right] = \left[ { - \frac{{\partial {\text{E}}\left( {x,y} \right)}}{\partial x}, - \frac{{\partial {\text{E}}\left( {x,y} \right)}}{\partial y}} \right]$$
(61)

to represent a force field that drives information flow toward the area with the lowest energy, or equivalently driving that information flows from rare occurring locations toward high occurring locations.

Changes of xy and the rates of changes are described by IxIy, respectively, and both are actually driven by the difference of potential energy density of E(xy). The problems about whether one of X, Y causes the other or whether two are mutually caused each other may be examined through IxIy. Typically, we may encounter the following cases:

$$\begin{aligned} {\text{Case}} \,O: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( x \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( y \right); \\ {\text{Case }}\,A: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( x \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( {x,y} \right); \\ {\text{Case }}\,B: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( {x,y} \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( y \right); \\ {\text{Case }}\,C: I_{x} & = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial x} = f\left( {x,y} \right),\ I_{y} = \frac{{\partial { \ln }p\left( {x,y} \right)}}{\partial y} = g\left( {x,y} \right). \\ \end{aligned}$$
(62)

For Case O, changes of x merely relates to itself, while changes of y merely relates to itself, that is, changing x is independent of change of y. For Case A, changes of x merely relates to itself, while changes of y relate to both of \(x,y,\) where we may regard that changing x causes change of y. For Case B, changes of y merely relates to itself, while changes of x relate to both of \(x,y,\) where we may regard that changing y causes change of x. For Case C, changes of xy are mutually related.

From a set of samples of xy, we may develop certain statistics to identify which case is actually encountered. Due to noise and a finite sample size, the first three cases are rarely found. What are often encountered is Case C. In such cases, we may further check whether one of xy takes a dominant role, while the other maybe ignored, that is, whether we have either or both of

$$f\left( {x,y} \right) \approx f\left( x \right), g\left( {x,y} \right) \approx g\left( y \right).$$
(63)

Further insights on causality may be obtained from this perspective, not only a pair X, Y may be identified in one of the four cases on the entire domain that x, y vary, but also a pair may be identified in one case on some subdomain but in a different case on some different subdomain. That is, causal direction may reverse, disappear, and emerge as x, y vary on different subdomains.

To be more specific, we observe two typical examples. The first considers binary x, y from

$$p\left( {x,y} \right) = p\left( {y |x} \right)p\left( x \right), \quad {\text{for}}\ x,y = 0, 1$$
(64)
$$p\left( {y |x} \right) = s^{y} \left( {bx + c} \right)\left[ {1 - s\left( {bx + c} \right)} \right]^{1 - y} ,\,q\left( x \right) = q^{x} \left( { 1 - q} \right)^{ 1- x} ,$$

where s(r) is a sigmoid function and p(y|x) describes a logistic regression, for which we get

$$\begin{aligned} I_{x} & = { \ln }\frac{q}{1 - q} + bs^{{\prime \left( {bx + c} \right)}} \left[ {\frac{y}{{s\left( {bx + c} \right)}} - \frac{1 - y}{{1 - s\left( {bx + c} \right)}}} \right] = { \ln }\frac{q}{1 - q} + b\delta , \\ \delta & = s\left( {bx + c} \right) - y,\ \ I_{y} = { \ln }\frac{{s\left( {bx + c} \right)}}{{1 - s\left( {bx + c} \right)}}. \\ \end{aligned}$$
(65)

We usually have \(\delta \approx 0\) if the logistic regression fits well, thus it leads to Case A above, i.e., the causal direction is x → y, which is consistent to our existing understanding on this model.

The second example considers p(x,y) from a joint density of Gaussian variables x, y with zero mean and unit variance as well as their correlation coefficient ρ. It follows that

$$-I_{x} = x + \rho y,\ \ -I_{y} = y + \rho x,$$
(66)

which leads to Case 0 when ρ = 0, Case A when ρy ≈ 0, Case B when ρx ≈ 0, and Case C in general. That is, we are unable to identify causal direction on the entire domain, which is also consistent to our existing understanding. Interestingly, we get new insight that it is possible to detect causal direction in some particular subdomains.It also may deserve to extend these studies to consider a density \(p\left( {\varvec{x},\varvec{y}} \right) \varvec{ }\) with \(\varvec{x},\varvec{y}\) being vectors such that we examine causality between two groups of variables.

SEM and its relations to modulated TFA-APT and nGCH-driven M-TFA-O

In its early stages of developments, modeling by equations in path analyses and structural equation modeling (SEM) were used without a particular clarification. In recent decades, SEM is gradually developed into the following formulation (Ullman 2006; Kline 2016):

$$\varvec{x} = {\varvec{\Lambda}}_{\varvec{x}}\varvec{\xi}+\varvec{\delta},\ \varvec{ y} = {\varvec{\Lambda}}_{\varvec{y}}\varvec{\eta}+\varvec{\varepsilon},\ \varvec{ \eta } = \varvec{B\eta } + {\varvec{\Gamma}}\varvec{\xi}+ \varvec{\varsigma }$$
(67)

To compare modulated TFA-APT and nGCH-driven M-TFA-O, we observe the following equations from Eq. (42) and in Eq. (43):

$$\varvec{r}_{\varvec{t}} = \varvec{a} + \varvec{Af}_{\varvec{t}} + \varvec{e}_{\varvec{t}},\ \varvec{ f}_{\varvec{t}} = \varvec{Bf}_{{\varvec{t} - 1}} + \varvec{Hm}_{\varvec{t}} +\varvec{\varepsilon}_{\varvec{t}},\ \varvec{ m}_{\varvec{t}} = \varvec{Cv}_{\varvec{t}} +\varvec{\eta}_{\varvec{t}} ,$$

Putting the last one into the second one, we may rewrite

$$\begin{aligned} \varvec{ f}_{\varvec{t}} & = \varvec{Bf}_{{\varvec{t} - 1}} + \varvec{HCv}_{\varvec{t}} + \varvec{H\eta }_{\varvec{t}} +\varvec{\varepsilon}_{\varvec{t}} ,\varvec{ } \\ \varvec{r}_{\varvec{t}} & = \varvec{a} + \varvec{Af}_{\varvec{t}} + \varvec{e}_{\varvec{t}},\ \varvec{ m}_{\varvec{t}} = \varvec{Cv}_{\varvec{t}} +\varvec{\eta}_{\varvec{t}} . \\ \end{aligned}$$
(68)

Table 2 compares the notations in Eqs. (62) and (63).

Table 2 In comparison with modulated TFA-APT and GMCH-driven M-TFA

The two are actually the same at the special case H = 0. Generally, we observe that modulated TFA-APT may be regarded as a variant or extension of SEM.

Coming from different perspectives, SEM and the modulated TFA–APT aim at causal analysis in a closely related way. Both consist of FA as basic ingredient that suffers the intrinsic rotation indeterminacy by Eq. (40). In path analysis and SEM study, the problem is avoided by making hidden factors ft and/or the elements of A partly known with human-aide. While in the modulated TFA-APT, the problem is solved by considering both independence cross hidden factors and temporal dependence Bft1 among each factor. We may combine the ideas to improve each other. On one hand, SEM motivates us to prune away extra edges that correspond to elements of A, which may be implemented by sparse learning. On the other hand, we may improve SEM by considering temporal dependence among endogenous factors.

Moreover, rotation indeterminacy may also be removed by changing the driving noise of hidden factors from Gaussian q(ɛ ( j) t ) into non-Gaussian q(ɛ ( j) t ) (Xu 2001, 2004). Furthermore, conditional heteroskedasticity (Chiu and Xu 2003) has also been included in the driving noise to encode non-stationarity. The two points are actually included in Item (c) in Eq. (43), which extends the modulated TFA-APT into nGCH-driven M-TFA-O, which may also be used to improve SEM.  Furthermore,  a non-diagonal matrix B may be considered to replace a diagnal matrix B in TFA, such that Granger causality like problem (Granger 1969) may be taken in consideration together with the previous cofounder problem further examined.

Abbreviations

AIC:

Akaike information criterion

APT:

arbitrage pricing theory

AR:

autoregressive

ARCH:

Autoregressive Conditional Heteroskedasticity

ARIMA:

autoregressive integrated moving average

ARMA:

autoregressive–moving average

BYY:

Bayesian Ying Yang

BIC:

Bayesian information criterion

CAIC:

consistent AIC

CAPM:

capital asset pricing model

EMH:

efficient market hypothesis

HMM:

hidden Markov model

GARCH:

generalized ARCH

LDS:

linear dynamical system

LR:

likelihood ratio

MDL:

minimum description length

ME:

mixture-of-experts

ML:

maximum likelihood

MLFA:

maximum likelihood factor analysis

MML:

minimum message length

MUV:

mixture using variance

NFA:

non-Gaussian factor analyses

NRBF:

normalized radial basis function

ρ-diagram:

a diagram defined by a set of pairwise correlation coefficients

RPCL:

rival penalized competitive learning

SEM:

structural equation modeling

SSM:

state space model

TFA:

temporal factor analysis

VAR:

vector autoregressive

VB:

variational Bayes

References

  • Abeysekera SP, Mahajan A (1987) A test of the APT in pricing UK stocks. J Account Finance 17(3):377–391

    Google Scholar 

  • Azeez AA, Yonezawa Y (2006) Macroeconomic factors and the empirical content of the Arbitrage Pricing Theory in the Japanese stock market. Jpn World Econ 18(4):568–591

    Article  Google Scholar 

  • Azoff ME (1994) Neural network time series forecasting of financial markets. Wiley, New York

    Google Scholar 

  • Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econom 31:307–327

    Article  MathSciNet  MATH  Google Scholar 

  • Box G, Jenkins G (1970) Time series analysis: forecasting and control. Holden-Day, San Francisco

    MATH  Google Scholar 

  • Brown SJ (1989) The number of factors in security returns. J Finance 44(5):1247–1262

    Article  Google Scholar 

  • Chamberlain G, Rothschild M (1983) Arbitrage, factor structure, and mean–variance analysis on large asset markets. Econometrica 51(5):1281–1304

    Article  MathSciNet  MATH  Google Scholar 

  • Chen NF, Roll R, Ross S (1986) Economic forces and the stock market. J Bus 59(3):383–403

    Article  Google Scholar 

  • Cheung YM, Leung WM, Xu L (1996) Combination of buffered back-propagation and RPCL-CLP by mixture-of-experts model for foreign exchange rate forecasting. In: Proceedings of 3rd international conference on neural networks in the capital markets, London, UK, Oct 11–13, 1996. World Scientific Pub, Singapore, pp 554–563

  • Cheung Y, Leung WM, Xu L (1997) Adaptive rival penalized competitive learning and combined linear predictor model for financial forecast and investment. Int J Neural Syst 8:517–534

    Article  Google Scholar 

  • Chiu KC, Xu L (2002) Stock price and index forecasting by arbitrage pricing theory-based Gaussian TFA learning. In: Yin HJ (ed) Lecture notes in computer sciences (LNCS), vol 2412. Springer, Berlin, pp 366–371

  • Chiu KC, Xu L (2002) A comparative study of Gaussian TFA learning and statistical tests on the factor number in APT. In: Proceedings of international joint conference on neural networks 2002 (IJCNN ‘02), Honolulu, Hawaii, USA, May 12–17, 2002. pp 2243–2248

  • Chiu KC, Xu L (2003) Stock forecasting by ARCH driven Gaussian TFA and alternative mixture experts models. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sept 26–30. pp 1096–1099

  • Chiu KC, Xu L (2003) On generalized arbitrage pricing theory analysis: empirical investigation of the macroeconomics modulated independent state–space model. In: Proceedings of 2003 international conference on computational intelligence for financial engineering, Hong Kong, March 20–23. pp 139–144

  • Chiu KC, Xu L (2004a) Arbitrage pricing theory based Gaussian temporal factor analysis for adaptive portfolio management. J Decis Support Syst 37:485–500

    Article  Google Scholar 

  • Chiu KC, Xu L (2004b) NFA for factor number determination in APT. Int J Theor Appl Finance 7:253–267

    Article  MATH  Google Scholar 

  • Choey M, Weigend AS (1997) Nonlinear trading models through Sharpe ratio optimization. Int J Neural Syst 8(3):417–431

    Article  Google Scholar 

  • Dhrymes PJ, Friend I, Gultekin B (1984) A critical reexamination of the empirical evidence on the arbitrage pricing theory. J Finance 39(2):323–346

    Article  Google Scholar 

  • Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of variance of United Kingdom Inflation. Econometrica 50:987–1008

    Article  MathSciNet  MATH  Google Scholar 

  • Engle RF, Granger CWJ (1987) Co-integration and error–correction: representation, estimation and testing. Econometrica 55(2):251–276

    Article  MathSciNet  MATH  Google Scholar 

  • Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  • Fishburn PC (1977) Mean-risk analysis with risk associated with below-target returns. Am Econ Rev 67(2):116–126

    Google Scholar 

  • Gately E (1995) Neural networks for financial forecasting. John Wiley & Sons, New York

    Google Scholar 

  • Ghahramani Z, Hinton GE (2000) Variational learning for switching state–space models. Neural Comput 12(4):831–864

    Article  Google Scholar 

  • Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438

    Article  MATH  Google Scholar 

  • Hooper D, Coughlan J, Mullen MR (2008) Structural equation modelling: guidelines for determining model fit. Electron J Bus Res Methods 6(1):53–60

    Google Scholar 

  • Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696

  • Hung KK, Cheung CC, Xu L (2000) New Sharpe-ratio-related methods for portfolio selection. In: IEEE/IAFE/INFORMS 2000 conference on computational intelligence for financial engineering, New York City, USA, March 26–28, pp 34–37

  • Hung KK, Cheung Y, Xu L (2003) An extended ASLD trading system to enhance portfolio management. IEEE Trans Neural Networks 14:413–425

    Article  Google Scholar 

  • Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3:79–87

    Article  Google Scholar 

  • Jangmin O, Jongwoo L, Lee JW, Zhang BT (2006) Adaptive stock trading with dynamic asset allocation using reinforcement learning Inform Sci 176(15):2121–2147

    Google Scholar 

  • Jordan MI, Xu L (1995) Convergence results for the EM approach to mixtures of experts architectures. Neural Netw 8:1409–1431

    Article  Google Scholar 

  • Kline RB (2015) Principles and practice of structural equation modeling, 4th edn. Guilford Publications, New York

    MATH  Google Scholar 

  • Kwok HY, Chen CM, Xu L (1998) Comparison between mixture of ARMA and mixture of AR model with application to time series forecasting. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, October 21–23, vol 2. pp 1049–1052

  • Leontaritis IJ, Billings SA (1985) Input-output parametric models for non-linear systems Part I: deterministic non-linear systems and Part II: stochastic non-linear systems. Int J Control 41:303–344

    Article  MATH  Google Scholar 

  • Leung WM, Cheung Y, Xu L (1997) Application of mixture of experts models to nonlinear financial forecasting. In: Caldwell RB (ed) Nonlinear financial forecasting: proceedings of the first INFFC, (Finance & Technology Publishing, 1997), pp 153–168

  • Markowitz HM (1952) Portfolio selection. J Finance 7(1):77–91

    Google Scholar 

  • Markowitz HM (1959) Portfolio selection: efficient diversification of investments. John Wiley & Sons, New York

    Google Scholar 

  • McGrory CA, Titterington DM (2007) Variational approximations in Bayesian model selection for finite mixture distributions. Comput Stat Data Anal 51(11):5352–5367

    Article  MathSciNet  MATH  Google Scholar 

  • Moody J, Saffell M (2001) Q learning to trade via direct reinforcement. IEEE Trans Neural Networks 12(4):875–889

    Article  Google Scholar 

  • Moody J, Lizhong W, Liao Y, Saffell M (1998) Performance functions and reinforcement learning for trading systems and portfolios. J Forecasting 17:441–470

    Article  Google Scholar 

  • Neuneier R (1996) Optimal asset allocation using adaptive dynamic programming. In: Touretzky DS (ed) Advances in neural information processing systems, 8th edn. MIT Press, Cambridge, pp 952–958

    Google Scholar 

  • Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2):1–62

    Article  MathSciNet  Google Scholar 

  • Perrone MP (1994) Putting it all together: methods for combining neural networks. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann, San Francisco, pp 1188–1189

    Google Scholar 

  • Perrone MP, Cooper LN (1993) When networks disagree: ensemble methods for neural networks. In: Mammone RJ (ed) Neural networks for speech and image processing. Chapman & Hall, New York, pp 126–142

    Google Scholar 

  • Peters J, Janzing D, Gretton A, Schölkopf B (2009) Detecting the direction of causal time series. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, pp 801–808

  • Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Article  Google Scholar 

  • Redner RA, Walker HF (1984) Mixture densities, maximum likelihood, and the EM algorithm. SIAM Rev 26:195–239

    Article  MathSciNet  MATH  Google Scholar 

  • Ross S (1976) The arbitrage theory of capital asset pricing. J Econ Theory 13(3):341–360

    Article  MathSciNet  Google Scholar 

  • Rubin DB, John L (2011) Rubin causal model. International encyclopedia of statistical science. Springer, Berlin, pp 1263–1265

    Chapter  Google Scholar 

  • Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J Finance XIX(3):425–442

    Google Scholar 

  • Sharpe FW (1966) Mutual fund performance. J Bus 39(S1):119–138

    Article  Google Scholar 

  • Sharpe WF (1994) The Sharpe ratio-properly used, it can improve investment. J Portfolio Manag Fall 21:49–58

    Article  Google Scholar 

  • Shumway RH, Stoffer DS (1991) Dynamic linear models with switching. J Am Stat Assoc 86(415):763–769

    Article  MathSciNet  Google Scholar 

  • Sims C (1980) Macroeconomics and reality. Econometrica 48(1):1–48

    Article  Google Scholar 

  • Sortino FA, van der Meer R (1991) Downside risk: capturing what’s at stake in investment situations. J Portfolio Manag 17(4):27–31

    Article  Google Scholar 

  • Tang H, Chiu K-C, Xu L (2003) Finite mixture of ARMA-GARCH model for stock price prediction. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sep 26–30, pp 1112–1119

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tu S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electr Electron Eng China 6(2):245–255

    Article  Google Scholar 

  • Ullman JB (2006) Structural equation modeling reviewing the basics and moving forward. J Pers Assess 87(1):35–50

    Article  Google Scholar 

  • Wang P et al (2011) Radar HRRP statistical recognition with temporal factor analysis by automatic Bayesian Ying–Yang harmony learning. Front Electr Electron Eng China 6(2):300–317

    Article  MATH  Google Scholar 

  • Westland JC (2015) Structural equation modeling: from paths to networks. Springer, New York

    Google Scholar 

  • Williams PM (1995) Bayesian regularization and pruning using a Laplace prior. Neural Comput 7(1):117–143

    Article  Google Scholar 

  • Wong WC, Yip F, Xu L (1998) Financial prediction by finite mixture GARCH model. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, Oct 21–23, 3(1998), pp 1351–1354

  • Wright S (1921) Correlation and causation. J Agric Res 20(7):557–585

    MathSciNet  Google Scholar 

  • Wright S (1934) The method of path coefficients. Ann Math Stat 5(3):161–215

    Article  MATH  Google Scholar 

  • Xu L (1994) Signal segmentation by finite mixture model and EM algorithm. In: Proceedings of international symposium on artificial neural networks, Tainan, Dec 15–17, pp 453–458

  • Xu L (1995) Channel equalization by finite mixtures and the EM algorithm. In: Proceedings of IEEE neural networks and signal processing workshop. Cambridge, MA, Aug 31–Sep 2, vol 5, pp 603–612

  • Xu L (1995) Ying–Yang machines: a Bayesian–Kullback scheme for unified learning and new results on vector quantization. In: Proceedings of the international conference on neural information processing, Beijing, China, Oct 30–Nov 3, pp 977–988 (A further version Advances in NIPS8, Touretzky DS et al (ed), MIT Press, Cambridge MA, 1996: 444–450)

  • Xu L (1997) Bayesian Ying Yang system and theory as a unified statistical learning approach: (II) from unsupervised learning to supervised learning, and temporal modeling. In: Wong KM et al (eds) Proceedings of theoretical aspects of neural computation: a multidisciplinary perspective. Springer, Berlin, pp 29–42

  • Xu L (1998) RBF nets, mixture experts, and Bayesian Ying–Yang learning. Neurocomputing 19:223–257

    Article  MATH  Google Scholar 

  • Xu L (2000) Temporal BYY learning for state space approach, hidden Markov model, and blind source separation. IEEE Trans Signal Process 48(7):2132–2144

    Article  MathSciNet  MATH  Google Scholar 

  • Xu L (2001) BYY harmony learning, independent state space and generalized APT financial analyses. IEEE Trans Neural Netw 12:822–849

    Article  Google Scholar 

  • Xu L (2002) Temporal factor analysis: stable-identifiable family, orthogonal flow learning, and automated model selection. In: Proceedings of international joint conference on neural networks. Honolulu, HI, USA, 12–17 May, pp 472–476

  • Xu L (2004) Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor autodetermination. IEEE Trans Neural Netw 15(4):885–902

    Article  MathSciNet  Google Scholar 

  • Xu L (2007) A unified perspective and new results on RHT computing, mixture based learning, and multi-learner based problem solving. Pattern Recogn 40:2129–2153

    Article  MATH  Google Scholar 

  • Xu L (2009) Learning algorithms for RBF functions and subspace based functions. In: Olivas ES et al (eds) Handbook of research on machine learning applications and trends: algorithms, methods and techniques. IGI Global, Hershey, pp 60–94

    Google Scholar 

  • Xu L (2010) Bayesian Ying–Yang system, best harmony learning, and five action circling. J Front Electr Electron Eng China 5(3):281–328 (A special issue on Emerging Themes on Information Theory and Bayesian Approach)

    Article  MathSciNet  Google Scholar 

  • Xu L (2012) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. J Front Electr Electron Eng 7(1):147–196 (A special issue on Machine learning and intelligence science: IScIDE (C))

    Google Scholar 

  • Xu L (2018) Deep bidirectional intelligence: AlphaZero, deep IA-search, deep IA-infer, and TPC causal learning. Appl Inform 5(5):38

    Google Scholar 

  • Xu L, Amari S (2008) Combining classifiers and learning mixture of experts. In: Rabuñal Dopico JR (ed) Encyclopedia of artificial intelligence. IGI Global, Hershey, pp 318–326

    Google Scholar 

  • Xu L, Cheung Y (1997) Adaptive supervised learning decision networks for traders and portfolios. J Comput Intell Finance 5(6):11–16 (A short version also in Proceedings of IEEE-IAFE 1997 International Conference on Computational Intelligence for Financial Engineering (CIFEr), New York City, March 23-25, 1997, 206–212)

    Google Scholar 

  • Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8(1):129–151

    Article  Google Scholar 

  • Xu L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival Penalized competitive learning. In: Proceedings of 11th international conference on pattern recognition. Hague, Netherlands, Aug 30–Sep 3, pp 672–675

  • Xu L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Trans Neural Netw 4:636–649

    Article  Google Scholar 

  • Xu L, Jordan MI, Hinton GE (1994) A modified gating network for the mixtures of experts architecture. Proceedings of 1994 world congress on neural networks, vol 2. San Diego, CA, June 4–9, pp 405–410

  • Xu L, Jordan MI, Hinton GE (1995) An alternative model for mixtures of experts. In: Tesauro G et al (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 633–640

    Google Scholar 

  • Zhang PG (ed) (2003) Neural networks in business forecasting, forecasting and control. IRM Press, London

    Google Scholar 

  • Zhang K, Hyvärinen A (2009) On the identifiability of the post-nonlinear causal model. Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009). Montreal, Canada, 2009, pp 647–655

Download references

Authors’ contributions

All from the sole author LX. The author read and approved the final manuscript.

Acknowledgements

This work was supported by the Zhi-Yuan chair professorship start-up Grant (WF220103010) from Shanghai Jiao Tong University.

Competing interests

The author declares that there is no competing interests.

Availability of data and materials

Not applicable.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Funding

WF220103010, Shanghai Jiao Tong University.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Xu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, L. Machine learning and causal analyses for modeling financial and economic data. Appl Inform 5, 11 (2018). https://doi.org/10.1186/s40535-018-0058-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40535-018-0058-5

Keywords