 Review
 Open Access
 Published:
Machine learning and causal analyses for modeling financial and economic data
Applied Informatics volume 5, Article number: 11 (2018)
Abstract
Instead of aiming at a systematic survey, we consider further developments on several typical linear models and their mixture extensions for prediction modeling, portfolio management and market analyses. The focus is put on outlining the studies by the author’s research group, featured by (a) extensions of AR, ARCH and GARCH models into finite mixture or mixtureofexperts; (b) improvements of Sharpe ratio by maximizing the expected return and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification; (c) developments of arbitrage pricing theory (APT) into temporal factor analysis (TFA)based temporal APT, macroeconomicsmodulated temporal APT and a general formulation for market modeling, together with applications to temporal prediction and dynamic portfolio management; (d) Bayesian Ying–Yang (BYY) harmony learning is adopted to implement these developments, featured with automatic model selection. After a brief introduction on BYY harmony learning, gradientbased algorithms and EMlike algorithms are provided for learning alternative mixtureofexpertsbased AR, ARCH and GARCH models; and (e) path analysis for linear causal analyses is briefly reviewed, a recent development on ρdiagram is refined for cofounder discovery, and a causal potential theory is proposed. Also, further discussions are made on structural equation modeling and its relations to modulated TFAAPT and nGCHdriven MTFAO.
Introduction
Financial and economic data are naturally recorded as temporal sequences or time series, and thus one of major tasks on those data is making time series analysis. Typically, a mathematical model is obtained to describe the regression relation of the current observation from its past observations, such that the future observation is predicted. Such a prediction task has been extensively studied in both the literature of time series analysis and the literature of machine learning and neural networks.
One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–movingaverage (ARMA) model, which describes a linear dependence of the current observation on past values and noise disturbances. Extended from describing stationary processes to data with some identifiable trend of a polynomial growth (Box and Jenkins 1970), an initial differencing step can be applied to remove such a nonstationarity. See Box 1 in Fig. 1; the autoregressive integrated moving average (ARIMA) model is used to refer a “cascade” of this initialization and ARMA. For simplicity, we still prefer to use AMRA to refer ARIMA by regarding such an initialization as a preprocessing stage.
In the literatures of statistics and econometrics, as outlined in Fig. 1 by Box 2, generalizations of ARMA have also been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering conditional heteroskedasticity of variables (Engle 1982; Bollerslev 1986), to nonlinear ARMA for modeling nonlinear dependence (Leontaritis and Billings 1985), and Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987).
The field of NNML in economics and finance involves each of the three streams of studies. In the early stage, most efforts were put on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as outlined in Fig. 1 by Box 3. There have been already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies.
Since 1994, the author’s group has made many efforts on extending AR, ARMA, ARCH and GARCH models into finite mixture or mixtureofexperts (Xu 1994, 1995a, b; Cheung et al. 1996, 1997; Leung 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2002a, 2003; Tang et al. 2003). Outlined in Fig. 1 by Box 4, studies actually proceed along an alternative road for modeling temporal dependence featured with nonlinearity, heteroskedasticity and nonstationarity. “Financial prediction: time series models and three finite mixture extensions” section is dedicated to the studies summarized in Fig. 1, together with introductions on learning implementations by the maximum likelihood (ML) learning, the rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993), and approaches of learning with model selection.
“Dynamic trading and portfolio management” section is dedicated to the studies summarized in Fig. 2, toward portfolio management directly, instead of making nonlinear modeling for analyses and predictions. Around the second half of the 1990s, efforts in the literature of neural networks and machine learning in economics and finance started to shift to adaptive trading; see Box 1. Subsequently, these efforts converge to the road pioneered by the Markowitz portfolio theory (Markowitz 1952) that maximizes the portfolio expected return for a given amount of portfolio risk by carefully choosing the proportions of assets; see Box 2. Based on Markowitz’s mean–variance paradigm, Sharpe (1966, 1994) further suggests evaluating the goodness of an asset by a ratio of the excess asset return; see Box 3. Later, it is further realized that the return variance is not an appropriate measure of portfolio risk because it counts the positive fluctuation above the expected returns (called upside volatility) also as the part of risk. The downside risk thus becomes a topic to study, as illustrated in Fig. 2 by Box 4; e.g., Markowitz (1959) counts the volatility below the expected returns only.
After a brief introduction on the abovementioned boxes in Fig. 2, “Dynamic trading and portfolio management” section further reexamines the Markowitz paradigm and Sharpe ratio with extensions that maximizes the expected returns and the upside volatility while minimizing the downside risk, with the help of a priori aided diversification (Hung et al. 2000, 2003), see Box 5 in Fig. 2. Moreover, several extensions have been proposed along this direction in Sect III(C) of Xu (2001), including that nonparametric estimates of the expected return and volatilities are improved by ARCH or GARCH models; see Box 6 in Fig. 2.
Next, “Market modeling: APT theory and temporal factor analysis” section is dedicated to the efforts summarized in Fig. 3. The Markowitz scheme also leads to the Capital Asset Pricing Model (CAPM) (Sharpe 1964). However, the CAPM is criticized to be not enough to describe a market behavior merely via one endogenous factor. Then, a general linear model of multiple factors has been proposed under the name of Arbitrage Pricing Theory (APT) (Ross 1976). Unfortunately, the APT has not been widely accepted in popularity similar to the CAPM. The reason lies largely with its significant drawback: namely, its implementation is difficult due to the lack of specificity regarding the number and nature of the factors that systematically affect asset return (Dhrymes et al. 1984; Abeysekera and Mahajan 1987).
In “Market modeling: APT theory and temporal factor analysis” section, we start from introducing three approaches that are usually applied for the implementation of APT and address their drawbacks as outlined in “Introduction” section of (Xu 2001), which leads to an observation that the lack of specificity regarding the endogenous factors is not just regarding the number and nature of the factors, but even more seriously arising from the socalled rotation indeterminacy implemented by factor analysis. Thus, further efforts should explore how to add certain structure to remove or remedy this indeterminacy. As outlined in Fig. 3 by Box 1 and Box 2, temporal factor analysis (TFA) (Xu 1997, 2000) is suggested as a generalization of the original APT theory (Xu 2001) to tackle such an incompleteness, featured with a firstorder autoregressive dependence added to each factor such that the incompleteness caused by a notorious rotation indeterminacy is removed. Such a generalization is thus called temporal APT in a sense that temporal relation is taken into consideration.
This section further considers the influences of macroeconomic indexes such as GDP, inflation, investor confidence and yield curve, via their roles in controlling or modulating the temporal factors, which leads to a macroeconomicsmodulated temporal APT shown in Fig. 3 by Box 3. Alternatively, TFA may also be replaced by nonGaussian factor analyses (NFA) such that the incompleteness caused by rotation indeterminacy can also be removed; see Box 6 and Box 7 in Fig. 3. Actually, both the temporal factors and nonGaussian factors are two aspects of one market model: one observes a dynamic market process, while the other describes the market with all the time points projected to one reference spot. Even generally, conditional heteroskedasticity may also be added to the factors, which finally leads to Box 8 in Fig. 3, namely, a general formulation for financial market modeling that systematically integrates all the ingredients. As illustrated in Fig. 3 by Box 4, various prediction tasks and investment managements can also be conducted with the help of the temporal APT and the macroeconomicsmodulated temporal APT.
Further developments of these linear models introduced are suggested to be implemented by the Bayesian Ying–Yang (BYY) harmony learning. In “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, the fundamentals of BYY harmony learning are briefly introduced. For learning alternative mixtureofexpertsbased AR, ARCH and GARCH models, both gradientbased algorithms and EMlike algorithms are provided for implementations, featured with automatic model selection and in reference of the wellknown EM algorithm.
Except for the first column in Fig. 1, where only one time series is considered, mostly we consider dependences across more than one channel of time series. Prediction and decision making in portfolio management are based on such dependences that may not necessarily reflect causal structure underlying data, while it will be better to make prediction and decision based on casual structure. In “Linear causal analyses” section, path analyses (Wright 1934) for linear causal analyses is briefly reviewed, a recent development on ρdiagram (Xu 2018) is refined for cofounder discovery and a causal potential theory is proposed. Further discussions are made on structural equation modeling (SEM) (Ullman 2006; Pearl 2010a; Westland 2015; Kline 2015) and its relations to modulated TFAAPT and nGCHdriven MTFAO.
Financial prediction: time series models and three finite mixture extensions
Time series models and neural networks
One most classic tool for time series analyses is the autoregressive (AR) model or generally autoregressive–movingaverage (ARMA) model as follows:
where \(\varepsilon_{t} \sim^{{{\text{i.i.d}} .}} G(\varepsilon 0, \sigma^{2} )\) denotes that \(\varepsilon_{1} , \ldots ,\varepsilon_{t} , \ldots\) are i.i.d. samples from \(G(\varepsilon 0, \sigma^{2} )\), while \(G(u\mu , \sigma^{2} )\) denotes a Gaussian distribution of u with the mean μ and the variance σ^{2}. Particularly, the ARMA model degenerates to the AR model when q = 0.
The ARMA model is appropriate to describe a wide sense stationary sequence. Extension has been made to describe data ξ_{t} that have some clearly identifiable trend of a polynomial growth (Box and Jenkins 1970); see Box 1 in Fig. 1. It is made simply by an initial differencing to remove the nonstationarity. That is, we get
A cascade of this initialization and ARMA is called the autoregressive integrated moving average (ARIMA) model. For simplicity, we prefer to still use AMRA to indicate ARIMA by regarding such an initialization as a preprocessing stage.
In the literature of statistics, econometrics, control and signal processing, generalizations of ARMA have been made toward Autoregressive Conditional Heteroskedasticity (ARCH) and generalized ARCH (GARCH) for considering variables conditional to heteroskedasticity (Engle 1982; Bollerslev 1986); see Box 8 in Fig. 1. Namely, we consider
where σ_{t} is not a constant, but given by the following regression:
which is usually denoted by GARCH(p,q) and degenerates to the ARCH model when p = 0.
Extensions of the ARMA model have also been made under the name of nonlinear ARMA (NARMA) for modeling nonlinear dependence (Leontaritis and Billings 1985) and to Vector AR (VAR) for capturing the linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987). In the literature, many efforts have been made on using multilayer neural networks or recurrent networks for a sophisticated nonlinear dependence of the current observation on past values and noise disturbances, as illustrated by Box 3 in Fig. There are already several books on these studies (e.g., Azoff 1994; Gately 1995; Zhang 2003), and thus this chapter does not cover this type of studies. Instead, the subsequent two subsections will focus on Box 4 in Fig. 1, namely, learning mixture of multiple models.
Learning mixture of AR, ARMA, ARCH and GRACH models
Studies on finite mixture extensions of AR, ARMA, ARCH and GARCH models can be summarized into the following general expression:
where we consider k regression models \(x_{t} = \mu_{i,t} + \varepsilon_{i,t} , i = 1,, \ldots ,k\) with each \(\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t  1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) being either of AR, ARMA, ARCH and GARCH models, and with the corresponding residual ɛ_{i,t} from \(G(\varepsilon_{i,t} 0, \sigma_{i,t}^{2} )\). Typically, the studies of the AR, ARCH and GARCH models share the following detailed expression (Xu 1995a, b; Cheung et al. 1997; Kwok et al. 1998; Wong et al. 1998; Chiu and Xu 2003, 2004a; Tang et al. 2003):
For ARMA (Kwok et al. 1998; Tang et al. 2003), the detailed expression of \(\mu_{i,t} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t  1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) is given by Eq. (1). Moreover, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t} \left( {\varvec{x}_{t  1}^{{q_{i} }} , \varvec{a}_{i} } \right)\) can be also a specific nonlinear function, e.g., given by threelayer neural networks (Cheung et al. 1996, 1997) or the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, Xu 2009).
According to Eq. (4), a sequence x_{1}, …, x_{t}, … may come from the ith one of the k models with the probability α_{i}, and jointly the k models describe the sequence x_{1}, …, x_{t}, … with a residual ɛ_{t} that comes from a Gaussian mixture \(P(\varepsilon_{t} \varvec{x}_{t  1}^{q} ,\theta )\). In such a way, a nonlinear dependence of the current observation on past values and noise disturbances is modeled by probabilistically combining a mixture of linear models, which keeps the model structure simple and easy to learn. Moreover, nonstationarity beyond ones handled by ARIMA and GARCH models is able to be modeled via switching among individual linear models.
Also, a sequence x_{1}, …, x_{t}, … may be segmented into pieces with different statistical properties, simply by Bayesian posterior as follows (Xu 1994, 1995a, b):
that is, x_{t} is identified as coming from the j^{*}th model by
To reduce the number of small fragments, some postprocessing or smoothing regularization may be added. Moreover, we may extend a finite mixture into a hidden Markov model (HMM) (Rabiner 1989), in which each hidden state is associated with one \(G(x_{t}  \mu_{j,t} 0, \sigma_{j,t}^{2} )\) and the transition between state is described by
with α_{j,t} estimated as time proceeds and then used in Eq. (5) and Eq. (6). Moreover, we can also further modify Eq. (5) and Eq. (6) into
Next, we proceed to estimate x_{t} from the finite mixture by Eq. (4). It follows that
that is, we improve the prediction of x_{t} via each individual model by a line combination weighted by each α_{i}. However, this improvement is limited because α_{i} is a constant that does not change as the samples vary with time.
Each α_{i} in Eq. (4) cannot directly be replaced by its corresponding Bayes posterior by Eq. (5). First, \(P(j_{t} x_{t} ,\varvec{x}_{t  1}^{q} ,\theta )\) cannot be moved out of the integral \(\mathop \smallint \nolimits x_{t} P(j_{t} x_{t} ,\varvec{x}_{t  1}^{q} ,\theta )G(x_{t} \mu_{j,t} , \sigma_{j,t}^{2} )dx_{t}\), though the integral can be made approximately. Second, the calculation needs to know x_{t}. Getting \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}\) from knowing x_{t} is applicable to a filtering problem that gets a smoothed or filtered version from x_{t}, but it is not applicable to a prediction problem that targets at getting \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x}_{t}\) from its past observations.
Instead, we use a predictive \(P(j_{t} \varvec{x}_{t  1}^{q} ,\varphi )\) based on the immediate past observations \(\varvec{x}_{t  1}^{q}\) to combine the prediction of individual prediction model adaptively; that is, we have
which summarizes extensions of the AR, ARMA, ARCH and GARCH models with the help of the mixtureofexperts (ME). In the implementation of the original ME (Jacobs et al. 1991; Jordan and Xu 1995), \(P(j\varvec{x}_{t  1}^{q} ,\varphi )\) is called the gating net and given as follows:
with \(g_{1} \left( {\varvec{x}_{t  1}^{q} ,\varphi } \right), \ldots , g_{k} \left( {\varvec{x}_{t  1}^{q} ,\varphi } \right)\) being the output of multilayer networks.
In an implementation of an alternative ME model (Xu et al. 1994, 1995), we consider a predictive Bayesian posteriori
For the AR, ARCH and GARCH models, we further have
To simplify the computation, we may consider the following approximation:
A further insight into Eq. (11) can be obtained at a setting that σ ^{2}_{ j, t−1} = σ ^{2}_{ j} and x_{t−1} = μ_{j,t−1}.; in this special case, we have a further simplification:
which shares a similar concept to the mixtureusing variance (MUV) and actually degenerates to this MUV (Perrone and Cooper 1993, Perrone 1994) when \(\alpha_{j} \propto \sigma_{j,t}^{  1}\). Another special case is that α_{i}/σ_{i,t} is constant, and it follows from Eqs. (11) to (12) that we have
by which we get the counterparts of NRBF and ENRBF (Xu 1998, Xu 2009).
The other choices of \(P(j\varvec{x}_{t  1}^{q} ,\varphi )\) may also be obtained or modified from Table 3 in Xu and Amari (2008). Moreover, similar to Eq. (8), it still follows from \(q(\varvec{x}_{t  1}^{q} \psi_{j} )\) given by Eqs. (11) and (12) that we may further incorporate the HMM model from Eq. (7) into Eq. (11) and get
Maximum likelihood, RPCL learning and learning with model selection
Typically, unknown parameters in the models in Eqs. (4), (8), (10) and (11) are estimated by the maximum likelihood (ML) learning, that is, the following maximization:
This maximization is implemented by the EM algorithm (Redner and Walker 1984), e.g., see the EM algorithms for finite mixture of AR models in Xu (1994, 1995a, b), finite mixture of GARCH models in Wong et al. (1998), finite mixture of ARMA–GARCH models in Tang et al. (2003) and the original ME in Jordan and Xu (1995), as well as the alternative ME model, NRBF and ENRBF in Xu et al. (1994, 1995) and Xu (1998, 2009).
For an HMM mixture, we may also have the following approximate likelihood:
One critical problem for the ML learning is that a good performance on a training set is not necessarily good on a testing set, especially when the training set consists of a small size of samples. The reason is that there may be too many free parameters. As introduced in the third section of Xu (2009), efforts on this problem are mainly featured by learning with model selection. Model selection refers to select a model with an appropriate complexity \(\varvec{k}\). For the models considered in the previous subsection, \(\varvec{k }\) consists of the number of individual models, the autoregression order and the moving average order for each individual model. Typically, the ML learning is not good for model selection. However, whether the EM algorithm works well depends on whether an appropriate \(\varvec{k}\) is selected.
Classically, model selection is made in a twostage implementation. First, enumerate a candidate set \(\varvec{\rm K}\) of \(\varvec{k}\) and estimate a solution \(\varTheta_{\varvec{k}}^{*}\) for the unknown set Θ_{k} of parameters by the ML learning at each \(\varvec{k} \in \varvec{\rm K}\). Second, use a model selection criterion \(J\left( {\varTheta_{\varvec{k}}^{*} } \right)\) to select a best \(\varvec{k}^{*}\). Several classical criteria are available for the purpose, such as AIC, CAIC and BIC/MDL, and readers are referred to Xu (2009, 2010) for a recent outline. Unfortunately, any one of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Even with a criterion \(J\left( {\varTheta_{\varvec{k}} } \right)\) available, this twostage approach usually incurs a huge computing cost. Still, the parameter learning performance deteriorates rapidly as \(\varvec{k}\) increases, which makes the value of \(J\left( {\varTheta_{\varvec{k}} } \right)\) to be evaluated unreliably.
One direction that tackles this challenge is called automatic model selection, which is associated with a learning algorithm or a learning principle with the following two features:

When there is an indicator \(\rho \left( {\theta_{\varvec{r}} } \right)\) on a subset \(\theta_{\varvec{r}} \in \varTheta_{\varvec{k}}\), we have \(\rho \left( {\theta_{\varvec{r}} } \right) = 0\) if \(\theta_{\varvec{r}}\) consists of parameters of a redundant structural part.

In implementation of this algorithm or principle, there is a mechanism that automatically drives \(\rho \left( {\theta_{\varvec{r}} } \right) \to 0\) as \(\theta_{\varvec{r}}\) toward a specific value. Thus, the corresponding redundant structural part is effectively discarded.
An early effort along this direction is rival penalized competitive learning (RPCL) (Xu et al. 1992, 1993) for adaptively learning a model that consists of \(k\) substructures as follows:
where η > 0 is a learning step size and γ is a small positive number, e.g., γ = 0.005–0.01. With \(k\) initially at a value large enough, a current input sample x_{t} is allocated to one of the \(k\) substructures via competition. The winner adapts to this sample by a little bit, while the rival is delearned a little bit to reduce a duplicated allocation. This rival penalized mechanism will discard extra substructures, making model selection automatically during learning. Readers are referred to Xu (2007) for a recent overview and extensions.
Corresponding to Eq. (16), π_{j,t}(θ ^{old}_{ j} ) in Eq. (18) is given as follows:
For an HMM mixture, we may also approximately have
Another stream of automatic model selection is featured by those appropriate priorbased efforts. By a Laplace prior in a regression task, sparse learning or Lasso shrinkage prunes away extra weights (Williams 1995; Tibshirani 1996). For pruning away Gaussian components on Gaussian mixture, a Jeffreys priori is used in the implementation of the minimum message length (MML) that minimizes a twopart message for a statement of model and a statement of data encoded by that model (Figueiredo and Jain 2002), and also Dirichlet–Normal–Wishart priories is added on Gaussian components in the implementation of the variational Bayes (VB) that computes a lower bound of the marginal likelihood (McGrory and Titterington 2007).
However, these efforts highly depend on choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, VB and MML all degenerate to the maximum likelihood learning, while the RPCL learning is still capable of automatic model selection. Firstly proposed in Xu (1995a, b) and systematically developed over a decade and half (Xu 2001, 2007, 2010, 2012), the third stream of efforts has been made under the name of Bayesian Ying–Yang (BYY) harmony learning. The BYY harmony learning shares a mechanism similar to the RPCL learning. Also, the performances of BYY harmony learning can be further improved by incorporating appropriate priors. Further details about the BYY harmony learning are referred to “Bayesian Ying–Yang harmony learning and two exemplar learning algorithms” section, where a tutorial is also provided on one BYY harmony learning algorithm for alternative mixtureofexpertsbased GARCH models.
Dynamic trading and portfolio management
Dynamic trading by supervised learning and reinforcement learning
Instead of building a mathematical model for understanding and forecasting time series, studies of neural networks and machine learning in economics and finance started to shift from nonlinear forecasting modeling to adaptive trading and dynamic portfolio management (Neuneier 1996; Choey and Weigend 1997; Xu and Cheung 1997; Moody et al. 1998; Hung et al. 2000; Moody and Saffell 2001; Hung et al. 2003; Chiu and Xu 2004b; Jangmin 2006). Efforts on portfolio management will be addressed in the next subsection. In the sequel, we introduce efforts on learning dynamic trading based on one single time series, with the help of supervised learning, reinforcement learning and Sharpe ratio maximization.
Given a sequence x_{1}, …, x_{t}, e.g., the sequence of one asset, Gold, FOREX index,…, etc., at any time point t ≤ τ we may infer a sequence \(I_{1}^{p} , \ldots I_{t}^{p}\) each \(I_{\tau }^{p}\) being the following desired trading signal:
based on a trading strategy (e.g., maximum return) or an external expertise.
The task of learning decision, as illustrated by Box 1 in Fig. 2, can be formulated as a nonlinear regression model:
where \(f\left( {XF_{t}^{q} , \left\{ {I_{t  \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right)\) is implemented by an ENRBF network in Xu & Cheung (1997). Also, it can be implemented by threelayer neural networks. Supervised learning is used to determine the unknown parametric Θ by minimizing
where XF ^{q}_{ t} may be directly a number of past observations {x_{t−τ}} ^{q}_{ t=1} or certain features {F ^{( i)}_{ t} } extracted from{x_{t−τ}} ^{q}_{ t = 1} , e.g., F ^{( i)}_{ t} may be MACD, RSI, %K, %D, as well as features from candlestick charts and configurations from waves, etc. Also, we may put both together to consider \(XF_{t}^{q} = \left\{ {\left\{ {x_{t  \tau } } \right\}_{t = 1}^{q} , \left\{ {F_{t}^{(i)} } \right\}} \right\}.\)
One key problem is how to keep a good generalization ability by training with a small length of sequence x_{1}, …, x_{t}. One way is adding some regularization term E_{2}(Θ) + λΓ(Θ). Without a priori knowledge, however, it is not an easy task to get an appropriate term Γ(Θ) and its strength λ. The other way is to describe the model as follows:
with \(I_{t}^{p} = [z_{t}^{\left( 1 \right)} ,z_{t}^{\left( 2 \right)} ,z_{t}^{\left( 3 \right)} ]^{\rm T} , z_{t}^{\left( 2 \right)} = 0\, {\text{or}} \,1\) and z ^{(1)}_{ t} + z ^{(2)}_{ t} + z ^{(3)}_{ t} = 1. Correspondingly, min _{Θ}E_{2}(Θ) is replaced by maximizing the likelihood \(L\left( \varTheta \right) = \mathop \sum \limits_{t} { \ln }q(I_{t}^{p} f(XF_{t}^{q} ,\{ I_{t  \tau }^{p} \}_{t = 1}^{q} ,\varTheta ))\). In the formulation, learning regularization may be implemented via Bayesian learning with help of a priori distribution q(Θ), i.e., max _{Θ}[L(Θ) + lnq(Θ)]. For a better generalization ability, we may also put q(I ^{p}_{ t} f(XF ^{q}_{ t} , {I ^{p}_{ t− τ} } ^{q}_{ t=1} , Θ)) into a Bayesian Ying–Yang system and making BYY harmony learning with automatic model selection; see Sect. 4.4 in Xu (2010).
The other key problem is how to make a preprocessing stage for getting a desired sequence \(I_{1}^{p} , \ldots ,I_{t}^{p}\), which can be obtained automatically by a trading strategy, e.g., getting a profit and cutting a loss beyond a prespecified threshold as follows:
where σ_{t} is an estimation of the volatility about this asset. Also, \(I_{1}^{p} , \ldots ,I_{t}^{p}\) may come from an outcome of market technical analysis, which is difficult to get \(I_{1}^{p} , \ldots ,I_{t}^{p}\) adaptively in a dynamic trading.
From the studies (Moody et al. 1998; Moody and Saffell 2001; Jangmin 2006), \(I_{1}^{p} , \ldots ,I_{t}^{p}\) is a sequence of actions that are dynamically learned by reinforcement learning. Typically, a reinforcement learning model consists of a set S of environment states (e.g., differences in the current price of asset and the volumes in holding) and a set A (e.g., buy, sell, no action) of actions. There is also a policy π that chooses an action a_{t} ∊ A at an environment state s_{t}. The action a_{t} makes the environment move to a new state s_{t+1}. Associated with the transition (s_{t}, a_{t}, s_{t+1}), there is a scalar immediate reward r_{t+1}(s_{t}, a_{t}, s_{t+1}) that is estimated according to a utility function, e.g., a maximum profit. The goal is to collect as much reward as possible by determining a sequence of actions a_{1}, …, a_{t}.
In the literature of reinforcement learning, one popular approach is called Qlearning, by which a_{t} is chosen according to a table Q(s_{t}, a_{t}) that is learned from r_{t+1}(s_{t}, a_{t}, s_{t+1}). For a dynamic trading, the S of environment states are featured by differences in the current price of asset and the volumes in holding. Quantizing the differences into the states is not an easy task. Also, there will be a large number states to be considered. As a result, we need to learn a large Q(s_{t}, a_{t}) table, which not only increases computing cost rapidly, but also makes the problem of a small sample size become more serious because Q(s_{t}, a_{t}) consists of too many free parameters to be determined. Instead of Qlearning, the action a_{t} in r_{t+1}(s_{t}, a_{t}, s_{t+1}) can be approximately replaced by the value of I ^{p}_{ t} given by Eq. (22) such that r_{t+1}(s_{t}, a_{t}, s_{t+1}) is replaced by an expression r_{t+1}(s_{t}, s_{t+1}, {x_{t−τ}} ^{q}_{ t=1} , {I ^{p}_{ t− τ} } ^{q}_{ t=1} , Θ). As a result, the maximization of ∑ ^{∞}_{ t=1} γ^{t}r_{t+1}(s_{t}, a_{t}, s_{t+1}) with respect to a sequence of discrete actions a_{1}, …, a_{t} is replaced by the maximization of ∑ ^{∞}_{ t=1} γ^{t}r_{t+1}(s_{t}, s_{t+1}, {x_{t−τ}} ^{q}_{ t=1} , {I ^{p}_{ t− τ} } ^{q}_{ t=1} , Θ) with respect to Θ. Similar to learning regularization, the problem of a small sample size may also be handled by adding a a priori term, e.g., \(\sum\nolimits_{t = 1}^{\infty } {\gamma^{t} r_{t + 1} \left( {s_{t} , s_{t + 1} , \left\{ {x_{t  \tau } } \right\}_{t = 1}^{q} , \left\{ {I_{t  \tau }^{p} } \right\}_{t = 1}^{q} , \varTheta } \right) + \lambda { \ln } q\left( \varTheta \right)} .\)
The last but not the least, the specific expression of r_{t+1}(s_{t}, a_{t}, s_{t+1}) is an important practical issue, related to the current price of asset, the volume in holding, the transaction cost and the tax, as well as personal preference. There could be a number of choices. See Fig. 2 by Box 3; a widely used one is the Sharpe ratio, which is originally suggested for evaluating the goodness of an asset in market by a ratio of the excess asset return (i.e., after minus the benchmark return) over the standard deviation of the excess asset return (Sharpe 1966, 1994). For dynamic trading, it is not the Sharpe ratio of the asset in market that has to be calculated, but the Sharpe ratio of the dynamic trading system, which depends on a sequence of actions a_{1}, …, a_{t}.
Dynamic portfolio management by maximizing Sharpe ratio and extensions
Instead of only considering one single asset, a common and more reliable practice is considering a portfolio of assets, and thus portfolio management is one important topic in the finance literature. For the supervised learning by Eq. (22), its extension can be made simply by considering \(I_{j,t}^{p} (XF_{t}^{q} ,\{ I_{j,t  \tau }^{p} \}_{t = 1}^{q} ,\varTheta_{j} ), j = 1, \ldots ,k\) with each in the format of Eq. (22), and learning is made by minimizing the total sum ∑ _{j}E_{2}(Θ_{j}). Simply, we get the training signals \(I_{j,1}^{p} , \ldots ,I_{j,t}^{p}\) per asset individually. Still, further studies are needed on how to get the training signals bases on the whole portfolio of assets. Conceptually, extension of reinforcement learning to multiple assets is rather straightforward too. However, both the set S of environment states and the set \(A\) of possible actions increase rapidly, which makes learning a large table Q(s_{t}, a_{t}) seriously suffer the problem of a small sample size. Thus, it becomes more critical to get a_{1}, …, a_{t} to be approximately replaced by {I ^{p}_{ j, t} (XF ^{q}_{ t} , {I ^{p}_{ j, t − τ} } ^{q}_{ t=1} , Θ_{j})} ^{k}_{ j=1} in evaluating the reward r_{t+1} (Moody et al. 1998; Moody and Saffell 2001). Similar to supervised learning, one direction for tackling the problem of a small sample size is incorporating with learning regularization.
Alternatively, another direction to pursuit portfolio management is exploring the road pioneered by the Markowitz portfolio theory (Markowitz 1952), see Box 2 in Fig. 2. By this theory, the return of an investment portfolio is the proportionweighted combination of the constituent assets’ returns, while the portfolio volatility is a function of the correlations between the component assets. The portfolio expected return is maximized subject to a given amount of portfolio risk, or equivalently risk is minimized for a given level of expected return. Moreover, the Markowitz mean–variance scheme also leads to the suggestion of Sharpe ratio (Sharpe 1966, 1994), which is typically used to evaluate the performance of a portfolio.
In both the standard Markowitz mean–variance scheme and Sharpe ratio approach, a risk is defined as the return variance, which has been subsequently realized that the variance is not an appropriate measure because it counts the positive fluctuation above the expected returns (also called upside volatility) as a part of the risk. See Box 4 in Fig. 2; the downside risk thus becomes a topic to study. Markowitz (1959) counts the volatility below the expected returns only. Fishburn (1977) makes a meanrisk analysis with risk associated with belowtarget returns and proposes a more sophisticated measure of risk associated with belowtarget return, which has been further refined by Sortino and Meer (1991). Basically, this downside risk is the volatility of return below the minimal acceptable return (also called target return G).
Moreover, the downside risk of a single asset has been extended into the following covariance (Hung et al. 2000, 2003):
for the returns \(r_{j} ,\, j = 1, \ldots ,k\) of multiple assets. Also, we have the following matrix for the upside volatility:
The sprit of the Markowitz theory and the Shape ratio, i.e., maximizing the expected returns while minimizing the risk, is reasonably modified into one extended Sharpe ratio featured by maximizing both the expected returns and the upside volatility while minimizing the downside risk; see Box 5 in Fig. 2. In Hung et al. (2000, 2003), this generalization is implemented by the following maximizaon:
As shown in Fig. 4, we use the parameters H, B to adapt the investor’s preference. The parameter H represents a strength of maximizing upside volatility and B represents a strength of diversification or regularization. The term \(\varvec{w}^{\text{T}} \left( {1  \varvec{w}} \right)\) is a diversification term that reaches its minimum when one w_{i} is 1 and others are 0, and its maximum when all the elements \(\varvec{w}\) are equal.
It has been experimentally shown that this generalization of Sharpe ratio can effectively reduce the risk while obtaining great returns, in comparison with the standard Markowitz mean–variance scheme and Sharpe ratio. Moreover, investors expect a constant return with a minimum downward risk, for which we can simply set \(\varvec{w}^{\text{T}} E\varvec{r} = r_{\text{spec}}\), while the others expect a maximum return under a constant downward risk, for which we can simply set \(\varvec{w}^{\text{T}} \varvec{Dw} = v_{\text{spec}}\).
In Sect III(C) of Xu (2001), several developments have been proposed along this direction. First, a more practical scenario is considered, featured with a portfolio of risk securities with returns \(r_{j,t} , \,j = 1, \ldots ,k\), a riskfree bond with return r^{f} and transaction cost with a rate r_{c}. That is, \(r_{t} = \varvec{w}^{\text{T}} \varvec{r}\) is replaced by
where each w_{j,t} may be nonnegative as in Eq. (28). In this case, short of a risk security is not permitted but borrowing from the riskfree bond is allowed, i.e., we can have 1 − α_{0} < 0. Also, we may allow a negative w_{j,t}, i.e., short of a risk security is permitted.
Second, instead of considering \(E\varvec{w}^{\text{T}} \varvec{r} = \varvec{w}^{\text{T}} E\varvec{r}\) and \(E\left[ {\varvec{w}^{\text{T}} \varvec{r}  E\varvec{w}^{\text{T}} \varvec{r}} \right]\left[ {\varvec{w}^{\text{T}} \varvec{r}  E\varvec{w}^{\text{T}} \varvec{r}} \right]^{\text{T}}\) for the expected return and its volatility, we compute their estimations directly from samples R_{T} = {r_{t}, t = 1, …, T} within a time window. Accordingly, it follows from Eq. (25) that we get the counterpart of Eq. (28) as follows:
where #S denotes the cardinality of the set S, and the parameter \(\beta_{V} ,\beta_{\varvec{w}}\) are the counterparts of H, B in Eq. (28). Moreover, \(D\left( \varvec{w} \right)\) is a diversification term that reaches its minimum when one w_{i} is 1 and the others are 0, and reaches its maximum when all the elements \(\varvec{w}\) are equal. There could be several choices for \(D\left( \varvec{w} \right)\). One example is \(\varvec{w}^{\text{T}} \left( {1  \varvec{w}} \right)\) in Eq. (28) or equivalently \( \varvec{w}^{\text{T}} \varvec{w}\). One other example is
Moreover, \({{M\left( {R_{T} } \right)} \mathord{\left/ {\vphantom {{M\left( {R_{T} } \right)} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}} \right. \kern0pt} {\sqrt[\gamma ]{{V_{G}^{D} \left( {R_{T} } \right)}}}}\) is a ratio which is also an improvement over \(\varvec{w}^{\text{T}} E\varvec{r}/\varvec{w}^{\text{T}} \varvec{Dw}\) in Eq. (28), and actually \(\varvec{w}^{\rm T} E\varvec{r}/\varvec{w}^{\rm T} \varvec{Dw}\) is not really a ratio. Third, instead of directly searching the parameters \(\alpha_{0} ,\varvec{w}_{t}\), we may let
with \(g\left( {\varvec{r}_{t} ,\psi } \right), f\left( {\varvec{r}_{t} ,\varphi } \right)\) implemented by neural networks, e.g., an ENRBF network. In the next section, we will show that a portfolio of security returns \(\varvec{r}_{t}\) may also be modeled by a temporal extension of arbitrage pricing theory such that \(\varvec{r}_{t}\) is mapped into inner factor \(\varvec{y}_{t}\) with a much lowered dimension. Instead of depending on the security returns \(\varvec{r}_{t}\), we use \(\varvec{y}_{t}\) to replace \(\varvec{r}_{t}\) in Eq. (28) for a further improvement.
Following the extension proposed in Xu (2001), most of the above addressed extensions have been investigated together with detailed algorithm, experiments on real market data and comparative studies (Chiu and Xu 2002b, 2003, 2004b). Still, at the end of Sect III(C) in Xu (2001), there was one briefly introduced idea that has not been further investigated yet. Here, some further details are addressed.
In Eq. (30) and also in Eq. (28), as well as in the existing studies on the Markowitz portfolio optimization and the Sharpe ratio, the expected return and volatilities are nonparametric estimates directly from samples \(R_{T} = \left\{ {\varvec{r}_{t} ,t = 1, \ldots ,T} \right\}.\) To capture a temporal dependence better, one idea is using an ARCH or GARCH model to describe a sequence {r_{t}, t = 1, …, T} of the portfolio return \(r_{t} = \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ;\) see Box in Fig. 2. It follows from Eq. (3) that we have
Taking the expectation and separating the first term from the rest, as well as approximately considering \(E\varvec{w}_{t}^{\text{T}} \varvec{r}_{t} \approx a_{1} \varvec{w}_{t}^{\text{T}} \varvec{r}_{t} ,\) we further get
from which we get the following GARCHbased Shape ratio
Given the GARCH model and the past \(Er_{t  j} , r_{t  j} , \quad j = 1, \ldots ,k,\) we have \(E\hat{r}_{t  1} ,\) \(\hat{\sigma }_{t}^{2} ,\) r ^{AR}_{ t} , a_{1}, β_{1} available. As \(\varvec{r}_{t}\) is obtained, we compute the gradient of \(J\left( {\varvec{w}_{t} } \right)\) and update
Then, we get \(\varepsilon_{t}^{2} = (\varvec{w}_{t}^{\text{T}} \varvec{r}_{t}  r_{t}^{AR} )^{2}\) and update \(a_{i}^{\text{new}} = e^{{c_{1}^{\text{new}} }} , c_{i}^{\text{new}} = c_{i}^{\text{old}}  \eta \frac{{{\text{d}}\varepsilon_{t}^{2} }}{{{\text{d}}c_{i}^{\text{old}} }},\quad {\text{for }}i = 0,1,\) \(a_{j}^{new} = a_{j}^{old}  \eta \frac{{d\varepsilon_{t}^{2} }}{{da_{j}^{old} }}, \quad {\text{for }}j = 2, \ldots ,q.\)
Also, we update the parameters ϑ in the same way as one standard GARCH solving approach. Next, we use Eq. (36) for updating \(\varvec{w}_{t + 1}\) again.
Market modeling: APT theory and temporal factor analysis
Arbitrage pricing theory and factor analysis’s incapability
Beyond only optimizing the outcome by investing a portfolio of multiple assets, the Markowitz mean–variance scheme also leads to the linear modeling of the market. The most famous one is the wellknown capital asset pricing model (CAPM) (Sharpe 1964). However, the CAPM is criticized as being not sufficient to describe market behavior merely via one endogenous factor.
Under the name of arbitrage pricing theory (APT), Ross (1976) proposed the following linear model of multiple hidden or endogenous factors:
As illustrated in Fig. 5a, \(\varvec{r}_{t}\) consists of the returns of k assets in this market, \(\varvec{f}_{t}\) consists of m risky hidden factors that will affect the rate of returns on all assets by different degrees of sensitivity and a_{ij} is the sensitivity of the ith asset to factor j, also called factor loading, Moreover, each element of \(\varvec{e}_{t}\) is the risky asset’s idiosyncratic random shock with mean zero, and each element of \(\varvec{a}\) is a constant part of the corresponding risky asset.
Since its inception, the APT has attracted a considerable interest as a tool for interpreting investment results and controlling portfolio risk. However, the APT has been accepted by the investment community, but is not as popular as the CAPM. The reason largely relates to APT’s serious drawback, namely, its implementation is difficult due to the lack of specificity regarding the nature of the factors that systematically affect asset returns. As outlined in Sect. I of (Xu 2001), typically three types of approaches have been applied for the APT implementation.
Most of the studies are featured with \(\varvec{f}_{t}\) given by the socalled fundamental factors, i.e., historic time series of a set of macroeconomic or fundamental indexes. With the hidden factors chosen, the problem becomes a typical multivariate linear regression problem: \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\). However, choosing these fundamental factors is not an easy task. Chen et al. (1986) chose five macroeconomic factors, including surprises in GDP, inflation, investor confidence, and yield curve. Also, others consider index or spot or future market price, e.g., shortterm interest rate, a diversified stock index, oil price, gold or precious metal prices, and currency exchange rate in place of macroeconomic factors. With efforts over decades, little progress has been achieved on identifying the number and nature of these fundamental factors. Many researchers believe that this issue is essentially empirical in nature, because the factors change over time and between economies.
There have been also efforts under the name of the crosssectional approaches that observes the correlations of all the assets of \(\varvec{r}_{t}\) to each of the hidden factor in \(\varvec{f}_{t}\) by a certain period, resulting in estimates of elements of A that reflect the assets’ sensitivities to these hidden factors. Then, the task is to estimate \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) and A, which is typically handled as a linear crosssectional regression and solved by the least square error method in the literature of economics and finance. In Sect. I of Xu (2001), it is formulated as an inverse mapping problem, a topic that has been widely studied in the neural network and machine learning literature.
Observation of an implementation of the least square error method actually shows that the residuals \(\varvec{e}_{t}\) are uncorrelated among the elements and also with the factors \(\varvec{f}_{t}\) and that each element of \(\varvec{e}_{t}\) reflects a collective effect of many random noise, that is, we have \(E\varvec{f}_{t} \varvec{e}_{t}^{\rm T} = 0\) and also \(q(\varvec{r}_{t} \varvec{f}_{t} )\) as shown by the topdown pathway on the right part of Fig. 5b. An inverse of the topdown path is a bottomup path on the left part of Fig. 5b, for which the optimal solution is the following Bayesian inverse:
Here, we encounter a probabilistic structure \(q\left( {\varvec{f}_{t} } \right)\) of hidden factors. Approximately, if only considering its statistics up to the second order, \(q\left( {\varvec{f}_{t} } \right)\) is approximated by a Gaussian \(G\left( {\varvec{f}_{t} \left {\nu ,\varLambda } \right.} \right)\) as shown in Fig. 5b. In such a case, we have the following analytical solution:
which returns to a least square error solution when there is no information about \(q\left( {\varvec{f}_{t} } \right)\) for which we may simply set Λ = 0, ν = 0.
Similar to the first approach, the second approach is also essentially empirical in nature, which needs not only a manual help to identify the number and nature of hidden factors, but also at least an enough long period of historic data about factors for estimating of elements of A. Moreover, getting elements of A by the correlations between \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) actually imposes additional constraints on the values that A may take. The second approach is supplementary to the first approach, but it still cannot get rid of the nature that the factors are chosen heuristically and even rather arbitrarily. We may regard that the second approach actually consists of two steps. First, estimation of elements of A bases on a period historic data of macroeconomic or fundamental indexes takes the same role of the first approach or even just an implementation of the first approach. Second, we estimate \(\varvec{f}_{t}\) upon \(\varvec{r}_{t}\) and A, e.g., typically by Eq. (39).
The third type of efforts are called factoranalytic approach, attempting to use a statistical approach called factor analysis (FA) to get both the unknown and the unknown factors estimated from the observed return series \(\left\{ {\varvec{r}_{t} } \right\}\). There is no need of external heuristics, and thus it seems more appealing. As shown in Fig. 5b, an FA model comes from modifying Fig. 5a with an additional structure that \(\varvec{f}_{t}\) comes from a Gaussian \(G\left( {\varvec{f}_{t} \left {\nu ,\varLambda } \right.} \right)\) with a diagonal Λ or even \(\varLambda = I\). Unfortunately, empirical tests showed that factor analysis does not explain economic variables well. As addressed in Sect. I of Xu (2001), some incapability of factor analysis mainly comes from two kinds of intrinsic indeterminacy. One is the rotation indeterminacy, i.e.,
while such a rotation may lead to a solution far from the correct one. The other comes from an intrinsic indeterminacy of an appropriate number of factors, while the selection of a correct number of factors is essential to the performance of using the APT model. Usually, it is set by a rule of thumb. Actually, factor analysis also suffers other types of indeterminacy. One is any rescaling \(D\varvec{f}_{t}\) of a solution \(\varvec{f}_{t}\) is still a solution for a diagonal matrix D, which is not critical because it reserves the waveform of each element in \(\varvec{f}_{t}\). The other is additive indeterminacy, i.e., A, Λ, Σ and A^{*}, Λ^{*}, Σ^{*}are both the solutions as long as AΛA^{T} + Σ = A^{*}Λ^{*}A^{*T} + Σ^{*}. However, the effect of this indeterminacy can be reduced significantly when Σ = σ^{2}I. Therefore, our attention should be mainly on the first two key challenges, namely, removing the rotation indeterminacy by Eq. (40) and determining an appropriate number of factors.
The first challenge has been seldom considered by the APT studies in the fields of economics and finance, while there are some efforts on the second challenge, i.e., determining an appropriate number of factors with the help of statistical testing. The simplest one is making maximum likelihood factor analysis (MLFA) followed by the likelihood ratio (LR) test, shortly MLFALR. Empirical evidences show that the minimum number of factors accepted by the LR test tends to increase with the number of securities. Alternatively, Chamberlain and Rothschild (1983) suggest analyzing eigenvalues of the population covariance matrix, shortly eigenvalue approach. Still, Brown (1989) empirically found that this approach biases toward too few factors and the result consistent with one factor may be equally consistent with multiple equally weighted factors.
On one hand, being essentially empirical in nature, both the fundamental factorbased approaches and the crosssectional approaches rely on preknowledge or external beliefs to choose the factors heuristically, in lack of consensus and consistency over what should be the real factors in APT. On the other hand, the implementation of factor analysis suffers the rotation indeterminacy by Eq. (40) and the difficulty of determining an appropriate number of factors. These problems incur for criticisms on the APT theory, e.g., see Dhrymes et al. (1984); Abeysekera and Mahajan (1987).
Instead of doubting the incorrectness of the APT theory, our understanding is that the APT theory is correct but incomplete. The APT suggests to model a market at no arbitrage equilibrium by a linear model, which is justifiable. However, this theory is incomplete because this linear model cannot be uniquely or even reasonably specified merely from the observed return series \(\left\{ {\varvec{r}_{t} } \right\}\). To complete the theory, further specification should be imposed on the components of this model. The fundamental factorbased approaches fix the hidden factors by heuristically and empirically picking a set of macroeconomic or fundamental indexes, which removes the indeterminacy but leaves the difficult questions on how to choose these factors and whether the factors should come directly from macroeconomic or fundamental indexes. The crosssectional approaches aim at estimating \(A\), which leaves the difficult question on how A can be estimated correctly. To get A by the assets’ sensitivities to these hidden factors, we still need to heuristically and empirically pick a set of macroeconomic or fundamental indexes, Finally, the FA model is also unable to remove the incompleteness of the APT, because imposing an additional Gaussian \(G\left( {\varvec{f}_{t} \left {\nu ,\varLambda } \right.} \right)\) is still not enough to remove the critical indeterminacy by Eq. (40). In a summary, the original APT (Ross 1976) is reasonable but incomplete, and further efforts should explore how to add certain structure to remove or remedy the incompleteness.
Temporal factor analysis and temporal APT
The famous CAPM model is featured by one factor that is not a manually chosen exogenous macroeconomic or fundamental index but an invisible and intrinsic market indicator. The APT was motivated by following the basic sprit of CAPM to answer the criticism that merely one factor is not enough to describe the market behavior. However, implementing APT by manually picking macroeconomic or fundamental indices actually deviates from the original motivation. Encouragingly, the direction of FA implementation is still consistent with the original motivation that seeks intrinsic factors, and thus we further proceed along this direction. Keeping Eq. (37), we extend the Gaussian structure \(G\left( {\varvec{f}_{t} \left {\nu ,\varLambda } \right.} \right)\) into a better structure such that the indeterminacy by Eq. (40) or the incompleteness of the FA model can be removed or at least remedied.
Temporal factor analysis (TFA) is such a further development of FA; see Box 1 in Fig. 3. The early study was started in 1997, firstly introduced briefly by Xu (1997) and further addressed in Xu (2000) (this manuscript actually reached the editorial office also in 1997). See Box 2 in Fig. 3: the key idea is modifying Eq. (37) as follows:
That is, the firstorder autoregressive dependence is added to each factor in \(\varvec{f}_{t}\) via B, and Eq. (41) returns to FA by Eq. (37) when B = 0.
It is this temporal dependence that removes the rotation indeterminacy by Eq. (40); see Sect IV (A) in Xu (2000) and Sect. II in Xu (2002). Roughly, the following points may be understood:

For any diagonal matrix D, we have \(A\varvec{f} = \tilde{A}\tilde{\varvec{f}},\tilde{A} = AD,\tilde{\varvec{f}} = D^{  1} \varvec{f},\) which keeps the format \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\) unchanged and also the elements of \(\tilde{\varvec{f}}\) remain mutually independent. i.e., Equation (37) has an indeterminacy of unknown scaling on factors of \(\tilde{\varvec{f}}\). Thus, we may simply consider \(\varvec{f}_{t} \sim G\left( {\varvec{f}_{t} \left {0,I} \right.} \right)\). For any rotation matrix φ with \(\varphi^{\text{T}} \varphi = I\), we have \(A\varvec{f} = \tilde{A}\tilde{\varvec{f}},\) and \(\tilde{A} = A\varphi^{\text{T}} ,\tilde{\varvec{f}} = \varphi \varvec{f}\) with \(\tilde{f}_{t} \sim G\left( {\tilde{f}_{t} \left {0,I} \right.} \right)\). That is, Eq. (37) has also an indeterminacy of unknown rotation on factors \(\tilde{\varvec{f}}\).

For any diagonal matrix D, we also have \(D^{  1} \varvec{f}_{t} = D^{  1} BDD^{  1} \varvec{f}_{t  1} + D^{  1} \varepsilon_{t}\) and \(\tilde{\varvec{f}}_{t} = B\tilde{\varvec{f}}_{t  1} + \tilde{\varepsilon }_{t} ,\), where \(\tilde{\varepsilon }_{t} = D^{  1} \varepsilon_{t}\) comes from \(G\left( {\tilde{\varepsilon }_{t} \left {0,D^{  1} \varLambda D^{  1} } \right.} \right)\) and \(D^{  1} \varLambda D^{  1}\) is still diagonal. That is, Eq. (41) still has an indeterminacy of unknown scaling on factors \(\tilde{\varvec{f}}\). Again, we may consider \(\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left {0,I } \right.} \right).\) For any rotation matrix φ with φ^{T}φ = I, we have \(\tilde{\varvec{f}}_{t} = \tilde{B}\tilde{\varvec{f}}_{t  1} + \tilde{\varepsilon }_{t}\) with \(\tilde{\varepsilon }_{t} \sim G\left( {\tilde{\varepsilon }_{t} \left {0,I } \right.} \right)\), while \(\tilde{B} = \varphi B\varphi^{\text{T}}\) is no longer diagonal and even B is diagonal. If \(\tilde{B} = \varphi B\varphi^{\text{T}}\) is required to be diagonal, the only rotation matrix is φ = I and thus the rotation indeterminacy is removed.
Still there is an indeterminacy of unknown scaling on factors of \(\tilde{\varvec{f}}\), but it will not change the waveform of f_{1,t}, …, f_{n,t}. Also, we may normalize each factor to remove such indeterminacy.
In Xu (2001), the TFA by Eq. (41) is thus suggested as a refinement of the original APT theory, by which the original part of APT is kept without modification, while a temporal structure \(\varvec{f}_{t} = B\varvec{f}_{t  1} + \varepsilon_{t}\) is added such that the incompleteness caused by the rotation indeterminacy has been removed. Such a refinement may be called temporal APT in a sense that temporal relation is taken into consideration of market modeling. That is, a static equation by Eq. (37) is not enough to describe a market equilibrium, but a temporal structure should be an important ingredient of a market equilibrium.
Why is an AR model of merely order one \(\varvec{f}_{t} = B\varvec{f}_{t  1} + \varepsilon_{t}\) considered as this temporal structure? First, we consider that hidden factors \(\varvec{f}_{t}\) are driven by Gaussian noise \(\varepsilon_{t} \sim G\left( {\varepsilon_{t} \left {0,\varLambda } \right.} \right),\) following a general consensus that the noisy component in most econometric and statistical models is Gaussian distributed. The rationale comes from the central limit theorem which implies that the compounding of a large number of unknown distributions will be approximately normal. Second, the firstorder AR model can be attributed to the weak form of efficient market hypothesis (EMH), that is, stock price today is conditionally independent of all previous prices given the price of yesterday. Third, though observable economic indices are seldom independent, it cannot rule out that hidden factors that denominate a market equilibrium are mutually independent. Instead, independent factors may help to make market equilibrium simpler.
As addressed in the previous subsection, past efforts on determining an appropriate number of factors have not provided much support on the APT. For one example, the MLFALR test shows that the number of factors tends to increase with the number of securities. For another example, the identification via eigenvalue approach (Chamberlain and Rothschild 1983) biases toward a smaller factor number. In one IJCNN 02 paper (Chiu and Xu 2002a), empirical tests on Hong Kong stock market data show not only that these two unfavorable biases are again observed, but also that the TFAbased APT can provide a reasonable answer to the number of factors in the Hong Kong stock market. As shown in Fig. 6, the number of factors identified by MLFALR test varies as the numbers of securities, while the number of factors identified by the eigenvalue approach is always 1. In contrast, BYY harmony learning based TFA stably identifies four or five factors regardless of the numbers of securities, which is quite consistent with the number identified via heuristic empirical analysis, e.g., in Chen et al. (1986).
The above introduced nature of TFA and preliminary studies suggest that there may need a renewed interest in the literature of finance and economics to further investigate APT and its further developments. To consider which topics to pursue, it is helpful to observe the differences of TFA from related methods.
First, \(\varvec{f}_{t} = B\varvec{f}_{t  1} + \varepsilon_{t}\) in Eq. (41) is actually a special type of the firstorder vector AR (VAR). Being different from the conventional VAR that are used for capturing linear interdependencies among multiple time series (Sims 1980; Engle and Granger 1987), the TFA captures the interdependencies among multiple time series by \(\varvec{r}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t}\) and temporal dependences by \(\varvec{f}_{t} = B\varvec{f}_{t  1} + \varepsilon_{t}\). As addressed in Sect. 3.2.1 in Xu (2012), it is more efficient to separately treat these two types of dependences.
Second, if we do not constrain B,Λ to be diagonal, Eq. (41) becomes a general state–space model (SSM) or a linear dynamical system (LDS), which has been widely studied in the literature of control theory and signal processing. As outlined in Sect. 5.2.1 of Xu (2012), in a period that is more or less the same as the studies on TFA (Xu 1997; 2000), there was a renewed interest on a general LDS, featured by using the EM algorithm for parameter estimation under the ML learning (Ghahramani and Hinton 2000). Accordingly, this EM algorithm was originally derived in the early 1980s and reintroduced in the early 1990s (Shumway and Stoffer 1991). Neither these studies suggest using the LDS as a further development of APT, nor the notorious rotation indeterminacy in Eq. (40) has been taken into consideration. On the contrary, more problems of indeterminacy than the FA are actually incurred in this general LDS model due to many extra free parameters, which makes identifiability even worse. For an example, applied to radar automatic target recognition based on highresolution range profile, it has been shown in Wang et al. (2011) that the recognition performance of the general LDS is actually even inferior to that of the FA, while TFA obtains better performances than the FA.
Third, many efforts have been made on determining the factor number of FA in the literature of statistics and machine learning, typically in a twostage implementation. The first stage uses the EM algorithm to make the ML learning for unknown parameters in the FA while the second stage selects an appropriate number of factors with help of a model selection criterion. In Tu and Xu (2011), a systematic comparative investigation has been made on a number of typical model selection criteria, including not only Akaike’s AIC, Schwarz’s BIC, Bozdogan’s CAIC, Hannan–Quinn criterion, but also recent Minka’s PCA criterion, Kritchman and Nadler’s tests, and Perry and Wolfe’s rank, as well as the criterion obtained from the BYY harmony learning theory (Xu 2001).
As discussed above, there is not really a need to further consider the relations to VAR and LDS. Instead, further explorations may start from continuing the study in the IJCNN02 paper (Chiu and Xu 2002b) and proceed to clarify the following issues:

Does using one of the above model selection criteria in a twostage implementation improve the number of FA factors identified by the MLFALR test and the eigenvalue approach? If yes, does this improvement help the FAbased implementation of APT, even still suffering the rotation indeterminacy by Eq. (40).

Still using one of the above model selection criteria in a twostage implementation, how much improvement TFA can be obtained after removing the rotation indeterminacy by \(\varvec{f}_{t} = B\varvec{f}_{t  1} + \varepsilon_{t}\)?
Additionally, studies may be made on data from other major international markets, with those past empirical analyses (e.g., Chen et al. 1986; Azeez and Yonezawa 2006) as references. In addition to a twostage implementation, one promising feature of implementing the TFA by the BYY harmony learning (Xu 2001) is that the number of temporal factors is determined automatically during learning, which saves computational costs greatly and also improves the learning performance of TFA, for which details are referred to Sect. 5 of Xu (2010) and Sect. 5.2 of Xu (2012).
Macroeconomicsmodulated TFAAPT and nGCHdriven MTFAO
In those empirical APT studies, the practice that uses macroeconomic indexes as \(\varvec{f}_{t}\) leads to an understanding that \(\varvec{f}_{t}\) typically consists of a set of macroeconomic or fundamental indexes. In an FA implementation or a TFA implementation by Eq. (41), such an understanding may not be correct. Actually, \(\varvec{f}_{t}\) may vary much slower than the return \(\varvec{r}_{t}\) and thus be regarded as a macroeconomic type of indices. However, \(\varvec{f}_{t}\) may also vary in a timescale similar to the changes of \(\varvec{r}_{t}\). Moreover, \(\varvec{f}_{t}\) in Eq. (41) is intrinsically determined from real data \(\varvec{r}_{t}\) and usually will not coincide with exogenous macroeconomic indexes, such as GDP, inflation, investor confidence, and yield curve. Therefore, we need to further investigate how the market is influenced by these exogenous variables or macroeconomic indexes.
Being quite different from many existing studies that explicitly model the relation between market return \(\varvec{r}_{t}\) and macroeconomic indices, the influences of these indices to \(\varvec{r}_{t}\) are considered via their roles in modulating the temporal factors in \(\varvec{f}_{t} ,\) as shown in Fig. 3 by Box 3. This idea is realized via extending Eq. (41) into the following macroeconomicsmodulated TFA–APT:
where \(\varvec{e}_{t} ,\) ɛ_{t}, and η_{t} are Gaussian white noises and independent of each other. Typically, \(\varvec{m}_{t}\) consists of several macroeconomic indices, and \(\varvec{\nu}_{t}\) consists of several known nonmarket factors that affect the macroeconomy. Specifically, \(H\varvec{m}_{t}\) describes the effect of the macroeconomic indices to the security market via the hidden factors \(\varvec{f}_{t}\). Actually, Eq. (42) comes from a simplification of one proposed in Sect. III(C) of (Xu 2001) and its Eq. (101), in particular, under the name of macroeconomicsmodulated independent state–space model.
In one CIFEr2003 conference paper (Chiu and Xu 2003), empirical investigation is made on the model by Eq. (42). First, white noise tests are made on \(\varvec{e}_{t} ,\) ɛ_{t}, and η_{t} to ensure model specification adequacy. Second, the performances in return prediction and index forecasting are compared with that of the TFA model. Empirical results reveal that the model is not only well specified, but also superior to the TFA model in stock price and index forecasting.
See Box 4 in Fig. 3, there are two ways to perform prediction based on Eq. (41) and Eq. (42). The first way is intrinsically to get \(\varvec{r}_{t  1} \to \varvec{f}_{t  1}\) and predict \(\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t  1}\) for Eq. (41) and \(\hat{\varvec{r}}_{t} = \varvec{a} + A\left( {B\varvec{f}_{t  1} + H\varvec{m}_{t} } \right)\) for Eq. (42), while the second way is considering a given prediction \(\varvec{r}_{t  1} \to \varvec{y}_{t}\) via \(\varvec{r}_{t  1} \to \varvec{f}_{t  1}\), \(B\varvec{f}_{t  1} \to \varvec{f}_{t}\) and then \(\varvec{f}_{t} \to \varvec{y}_{t}\) by learning either linear or nonlinear regression, where y_{t} could be either \(\varvec{r}_{t}\) or any type of market indices. In one paper (Chiu and Xu 2002), \(\varvec{f}_{t} \to \varvec{y}_{t}\) is implemented by the normalized radial basis function (NRBF) and extended NRBF (ENRBF) (Xu 1998, 2009) and predicts the stock price or return \(\varvec{r}_{t}\). Empirical studies on Hong Kong market data have shown the superiority of this prediction over not only a conventional prediction \(\varvec{f}_{t} \to \varvec{y}_{t}\), but also the prediction \(\hat{\varvec{r}}_{t} = \varvec{a} + AB\varvec{f}_{t  1}\).
Based on Eqs. (41) and (42), in addition to making a prediction featured with learning a regression \(\varvec{f}_{t} \to \varvec{y}_{t}\), we may also use \(\varvec{f}_{t}\) to replace \(\varvec{r}_{t}\) in the previous Eq. (29) for adaptive portfolio management; see Box 5 in Fig. 3. This APT based portfolio management was firstly suggested in Sect. III(c) and especially by Eqs. (96) and (97) in Xu (2001). Extensive simulation results reveal that this \(\varvec{f}_{t}\)based portfolio management generally excels the return \(\varvec{r}_{t}\) based portfolio management by Eq. (29) (Chiu and Xu 2004b).
In general, a parametric \(\varvec{y}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right)\) can be added to Eq. (41) to provide the outputs of this model for application purposes for such prediction and portfolio management. Moreover, beyond the consideration of Gaussian white noises as the driven noise ɛ_{t}, we may consider a nonGaussian driven noise ɛ_{t} or a driven noise ɛ_{t} with a conditional heteroskedasticity. In summary, we further generalize Eq. (42) into the following model

(a)
$${\mathbf{r}}_{t} = \varvec{a} + A\varvec{f}_{t} + \varvec{e}_{t} , {\text{E }}{\mathbf{f}}_{t} \varvec{e}_{t}^{\text{T}} = 0,$$
\({\mathbf{e}}_{t} \sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\varvec{e}_{t} 0, \varSigma_{e} )\) with a diagonal covariance \(\varSigma _{e}\)

(b)
\({\mathbf{y}}_{t} = g\left( {\varvec{f}_{t} ,\theta } \right);\)

(c)
$$\begin{aligned} {\mathbf{f}}_{t} &= B\varvec{f}_{t  1} + H\varvec{m}_{t} + {\text{diag}}\left[ {\sigma_{t}^{\left( 1 \right)} , \ldots ,\sigma_{t}^{\left( m \right)} } \right]\varepsilon_{t}, \ q\left( {\varepsilon_{t} } \right) = \mathop \prod \limits_{j} q\left( {\varepsilon_{t}^{\left( j \right)} } \right), \\ \varepsilon_{t} &= [\varepsilon_{t}^{\left( 1 \right)} , \ldots ,\varepsilon_{t}^{\left( m \right)} ]^{\text{T}}_{,}\ \ {\text{E }}{\mathbf{f}}_{t  1} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}{\mathbf{m}}_{t} \varepsilon_{t}^{\text{T}} = 0,\ {\text{E}}\varepsilon_{t}^{\left( j \right)} = 0, {\text{E}}\varepsilon_{t}^{\left( j \right) 2} = 1, \\ q\left( {\varepsilon_{t}^{\left( j \right)} } \right) &= \left\{ {\begin{array}{ll} G(\varepsilon_{t}^{\left( j \right)} 0, 1), \qquad \qquad \qquad \qquad \quad \left( {\text{i}} \right)\, {\text{one}}\;{\text{Gaussian,}} \\ \mathop \sum \limits_{i} \alpha_{i}^{\left( j \right)} G(\varepsilon_{t}^{\left( j \right)} \mu_{i}^{\left( j \right)} , \lambda_{i}^{\left( j \right)} ),\qquad \quad \qquad \;\left( {\text{ii}} \right) \,{\text{Gaussian }}\,{\text{mixture}}; \\ \end{array} } \right. \\ \sigma_{t}^{\left( j \right)} &= \left\{ {\begin{array}{ll} {\rm a} \ {\text{constant}}\, \sigma_{{}}^{\left( j \right)} , &\quad \left( {\text{a}} \right) \ {\text{nonheteroskedasticity}}, \\ \sigma_{t}^{\left( j \right)} \left( {\vartheta_{{}}^{\left( j \right)} } \right) {\text{given }}\;{\text{by }}\;{\text{Eq}}.\,\left( 3 \right), &\quad \left( {\text{b}} \right) \ {\text{heteroskedasticity}}; \\ \end{array} } \right. \end{aligned}$$

(d)
$$\begin{aligned}{\mathbf{m}}_{t} &= C\varvec{\nu}_{t} + \eta_{t} , {\text{E }}{\varvec{\upnu}}_{t} \eta_{t}^{\text{T}} = 0, \\ \eta_{t} &\sim^{{{\text{i}} . {\text{i}} . {\text{d}} .}} G(\eta_{t} 0, \varSigma_{\eta } ) {\text{with }}\;{\text{a}}\; {\text{digognal}}\; {\text{covariance }} \varSigma_{\eta } . \\ \end{aligned}$$(43)
Its basic part consists of ingredients (a)(b)(c). In the special case H = 0, its function is TFA with two extensions. One is outputting y_{t}, thus shortly denoted by TFAO. The other is that ingredient (c) drives f_{t} by its last term that is either or both of nonGaussian (nG) and conditional heteroscedasticity (CH), for which we use nGCHdriven TFAO to refer this formulation. When H ≠ 0, f_{t} is also modulated by the macroeconomic market force \(\varvec{m}_{t}\), it leads to the general formulation shortly named nGCHdriven MTFAO.
The central role is taken by the statistical nature of ingredient (c), with several scenarios as follows:

For the case that \(B = 0, H = 0\) and q(ɛ ^{( j)}_{ t} ) in Choice (i) as well as σ ^{( j)}_{ t} in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the FAbased implementation of the original APT by Eq. (37).

For the case that B = 0, \(\varepsilon_{t} = 0\), it follows from \(\tilde{A} = AH\) that ingredient (a) and ingredient (c) jointly degenerate back to the fundamental factors based implementation of the original APT by Eq. (37).

For the case that B = 0, q(ɛ ^{( j)}_{ t} ) in Choice (i), and σ ^{( j)}_{ t} in Choice (a), ingredient (a) and ingredient (c) jointly act as a combination of the above two implementations.

For the case that H = 0, q(ɛ ^{( j)}_{ t} ) in Choice (i), and σ ^{( j)}_{ t} in Choice (a), as well as B = diag[b_{1}, …, b_{m}]^{T}, ingredient (a) and ingredient (c) jointly become the TFAbased implementation by Eq. (41). It further becomes Eq. (42) when H ≠ 0. Moreover, conditional heteroskedasticity is further considered in \(\varepsilon_{t}\) via Choice (i) of σ ^{( j)}_{ t} to be replaced by Choice (b). As shown by empirical investigation in the CIEF’2003 conference paper (Chiu and Xu 2003), we consider that the conditional heteroskedasticity in the TFAbased implementation is considerably better than the TFAbased implementation without such a consideration.
Another alternative is that Choice (i) of a Gaussian q(ɛ ^{( j)}_{ t} ) is replaced by Choice (ii) of a nonGaussian q(ɛ ^{( j)}_{ t} ). In the simplest case, B = 0, H = 0, and σ ^{( j)}_{ t} in Choice (a), ingredient (a) and ingredient (c) jointly degenerate back to the nonGaussian FA (NFA) as outlined in Fig. 3 by Box 6, for which details are referred to Sect. III(A) in Xu (2001), Sect. IV in Xu (2004), and Sect. 3.2 in Xu (2010). Accordingly, we get a NonGaussian APT as shown in Fig. 3 by Box 7. Interestingly, NFA can also remove the FA’s rotation indeterminacy by Eq. (40), though there is no temporal structure \(\varvec{f}_{t}\) in consideration because B = 0, H = 0. Similar to Fig. 6, shown in Fig. 7 are the results of empirical investigation made on determining the appropriate factor number of APT by NFA (Chiu and Xu 2004a), still in comparison with the results of the MLFALR test and the eigenvalue approach as listed in Fig. 7a. Again, the BYY harmony learningbased NFA stably identified four or five factors regardless of the numbers of securities.
This alternative provides a different perspective on how to remove the indeterminacy by Eq. (40) or the incompleteness of APT. Without the additional equation about \(\varvec{f}_{t}\), the formulation of NFA implementation seems closer than the TFA implementation to the original APT formulation by Eq. (37). Naturally, there rises a question on which one is right, TFA or NFA? Actually, they are two aspects of one market model. TFA observes a dynamic market process while NFA describes the market with all the time points projected to one observation spot such that a Gaussian process is projected to be observed as a mixture of Gaussian distributions. Generally, we may have two natures to be considered in the same market, that is, considering both B = diag[b_{1}, …, b_{m}]^{T} and the choice (ii) of a nonGaussian q(ɛ ^{( j)}_{ t} ). Even generally, the conditional heteroskedasticity may also be added in via letting \(\sigma_{t}^{\left( j \right)}\) in the choice (b). Systematically integrating all the parts and all the ingredients together, Eq. (43) may serve as a general formulation for financial market modeling.
Bayesian Ying–Yang harmony learning and two exemplar learning algorithms
Bayesian Ying–Yang (BYY) harmony learning
The Bayesian Ying–Yang (BYY) harmony learning was proposed in Xu (1995a, b) and subsequently developed systematically (Xu 2001, 2007, 2010, 2012), which provides not only a framework that accommodates typical learning approaches from a unified perspective, but also a new road that leads to improved model selection criteria, Ying–Yang alternative learning with automatic model selection, as well as coordinated implementation of Yingbased model selection and Yangbased learning regularization.
From a modern science perspective that regards the famous ancient Yin–Yang philosophy as a meta theory of system sciences and intelligent systems, a system that survives and interacts with its world can be regarded as a Ying–Yang system that functionally composes of two complement parts. One is called Ying, from its inside into its external world, by which a set \(\varvec{X}_{N} = \{ x_{t} \}_{t = 1}^{N}\) of samples are regarded as generated from its representation \(\varvec{R}\), while the other is called Yang, from an external world into its inside. A two directional view is considered via the joint distribution of \(\varvec{X},\varvec{R}\) in two types of Bayesian decomposition. The decomposition of \(p\left( {\varvec{X},\varvec{R}} \right)\) coincides the Yang concept with a visible domain \(p\left( \varvec{X} \right)\) for a Yang space and a \(\varvec{X} \to \varvec{R}\) pathway by \(p(\varvec{R}\varvec{X})\) as a Yang pathway. Thus, \(p\left( {\varvec{X},\varvec{R}} \right)\) is called Yang machine. Also, \(q\left( {\varvec{X},\varvec{R}} \right)\) is called Ying machine with an invisible domain \(q\left( \varvec{R} \right)\) for a Ying space and a \(\varvec{R} \to \varvec{X}\) pathway by \(q(\varvec{X}\varvec{R})\) as a Ying pathway. Such a Ying–Yang pair is called Bayesian Ying–Yang (BYY) system. Ying–Yang pair interact with each other under the principle of best harmony, which is mathematically implemented by maximizing
For a machine learning or modeling purpose, we first need to consider a mathematical representation for \(\varvec{R}\). The first column of Table lists several typical examples. Usually, \(\varvec{R}\) consists of two parts. One is a longterm memory θ that consists of all unknown parameters in the system for collectively representing the underlying structure of \(\varvec{X}_{N}\), while the other is a shortterm memory YL with each element being either or both of a categorical label ℓ ∊ L and a vector y ∊ Y as the corresponding inner representation of one element x ∊ X. For examples, we have a vector y for describing \(\varvec{f}_{t}\) in the APT model by Eq. (37), while we simply have a label ℓ in the time series model by Eq. (4).
The probabilistic structure q(Y, L) is considered jointly with \(q(\varvec{X}\varvec{R}) = q(\varvec{X}Y,L,\theta )\), depending on both the tasks in consideration and a tradeoff between the complexity of q(Y, L) and the complexity of \(q(\varvec{X}Y,L,\theta )\). For the task of TFA modeling by Eq. (41), we have \(q(\varvec{X}Y,L,\theta )\) by \(q(\varvec{r}_{t} \varvec{f}_{t} )\) and q(Y, L) by \(q\left( {\varvec{f}_{t} \left {\varvec{f}_{t  1} } \right.} \right)\) as follows:
Moreover, the remaining part in q(R) = q(Y, Lθ)q(θ) is usually called a priori q(θ) that is chosen depending on the types of parameters and their positions in the Ying machine. In general, a Ying machine q(X, R) = q(XR)q(R) is designed according to a least complexity principle, featured with designing q(R) = q(Y, Lθ)q(θ) in a least redundancy principle and designing \(q(\varvec{X}\varvec{R}) = q(\varvec{X}Y,L,\theta )\) in a divide–conquer principle.
For the Yang machine p(X, R) = p(RX)p(X), p(X) directly comes from samples \(\varvec{X}_{N}\), while p(RX) is designed based on the Ying machine q(X, R) = q(XR)q(R) according to the variety preservation principle, that is
where Cov_{RX} indicates a covariance matrix of R conditioning on X. Readers are referred to Xu (2010, 2012) for recent systematic outlines on major issues for designing Ying–Yang machines. To be specific, reading is suggested to start with Sect. 3.2 in Xu (2012) and refer to Sect. 4.2 in Xu (2010) for supplementary materials. Also, readers are referred to Xu (2011) for another perspective that a codimensional matrix pair forms a building unit and a hierarchy of such building units sets up the BYY system.
With a BYY system designed, all the remaining unknowns in the system are determined via maximizing the harmony functional by Eq. (44). Typically, there are two types of unknowns. Given the structure of a BYY system or a parametric model in general, it actually means a family of infinite many candidate structures with everyone in a same configuration but in different scales. That is, each candidate is featured by a scale parameter \(\varvec{k}\) in terms of one integer or a set of integers. For examples, \(\varvec{k }\) consists of the model number k and the orders {q_{i}} for the model in Eq. (3), while merely of the dimension k in the APT model by Eq. (37).
The second type of unknown is featured by a set \(\theta_{\varvec{k}}\) of unknown parameters within the candidate structure featured by a specific \(\varvec{k}\). Accordingly, maximizing the harmony functional H(pq) by Eq. (44) makes both parameter learning on determining \(\theta_{\varvec{k}}\) and model selection on determining \(\varvec{k}\). This BYY best harmony learning provides a favorable mechanism for model selection. Readers are referred to Xu (2010, 2012) for recent systematic overviews on the fundamentals, the novelties and favorable natures of the BYY best harmony learning. To be specific, reading is suggested to start with Sect. 4.1 in Xu (2012) on two different aspects of measuring bientity proximity and Sect. 4.2 on the BYY harmony learning from the perspectives of Ying–Yang best matching versus Ying–Yang best harmony, and then proceed to Sect. 7 for a systematic outline on the thirteen topics about the BYY best harmony learning. Also, readers are referred to Xu ( 2010) for supplementary materials in Sect. 4.1 and the roadmap shown in Fig. A2 for the relations to other typical learning approaches.
The implementation of maximizing H(pq) consists of different specific cases for different learning problems and application tasks. Inputting the samples \(\varvec{X}_{N}\) by \(p\left( \varvec{X} \right) = \delta \left( {\varvec{X}  \varvec{X}_{N} } \right)\), H(pq) in Eq. (44) is simplified into the one on the top of Table 1. As \(\varvec{R}\) takes different specific forms given in the first column of Table 1, we have four types of H(pq) as listed in the second column of the table, plus their corresponding special cases of i.i.d. samples \(\left\{ {x_{t} } \right\}_{t = 1}^{N}\).
Moreover, the collective operations \(\int {[ \bullet ]} \,{\text{d}}Y_{N}\) and \(\sum_{L} \left[ { \bullet } \right]\) may be simplified by removing the integral or the summation to merely consider their optimal values, from which those of H(pq) in the second column of Table 1 result in the corresponding counterparts of \(H(\varTheta_{\varvec{k}} X_{N} )\) in the third column of the table. Each type in the second column may have more than one counterparts by removing either or both of the two collective operations. Such a removal makes learning implementation of \(H(\varXi_{\varvec{k}} X_{N} )\) easier but the learned system become more prone to an overfitting of a small size of samples.
As addressed at the end of “Learning mixture of AR, ARMA, ARCH and GRACH models” section, the BYY harmony learning has an automatic model selection mechanism similar to the RPCL learning. Additionally, \(H(\varTheta_{\varvec{k}} X_{N} )\) in the third column of Table 1 provides another angle to view such a mechanism. For example, observing the choice (a) in the lastbottom box of the table, maximizing \(H(\varTheta_{\varvec{k}} X_{N} )\) consists of maximizing not only \(p\left( {\theta X_{N} , \varXi } \right)\) that is same as the Bayesian learning, but also \(\mathop \sum \nolimits_{t = 1}^{N} p(y_{t} ,\ell_{t}  x_{t} ,\theta )\pi (x_{t} ,y_{t} ,\ell_{t} \theta_{{\ell_{t} }} )\) that includes maximizing a term \(\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}\) with \(\omega_{{\ell_{t} }} = q(x_{t} y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} \theta_{{\ell_{t} }} )\). Noticing that \(\omega_{{y_{t} ,\ell_{t} }} \ln \omega_{{y_{t} ,\ell_{t} }}\). monotonically increasing for \(\omega_{{\ell_{t} }} > e^{  1}\) but decreasing for \(\omega_{{\ell_{t} }} < e^{  1}\), a value \(\omega_{{\ell_{t} }} = q(x_{t} y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} )q(y_{t} ,\ell_{t} \theta_{{\ell_{t} }} ) > e^{  1}\) indicates the current fit to x_{t} is bigger than this threshold and increasing \(\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}\) enhances learning by \(q\left( {x_{t} y_{t} ,\ell_{t} ,\theta_{{\ell_{t} }} } \right)q(y_{t} ,\ell_{t} \theta_{{\ell_{t} }} )\) to fit x_{t}; while a value \(\omega_{{\ell_{t} }} < e^{  1}\) indicates that this fit is below a threshold and increasing \(\omega_{{\ell_{t} }} \ln \omega_{{\ell_{t} }}\) actually reduces this fit, i.e., a delearning occurs. This is similar to the RPCL learning.
For the existing Bayes approaches, it is crucial to choosing an appropriate prior, which is usually a difficult task, while an inappropriate prior may deteriorate the performance of model selection seriously. Without any priors on the parameters, Bayes approaches degenerate to the maximum likelihood learning, while the BYY harny learning is still capable of automatic model selection. Also in Table 1, if a priori distribution q(θΞ_{q}) is also considered, the performances of BYY harmony learning will be further improved. A simple choice of q(θΞ_{q}) is a Jeffreys prior, for which there is no parameter Ξ_{q}. Alternatively, we may also consider a parametric distribution. Typically, a priori q(θΞ_{q}) and a posteriori \(p(\theta  X_{N} ,\varXi_{p} )\) are either jointly a conjugate parametric pair or approximately two parametric distributions with each having a set of hyperparameters, namely, Ξ_{p},Ξ_{q}. Actually, a hyperpriori q(Ξ) is further considered for \(\varXi = \left\{ {\varXi_{p} , \varXi_{q} } \right\}\), for which q(Ξ) is a distribution usually with no more prior, e.g., by a Jeffreys prior.
The implementation of maximizing H(pq) is featured by jointly determining \(\varTheta_{{\varvec{k} }}\) and \(\varvec{k}\), namely
Moreover, determining \(\varTheta_{{\varvec{k} }}\) further consists of determining \(\theta_{\varvec{k}}\) and \(\varXi_{{\varvec{k} }}\) (if any), as well as updating y_{t}, ℓ_{t} per sample x_{t}. Generally, the implementation of Eq. (47) is an alternative iterative process that consists of Step yℓ for updating y_{t}, ℓ_{t}, Step θ for parameter learning, Step Ξ for learning hyperparameters (if any), and Step \(\varvec{k}\) for model selection. This process is featured by apex approximation, manifold shrinking, and balanced operation. Readers are referred to Sect. 4.3 in Xu (2012) for a recent systematic overview on major issues about the BYY harmony learning implementation and to Sect. 4.3 in Xu (2010) for further supplementary materials. Considering two typical learning tasks, readers are referred to Sect. 2 in Xu (2012) and Sect. 3 in Xu (2010) for the BYY harmony learning algorithms on Gaussian mixture and factor analysis as well as their extensions.
Learning implementation: gradient algorithms versus EMlike algorithms
The maximization by Eq. (47) can be implemented by different types of learning algorithms. The simplest and widely applicable type is featured by the following gradient based updating:
where \({{\Delta }}u \propto g_{u}\) means \({{\Delta }}u = {{\gamma }}g_{u}\) with a small γ > 0, \(\nabla_{{u \in D_{u} }} f\left( u \right)\) is the gradient of f(u) with respect to u within the domain D_{u} of u, and \(u + {{\Delta }}u \in D_{u}\) means updating within the domain D_{u} of u. In the sequel, the use of \({{\Delta }}u \propto g_{u}\) includes the updating \(u_{{}}^{\text{new}} = u_{{}}^{\text{old}} + {{\Delta u }} \in D_{u}\) even without writing it explicitly. For those choices of \(H\left( {\varTheta_{{\varvec{k} }} _{ } X_{N} } \right)\) in Table 1, if integrals are involved, we need to first handle the integrals and then take gradient on a mathematical expression without integrals, for which we approximately use a Taylor expansion around a maximal point up to the second order. Readers are referred to Sect. 4.3 in Xu (2012) for further details.
To show how a BYY harmony learning algorithm is obtained via the gradient based updating by Eq. (48). Further details are provided on learning the following alternative mixtureofexperts:
which comes from Eqs. (10), (11) and (12), while μ_{j,t} comes from the GARCH model given by Eq. (5). To develop algorithms for the ML learning by Eq. (16)(c) and the RPCL learning by Eq. (18), we consider the following likelihood:
Instead of maximizing the likelihood, learning algorithm is derived for maximizing
where q(θΞ_{q}) is a priori distribution typically in a least redundant factorization as follows:
Alternatively, each factor may be simply a Jeffreys prior. The posterior p(θX_{N}, Ξ_{p}) also have choices. First, p(θX_{N}, Ξ_{p}) and q(θΞ_{q}) are a conjugate pair such that the integral over θ can be handled analytically; see Sect. 4.3 of Xu (2012). Second, we may simply consider that p(θX_{N}, Ξ_{p}) is free of structure and maximizing H(pq) with respect to p(θX_{N}, Ξ_{p}) is simplified into the maximization of \(H(\varTheta_{\varvec{k}} X_{N} )\) with respect to \(\varTheta_{\varvec{k}} .\) It follows from Eq. (48) that we consider the following gradient updating
where ϕ is a subset of \(\varTheta_{\varvec{k}} = \left\{ {\theta ,\varXi_{\varvec{k}} } \right\}\), e.g., either of \(\left\{ {\varvec{a}_{j} } \right\},\left\{ {\mu_{j} } \right\},\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}, \ldots {\text{etc}}.\) One particular example of ϕ is \(\varvec{\alpha}= [\alpha_{1} , \ldots ,\alpha_{k} ]^{\text{T}}\) subject to each α_{j} ≥ 0 and \(\varvec{\alpha}^{\text{T}} 1 = 1\) with 1 = [1, …, 1]^{T}, for which we get \(\varvec{\alpha}\) via updating \(\varvec{c} = [c_{1} , \ldots ,c_{k} ]^{\text{T}}\) as follows:
As addressed in Eq. (5) in Xu (2010) and in Sect. 4.3.2 of Xu (2012), the maximization of Eq. (47) has a mechanism that pushes α_{j} → 0 if the corresponding expert is extra, i.e., automatic model selection occurs. Each of nonnegative parameters in \(\left\{ {\varvec{b}_{j} } \right\},\left\{ {\varvec{w}_{j} } \right\}\) may also be updated in a similar way, e.g., considering ξ = v^{2} or ξ = exp (v) such that ξ is updated via \(\Delta v \propto \nabla_{v} H(\varTheta_{\varvec{k}}^{\text{old}} X_{N} ).\) With the help of the priories \(q\left( {\beta_{j,i} } \right)\) and q(ω_{j,i}) in Eq. (52), the maximization of Eq. (47) also pushes β_{j,i} → 0 and ω_{j,i} → 0 if some order of the GARCH part in Eq. (4) and Eq. (5) is extra. Moreover, with help of the priori q(a_{j,i}) in Eq. (52), the maximization of Eq. (47) also pushes \(\rho_{j,i}^{ 2} \to 0\) if some order of the AR part in Eq. (4) and Eq. (5) is extra.
The learning implementation by Eq. (53) covers not only the gradient based ML learning by simply setting Δπ_{j,t}(θ ^{old}_{ j} ) = 0 in the Yang step, but also the RPCL learning algorithm simply with p_{j,t} given by Eq. (18). Moreover, setting \(\varvec{w}_{i} = 0\) leads to learning a mixture of ARCH models, while setting \(\varvec{w}_{i} = 0\) and \(\varvec{b}_{i} = 0\) degenerates to learning a mixture of AR models.
For implementing the ML learning, it also been widely regarded that the EM algorithm is preferred over the gradientbased algorithm (Redner and Walker 1984; Xu and Jordan 1996). In addition to the gradientbased implementation by Eq. (53), the BYY harmony learning may also be implemented by the following EMlike procedure:
where A–B denotes the complement of A with respect to B, i.e., \(\varvec{A} {} \varvec{B} = \left\{ {x \in \varvec{A}\left {x \notin \varvec{B}} \right.} \right\}\). When the root ϕ^{*} of χ(ϕ) = 0 is solved analytically, setting Δπ_{j,t}(θ) = 0 makes Eq. (53) degenerate to the EM algorithm for the ML learning if \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0\) or the Bayes learning if \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) \ne 0\). Generally, the algorithm by Eq. (55) is different from the EM algorithm by the factor 1 + Δπ_{j,t}(θ), which takes an important role in making model selection. However, the EM algorithm is guaranteed to converge (Redner and Walker 1984), while the factor 1 + Δπ_{j,t}(θ) makes the Ying–Yang iteration lose such a guarantee.
Efforts are made on remedying this weakness. One simple way is replacing ϕ^{new} = ϕ^{*} in Eq. (55) by the following linear combination
E.g., see Box 3 and Remark (c) in Fig. 7 and Box 7 in Fig. 8 of Xu (2010). However, how to choose an appropriate 0 ≤ η ≤ 1 remains a problem, which can be handled in one of the following two ways:

Initialize η ≤ 1, get ϕ^{new} by Eq. (56) and check whether \(H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \phi^{new} X_{N} ) > H(\tilde{\varTheta }_{k}^{{o{\text{ld}}}} \mathop \cup \nolimits \phi^{old} X_{N} )\)
If yes, we move to the next Ying step in Eq. (55), otherwise reduce η in some way to get ϕ^{new} and make such a check again.

Seek an optimal η^{*} that maximizes \(H\left( \eta \right) = H(\tilde{\varTheta }_{k}^{\text{old}} \mathop \cup \left[ {\phi^{\text{old}} + \eta \left( {\phi^{*}  \phi^{\text{old}} } \right)} \right]X_{N} )\), which can be handled by one of many techniques for one variable optimization. One example is solving the root of dH(η)/dη = 0.
Alternatively, another way to get ϕ^{new} from ϕ^{*} and ϕ^{old} is a reconsideration of \(\nabla_{\phi } H(\varTheta_{\varvec{k}} X_{N} )\) in Eq. (53). Making a first order Taylor expansion of ρ_{j,t}(θ) around θ^{old} and of ∇_{ϕ}π_{j,t}(θ) around ϕ^{*}, we consider
where the second ≈ comes from dropping the second order term \(\left( {\phi  \varphi^{\text{old}} } \right)^{\rm T} \nabla_{\phi } \rho_{j,t} \left( {\theta^{\text{old}} } \right)\;\nabla_{{\phi \phi^{\rm T} }} \pi_{j,t} \left( {\tilde{\theta }^{\text{old}} ,\varphi^{*} } \right)\left( {\phi  \varphi^{*} } \right)\). Taking the sum over j, t, the counterpart of the first term becomes \(\chi \left( {\phi^{*} } \right) = 0\) and thus disappears, from which we are led to
Then, we solve ψ(ϕ^{new}) = 0 to get \(\phi^{\text{new}}\) from ϕ^{*} and ϕ^{old}. Particularly, when \(g_{\phi } \left( {\varTheta_{\varvec{k}} } \right) = 0\) we simply have
It is still a linear function of ϕ^{*} and ϕ^{old}, but becomes much advanced than the one by Eq. (56).
Linear causal analyses
Path analyses and a recent development on ρdiagram
Path analyses is one earliest causal analysis approach, proposed around 1918 by Sewall Wright who made its developments more extensively in the 1920s (Wright 1921, 1934). It has been not only further investigated in the formulation of structural equation modeling (SEM) (Ullman 2006; Hooper et al. 2008; Pearl 2010a; Kline 2015) with wide applications, but also found its uses in many complex modeling areas, including biology, psychology, sociology, and econometrics. Details are left to a vast volume of publications in literature. Here, we introduce a recent development on a modified formulation named ρdiagram (Xu 2018).
The formulation considers a directed acyclic graph (DAG) or Bayesian networks, with visible nodes x_{1}, x_{2},…, x_{n} and hidden nodes w_{1},…,w_{m}. Each x_{i} is normalized to be zero mean and unit variance and each w_{j} is assumed to be zero mean and unit variance too; while each edge is associated with the correlation coefficient between its two nodes. In other words, such a diagram is completely defined by pairwise correlation coefficients, and thus called ρdiagram in that each correlation coefficient is denoted by ρ shortly. Being different from the classical procedure for path analyses, namely getting topology by prior, estimating unknown parameters and causal effects, and making modelfit assessment on alternative models, a TPC procedure is suggested for ρdiagram (Xu 2018), which begins at Topology discovery from data based on ρdiagram, and then makes Parameter estimation and Causality embedded modelfit assessment.
Topology discovery is based on equations that are obtained from path tracing in a way similar to Wright’s system of tracing rules. The difference is that unknowns in equations involve only the withindiagram ρvariables, while knowns are pairwise correlation rcoefficients obtained from visible nodes x_{1}, x_{2},…, x_{n}, subject to the constraints that all the ρvariables vary between [− 1,+ 1]. We discover a topology underlying data by checking whether a set of constrained equations is deterministically solved, that is, having (1) no solution, (2) a unique solution (or few solutions), and (3) infinite many of solutions.
For details refer to Xu (2018). Here, an illustration is made on topologies of 3node diagrams, as illustrated in Fig. 8. Given a diagram with nodes x, y, z, the simplest case is illustrated in Fig. 8a, featured by that every pairwise correlation is zero or there is only one pair that gets r_{ij} ≠ 0, which can be directly identified by observing r_{ij}, ∀i,j ∈{x,y,z}. Shown in Fig. 8b are topologies that have two edges. The first one gets two edges in a fork, which can be identified by observing r_{ij} = 0 for only one pair while r_{ij} ≠ 0 for other two pairs. The other topologies describes the causality from conditional independence analysis, which can be identified by observing r_{ik}r_{kj} = r_{ij} ≠ 0 ∀i,j ∈{x,y,z} on all the permutations of x, y, z.
Shown in Fig. 8c are two typical topologies of widely encountered causal structure called cofounder. Via path tracing, the following equations are obtained:
As shown in Fig. 8c, we may check whether two lines get cross within the dashed box. If yes, a cofounder is identified in either of two topologies on the bottom of Fig. 8c. However, the direction between j and k cannot be identified. Even so, the direct causal direction and effect
is uniquely determined, i.e., the cofounder effect can be remedied.
If two lines do not intersect within the box, one may further check one other permutation of labels i, j, k. It is unlikely that two different permutations are both identified because it merely happens when not only ρ = r holds on two edges but also four linear equations have consistent solution for unknowns. If no permutation can be identified, it means that there is not such a cofounder causality underlying data. However, there may be still other causality. On one hand, we may check whether there is some causality in types of Fig. 8a, b. On the other hand, we may continue to diagrams with four nodes or more.
Causal potential theory
As already mentioned above, the direction between j and k in Fig. 8c cannot be identified. Also, edge directions in Fig. 8b cannot be identified too. There have been extensive studies on detecting causal direction and evaluating causal strength (Peters et al. 2009; Zhang and Hyvärinen 2009; Hoyer et al. 2009; Rubin and John 2011), via analyzing certain types of asymmetry between two variables X and Y. One most authoritative definition of causality is p(Ydo X = x) with ‘do X = x’ indicating the action that imposes X = x (Pearl 2010b). In these studies, causality is actually examined from a descriptive perspective.
As illustrated in Fig. 8d, possible movements that apple falls and balance loses are actually caused by physics mechanism, i.e., the law of universal gravitation and the lever principle, where causality is actually an issue of dynamics, about how movements are caused by forces that come from potential difference. It follows from the viewpoint of grand unification that we are thus motivated to believe that causality in terms of probability, information, and intelligence should be also governed by similar dynamics.
Consider the relationship described by density distribution \(p\left( {x,y} \right), \varvec{ }\) as illustrated in Fig. 8d, the quantity \({\text{E}}\left( {x,y} \right) \propto  { \ln } p\left( {x,y} \right)\) actually describes a sort of potential energy density on an infinitesimal piece dxdy, and represents a difference of potential energy density in reference of a uniform distribution on the space x, y, while we can get
to represent a force field that drives information flow toward the area with the lowest energy, or equivalently driving that information flows from rare occurring locations toward high occurring locations.
Changes of x, y and the rates of changes are described by I_{x}, I_{y}, respectively, and both are actually driven by the difference of potential energy density of E(x, y). The problems about whether one of X, Y causes the other or whether two are mutually caused each other may be examined through I_{x}, I_{y}. Typically, we may encounter the following cases:
For Case O, changes of x merely relates to itself, while changes of y merely relates to itself, that is, changing x is independent of change of y. For Case A, changes of x merely relates to itself, while changes of y relate to both of \(x,y,\) where we may regard that changing x causes change of y. For Case B, changes of y merely relates to itself, while changes of x relate to both of \(x,y,\) where we may regard that changing y causes change of x. For Case C, changes of x, y are mutually related.
From a set of samples of x, y, we may develop certain statistics to identify which case is actually encountered. Due to noise and a finite sample size, the first three cases are rarely found. What are often encountered is Case C. In such cases, we may further check whether one of x, y takes a dominant role, while the other maybe ignored, that is, whether we have either or both of
Further insights on causality may be obtained from this perspective, not only a pair X, Y may be identified in one of the four cases on the entire domain that x, y vary, but also a pair may be identified in one case on some subdomain but in a different case on some different subdomain. That is, causal direction may reverse, disappear, and emerge as x, y vary on different subdomains.
To be more specific, we observe two typical examples. The first considers binary x, y from
where s(r) is a sigmoid function and p(yx) describes a logistic regression, for which we get
We usually have \(\delta \approx 0\) if the logistic regression fits well, thus it leads to Case A above, i.e., the causal direction is x → y, which is consistent to our existing understanding on this model.
The second example considers p(x,y) from a joint density of Gaussian variables x, y with zero mean and unit variance as well as their correlation coefficient ρ. It follows that
which leads to Case 0 when ρ = 0, Case A when ρy ≈ 0, Case B when ρx ≈ 0, and Case C in general. That is, we are unable to identify causal direction on the entire domain, which is also consistent to our existing understanding. Interestingly, we get new insight that it is possible to detect causal direction in some particular subdomains.It also may deserve to extend these studies to consider a density \(p\left( {\varvec{x},\varvec{y}} \right) \varvec{ }\) with \(\varvec{x},\varvec{y}\) being vectors such that we examine causality between two groups of variables.
SEM and its relations to modulated TFAAPT and nGCHdriven MTFAO
In its early stages of developments, modeling by equations in path analyses and structural equation modeling (SEM) were used without a particular clarification. In recent decades, SEM is gradually developed into the following formulation (Ullman 2006; Kline 2016):
To compare modulated TFAAPT and nGCHdriven MTFAO, we observe the following equations from Eq. (42) and in Eq. (43):
Putting the last one into the second one, we may rewrite
Table 2 compares the notations in Eqs. (62) and (63).
The two are actually the same at the special case H = 0. Generally, we observe that modulated TFAAPT may be regarded as a variant or extension of SEM.
Coming from different perspectives, SEM and the modulated TFA–APT aim at causal analysis in a closely related way. Both consist of FA as basic ingredient that suffers the intrinsic rotation indeterminacy by Eq. (40). In path analysis and SEM study, the problem is avoided by making hidden factors f_{t} and/or the elements of A partly known with humanaide. While in the modulated TFAAPT, the problem is solved by considering both independence cross hidden factors and temporal dependence Bf_{t−1} among each factor. We may combine the ideas to improve each other. On one hand, SEM motivates us to prune away extra edges that correspond to elements of A, which may be implemented by sparse learning. On the other hand, we may improve SEM by considering temporal dependence among endogenous factors.
Moreover, rotation indeterminacy may also be removed by changing the driving noise of hidden factors from Gaussian q(ɛ ^{( j)}_{ t} ) into nonGaussian q(ɛ ^{( j)}_{ t} ) (Xu 2001, 2004). Furthermore, conditional heteroskedasticity (Chiu and Xu 2003) has also been included in the driving noise to encode nonstationarity. The two points are actually included in Item (c) in Eq. (43), which extends the modulated TFAAPT into nGCHdriven MTFAO, which may also be used to improve SEM. Furthermore, a nondiagonal matrix B may be considered to replace a diagnal matrix B in TFA, such that Granger causality like problem (Granger 1969) may be taken in consideration together with the previous cofounder problem further examined.
Abbreviations
 AIC:

Akaike information criterion
 APT:

arbitrage pricing theory
 AR:

autoregressive
 ARCH:

Autoregressive Conditional Heteroskedasticity
 ARIMA:

autoregressive integrated moving average
 ARMA:

autoregressive–moving average
 BYY:

Bayesian Ying Yang
 BIC:

Bayesian information criterion
 CAIC:

consistent AIC
 CAPM:

capital asset pricing model
 EMH:

efficient market hypothesis
 HMM:

hidden Markov model
 GARCH:

generalized ARCH
 LDS:

linear dynamical system
 LR:

likelihood ratio
 MDL:

minimum description length
 ME:

mixtureofexperts
 ML:

maximum likelihood
 MLFA:

maximum likelihood factor analysis
 MML:

minimum message length
 MUV:

mixture using variance
 NFA:

nonGaussian factor analyses
 NRBF:

normalized radial basis function
 ρdiagram:

a diagram defined by a set of pairwise correlation coefficients
 RPCL:

rival penalized competitive learning
 SEM:

structural equation modeling
 SSM:

state space model
 TFA:

temporal factor analysis
 VAR:

vector autoregressive
 VB:

variational Bayes
References
Abeysekera SP, Mahajan A (1987) A test of the APT in pricing UK stocks. J Account Finance 17(3):377–391
Azeez AA, Yonezawa Y (2006) Macroeconomic factors and the empirical content of the Arbitrage Pricing Theory in the Japanese stock market. Jpn World Econ 18(4):568–591
Azoff ME (1994) Neural network time series forecasting of financial markets. Wiley, New York
Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econom 31:307–327
Box G, Jenkins G (1970) Time series analysis: forecasting and control. HoldenDay, San Francisco
Brown SJ (1989) The number of factors in security returns. J Finance 44(5):1247–1262
Chamberlain G, Rothschild M (1983) Arbitrage, factor structure, and mean–variance analysis on large asset markets. Econometrica 51(5):1281–1304
Chen NF, Roll R, Ross S (1986) Economic forces and the stock market. J Bus 59(3):383–403
Cheung YM, Leung WM, Xu L (1996) Combination of buffered backpropagation and RPCLCLP by mixtureofexperts model for foreign exchange rate forecasting. In: Proceedings of 3rd international conference on neural networks in the capital markets, London, UK, Oct 11–13, 1996. World Scientific Pub, Singapore, pp 554–563
Cheung Y, Leung WM, Xu L (1997) Adaptive rival penalized competitive learning and combined linear predictor model for financial forecast and investment. Int J Neural Syst 8:517–534
Chiu KC, Xu L (2002) Stock price and index forecasting by arbitrage pricing theorybased Gaussian TFA learning. In: Yin HJ (ed) Lecture notes in computer sciences (LNCS), vol 2412. Springer, Berlin, pp 366–371
Chiu KC, Xu L (2002) A comparative study of Gaussian TFA learning and statistical tests on the factor number in APT. In: Proceedings of international joint conference on neural networks 2002 (IJCNN ‘02), Honolulu, Hawaii, USA, May 12–17, 2002. pp 2243–2248
Chiu KC, Xu L (2003) Stock forecasting by ARCH driven Gaussian TFA and alternative mixture experts models. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sept 26–30. pp 1096–1099
Chiu KC, Xu L (2003) On generalized arbitrage pricing theory analysis: empirical investigation of the macroeconomics modulated independent state–space model. In: Proceedings of 2003 international conference on computational intelligence for financial engineering, Hong Kong, March 20–23. pp 139–144
Chiu KC, Xu L (2004a) Arbitrage pricing theory based Gaussian temporal factor analysis for adaptive portfolio management. J Decis Support Syst 37:485–500
Chiu KC, Xu L (2004b) NFA for factor number determination in APT. Int J Theor Appl Finance 7:253–267
Choey M, Weigend AS (1997) Nonlinear trading models through Sharpe ratio optimization. Int J Neural Syst 8(3):417–431
Dhrymes PJ, Friend I, Gultekin B (1984) A critical reexamination of the empirical evidence on the arbitrage pricing theory. J Finance 39(2):323–346
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of variance of United Kingdom Inflation. Econometrica 50:987–1008
Engle RF, Granger CWJ (1987) Cointegration and error–correction: representation, estimation and testing. Econometrica 55(2):251–276
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Fishburn PC (1977) Meanrisk analysis with risk associated with belowtarget returns. Am Econ Rev 67(2):116–126
Gately E (1995) Neural networks for financial forecasting. John Wiley & Sons, New York
Ghahramani Z, Hinton GE (2000) Variational learning for switching state–space models. Neural Comput 12(4):831–864
Granger CWJ (1969) Investigating causal relations by econometric models and crossspectral methods. Econometrica 37(3):424–438
Hooper D, Coughlan J, Mullen MR (2008) Structural equation modelling: guidelines for determining model fit. Electron J Bus Res Methods 6(1):53–60
Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696
Hung KK, Cheung CC, Xu L (2000) New Sharperatiorelated methods for portfolio selection. In: IEEE/IAFE/INFORMS 2000 conference on computational intelligence for financial engineering, New York City, USA, March 26–28, pp 34–37
Hung KK, Cheung Y, Xu L (2003) An extended ASLD trading system to enhance portfolio management. IEEE Trans Neural Networks 14:413–425
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3:79–87
Jangmin O, Jongwoo L, Lee JW, Zhang BT (2006) Adaptive stock trading with dynamic asset allocation using reinforcement learning Inform Sci 176(15):2121–2147
Jordan MI, Xu L (1995) Convergence results for the EM approach to mixtures of experts architectures. Neural Netw 8:1409–1431
Kline RB (2015) Principles and practice of structural equation modeling, 4th edn. Guilford Publications, New York
Kwok HY, Chen CM, Xu L (1998) Comparison between mixture of ARMA and mixture of AR model with application to time series forecasting. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, October 21–23, vol 2. pp 1049–1052
Leontaritis IJ, Billings SA (1985) Inputoutput parametric models for nonlinear systems Part I: deterministic nonlinear systems and Part II: stochastic nonlinear systems. Int J Control 41:303–344
Leung WM, Cheung Y, Xu L (1997) Application of mixture of experts models to nonlinear financial forecasting. In: Caldwell RB (ed) Nonlinear financial forecasting: proceedings of the first INFFC, (Finance & Technology Publishing, 1997), pp 153–168
Markowitz HM (1952) Portfolio selection. J Finance 7(1):77–91
Markowitz HM (1959) Portfolio selection: efficient diversification of investments. John Wiley & Sons, New York
McGrory CA, Titterington DM (2007) Variational approximations in Bayesian model selection for finite mixture distributions. Comput Stat Data Anal 51(11):5352–5367
Moody J, Saffell M (2001) Q learning to trade via direct reinforcement. IEEE Trans Neural Networks 12(4):875–889
Moody J, Lizhong W, Liao Y, Saffell M (1998) Performance functions and reinforcement learning for trading systems and portfolios. J Forecasting 17:441–470
Neuneier R (1996) Optimal asset allocation using adaptive dynamic programming. In: Touretzky DS (ed) Advances in neural information processing systems, 8th edn. MIT Press, Cambridge, pp 952–958
Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2):1–62
Perrone MP (1994) Putting it all together: methods for combining neural networks. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann, San Francisco, pp 1188–1189
Perrone MP, Cooper LN (1993) When networks disagree: ensemble methods for neural networks. In: Mammone RJ (ed) Neural networks for speech and image processing. Chapman & Hall, New York, pp 126–142
Peters J, Janzing D, Gretton A, Schölkopf B (2009) Detecting the direction of causal time series. In: Proceedings of the 26th annual international conference on machine learning. ACM, New York, pp 801–808
Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood, and the EM algorithm. SIAM Rev 26:195–239
Ross S (1976) The arbitrage theory of capital asset pricing. J Econ Theory 13(3):341–360
Rubin DB, John L (2011) Rubin causal model. International encyclopedia of statistical science. Springer, Berlin, pp 1263–1265
Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J Finance XIX(3):425–442
Sharpe FW (1966) Mutual fund performance. J Bus 39(S1):119–138
Sharpe WF (1994) The Sharpe ratioproperly used, it can improve investment. J Portfolio Manag Fall 21:49–58
Shumway RH, Stoffer DS (1991) Dynamic linear models with switching. J Am Stat Assoc 86(415):763–769
Sims C (1980) Macroeconomics and reality. Econometrica 48(1):1–48
Sortino FA, van der Meer R (1991) Downside risk: capturing what’s at stake in investment situations. J Portfolio Manag 17(4):27–31
Tang H, Chiu KC, Xu L (2003) Finite mixture of ARMAGARCH model for stock price prediction. In: Proceedings of 3rd international workshop on computational intelligence in economics and finance, North Carolina, USA, Sep 26–30, pp 1112–1119
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58(1):267–288
Tu S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electr Electron Eng China 6(2):245–255
Ullman JB (2006) Structural equation modeling reviewing the basics and moving forward. J Pers Assess 87(1):35–50
Wang P et al (2011) Radar HRRP statistical recognition with temporal factor analysis by automatic Bayesian Ying–Yang harmony learning. Front Electr Electron Eng China 6(2):300–317
Westland JC (2015) Structural equation modeling: from paths to networks. Springer, New York
Williams PM (1995) Bayesian regularization and pruning using a Laplace prior. Neural Comput 7(1):117–143
Wong WC, Yip F, Xu L (1998) Financial prediction by finite mixture GARCH model. In: Proceedings of international conference on neural information processing, Kitakyushu, Japan, Oct 21–23, 3(1998), pp 1351–1354
Wright S (1921) Correlation and causation. J Agric Res 20(7):557–585
Wright S (1934) The method of path coefficients. Ann Math Stat 5(3):161–215
Xu L (1994) Signal segmentation by finite mixture model and EM algorithm. In: Proceedings of international symposium on artificial neural networks, Tainan, Dec 15–17, pp 453–458
Xu L (1995) Channel equalization by finite mixtures and the EM algorithm. In: Proceedings of IEEE neural networks and signal processing workshop. Cambridge, MA, Aug 31–Sep 2, vol 5, pp 603–612
Xu L (1995) Ying–Yang machines: a Bayesian–Kullback scheme for unified learning and new results on vector quantization. In: Proceedings of the international conference on neural information processing, Beijing, China, Oct 30–Nov 3, pp 977–988 (A further version Advances in NIPS8, Touretzky DS et al (ed), MIT Press, Cambridge MA, 1996: 444–450)
Xu L (1997) Bayesian Ying Yang system and theory as a unified statistical learning approach: (II) from unsupervised learning to supervised learning, and temporal modeling. In: Wong KM et al (eds) Proceedings of theoretical aspects of neural computation: a multidisciplinary perspective. Springer, Berlin, pp 29–42
Xu L (1998) RBF nets, mixture experts, and Bayesian Ying–Yang learning. Neurocomputing 19:223–257
Xu L (2000) Temporal BYY learning for state space approach, hidden Markov model, and blind source separation. IEEE Trans Signal Process 48(7):2132–2144
Xu L (2001) BYY harmony learning, independent state space and generalized APT financial analyses. IEEE Trans Neural Netw 12:822–849
Xu L (2002) Temporal factor analysis: stableidentifiable family, orthogonal flow learning, and automated model selection. In: Proceedings of international joint conference on neural networks. Honolulu, HI, USA, 12–17 May, pp 472–476
Xu L (2004) Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor autodetermination. IEEE Trans Neural Netw 15(4):885–902
Xu L (2007) A unified perspective and new results on RHT computing, mixture based learning, and multilearner based problem solving. Pattern Recogn 40:2129–2153
Xu L (2009) Learning algorithms for RBF functions and subspace based functions. In: Olivas ES et al (eds) Handbook of research on machine learning applications and trends: algorithms, methods and techniques. IGI Global, Hershey, pp 60–94
Xu L (2010) Bayesian Ying–Yang system, best harmony learning, and five action circling. J Front Electr Electron Eng China 5(3):281–328 (A special issue on Emerging Themes on Information Theory and Bayesian Approach)
Xu L (2012) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. J Front Electr Electron Eng 7(1):147–196 (A special issue on Machine learning and intelligence science: IScIDE (C))
Xu L (2018) Deep bidirectional intelligence: AlphaZero, deep IAsearch, deep IAinfer, and TPC causal learning. Appl Inform 5(5):38
Xu L, Amari S (2008) Combining classifiers and learning mixture of experts. In: Rabuñal Dopico JR (ed) Encyclopedia of artificial intelligence. IGI Global, Hershey, pp 318–326
Xu L, Cheung Y (1997) Adaptive supervised learning decision networks for traders and portfolios. J Comput Intell Finance 5(6):11–16 (A short version also in Proceedings of IEEEIAFE 1997 International Conference on Computational Intelligence for Financial Engineering (CIFEr), New York City, March 2325, 1997, 206–212)
Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8(1):129–151
Xu L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival Penalized competitive learning. In: Proceedings of 11th international conference on pattern recognition. Hague, Netherlands, Aug 30–Sep 3, pp 672–675
Xu L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Trans Neural Netw 4:636–649
Xu L, Jordan MI, Hinton GE (1994) A modified gating network for the mixtures of experts architecture. Proceedings of 1994 world congress on neural networks, vol 2. San Diego, CA, June 4–9, pp 405–410
Xu L, Jordan MI, Hinton GE (1995) An alternative model for mixtures of experts. In: Tesauro G et al (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 633–640
Zhang PG (ed) (2003) Neural networks in business forecasting, forecasting and control. IRM Press, London
Zhang K, Hyvärinen A (2009) On the identifiability of the postnonlinear causal model. Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009). Montreal, Canada, 2009, pp 647–655
Authors’ contributions
All from the sole author LX. The author read and approved the final manuscript.
Acknowledgements
This work was supported by the ZhiYuan chair professorship startup Grant (WF220103010) from Shanghai Jiao Tong University.
Competing interests
The author declares that there is no competing interests.
Availability of data and materials
Not applicable.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
WF220103010, Shanghai Jiao Tong University.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Xu, L. Machine learning and causal analyses for modeling financial and economic data. Appl Inform 5, 11 (2018). https://doi.org/10.1186/s4053501800585
Received:
Accepted:
Published:
Keywords
 Prediction modeling
 Portfolio management
 Mixtureofexperts
 Conditional heteroskedasticity
 Arbitrage pricing theory
 Temporal factor analysis
 Macroeconomics modulated
 Path analyses
 Structural equation modeling
 Cofounder discovery
 Causal potential theory