Skip to main content

Further advances on Bayesian Ying-Yang harmony learning

Abstract

After a short tutorial on the fundamentals of Bayes approaches and Bayesian Ying-Yang (BYY) harmony learning, this paper introduces new progresses. A generic information harmonising dynamics of BYY harmony learning is proposed with the help of a Lagrange variety preservation principle, which provides Lagrange-like implementations of Ying-Yang alternative nonlocal search for various learning tasks and unifies attention, detection, problem-solving, adaptation, learning and model selection from an information harmonising perspective. In this framework, new algorithms are developed to implement Ying-Yang alternative nonlocal search for learning Gaussian mixture and several typical exemplars of linear matrix system, including factor analysis (FA), mixture of local FA, binary FA, nonGaussian FA, de-noised Gaussian mixture, sparse multivariate regression, temporal FA and temporal binary FA, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model. These algorithms are featured with a favourable nature of automatic model selection and a unified formulation in performing unsupervised learning and semi-supervised learning. Also, we propose a principle of preserving multiple convex combinations, which leads alternative search algorithms. Finally, we provide a chronological outline of the history of BYY learning studies.

Background

Bayes approach and automatic model selection

Learning in an intelligent system is featured by three levels of inverse problems, for which details are referred to Sect. 1 of (Xu 2010a,2010b). To be self-contained, we make a brief overview on typical learning tasks and approaches from such a viewpoint, with help of the illustration in Figure 1.

Figure 1
figure 1

Learning studies from a viewpoint of three levels of inverse problems.

Learning tasks associated with the front level can be viewed from a perspective of learning a mapping xy, called representative model, by which an observed sample x in a visible domain X is mapped into its corresponding encoding y as a signal or inner code to perform a task of problem solving, such as abstraction, classification, inference and control. Existing learning methods for a representative model can be roughly divided into two groups as follows: (1) One is featured by learning a mapping xy according to whether a principle is satisfied by the resulted inner encodings of y, while not explicitly taking the other directional mapping yx in consideration. One exemplar family is featured by a linear mapping y=W x that transforms x into y of independent components, such as principal component analysis (PCA) and independent component analyses (ICA) (Xu 2003a). The other widely studied family is supervised learning by a linear or a nonlinear mapping that makes samples of y to approach the desired target samples. (2) The other group is featured by learning a mapping xy as an inverse of a given mapping yx that describes how observed samples are generated. Some efforts aim at that the cascade of xy and yx implement a unitary transform xx, as often encountered in adaptive control. Most of studies consider yx in a probabilistic sense by q(x|y) together with y described by q(y). Accordingly, xy is either directly the Bayesian inverse of q(x|y)q(y) or its certain approximation.

Typically, the mapping yx in the front level is unknown and should be learned from a given set \(X_{N}=\{x_{t}\}_{t=1}^{N}\) of samples, which is also called generative learning. Usually, the corresponding distribution structure (or called generative models) is designed according to types of applications. One widely studied structure is the linear system shown in Figure 1. As to be further addressed in the subsequent sections, this structure not only covers subspace methods, Gaussian mixture, factor analysis and its extensions to binary or nonGaussian factors but also can be further generalised to many others.

The generative learning task is estimating θ={ψ,ϕ} in the pre-designed distributions q(x|y,ψ) and q(y|ϕ). One most widely used principle is the maximum likelihood, that is,

$$\begin{array}{@{}rcl@{}} \theta^{*}=arg\max_{\theta}L(\theta),\ L(\theta)= \sum_{t=1}^{N} \ln{q(x_{t}\vert \theta)}, \ q(x\vert \theta)=\int q(x|y, \psi)q(y|\phi)dy. \end{array} $$
((1))

Though it can be implemented directly by a gradient-based algorithm, an effective alternative is called the expectation-maximisation (EM) algorithm (Dempster et al. 1977) that alternatively implements its E step for xy by the following Bayes inverse:

$$\begin{array}{@{}rcl@{}} p(y|x)=p(y|x,\theta^{old}), \ p(y|x,\theta)=\frac{q(x|y, \psi)q(y|\phi)}{q(x\vert \theta)}, \end{array} $$
((2))

and its M step that updates θ by

$$\begin{array}{@{}rcl@{}} &\theta^{new}=arg\max_{\theta}L(\theta, \theta^{old}),& \\ &L(\theta, \theta^{old})=\sum_{t=1}^{N}\int p(y|x_{t}, \theta^{old}) \ln{[q(x_{t}|y, \psi)q(y|\phi)]}dy.& \end{array} $$
((3))

This EM iteration is guaranteed to converge to a local maximum of L(θ) without requiring any learning stepsize, while the gradient-based algorithm needs an appropriate learning stepsize that results in learning instability if the size is too big or a very slow convergence if the size is too small. Moreover, the EM algorithm keeps the constraints of Gaussian mixture satisfied and demonstrates a super-linear convergence rate, with further details referred to Xu and Jordan (1996).

In many applications, the computation of p(y|x,θ) is intractable. The variational method is proposed to maximise a lower bound of L(θ) (Dayan et al. 1995; Jordan et al. 1999). Precisely, estimating θ by Equation 1 or Equation 2 implements another inverse problem X N θ in the second level shown in Figure 1, on which all the levels share the same q(x|y,ψ) that maps y, ψ (a part of θ), and also inclusively k all together to describe how observed samples of x are generated. Similarly, we may consider X N θ by a Bayes inverse p(θ|X N ) of q(x|θ)q(θ|k). However, its computation is intractable. Instead, we get X N θ by the following maximum posterior (MAP) (or called the classic or naive Bayes learning):

$$\begin{array}{@{}rcl@{}} \theta^{*}=arg\max_{\theta} [L(\theta)+ \ln{q(\theta |k)}]. \end{array} $$
((4))

How to use a priori q(θ|k) is a topic that has a long history and has been considered from several aspects. The classic Bayes school uses different parametric distributions on different parts of θ according to the natures of learning tasks and empirical experiences. Typical examples are those of conjugate priors (Diaconis and Ylvisaker 1979; Ntzoufras and Tarantola 2013). Extensive studies along this line have been made in the machine learning literature, especially on Dirichlet-multinomial for Gaussian mixture. Related studies also include those on multivariate linear regression and extensions. When Gaussian priori is used on each regression coefficient, learning by Equation 4 implements the ridge regression (Hoerl 1985) and Tikhonov regularisation (Tikhonov et al. 1995). When Laplace priori is used on each regression coefficient, learning by Equation 4 implements LASSO regression (Tibshirani 1996) or called sparse learning.

Another Bayes school prefers to use a non-informative priori. For a parameter varies on a compact support, such a priori is simply a uniform distribution. However, there is no such a uniform distribution on an infinite large support. Typically, a non-informative improper distribution q(θ|k) is used under the name of Jeffery priori (Jeffreys 1946), which has been widely used in the machine learning literature too. Also, there are some efforts that attempt to blend the two schools, e.g. the Jeffery priori is jointly used with a proper priori by the minimum message length (MML) method (Figueiredo and Jain 2002; Wallace and Dowe 1999). Moreover, there is also one effort called induced bias cancellation (IBC), by which the use of a priori is to cancel an implicit prior induced from using a learning model on a finite size of samples, e.g. see Eqs (20) and (21) in Xu (2000a) and also Sect. 3.4.3 in Xu (2007a). Interestingly, as addressed on page 304 of Xu (2010a), this IBC may be regarded as a degenerated but easy computing approximation of the normalised maximum likelihood (NML) that is obtained from a mini-max principle (Barron et al. 1998), which takes a key role in the recent developments of the MDL encoding.

One critical weak point of learning by Equation 4 is prone to a bad priori because q(θ|k) takes an important position that is equal to the empirical estimator via L(θ). To mitigate such a bad effect, the up-to-date Bayes studies prefer to consider the following one:

$$\begin{array}{@{}rcl@{}} &k*={argmax}_{k} L(X_{N}, k), \ L(X_{N}, k)=\sum_{t=1}^{N}\ln{q(x_{t}|k)}, & \\ & q(x|k)=\int q(x|\theta) q(\theta |k)d \theta. & \end{array} $$
((5))

It actually implements the third level inverse X N k (In Figure 1 there are merely two levels because the 2nd and 3rd levels are merged in a consideration of automatic model selection to be addressed after Equation 7). This task is usually called model selection. However, the integral over θ is computationally intractable, which is typically handled with help of some approximating technique. One classical one is made by the Bayesian information criterion (BIC) (Schwarz 1978) that approximately turns L(X N ,k) into

$$\begin{array}{@{}rcl@{}} L(X_{N}, k)\approx L(X_{N}, \theta^{*})-0.5 k ln{N}, \end{array} $$
((6))

by which learning is made via a two-stage implementation. The first stage enumerates all possible numbers of k to obtain a set of candidate models featured by different values of k, and estimates θ by Equation 1 for each candidate. At the second stage, we select the best candidate by Equation 5 with L(X N ,k) given by Equation 6. In implementation, the minimum description length (MDL) (Rissanen 1978) is actually equivalent to this BIC. There are also a number of other variants of L(X N ,k) available in the literature, e.g. another classic one is Akaike’s information criterion (AIC) (Akaike 1974, 1987).

However, a two-stage implementation suffers from a huge computation because it requires parameter learning for each candidate. Also, estimating θ by Equation 1 will become less reliable when the component number k is large and thus incurs for more free parameters.

This problem is tackled by considering a learning process or principle with a nature of automatic model selection, e.g. discarding extra hidden dimensions of y in Figure 1. With k initialised large enough, a learning principle demonstrates such a nature with the following two features:

  • there is an indicator Ψ π (θ) on θ or its subset, based on which a particular subset π can be effectively discarded if we have

    $$\begin{array}{@{}rcl@{}} \Psi_{\pi}(\theta) \to 0, \end{array} $$
    ((7))

    e.g. Ψ π (θ) is the variance of y (i) in Figure 1.

  • in learning implementation there is an intrinsic mechanism that leads to Equation 7 when the corresponding structure is redundant and thus can be effectively discarded.

Such automatic model selection is actually made during implementing the inverse problem X N θ. Thus, we merge the corresponding two levels in Figure 1 because it combines both the inverse problem X N θ and the inverse problem X N k.

For the existing studies, there are three roads towards automatic model selection. One is a heuristic road, featured by an early effort called Rival Penalised Competitive Learning (RPCL) made in the early 1990s (Xu et al. 1992, 1993), which gets an appropriate number k of clusters automatically determined during learning.

The second road is getting an aid from appropriate priories. For examples, learning by Equation 4 demonstrates such a nature by using either a Laplace priori in sparse learning (Tibshirani 1996) or jointly the Jeffery priori and a proper priori by the minimum message length (MML) (Figueiredo and Jain 2002). Another example is the Variational Bayes (VB) (Corduneanu and Bishop 2001; McGrory and Titterington 2007) that approximately maximises a lower bound of L(X N ,k) in Equation 5 via learning the hyper parameters in both a priori q(θ|k) and an approximate posteriori p(θ|X N ).

The third road is the following BYY harmony learning (BYY) that was firstly proposed in 1995 (Xu 1995) and subsequently developed systematically, which provides a general framework for learning X N θ and X N k under the BYY best harmony principle.

Bayesian Ying-Yang harmony learning

We reformulate Figure 1 into a general probabilistic formulation, resulting in Figure 2. We use R={Y,{θ}} to summarise three levels of inner representation and P(R|X) for the mapping XR that consists of the mappings XY and Xθ. On the one hand, we have p(X,R)=p(R|X)p(X) to describe the joint distribution of X, R, which is featured by a visible domain X (called Yang according to Chinese ancient philosophy) and a transformation from samples of observations into inner codes (in a function like a male animal and thus also called Yang according to the same Chinese philosophy). Jointly we called p(X,R)=p(R|X)p(X) a Yang structure or machine. On the other hand, we have a Ying structure or machine q(X,R)=q(X|R)q(R) to describe also the joint distribution of X, R, which is featured by q(R) to describe the invisible (thus called Ying) domain R for inner representation and a transformation from inner codes to observations (in a function like a female animal and thus called Ying). The paired Ying-Yang structures formulates a system called Bayesian Ying-Yang (BYY), in a tribute to the Chinese ancient philosophy.

Figure 2
figure 2

Bayesian Ying-Yang system.

The task of learning a BYY system starts from its structure design. That is, we need to give each of four component distributions a specific mathematical structure. Usually, p(X) comes from a given set X N of samples as follows:

$$\begin{array}{@{}rcl@{}} {p_{h}^{N}}(X)=\prod_{t=1}^{N} G(x|x_{t},h^{2}I), \ \text{especially} \ {p_{0}^{N}}(X)=\delta(X-X_{N}), \end{array} $$
((8))

where G(x|μ,Σ) denotes a Gaussian density with the mean vector μ and the covariance matrix Σ.

For the rest of the three components, we start at designing the structures of q(X|R) and q(R), based on which we further design the structure of p(R|X) that is typically a sort of an inverse of the Ying q(X|R)q(R) machine. This is consistent to the Ying-Yang philosophy, according to which Ying is primary and comes first, while the Yang is secondary and bases on the Ying.

The design of each component is guided by the corresponding one of the following three principles (Xu 2009):

  • A principle of least redundant representation for q(R).

  • A principle of divide-conquer for q(X|R).

  • A principle of Ying-Yang uncertainty conversation or variety preservation for p(R|X).

Further details are referred to Sect.4.2 of (Xu 2010a) and Sect.3.2 of (Xu 2012a). The first two principles are adopted from the existing studies, while the third is specific to the BYY system. In a compliment to the Yin-Yang philosophy, it requires that Yang machine preserves a dynamic range to appropriately accommodate uncertainty or information contained in the Ying machine. That is, we have U(p(X,R))=U(q(X,R)) under a uncertainty measure U(p) as shown within the table of Figure four(a) in Xu (2009).

Given a BYY system designed, the unknown values of all variables in R={Y,{θ}} are learnt according to a Ying-Yang best harmony principle. Mathematically, it is equivalent to make p(R|X)p(X) and q(X|R)q(R) become a best matching pair in a most compact form with a least complexity, which is achieved via maximising the following harmony functional:

$$\begin{array}{@{}rcl@{}} & H(p||q)=\int p({R}| {X})p({X})\ln[q({X}| {R})q({R})]d{X}d{R},&\\ & \text{subject to } U (p(X, R)) = U (q(X, R)).& \end{array} $$
((9))

On the one hand, maximising H(p||q) forces the Ying q(X|R)q(R) to match the Yang p(R|X)p(X). There are always certain structural constraints imposed on the Ying-Yang structures and also a constraint comes from \(p(X)={p_{h}^{N}}(X)\) by Equation 8 on a finite size of samples, because of which a perfect equality q(X|R)q(R)=p(R|X)p(X) may not be really reached but still be approached as close as possible. At this equality, H(p||q) becomes the negative entropy that describes the complexity of the BYY system. Further maximising it will decrease the system complexity and thus provides an ability for determining an appropriate k.

As addressed in Sect.4.1 of Xu (2010a), this principle is spelled as Ying-Yang best harmony from a perspective that Ying and Yang both adapt each other to reach the best agreement in a most tacit way (consuming a least amount of effort made in information communication), which can be better understood by rewriting Equation 9 into

$$\begin{array}{@{}rcl@{}} H(p||q)=H_{R|X}- KL(p({R}| {X})p({X})\Vert q({X}| {R})q({R})), \\ H_{R|X}=\int p({R}| {X})p({X})\ln[p({R}| {X})p({X})]d{X}d{R}, \\ where \ KL(p \Vert q)=\int p(u) \ln{p(u) \over q(u) }du. \end{array} $$
((10))

Maximising H(p||q) consists of minimising the second term for a best matching or agreement between the Ying-Yang pair and of minimising the first term for a least amount of information to be communicated from the Yang to the Ying towards an agreement.

The novelty and salient features of Equation 9 may also be observed from other aspects. Further details are referred to Sect. 4.1 in Xu (2010a) and Sect. 4.2.3 in Xu (2012a). Shown in Table 1 are recent applications and empirical studies of the BYY harmony learning.

Table 1 Recent BYY applications and empirical studies

Currently, the implementation of BYY harmony learning may suffer a dilemma of suboptimal solution versus learning instability. It is this dilemma that motivates the progresses introduced in this paper, which are outlined as follows:

  • A Lagrange implementation of the principle of variety preservation is proposed for learning the Yang structure, with a new Ying-Yang alternation nonlocal search obtained and the abovementioned dilemma removed.

  • An information harmonising perspective for BYY harmony learning such that the tasks of attention, detection, problem-solving, adaptation, learning and model selection are integrated in a concise formulation.

  • Learning algorithms that implement Ying-Yang alternative nonlocal search for learning GMM, FA, local FA, binary FA, nonGaussian FA, de-noised GMM, temporal FA, temporal binary FA and sparse multivariate regression, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model, with a favourable nature of automatic model selection and a unified formulation in performing unsupervised and semi-supervised learning.

  • A principle of preserving multiple convex combinations for implementing BYY harmony learning, which leads another type of Ying-Yang alternative nonlocal search algorithms.

Finally, at the end of this paper, a chronological outline is given on the innovative time points in the history of BYY harmony learning studies.

Methods

BYY harmony learning: Lagrange Ying-Yang alternation

Ignoring a priori q(θ), we simplify the best harmony of H(p||q) by Equation 9 into

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H(\theta)_{\ subject \ to \ U (p(X, Y)) = U (q(X, Y))}, &\\ & H(\theta) = \int p(Y| X){p_{h}^{N}}(X) \ln[q(X|Y, \theta)q(Y| \theta)]dY\;dX,& \end{array} $$
((11))

where the above constraint is a simplification of the counterpart in Equation 9. One example is considered in Sect. 4.1 in Xu (2010a) and Sect. 4.2.3 in Xu (2012a), featured with the following counterpart without considering the component p(X):

$$\begin{array}{@{}rcl@{}} p(Y| X)=q(Y| \theta, X), \ q(Y| \theta, X)=\frac{q(X|Y, \theta)q(Y| \theta)}{ q(X|\theta) }, \\ q(X|\theta) = \int q(X|Y, \theta)q(Y| \theta)dY. \end{array} $$
((12))

Even earlier in 2007, another example is given by Eq.(72) in Xu (2007a), under the name of equal covariance with U(p(X,Y))=U(q(X,Y)) denoting that the Yang preserves the covariance of q(X,Y).

The existing algorithms for maxθ H(θ) directly impose the constraint U(p(X,Y))=U(q(X,Y)), which makes learning suffer a dilemma of either local optimal solution or some learning instability, see the remarks in Table 1.

In this paper, we indirectly consider a relaxation of U(p(X,Y))=U(q(X,Y)) via considering K L(p(X,Y)q(X,Y))=0 as a Lagrange constraint (since K L(pq)≥0 becomes zero at the target p=q), resulting in the following augmented maximisation:

$$\begin{array}{@{}rcl@{}} \max_{\theta} H_{L}(\theta), \ \ H_{L}(\theta)=H(\theta)- \eta KL(p(Y| X) p(X)\Vert q(X|Y, \theta)q(Y| \theta)) \le H(\theta), \end{array} $$
((13))

where η>0 is a Lagrange coefficient. A nonzero value η will relax the target K L(p(X,Y)q(X,Y))=0. The smaller the value η is, it becomes more relaxed, or vice versa.

Moreover, Equation 13 can be rewritten into

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H_{L}(\theta),\ H_{L}(\theta)=(1+\eta) H(\theta) +\eta [E_{Y|X}+E_{X}(h)],& \\ &E_{Y|X}=-\int p(Y| X){p_{h}^{N}}(X)\ln{p(Y| X)}dY\;dX,\ E_{X}(h)=- \int {p_{h}^{N}}(X) \ln{{p_{h}^{N}}(X)}dX. & \end{array} $$
((14))

Given \(p(Y| X)=p^{old}_{Y| X}\) fixed, maxθ H L (θ) becomes

$$\begin{array}{@{}rcl@{}} & \theta^{new}=arg \max_{\theta} H(\theta)_{p_{Y| X}=p^{old}_{Y| X}}, \ h^{new}=arg \max_{h} H_{L}(\theta)_{p_{Y| X}=p^{old}_{Y| X}}, & \end{array} $$
((15))

with H L (θ new)≥H L (θ old).

Given θ=θ new,h=h new, maximising H L (θ) subject to \( \int p(Y| X)dY=1\) with respect to a free p(Y|X) results in

$$\begin{array}{@{}rcl@{}} & p^{new}_{Y| X}= \frac{[q(X|Y, \theta^{new})q(Y| \theta^{new})]^{(1+1/\eta)}}{\int [q(X|Y, \theta^{new})q(Y| \theta^{new})]^{(1+1/\eta)}dY},& \end{array} $$
((16))

which keeps H L (θ) to be nondecreasing too.

Therefore, alternatively updating Equations 15 and 16 makes H L (θ) monotonically non-decrease and finally converge. That is, learning stability is guaranteed.

Given h fixed, the term E X (h) can be ignored because it is irrelevant to updating θ and p(Y|X). With help of E X (h), an appropriate h can be estimated in a way similar to ones summarised in Sect.2 of (Xu L 2003b).

Without losing generality, we consider Equation 14 at the special case h=0 and get

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H_{L}(\theta), \ H_{L}(\theta)=(1+\eta) H(\theta) +\eta E_{Y|X},& \\ & H(\theta) = \int p(Y| X_{N}) \ln[q(X_{N}|Y, \theta)q(Y| \theta)]dY,& \\ &E_{Y|X}=-\int p(Y| X_{N})\ln{p(Y| X_{N})}dY,& \end{array} $$
((17))

from which we get two types of detailed implementation according to the types of variables in Y.

When the variables in Y are discrete valued, the integral over Y becomes summation. It follows from Equations 15 and 16 that we are led to the general procedure for Ying-Yang alternative implementation given in Algorithm 1.

When the variables in Y are real valued, the integral over Y becomes intractable, for which we seek the help of the following Taylor expansion around u up to the second order :

$$\begin{array}{@{}rcl@{}} &\max_{\eta_{u}}\int {p (u)Q (u) d u}\approx Q(u^{\ast})-\frac{1}{2}Tr[\Gamma_{u} \Pi_{u^{*}}],& \\ &u^{\ast} =\arg \mathop {\max}\limits_{u} Q (u), \ \Pi_{u}=-\frac{\partial^{2}Q(u)}{\partial u\partial u^{\mathrm{T}}},& \end{array} $$
((18))

where η u ,Γ u are the mean and the covariance of p(u).

From Equations 17 and 18, we approximately have

$$\begin{array}{@{}rcl@{}} & H(\theta)\approx \pi(X_{N},Y_{*}, \theta) -\frac{1}{2}Tr[\Gamma^{Y}_{X_{N}} \Pi^{Y}_{X_{N}}], & \\ &\pi(X_{N},Y, \theta)= \ln[q(X_{N}|Y, \theta)q(Y| \theta)], &\\ & Y_{*}=arg\max_{Y} \pi(X_{N},Y, \theta),\ \Pi^{Y}_{X_{N}}=-\frac{\partial^{2} \pi(X_{N},Y, \theta)}{\partial vec(Y) \partial vec(Y)^{T}},& \\ & {\Gamma^{Y}_{X}}={Cov}_{p(vec(Y)| X)}vec(Y),& \end{array} $$
((19))

where C o v p(u) u denotes the covariance matrix of p(u) and v e c(A) denotes the vector obtained by stacking the column vectors of A one by one.

Maximising the above H(θ), we get another type of Ying-Yang alternative implementation, as summarised in Algorithm 2.

Given \(Y_{*}=Y^{old}_{*}, \ \Gamma ^{Y}_{X_{N} }= \Gamma ^{Y\, old}_{X_{N} }\), the counterpart of Equation 15 becomes simply

$$\begin{array}{@{}rcl@{}} & \theta^{new}=arg \max_{\theta} H(\theta), & \end{array} $$
((20))

which acts as the Ying step of Algorithm 2.

Given θ=θ new and \(Y_{*}^{new}\), the counterpart of Equation 16 become simply

$$\begin{array}{@{}rcl@{}} &\Gamma^{Y \, new}_{X } =arg \max_{\theta}[(1+\eta) H(\theta) +\eta E_{Y|X}] =\frac{\eta}{1+\eta} \Pi_{X}^{Y \, new\, -1}, & \end{array} $$
((21))

where \(E_{Y|X}\approx 0.5d_{Y}\ln {(2\pi e)} +0.5\ln {|\Gamma ^{Y}_{X } |}\) is obtained by approximately regarding it as the entropy of a Gaussian density with a covariance matrix \(\Gamma ^{Y}_{X_{N} }.\)

Another insight on Equation 13 comes from observing q(Y|θ,X)q(X|θ)=q(X|Y,θ)q(Y|θ) from Equation 12, by which K L(p(Y|X)p(X)q(X|Y,θ)q(Y|θ)) becomes

$$\begin{array}{@{}rcl@{}} & KL(p(Y| X) p(X)\Vert q(Y| \theta, X)q(X|\theta))& \\ &= \int p(X) KL(p(Y| X) \Vert q(Y| \theta, X)) dX+ KL(p(X)\Vert q(X|\theta)).& \end{array} $$
((22))

With \(p(X)={p_{0}^{N}}(X)\) by Equation 8 and with q(Y|θ,X), q(X|θ) by Equation 12, we can rewrite Equation 13 into

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H_{L}(\theta), \ \ H_{L}(\theta)=H(\theta)- \eta KL(p(Y| X_{N}) \Vert q(Y| \theta, X_{N}))+\eta \ln{q(X_{N}|\theta)},& \end{array} $$
((23))

from which we observe that the maximisation of H L (θ) consists of not only a best Ying-Yang harmony but also a degree η of jointly a top-down maximum likelihood learning and a bottom-up best matching between the posteriors p(Y|X N ) and q(Y|θ,X N ).

The maximisation of the above second and third terms is exactly what has been widely called variational learning (Corduneanu and Bishop 2001; Jordan et al. 1999; McGrory and Titterington 2007), which is equivalent to the Ying-Yang best matching, as previously pointed out in Xu (2010a) (especially see the roadmap in its Figure A2). The sum of two terms may be simply observed from

$$\begin{array}{@{}rcl@{}} & -[H(\theta) + E_{Y|X}]=- KL(p(Y| X)\Vert q(X_{N}|Y, \theta)q(Y| \theta))& \\ &= \ln{q(X_{N}|\theta) }-KL(p(Y| X)\Vert q(Y| \theta, X)) \le \ln{q(X_{N}|\theta) }, & \end{array} $$
((24))

which is a degenerated case that does not have the harmonising information flow H(θ) in the centre of Figure 3.

Figure 3
figure 3

Information harmonising formulation.

Next, we consider to drop off the last term in Equation 23, resulting in

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H_{G}(\theta), & \\ & H_{G}(\theta)=H(\theta) - \eta KL(p(Y| X_{N}) \Vert q(Y| \theta, X_{N})) & \\ & =-\eta \ln{q(X_{N}|\theta) }+(1+\eta) H(\theta) + \eta E_{Y|X} \le H(\theta), & \end{array} $$

which may also be obtained from considering the constraint by Equation 12 in a Lagrange. At the special case η=1, we may regard it as counterpart of Equation 24, with a difference in that H(θ) replaces lnq(X N |θ).

On the other hand, we may also generalise Equation 24 by a Lagrange as follows:

$$\begin{array}{@{}rcl@{}} & \ln{q(X_{N}|\theta) } -\eta KL(p(Y| X)\Vert q(Y| \theta, X))& \\ &=(1-\eta) \ln{q(X_{N}|\theta) }+\eta H(\theta) + \eta E_{Y|X}\ge H(\theta) + E_{Y|X}, & \end{array} $$

which becomes the counterpart of Equation 24 generally instead of only at η=1. Alternatively, we may reach a tighter lower bound by an appropriate value of η.

Last but not least, maximising H L (θ) by Equations 14 and 17 relates closely to some previous efforts summarised in Table 2.

Table 2 Related studies: KL- η -HL spectrum

Information harmonising dynamics

According to the Ying-Yang philosophy placed at the upper right corner of Figure 3, the Ying and Yang constitutes a harmony system surviving in an environment, by which the Ying is primary while the Yang has not only a nature of variety but also a good adaptability to both the Ying and its environment. We may not only understand Equations 13, 14 and 17 from a classic perspective but also get new insight on how the Ying and Yang interact dynamically.

The status of Ying-Yang harmony is jointly featured by H(p||q) and the Lagrange quantity η, where H(p||q) is given in Equation 9 or simply H(p||q)=H(θ) in Equation 11, while η is given in Equations 13 an 14, reflecting an agreement of balance between Ying and Yang in one of the following aspects:

  • Balance within the Yang domain, i.e. seeking a match between \({p_{h}^{N}}(X)\) by Equation 8 and \( q(X_{N}|\theta)=\int q(X|Y, \theta)q(Y| \theta) dY\), measured by a divergence \(-KL({p_{h}^{N}}(X)\Vert q(X_{N}|\theta))\) or equivalently a likelihood L(θ)= lnq(X N |θ).

  • Balance along the Yang pathway, i.e. to satisfy the constraint by Equation 12, e.g. measured by −K L(p(Y|X N )q(Y|θ,X N )).

  • Balance between Ying-Yang, i.e. both (a) and (b), measured by K L(p(Y|X)p(X)q(X|Y,θ)q(Y|θ)), as in Equation 13.

Here, we focus on the standard cases, i.e., Ying dominated models or the Ying is primary. For some exceptional cases that the Yang is primary, e.g. forward architecture (see Sect.II(C) in Xu (2001b)), we may consider a balance within the Yang domain and a balance via the Yang pathway.

Typically, η could be a monotonically increasing function of a goodness that measures such a balance, while a best Ying-Yang harmony is reached at a balance that the Ying-Yang system has a least complexity.

Quantitatively, the harmonising dynamics remains to be an open topic that demands further investigation. Qualitatively, this dynamics may be roughly depicted via the dynamics of η as follows.

We start at considering two extreme cases. One happens at a bad Ying-Yang balance, featured by

$$\begin{array}{@{}rcl@{}} \eta \mathrm{takes a very small value around} 0. \end{array} $$
((25))

The dynamics of maximising H L (θ) focuses at maximising H(θ) that makes p(Y|θ,X N )=δ(YY ) with Y =a r g maxY π(X N ,Y,θ) become mostly focused and least flexible in order to rapidly satisfy the most urgent need of Ying, that is, the BYY harmony learning degenerates to one special case that is an extension of competitive learning. Though it still works when the resulted H(θ) is used as a model selection criterion, e.g. see Eq.(10a) in Xu (1996), it becomes prone to an initialisation and poor in automatic model selection because of the winner-take-all (WTA) competition among the inner representations of Y. Therefore, we should not let η t always stay at a too small value.

The other extreme happens when the Ying-Yang balances well, featured by

$$\begin{array}{@{}rcl@{}} \eta \mathrm{ takes a very large value } \end{array} $$
((26))

such that η≈1+η. In such cases, maximising H L (θ) by Equation 17 actually focuses at maximising η[H(θ)+E Y|X ], or equivalently minimising the Kullback divergence η K L(p(Y|X)q(X N |Y,θ)q(Y|θ)) for a Ying-Yang best matching, which makes p(Y|θ,X N ) tend to Equation 12 and thus enjoy a larger varying range or a big flexibility to cope with new samples. However, the harmonising information H(θ) in the centre of Figure 3 becomes neglectable, i.e. becoming weak in reducing the system complexity. In such a case, Algorithm 1 and Algorithm 2 become equivalent to the EM algorithm for the maximum likelihood, which is poor in model selection too. This means that the dynamics is approaching an equilibrium as η tends a big value, during which model selection or structure changing is gradually shut off while parameters may still be refined.

In the beginning, a BYY system is given with a pre-designed Ying-Yang structure and usually with all the unknown parameters initialized either randomly or according to a priori knowledge. Thus, the BYY system fits a given set X N of samples badly, resulting in a poor Ying-Yang balance with a small η value in a way similar to the first extreme case. The dynamics focuses on not only adjusting the structure but also updating the parameters towards a balance with η quickly growing up, which gradually tends to an equilibrium with X N well described by a Ying-Yang structure in an appropriate complexity.

Surviving in an environment, the BYY system typically stays at one equilibrium of its harmonising dynamics. As the environment changes, the dynamics is featured by performing the following actions:

(A) Equilibrium and attention When the system feels familiar with its observations, the dynamics stays at one equilibrium with a big value of η. An unexpected environmental change will make η drop. A large drop will trigger the system’s attention to detect environmental novelty. In other words, there is an attention mechanism associated with η.

(B) Detection and problem-solving A small drop of η is associated with a deviation from one equilibrium, which causes an incremental of KL. This incremental is associated with actions of detecting objects, recognising patterns and solving problems (e.g. inference or control) by the mapping XY via p(Y|θ,X N ).

(C) Adaptation and learning When the two opposed changes of η and of KL are not big enough such that the value of η K L may not change considerably, learning will not be triggered and H L (θ) by Equation 17 approximately stays unchanged. However, maximising H L (θ) will start to minimise KL when the incremental of KL becomes large while η remains a high value, i.e. becoming close to the second extreme case by Equation 26. In this case, the learning made by Algorithm 1 or Algorithm 2 becomes closer to the maximum likelihood learning that merely updates the parameters in the system without a big structural change, that is, no model selection occurs.

(D) Model selection and structure pruning A big drop of η will happen when the BYY system faces a largely different environment, i.e. becoming the extreme case η=0, the dynamics has to not only adjust the structure but also update the parameters towards a new equilibrium with η brought up quickly.

In a summary, the above actions are featured by a feedback signal η as follows:

$$\begin{array}{@{}rcl@{}}</p><p class="noindent"> \eta= g(v), \ v=f(d_{M}, d_{D}, d_{U}), \ \frac{d g(v)}{dv} <0, \ \frac{\partial f }{\partial d_{u}} >0, \ u=M, D, U. \end{array} $$
((27))

Conceptually, η monotonically decreases with a vigilance signal v, and this v monotonically increases with d M , d D and d U , where d M reflects the discrepancy between data X and its counterpart \(\hat X\) reconstructed by the model, e.g. measured by the negative log-likelihood − lnq(X N |θ) or \(KL({p_{h}^{N}}(X)\Vert q(X_{N}|\theta))\), while d D reflects the deviation of an inner representation Y from the desired Y d , e.g. measured by the square error Y and its corresponding \(\hat Y\). Moreover, d U is a measure that reflects salient occurrences that attract attentions. Further investigation is needed on the detailed forms of d M , d D and d U , as well as the specific form of g(f(·,·,·)), which may be considered by nonlinear regression.

As illustrated in Figure 3, the strength η controls the flexibility and adaptability that Yang enjoys, described by an entropy gain \(-\eta \int p(Y| X){p_{h}^{N}}(X)\ln {p(Y| X){p_{h}^{N}}(X)}dYdX =\eta [E_{Y|X} +E_{X}(h)]\). Transferring this information from the Yang to the Ying, the Ying attempts to harmonise the information by updating parameters and modifying its structure to increase an amount of negative entropy η H(θ). Therefore, a net amount of harmonising information (1+η)H(θ)+η[E Y|X +E X (h)] is maximized, by which we are led to Equations 14 and 17.

For a large η, the Yang enjoys a large flexibility to avoid an overfitting of samples and to prepare an adaptability for possible environmental changes. The more flexibility (i.e. η E Y|X ) that the Yang currently enjoys, the larger amount of negative entropy (i.e. η H(θ)) is needed for the Ying to manage. When it becomes difficult to manage, the Ying-Yang balance will deteriorate and thus incur for a drop of η to reduce the flexibility of Yang. In other words, there is a negative feedback mechanism that stabilises the dynamics of information harmonising, as illustrated in Figure 4.

Figure 4
figure 4

Negative feedback stabilises dynamics.

Learning Gaussian mixture and learning factor analysis

We start at considering Gaussian mixture as follows

$$\begin{array}{@{}rcl@{}} & q(x,y|\theta)= \prod_{\ell=1}^{k}q(x,\theta_{\ell})^{y^{(\ell)}}, q(x,\theta_{\ell})=\alpha_{\ell}G(x|\mu_{\ell},\Sigma_{\ell}), & \end{array} $$
((28))

with xR d and θ j ={α j ,μ j ,Σ j }, where y=[y (1),,y (k)]T satisfies

$$\begin{array}{@{}rcl@{}} &\sum_{\ell=1}^{k} y^{(\ell)}=1, y^{(\ell)} \mathrm{takes either 0 or 1}.& \end{array} $$

Given \(X_{N}=\{x_{t}\}_{t=1}^{N}\) of i.i.d. samples, its corresponding samples \(Y=\{y_{t}\}_{t=1}^{N}\) are also i.i.d. Accordingly, \(p^{new}_{Y| X_{N}}\) in Equation 16 becomes simplified into

$$\begin{array}{@{}rcl@{}} & p^{new}_{Y| X_{N}}= \prod_{t=1}^{N}p(y_{t}| x_{t}, \theta^{new}, \eta^{new}), \ p(y| x, \theta, \eta)= \frac{q(x,y|\theta)^{\frac{1+\eta}{\eta}}}{\sum_{y}q(x,y|\theta)^{\frac{1+\eta}{\eta} }}, & \\ &p(\ell|x,\theta)=p(y^{(\ell)}=1, y^{(j)}=0, \forall j\ne \ell | x, \theta, \eta)= \frac{[\alpha_{\ell}G(x|\mu_{\ell},\Sigma_{\ell})]^{\frac{1+\eta}{\eta}}}{\sum_{j=1}^{k}[\alpha_{j}G(x|\mu_{j},\Sigma_{j})]^{\frac{1+\eta}{\eta}} }, & \end{array} $$
((29))

from which the Yang step of Algorithm 1 is turned into the Yang step of a new Ying-Yang alternating algorithm for learning Gaussian mixture, summarised in Algorithm 3. Its Ying step is obtained by maximising H L (θ) in Equation 17 with

$$\begin{array}{@{}rcl@{}} H(\theta) = \sum_{t=1}^{N}\sum_{y_{t}} p(y_{t}| x_{t}, \theta, \eta)\ln{q(x_{t}, y_{t}| \theta)}. \end{array} $$
((30))

Next, we consider one popular linear system as follows:

$$\begin{array}{@{}rcl@{}} & x=Ay+ e, \ \ q(y|\phi)=G(y|\nu, \Lambda), \ \Lambda=diag[\lambda_{1}, \cdots, \lambda_{k}], & \\ & Eey^{T}=0 \ or \ q(e|y, \psi)=q(e)=G(e|0, \Sigma),& \end{array} $$
((31))

which leads to what is typically called factor analysis (FA), where Σ is a nonnegative diagonal matrix.

Classically, the name FA is used to refer the model Equation 31 with Λ=I. In this paper, we use FA-a to shortly denote this classical FA, and use FA-b to refer the one by Equation 31 with a diagonal matrix ΛI together with the following orthogonal constraint

$$\begin{array}{@{}rcl@{}} &A^{T}A=I.& \end{array} $$
((32))

For the maximum likelihood learning, FA-a and FA-b are equivalent. However, FA-b becomes much more favourable by using a learning algorithm with a nature of automatic model selection. Readers are referred to Sect.2.2 in Xu (2011 and Tu and Xu (2011a) for further studies on FA-b versus FA-a.

Given \(X_{N}=\{x_{t}\}_{t=1}^{N}\) of i.i.d. samples and its corresponding \(Y=\{y_{t}\}_{t=1}^{N},\) H L (θ) in Equation 19 and H(θ) in Equation 17 become simplified into

$$\begin{array}{@{}rcl@{}} &H(\theta) \approx \sum_{t=1}^{N}H(\theta|x_{t}), \ H_{L}(\theta) \approx \sum_{t=1}^{N}H_{L}(\theta|x_{t}),& \\ & H(\theta|x_{t})= \pi(x_{t},y_{t}, \theta) -\frac{1}{2}Tr[\Gamma_{y\vert x} \Pi_{y\vert x}], & \\ & \pi(x,y, \theta)=\ln{[G(x|Ay+\mu,\Sigma)G(y|\nu, \Lambda)]}, \ \Pi_{y\vert x}=A^{\mathrm{T}}\Sigma^{ -1}A+\Lambda^{-1}.&\\ & H_{L}(\theta|x_{t})=(1+\eta)H(\theta|x_{t}) +\eta\frac{\ln|\Gamma_{y\vert x}| +m\ln{(2\pi e)}}{2} & \\ &y_{t}=arg\max_{y} \ \pi(x_{t},y, \theta)={Wx}_{t}+w, \ W= \Gamma_{y|x}A^{T}{\Sigma}^{-1}, \ w=\Lambda^{-1}\nu-W\mu, & \end{array} $$
((33))

Usually, ν is set to be 0. Here we use ν to denote a constant vector for convenience of a further extension in Algorithm 14.

We update Σ new,Λ new by Equation 20 via solving them analytically as follows:

$$\begin{array}{@{}rcl@{}} &y_{t}=W^{old}x_{t}+w^{old}, \ e_{t}=x_{t}-\mu-A^{{\ old}}(y_{t}-\nu), & \\ & \Sigma^{{new}}=A^{{\ old}}\Gamma^{old}_{y\vert x} A^{{\ old}\;\mathrm{T}} +\frac{1}{N}\sum\limits_{t}e_{t}{e_{t}^{T}},\ \Lambda^{new}=\Gamma^{old}_{y\vert x}+\frac{1}{N}\sum\limits_{t}(y_{t}-\nu) (y_{t}-\nu)^{T}. & \end{array} $$
((34))

Moreover, for updating A new we can get

$$\begin{array}{@{}rcl@{}} &A=R^{xy}\Lambda^{new \, -1}, \ R_{xy} =\frac{1}{N}\sum\limits_{t}e_{t} (y_{t}-\nu)^{T}.& \end{array} $$
((35))

For updating FA-a, the above obtained A can be directly used as A new. However, it can not be directly used as A new for updating FA-b because there is also the orthogonal constraint by Equation 32 to be satisfied, for which we let

$$\begin{array}{@{}rcl@{}} &A^{new}=G_{S}[R_{xy}\Lambda^{new \, -1}],& \end{array} $$
((36))

where G S [A] denotes a Gram-Schmidt operator that orthogonalizes A. Even simply, we may make a gradient-based local search

$$\begin{array}{@{}rcl@{}} & A^{{new}}=A^{old}+\gamma_{A} \Delta A, & \end{array} $$
((37))

where γ A >0 is a small learning stepsize, and Δ A is a projection A H(θ) onto Equation 32, e.g. for A given by Equation 35 we simply get

$$\begin{array}{@{}rcl@{}} & \Delta A=(I-AA^{T})\nabla_{A} H(\theta^{old}), \ \nabla_{A} H(\theta)=\Sigma^{new\, -1} [R_{xy}- \Lambda^{new}A].& \end{array} $$
((38))

The orthogonal constraint by Equation 32 also takes a role of removing a scale indeterminacy of the linear system by Equation 31, because an arbitrary diagonal matrix DI will make Equation 32 break though we may have A y=(A D)(D −1 y)=A y with y still from G(y|ν,Λ). Further details are referred to Sect.2.2 in Xu (2011).

Also, there are alternative constraints in place of Equation 32, e.g. see Eqs. (33) and (34) in Xu (2011).

One weak point by the above Equations 37 and 38 is that an appropriate γ A is needed; otherwise, it may cause learning instability. Alternatively, we may replace Equation 32 by the following easy computing one:

$$\begin{array}{@{}rcl@{}} & Tr[A^{T}A]=1.& \end{array} $$
((39))

Shortly, the notation FA-c is used to refer such a type of FA, namely, the one by Equation 31 not only with a diagonal ΛI but also with Equation 39. Then, we consider A H γ (θ) via Lagrange H γ (θ)=H(θ)−γ(T r[A T A]−1), resulting in

$$\begin{array}{@{}rcl@{}} \Sigma^{new}\nabla_{A} H_{\gamma}(\theta)=R_{xy}- \Lambda^{new}A-\gamma \Sigma^{{new}} A, \end{array} $$
((40))

which is solved as follows

$$\begin{array}{@{}rcl@{}} &A^{new}=A_{\gamma^{*}},\ A_{\gamma}=R_{xy}(\Lambda^{new}+\gamma\Sigma^{new})^{-1}, \ \gamma^{*} is the root of Tr[A_{\gamma}],& \end{array} $$
((41))

where γ is obtainable by any one-variate iterative algorithm, e.g. Newton.

In summary, we can turn Algorithm 2 into Algorithm 4 for learning factor analyses, via modifying the Ying step, that is, we update Σ new,Λ new based on Equation 34 and then update A new according to a choice of possible constraints on A.

When Σ=σ 2 I, we also get an alternative algorithm for learning Principal Component Analysis (PCA) with automatic model selection on the number of principal components. Further details about PCA versus FA are referred to Sect.3.2 of (Xu 2010a).

Learning local factor analysis

We can combine factor analysis by Equation 31 and Gaussian mixture by Equation 28 into the following general one:

$$\begin{array}{@{}rcl@{}} &q(y,\ell|\phi)=G(y|\nu_{\ell},\Lambda_{\ell})q(\ell| \alpha), \ q(\ell| \alpha)=\sum_{j=1}^{k}\alpha_{\ell} \delta_{\ell,j}, \ \sum_{j=1}^{k}\alpha_{j}=1, \ 1\ge \alpha_{j}\ge 0, & \\ & \pi(x,y, \ell,\theta)= \ln{[G(x|A_{\ell}y+\mu_{\ell},\Sigma_{\ell})q(y,\ell|\phi)q(A_{\ell})]},& \end{array} $$
((42))

where δ i,j is the Kronecker delta with δ i,j =1 if i=j and δ i,j =0 otherwise, which actually describes i.i.d. samples \(X_{N}=\{x_{t}\}_{t=1}^{N}\) by a mixture of local factor analysis or local subspaces at a special case \(\Sigma _{\ell }=\sigma ^{2}_{\ell }I\).

Accordingly, H L (θ) in Equation 23 is rewritten into

$$\begin{array}{@{}rcl@{}} & H_{L}(\theta)=\sum_{t=1}^{N} \sum_{\ell=1}^{k} p(\ell|x_{t},\theta)[(1+\eta)H_{L}(\theta|\ell, x_{t})-\eta \ln{p(\ell|x_{t},\theta)}], &\\ & H_{L}(\theta|\ell, x_{t}) = H(\theta|\ell, x_{t})+\frac{\eta}{1+\eta} E_{y|\ell,x_{t}},&\\ &E_{y|\ell,x_{t}} = -\int p(y|\ell, x_{t},\theta) \ln{p(y|\ell, x_{t},\theta)}dy,&\\ & H(\theta|\ell, x_{t})=\int p(y|\ell, x_{t},\theta) \pi(x_{t},y,\ell, \theta)dy.& \end{array} $$
((43))

Similar to H L (θ|x t ) in Equation 33, we further get

$$\begin{array}{@{}rcl@{}} & H_{L}(\theta|\ell, x_{t}) = \pi(x_{t},y_{t,\ell},\ell, \theta) + \frac{0.5\eta}{1+\eta}[\ln|\Gamma_{\ell, y\vert x}| +\ln{(2\pi)^{m_{\ell}}}],&\\ & E_{y|x_{t}}=0.5[\ln|\Gamma_{\ell, y\vert x}| +m_{\ell}\ln{(2\pi e)}],& \\ & H(\theta|\ell, x_{t})= \pi(x_{t},y_{t,\ell},\ell, \theta) -\frac{1}{2}Tr[\Gamma_{\ell,y\vert x} \Pi_{\ell,y\vert x}], & \\ & \Pi_{\ell,y\vert x}=A_{\ell}^{T}{\Sigma}_{\ell}^{-1}A_{\ell}+\Lambda_{\ell}^{-1}, \ \Gamma_{\ell,y\vert x}^{new}= \frac{\eta }{\eta+1}\Pi_{\ell,y\vert x}^{old\, -1},&\\ &y_{t,\ell}=arg\max_{y} \ \pi(x_{t},y,\ell, \theta)=W_{\ell}x_{t}+w_{\ell}, & \cr & W_{\ell}= \Gamma_{\ell,y|x}A_{\ell}^{T}{\Sigma_{\ell}}^{-1}, \ w_{\ell}=\Lambda_{\ell}^{-1}\nu_{\ell}-W_{\ell}\mu_{\ell}, & \end{array} $$
((44))

from which we further get θ new via maximising \(\sum _{t=1}^{N} \sum _{\ell =1}^{k} p(\ell |x_{t},\theta)H(\theta |\ell, x_{t}),\) resulting in the Ying step of a new Ying-Yang alternating algorithm for learning a mixture of local factor analysis, as in Algorithm 5. Actually, this Ying step combines the Ying of Algorithm 3 and the Ying of Algorithm 4.

Maximising H L (θ) with respect to p(|x t ,θ) yields

$$\begin{array}{@{}rcl@{}} p(\ell| x_{t}, \theta)= \frac{e^{\frac{\eta+1}{ \eta}\pi(x_{t},y,\ell, \theta) +\frac{1}{2}\ln{[|\Gamma_{\ell, y\vert x}| (2\pi)^{m_{\ell}}}]}}{ \sum_{\ell=1}^{k} e^{\frac{\eta+1}{ \eta}\pi(x_{t},y,\ell, \theta) +\frac{1}{2}\ln{[|\Gamma_{\ell, y\vert x}| (2\pi)^{m_{\ell}}}]} }\\ =\frac{[\alpha_{\ell} G(x_{t}|\mu_{\ell},A_{\ell}\Lambda_{\ell} A_{\ell}^{T}+ \Sigma_{\ell}) ]^{\frac{\eta+1}{ \eta}} }{ \sum_{\ell=1}^{k} [\alpha_{\ell} G(x_{t}|\mu_{\ell},A_{\ell}\Lambda_{\ell} A_{\ell}^{T}+ \Sigma_{\ell})]^{\frac{\eta+1}{ \eta}}}, \end{array} $$
((45))

from which and together with Equation 44, we see that the Yang step of Algorithm 5 actually combines the Yang of Algorithm 3 and the Yang of Algorithm 4.

This algorithm degenerates back to not only Algorithm 4 with k=1 but also Algorithm 3 with y=0 and A =0 for each .

Learning binary factor analysis

We consider another setting of the linear system, with each y () taking either 0 or 1 and q(y|ϕ) in Equation 31 being a multivariate Bernoulli distribution as follows:

$$\begin{array}{@{}rcl@{}} &q(y|\phi)=\prod_{i} \alpha_{i}^{y^{(i)}}(1-\alpha_{i})^{1-y^{(i)}}, \ q(x|y, \psi)=G(x|Ay+\mu,\Sigma),& \end{array} $$
((46))

which is called binary factor analysis (BFA).

Together with adding the constraint on y in Equation 28, we are lead to an equivalent form of Equation 28. In other words, learning BFA may be regarded as a relaxation or extension of learning Gaussian mixture.

Putting this setting into Equation 17, we get its simplified version as follows:

$$\begin{array}{@{}rcl@{}} & H_{L}(\theta)=(1+\eta) H(\theta) +\eta E_{Y|X},\ H(\theta) = \sum_{t=1}^{N}\sum_{y\in {\cal C}_{tf}} p(y| x_{t}, \theta) \pi(x_{t},y, \theta), &\\ & \pi(x,y, \theta)=\ln{[G(x|Ay+\mu,\Sigma)\prod_{i}\frac{\alpha_{i}^{y^{(i)}}}{(1-\alpha_{i})^{y^{(i)}-1}}]},&\\ &E_{Y|X}=-\sum_{t=1}^{N} \sum_{y\in {\cal C}_{tf}}p(y| x_{t}, \theta)\ln{p(y| x_{t}, \theta)}. & \end{array} $$
((47))

For a small k, tf can be the entire set that consists of all the possible values of y. For a large k, such an entire set could be huge, instead we consider one tf that merely consists of one subset of values that we focus on. One choice is given by

$$\begin{array}{@{}rcl@{}} & {\cal C}_{tf}=\{ y : \text{ differing from } y_{t}^{*} \text{ by less than } \kappa \text{ bits} \},& \\ & \text{where } \ y_{t}^{*} =arg\max_{y}\pi(x_{t},y, \theta^{old}) \text{ and } \kappa \text{ is a small number, e.g. } \kappa=1 \text{ or } 2. & \end{array} $$
((48))

One example was given by Eq. (20) in Xu (2010a) for binary FA, and the other example may also be found in Sect. 2.1.5 of Xu (2012a) on learning Gaussian mixture.

Given y t and tf ,t=1,…,N by Equation 48, we maximise H L (θ) with respect to p(y|x,θ), resulting in

$$\begin{array}{@{}rcl@{}} &p(y| x, \theta)= \frac{exp[\frac{1+\eta}{\eta} \pi(x,y, \theta)]}{ \sum_{y\in {\cal C}_{tf}} exp[\frac{1+\eta}{ \eta} \pi(x,y, \theta)] },& \end{array} $$
((49))

from which we get the Yang step of Algorithm 6 for binary factor analyses, similar to getting the Yang step of Algorithm 3 from the Yang step of Algorithm 1.

With p(y|x,θ) fixed, we get θ new by maximising H(θ), resulting in the Ying step of Algorithm 6.

Imposing the constraint on y in Equation 29 and letting tf to cover the entire domain y, this algorithm degenerates to Algorithm 3 for Gaussian mixture when Σ =Σ.

The summation over tf will incur for a high computing cost when tf consists of many elements. Alternatively, we may assume that \(p(y| x_{t}, \theta) =\prod _{i} p(y^{(i)}| x_{t}, \theta)\) with \( 0 \le \xi ^{(i)}_{y| x_{t}}= \int y^{(i)} p(y^{(i)}| x_{t}, \theta)dy^{(i)} \le 1\), and we simplify H(θ) and E Y|X into

$$\begin{array}{@{}rcl@{}} & H(\theta) = \sum_{t=1}^{N} \pi(x_{t},y, \theta)_{y= [\xi^{(1)}_{y| x_{t}}, \dots, \xi^{(k)}_{y| x_{t}}]^{T}}, & \\ &E_{Y|X}=- \sum_{t=1}^{N}\sum_{i=1}^{k}\xi^{(i)}_{y| x_{t}}\ln{\xi^{(i)}_{y| x_{t}}}, & \end{array} $$
((50))

from which we get Algorithm 7 with a simplified Ying step, but its Yang step needs to get \(\xi _{y| x_{t}}\) by solving a constrained quadratic optimisation via one of typical existing techniques (Fang et al. 1997; Floudas and Visweswaran 1995).

Learning nonGaussian factor analysis

We progress to consider an even general case, called nonGaussian factor analysis (NFA), with each independent component of y being nonGaussian, e.g. from a mixture of univariate Gaussians. Here, we consider the following setting:

$$\begin{array}{@{}rcl@{}} &q(y,z|\phi)=\prod_{i} \{G(y^{(i)}|\nu_{z^{(i)}}^{(i)},\lambda_{z^{(i)}}^{(i)}) q(z^{(i)}| \alpha)\}, &\\ & z= [z^{(1)}, \dots, z^{(k)}]^{T}, \ z^{(i)}=1, \dots, m_{i} \ with \ m_{i}\ge 1, & \\ & q(z^{(i)}| \alpha)=\sum_{j=1}^{m_{i}}\alpha_{j}^{(i)} \delta_{j,z^{(i)}},\ \sum_{j=1}^{m_{i}}\alpha_{j}^{(i)}=1, & \\ & \pi(x,y, z,\theta)= \ln{[G(x|Ay+\mu,\Sigma)q(y,z|\phi)q(A)]}.& \end{array} $$
((51))

where \(1\ge \alpha _{j}^{(i)}\ge 0\). Putting it into Equation 23, similar to Equation 43 we get

$$\begin{array}{@{}rcl@{}} & H_{L}(\theta)=\sum_{t=1}^{N} \sum_{z\in {\cal C}_{tf}} p(z|x_{t},\theta)[(1+\eta)H_{L}(\theta|z, x_{t})+\ln{p(z|x_{t},\theta)}], & \\ & H_{L}(\theta|z, x_{t}) = H(\theta|z, x_{t},)+\frac{\eta}{1+\eta} E_{y|z,x_{t}}(\theta),&\\ &E_{y|z,x_{t}}(\theta) = -\int p(y|z, x_{t},\theta) \ln{p(y|z, x_{t},\theta)}dy, &\\ & H(\theta|z, x_{t})=\int p(y|z, x_{t},\theta) \pi(x_{t},y,z, \theta)dy.& \end{array} $$

Similar to Equation 49, maximising H L (θ) gets p(z|x t ,θ) in the Yang step. Similar to Equation 44, we also get \( y_{z,t}=[y_{z,t}^{(1)}, \dots, y_{z,t}^{(k)}]^{T}=arg\max _{y} \pi (x_{t},y, z,\theta)\) and \( \Gamma _{z,y\vert x}=arg\max _{\ \Gamma _{z,y\vert x}} H_{L}(\theta |z, x_{t}).\)

We maximise H L (θ) to update θ, resulting in the Ying step of Algorithm 8 for learning NFA. The Ying step consists of the first part for updating each component \(\alpha _{j}^{(i) }G(y^{(i)}|\nu _{z^{(i)}}^{(i)},\lambda _{z^{(i)}}^{(i)})\) and the second part for updating G(x|A y+μ,Σ). Also, the role of \(\phantom {\dot {i}\!}\delta _{j,z^{(i)}}\) is picking those components that have contributions to the corresponding \(\alpha _{j}^{(i) }, \nu _{j}^{(i) }, \lambda _{j}^{(i) }\) according to whether z=j. The number m i of the components is determined via trimming off \(G(y^{(i)}|\nu _{z^{(i)}}^{(i)},\lambda _{z^{(i)}}^{(i)})\) if \((\alpha _{j}^{(i) }\lambda _{j}^{(i)})^{new}\to 0\).

Unsupervised vs semi-supervised

Instead of knowing i.i.d. samples \(X_{N}=\{x_{t}\}_{t=1}^{N}\), there maybe a subset X s X N in which each x t X s is associated with a supervision sample \(y_{t}^{*}\). The problem is called unsupervised learning when X s is an empty set, and called supervised learning when X s =X N . Generally, the problem is called semi-supervised learning as X s is between the two extreme cases.

For the BYY harmony learning, unsupervised, semi-supervised and supervised learning are all expressed in a same formulation. There are two types of implementation according to whether y is discrete or real.

When y is discrete, we modify H L (θ) by Equation 17 into

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H_{L,S}(\theta)=H_{L}(\theta)+\gamma H_{S} (\theta), & \\ &H_{S} (\theta) = \sum_{x_{t} \in X_{s}} \sum_{y_{t}} \delta_{y_{t}, y_{t}^{*}} p(y_{t}| x_{t}, \theta, \eta)\ln{q(x_{t}, y_{t}| \theta)},& \end{array} $$

where \(y_{t}^{*}\) is the teaching label associated with X s , and γ>0 is a confidence factor. The bigger the γ>0 is, the higher our confidence is on the supervision sample.

Accordingly, maximising H L,S (θ) results in

$$\begin{array}{@{}rcl@{}} & p(y| x, \theta)= \frac{q(x,y|\theta)^{[\gamma\delta_{y, y_{t}^{*}}+1+\eta]/{\eta}} }{\sum_{y}q(x,y|\theta)^{[\gamma\delta_{y, y_{t}^{*}}+1+\eta]/{\eta}}}. & \end{array} $$
((52))

Fixing p(y|x,θ), we further update θ via maximising H L,S (θ). From Equations 30 and 52, we have

$$\begin{array}{@{}rcl@{}} & H_{L,S}(\theta)=\sum_{t=1}^{N}\sum_{y_{t}} p(y_{t}| x_{t}) \ln{q(x_{t}, y_{t}| \theta)},& \\ &p(y_{t}| x_{t}) =p(y_{t}| x_{t}, \theta)(\eta+1+\gamma\delta_{y_{t}, y_{t}^{*}}),& \end{array} $$
((53))

from which we modify Algorithm 3 into Algorithm 9, with the Ying step kept unchanged while the Yang step modified into Algorithm 9.

We can always assign a teaching label \(\ell _{t}^{*}\) to each sample x t . If there is no teaching label, we assign \(\ell _{t}^{*}\) to be a number larger than k and thus always have \(\delta _{\ell, \ell _{t}^{*}}=0\). Otherwise, we let \(\ell _{t}^{*}\) to be its teaching label and have \(\delta _{\ell, \ell _{t}^{*}}=1\) when \(\ell =\ell _{t}^{*}\).

Similarly, we modify Algorithm 6 for learning binary FA into a semi-supervised version, i.e. Algorithm 10. Whether or not there is a teaching sample \(y_{t}^{*}\) for x t , we may always assign one \(y_{t}^{*}\) to each sample x t . If there is no teaching sample, we assign \(y_{t}^{*}\) to be out of tf and thus have \(\delta _{y, y_{t}^{*}}=0\). Otherwise, we let \(y_{t}^{*}\) to be its teaching sample and have \(\delta _{y, y_{t}^{*}}=1\) when \(y=y_{t}^{*}\).

When y is real valued, teaching samples will not affect the Yang step, while updating θ by the Ying step becomes maximising

$$\begin{array}{@{}rcl@{}} H_{S}(\theta)=\sum_{t=1}^{N}[(1+\eta) \pi(x_{t},y_{t}, \theta) + \gamma I_{t} \pi(x_{t},y_{t}^{*}, \theta) ], \ y_{t}=arg\max_{y} \pi(x_{t},y, \theta), \end{array} $$
((54))

where I t is an indicator explained by the remark given in Algorithm 11. It follows from Equation 54 that Algorithm 4 for learning FA can be modified into Algorithm 11 with some changes in the Ying step.

Moreover, we may combine Equations 53 and 54 to modify Algorithm 8 for learning NFA into Algorithm 12. Similar to Algorithm 10, we may always assign one discrete vector \(z_{t}^{*}=[z_{t}^{(1)*},\cdots,z_{t}^{(m)*}]\) to each sample x t . If there is no teaching information, we assign \(z_{t}^{*}\) to take a value that is out of our consideration, e.g. letting every \( z_{t}^{(i)*}\) to be a big number, we always have \(\delta _{z, z_{t}^{*}}=0\) for z tf . Otherwise, we let \(z_{t}^{*}\) to be its teaching label about z t , and use \(\delta _{z, z_{t}^{*}}=1\) to indicate \(z=z_{t}^{*}\).

Similar to the Yang step of Algorithm 9 and of Algorithm 10, we get \(p_{z| x_{t}}(\theta)\) with a difference that \(p_{z| x_{t}} =p_{z| x_{t}}(\theta ^{new})\) is not globally rescaled by a factor. Instead, a rescaling is distributed among each updating in the Ying step. Another difference from Algorithm 10 lies in that each z t is also associated with another real valued vector \(y_{t}=[y_{t}^{(1)},\cdots,y_{t}^{(m)}]\). For each teaching label \(z_{t}^{*}\), we may have two situations. One is that the corresponding teaching vector \(y_{z,t}^{*}=[y_{z,t}^{(1)*},\cdots,y_{z,t}^{(m)*}]^{T}\) is given together with \(z_{t}^{*}\). The other is that we have \(z_{t}^{*}\) only and need to estimate \(y_{z,t}^{*}\).

Also, the situation is different from getting y z,t =y(z t ,x t ,θ new) in Algorithm 8 where we only have x t without knowing both \(z_{t}^{*}\) and \(y_{z,t}^{*}\). Here, we estimate \(y_{z,t}^{*}= y(z_{t}^{*}, x_{t}, \theta ^{new})\) based on given the teaching signal \(z_{t}^{*}\).

Still, it relates to updating A,Σ,R xy in Algorithm 11 in that \(\delta _{z, z_{t}^{*}}\) takes a role of I t though the situation becomes more complicated due to the role of \(z_{t}^{*}\) and a scalar Gaussian mixture of each component \(y_{t}^{(i)}\).

BYY harmony sparse learning : a dual view

In all the previous sections, the BYY harmony learning implements the maximisation of H(θ) in Equation 11 without considering a priori q(θ). In this section, we show that learning performance can be further improved by a priori aided learning from a dual perspective.

We still consider the linear system in Figure 1, where the pair A,Y is observed from a dual view, or called co-dimensional (shortly co-dim) perspective (see Sect.2 in Xu (2011)). Considering a priori q(A|ρ) while ignoring priories on other parameters, we rewrite H(p||q) by Equation 9 into

$$\begin{array}{@{}rcl@{}} & H(p||q)=H(\theta, \phi, \rho) = \int p(A |X) [H(\theta)+\ln{q(A|\rho)} ] dA & \\ & = \int p(A |X) p(Y| X){p_{h}^{N}}(X) \ln[q(X|AY, \psi)q(Y| \phi)q(A|\rho)]dAdY\;dX,&\\ & = \int p(Y |X) [H_{d}(\theta)+\ln{q(Y| \phi)} ] dY& \end{array} $$
((55))

from which we observe that p(A|X),p(Y|X) take a same position in the first line and the last line, respectively, and that H d (θ) is actually a dual counterpart of H(θ) in Equation 11 as follows

\(H(\theta)= \int p(Y |X) {p_{h}^{N}}(X)\ln [q(X|AY, \psi)q(Y| \phi)]dY\;dX,\)

\(H_{d}(\theta)= \int p(A |X) {p_{h}^{N}}(X)\ln [q(X|AY, \psi)q(A|\rho)]dAdX.\)

This dual view motivates to improve the learning via not only updating A aided with a priori q(A|ρ) but also maximising H d (θ).

First, it follows from the second line in Equation 55 with help of Equation 18 that maximising H(p||q) is approximately turned into

$$\begin{array}{@{}rcl@{}} & \{A_{*}, \rho_{*}\}={argmax}_{A, \rho} H(A, \rho, \theta^{-}), & \\ & H(p||q)= H(A_{*}, \rho_{*}, \theta^{-})-\frac{1}{2}Tr[\Gamma^{A}_{X_{N}} {\Pi^{A}_{X}}], & \\ & {\Gamma^{A}_{X}}={Cov}_{p(vec[A]| X)}vec[A], \ {\Pi^{A}_{X}}=-\frac{\partial^{2} \pi(X_{N},AY, \theta)}{\partial vec[A] \partial vec[A]^{T}},& \\ &H(A, \rho, \theta^{-})=H(\theta)+\ln{q(A|\rho)},\ \theta_{*}^{-}={argmax}_{\theta_{*}^{-}} H(\theta), & \end{array} $$

where θ is resulted from removing A,ρ from θ. Anyone of the algorithms introduced in the previous sections can implement the maximisation of the last line above.

Here, we consider the maximisation of the first line, for which we start at

$$\begin{array}{@{}rcl@{}} &H(A, \rho, \theta^{-})=\int p(Y |X_{N}) \pi(X_{N},AY, \theta)dY, & \\ &\pi(X_{N},AY, \theta)= \ln[q(X_{N}|AY, \theta)q(A|\rho)q(Y| \theta)].& \end{array} $$

Given \(X_{N}=\{x_{t}\}_{t=1}^{N}\) that consists of i.i.d. column vectors, we consider the settings

$$\begin{array}{@{}rcl@{}} &q(A|\rho) = \prod_{j} G(a_{j}|0,{\Sigma_{j}^{a}})& \\ & \ln{q(A|\rho)} =-0.5md\ln{(2\pi)}-0.5\sum_{j}\ln{|{\Sigma_{j}^{a}}|} -0.5\sum_{j} {a_{j}^{T}}\Sigma_{j}^{a\, -1}a_{j}, & \\ & q(X_{N}|AY, \theta)=\prod_{t} G(x_{t}|{Ay}_{t}, \Sigma),\ {Ay}_{t}=\sum_{i} a_{i} y_{t}^{(i)}, &\\ &\ln{q(X_{N}|AY, \theta)}=-0.5N[d\ln{(2\pi)}+\ln{|\Sigma|}] -0.5\sum_{t}(x_{t}-{Ay}_{t})^{T}\Sigma^{-1}(x_{t}-{Ay}_{t}), & \end{array} $$
((56))

from which we can get

$$\begin{array}{@{}rcl@{}} & \nabla_{a_{j}} H(A, \rho, \theta^{-})= - \Sigma_{j}^{a\, -1}a_{j}+N\Sigma^{-1}[\mathbf{r}_{xy}^{(j)}- \sum_{i}a_{i} \lambda_{ij}],&\\ & \Lambda=[ \lambda_{ij} ], \ R_{xy}=[\mathbf{r}_{xy}^{(1)}, \cdots, \mathbf{r}_{xy}^{(m)}], & \end{array} $$

where θ is obtained from implementing the maximisation of the last line in Equation 56, which are available by the algorithms introduced in the previous sections.

From \(\nabla _{a_{j}} H(A, \rho, \theta ^{-})=0, j=1, \cdots, m\), A is solved by the following equation:

$$\begin{array}{@{}rcl@{}} & {\cal B}\ vec(A)=vec(R_{xy}), \ {\cal B}= I_{d\times d} \otimes \Lambda+\frac{diag[ \Sigma \Sigma_{1}^{a\, -1}, \cdots,\Sigma \Sigma_{m}^{a\, -1} ]}{N}, & \end{array} $$
((57))

where is the Kronecker product. This equation is equivalent to Eq. (51) in Xu (2011), i.e. the problem of solving a Sylvester matrix equation (Bartels and Stewart 1972; Miyajima 2013).

From \(\nabla _{a_{j}} H(A, \rho, \theta ^{-})\), we further get the second order derivative as follows

$$\begin{array}{@{}rcl@{}} & -\nabla^{2}_{a_{j}a_{\ell}^{T}} H(A, \rho, \theta^{-}) =\Sigma_{j}^{a\, -1}\delta_{j\ell}+ N\Sigma^{-1}\lambda_{j\ell}, &\\ & {\Pi^{A}_{X}}= N\Sigma^{-1} \otimes \Lambda + diag[ \Sigma_{1}^{a\, -1}, \cdots, \Sigma_{m}^{a\, -1} ]. \quad & \end{array} $$
((58))

Putting it into H(A,ρ,θ ) and fixing \(\Gamma ^{A}_{X_{N}}\), we get ρ H(A,ρ,θ ) and its root as follows

$$\begin{array}{@{}rcl@{}} \rho^{new}= diag[ \Sigma_{1}^{a }, \cdots, \Sigma_{m}^{a } ]^{new}= \Gamma^{A\ new}_{X_{N}}+vec(A^{new})vec(A^{new})^{T}. \end{array} $$
((59))

Similar to Equation 33, it follows from H L (p||q)=(1+η)H(p||q)+0.5η ln|Γ y|x | that we get

$$\begin{array}{@{}rcl@{}} & \Gamma^{A\ new}_{X_{N}} =\frac{\eta}{1+\eta}\Pi^{A\ new }_{X}, & \end{array} $$
((60))

which is put in the above Equation 59 for updating ρ new.

Computations of \({\cal B}, \Gamma ^{A}_{X_{N}}, {\Pi ^{A}_{X}}\) are rather simple since \(\Sigma _{j}^{a }, \Lambda, \Sigma \) are typically diagonal matrices, and even Σ=σ 2 I. Such uncorrelated structures facilitate learning featured with the nature of automatic model selection, see Sect.2.2 of Xu (2012a) and Sect.2.2 of Xu (2010a), that pushes redundant elements of A towards zeros via pushing its corresponding variances towards zeros. As a result, learning leads to a sparse matrix A.

Such a BYY harmony sparse learning comes from q(A|ρ) that takes a dual role of q(Y|ϕ). Being different from the existing sparse learning studies (Shi et al. 2011a, 2014; Tu and Xu 2011a; Xu 2012b) that consider either q(A|ρ) in a long tail distribution with extensive computing cost or q(A|ρ) in Equation 56 with help of one additional q(ρ) (see Sect.III of Xu (2012b)), here the updating by Equation 59 is made by q(A|ρ) in Equation 56 without considering such a priori q(ρ).

Of course, we may progress to consider a priori q(ρ) and also some priories about Λ,Σ, which will lead to another layer of integral about q(ρ),Λ,Σ. Readers are referred to Sect.2.3 in Xu (2011) for the details of implementation.

We may improve all the algorithms introduced in the previous sections, simply with its counterpart of solving A replaced by

$$\begin{array}{@{}rcl@{}} \text{updating } A \text{ by Equation~\ref{solvA} together with Equations~\ref{byyarho} and \ref{eq:BYY2AYang}. } \end{array} $$
((61))

which has been already listed in Algorithm 4, Algorithm 5, Algorithm 6, Algorithm 7, Algorithm 8 and Algorithm 12 as one alternative of A new=R xy Λ new −1 in the Ying step.

The implementation of maximising the first line in Equation 55 is featured by the order of integrals \(\int [\cdot ] dY dA\). In a dual view, we may also swap the order to consider maximising the last line in Equation 55. The detailed implementation will be quite similar. Moreover, we may alternatively conduct the two implementations.

De-noise Gaussian mixture

The Gaussian mixture by Equation 28 may also be viewed from a perspective of one specific linear system in Figure 1, with x t R d generated as follows

$$\begin{array}{@{}rcl@{}} & x=Ay+ e, \ y=[ y^{(1)}, \cdots, y^{(k)}]^{T}, y^{(j)} =0\ or \ 1, \ \sum_{j=1}^{k}y^{(j)} =1,&\\ &q(y|\phi)=\prod_{\ell=1}^{k} \alpha_{\ell}^{y^{(\ell)}}, \ \phi=\{\alpha_{\ell}\}, \ \sum_{\ell=1}^{k}\alpha_{\ell}=1, \ \text{with } \alpha_{\ell} \ge 0.& \\ &q(e| \psi)= G(e|0, {\sigma_{e}^{2}}I), \ q(A|\rho) = \prod_{j} G(a_{j}|\mu_{j},\Sigma_{j}),& \end{array} $$
((62))

as proposed in Sect.3.1 of Xu (2011). We have

$$\begin{array}{@{}rcl@{}} & q(x|\theta)=\sum_{y} \int q(x|Ay, \psi) q(y|\phi) q(A|\rho)dA & \\ &= \sum_{j} \alpha_{j} \int G(x|a_{j}, {\sigma_{e}^{2}}I)G(a_{j}|\mu_{j},\Sigma_{j}){da}_{j}=\sum_{j} \alpha_{j} G(x|\mu_{j}, {\sigma_{e}^{2}}I+\Sigma_{j}). & \end{array} $$

That is, we get a Gaussian mixture with each covariance matrix added with the variance of a common noise e. Given y (j)=1, we see that \(\hat x=x-e=a_{j}\) comes from G(a j |μ j ,Σ j ) and provides a de-noised version of observed sample x. Since y (j) takes 1 by a probability α j , the de-noised \(\hat X\) actually comes from a mixture \(\sum _{j} \alpha _{j} G(x|\mu _{j}, \Sigma _{j})\). Thus, this study is called, in Sect.3.1 of Xu (2011), learning de-noised Gaussian mixture or shortly de-noised GM.

Somewhat similar to Equation 43, we can rewrite H L (θ) in Equation 23 into

$$\begin{array}{@{}rcl@{}} & H_{L}(\theta)=\sum_{t=1}^{N} \sum_{\ell=1}^{k} p(\ell|x_{t},\theta)[(1+\eta)H_{L}(\theta|\ell, x_{t})-\eta \ln{p(\ell|x_{t},\theta)}], &\\ & H_{L}(\theta|\ell, x_{t}, \eta) = H(\theta|\ell, x_{t})+\frac{\eta}{1+\eta}E_{a_{\ell}|x_{t}}, E_{a_{\ell}|x_{t}} = -\int p(a_{\ell}| x_{t},\theta) \ln{p(a_{\ell}| x_{t},\theta)}{da}_{\ell},&\\ & H(\theta|\ell, x_{t})=\int p(a_{\ell}| x_{t},\theta) \pi(x_{t},a_{\ell}, \theta){da}_{\ell},& \\ &\pi(x,a_{\ell},\theta)= \ln{[G(x|a_{\ell},{\sigma_{e}^{2}}I)G(a_{\ell}|\mu_{\ell},\Sigma_{\ell})\alpha_{\ell}]}.& \end{array} $$
((63))

Similar to Equation 44, we further get

$$\begin{array}{@{}rcl@{}} & H(\theta|\ell, x_{t})= \pi(x_{t},a_{\ell}, \theta) -\frac{1}{2}Tr[\Gamma_{a_{\ell}\vert x} \Pi_{a_{\ell}\vert x}],\ H_{a_{\ell}|x_{t}}=0.5[\ln|\Gamma_{a_{\ell}\vert x}| +d\ln{(2\pi e)}],& \\ &a_{t,\ell}=arg\max_{a_{\ell}} \ \pi(x_{t},a_{\ell}, \theta)=W_{\ell}x_{t}+w_{\ell},\ a_{t,\ell}= [{\sigma^{2}_{e}}I+\Sigma_{\ell}]^{-1} (\Sigma_{\ell}x_{t}+ {\sigma^{2}_{e}} \mu_{\ell}), & \\ & W_{\ell}=[{\sigma^{2}_{e}}I+\Sigma_{\ell}]^{-1} \Sigma_{\ell}, \ w_{\ell}=[{\sigma^{2}_{e}}I+\Sigma_{\ell}]^{-1} {\sigma^{2}_{e}} \mu_{\ell}, & \\ & \Pi_{a_{\ell}\vert x}=({\sigma^{2}_{e}})^{-1}I+\Sigma_{\ell}^{-1}, \ \Gamma_{a_{\ell}\vert x}^{new}= \frac{\eta }{\eta+1}\Pi_{a_{\ell}\vert x}^{old\, -1}=\frac{\eta }{\eta+1}[{\sigma^{2}_{e}}I+\Sigma_{\ell}]^{-1} {\sigma^{2}_{e}} \Sigma_{\ell},&\\ & H_{L}(\theta|\ell, x_{t}, \eta) = \pi(x_{t},a_{\ell}, \theta) +\frac{0.5\eta}{1+\eta} \ln{\frac{|{\sigma^{2}_{e}} \Sigma_{\ell}|}{|{\sigma^{2}_{e}}I+\Sigma_{\ell}|}}+c_{\eta},& \end{array} $$
((64))

where c η is a constant that does not relate to θ,.

Maximising H L (θ) with respect to p(|x t ,θ) yields

$$\begin{array}{@{}rcl@{}} p(\ell| x_{t}, \theta)= \frac{e^{[\frac{\eta+1}{ \eta}\pi(x_{t},a_{\ell}, \theta) +0.5\ln{\frac{|{\sigma^{2}_{e}} \Sigma_{\ell}|}{|{\sigma^{2}_{e}}I+\Sigma_{\ell}|}} ]}}{ \sum_{j=1}^{k} e^{[\frac{\eta+1}{ \eta}\pi(x_{t},a_{j}, \theta) +0.5\ln{\frac{|{\sigma^{2}_{e}} \Sigma_{j}|}{|{\sigma^{2}_{e}}I+\Sigma_{j}|}} ]} }. \end{array} $$
((65))

Then, we maximise \(\sum _{t=1}^{N} \sum _{\ell =1}^{k} p(\ell |x_{t},\theta)H(\theta |\ell, x_{t})\) to get θ new, resulting in

$$\begin{array}{@{}rcl@{}} & \sigma_{e}^{2 {new}}=\frac{Tr[\Gamma_{a_{\ell}\vert x}^{old}] }{d}+ \frac{\sum_{\ell=1}^{k} \sum_{t}p(\ell|x_{t},\theta)(x_{t}-a_{t,\ell})^{T}(x_{t}-a_{t,\ell})}{Nd}, & \\ &\alpha_{\ell}^{new}=\frac{\sum_{t}p(\ell|x_{t},\theta)}{N}, \ \mu_{\ell}^{new}=\frac{\sum_{t}p(\ell|x_{t},\theta)a_{t,\ell}}{ \sum_{t}p(\ell|x_{t},\theta)}. &\\ & \Sigma_{\ell}^{new}=\Gamma^{old}_{a_{\ell}\vert x}+\frac{\sum_{t}p(\ell|x_{t},\theta)(a_{t,\ell}-\mu_{\ell}^{new})(a_{t,\ell}-\mu_{\ell}^{new})^{T}}{ \sum_{t}p(\ell|x_{t},\theta)}. & \end{array} $$
((66))

Putting the above Equations 66, 65 and 64 into Algorithm 13, we get a new Ying-Yang alternating algorithm for learning de-noise GM, which improves its counterpart in Sect.3.1 of Xu (2011) in that the Lagrange technique used in Algorithm 3 is used to help the Ying-Yang alternative implementation. Also, p(|x t ,θ) in Equation 65 has been extended to cover semi-supervised learning in the same way as in Algorithm 9.

Conventionally, noises are filtered by a preprocess (if needed) with help of a standard noise filtering method. In many applications, however, the problems of filtering noises and making clustering or density estimation are actually two coupled tasks. Instead, the de-noise GM provides a model to consider both in a same learning process, while Algorithm 13 provides a useful tool that implements both the tasks. Moreover, we can include some knowledge (e.g. teaching labels) in an easy way. One example is its potential application to image segmentation. Applying to a noisy image, a t, outcomes de-noised pixels for each segmented region, while pixel classification can be made by =a r g max p ,t . For a sharpen image, we may merely use \(\phantom {\dot {i}\!}a_{t,\ell ^{*}}\) as the de-noised pixels of each segmented region.

Sparse linear and logistic regression

When we are given a set of paired samples \(\{x_{t}, y_{t}^{*}\}\), the FA model by Equation 31 actually performs a multiple linear regression for the following mapping yx:

$$\begin{array}{@{}rcl@{}} & x=Ay+\mu+ e, \ y=[ y^{(1)}, \cdots, y^{(k)}]^{T}, q(y|\phi)=G(y|0, \Lambda), \ q(e)=G(e|0, \sigma^{2}I).& \end{array} $$
((67))

Though we may directly use Algorithm 11 for learning, it is difficult to trim off the redundant elements of y via checking whether λ i →0 in the Ying step of Algorithm 4. In this case, the contribution of \(\{ y_{t}^{*}\}\) will make none λ i in Λ in Equation 34 tend to zero. In contrast, learning by Algorithm 4 and Algorithm 11 is still able to push redundant elements of A towards zero when updating A is made by Equation 57 together with Equations 59 and 60.

For clarity, we simplify Algorithm 4 and Algorithm 11 into Algorithm 14. Its Yang step is directly Equation 60. The Ying step is a simplification of Equation 57 together with Equation 59 at Σ=σ 2 I, plus a new equation for updating σ 2. All the updating aims to maximise H(p||q) of Equation 56 in a simplification as follows

$$\begin{array}{@{}rcl@{}} H(p||q)= \ln[q(X_{N}|AY, \theta)q(A)]-\frac{1}{2}Tr[\Gamma^{A}_{X_{N}} {\Pi^{A}_{X}}], \end{array} $$

with q(A)=q(A|ρ) given by Equation 56. Also, from Equation 67 we have \(q(X_{N}|AY, \theta)=\prod _{t} G(x_{t}|{Ay}_{t}+\mu, \Sigma)\).

To get a further insight, we observe a special case that x t is simply univariate, i.e. d=1, at which A becomes a vector a T and Equation 67 actually becomes the widely studied linear regression problem, for which Algorithm 14 is simplified into Algorithm 15. It differs from the ordinary linear regression in that Λ is corrected by a term \( \frac {\sigma ^{2}}{N} \Sigma ^{a\, -1}\) for solving a.

Also, we may maximise H(p||q) to make sparse learning for a multiple logistic regression by

$$\begin{array}{@{}rcl@{}} & \ln{q(X_{N}|AY, \theta)}=\sum_{t} \ln{q(x_{t}|\hat {Ay}_{t}+\mu)}, &\\ & \ln{q(x_{t}|{Ay}_{t}+\mu)} = \sum_{i=1}^{d} [x_{t}^{(i)} \ln{s(\hat x_{t}^{(i)})}+(1-x_{t}^{(i)}) \ln{(1-s(\hat x_{t}^{(i)}))}], \ \hat x_{t}= {Ay}_{t}+\mu, & \cr &q(A|\rho) = \prod_{j} G(a_{j}|0,{\Sigma_{j}^{a}}), \ q(\mu)=G(\mu|0,\Sigma^{\mu}), & \end{array} $$
((68))

where 0≤s(r)≤1 is a sigmoid function, e.g. simply

$$\begin{array}{@{}rcl@{}} s(r)=1/(1+e^{-r}). \end{array} $$
((69))

We further get

$$\begin{array}{@{}rcl@{}} &\nabla_{a_{j}} \ln q(A)=-\Sigma_{j}^{a \, -1}a_{j}, \ \nabla_{\mu} \ln q(\mu)=-\Sigma^{\mu \, -1},& \\ &\delta a_{j} =\nabla_{a_{j}} \ln{q(X_{N}|AY, \theta)}=\sum_{t} \xi_{t}^{(j)} s'(\hat x_{t}^{(j)})y_{t}, &\\ &where \ s'(r)=\frac{ds(r)}{dr}, \ \xi_{t}^{(i)} =\frac{ x_{t}^{(i)}}{ s(\hat x_{t}^{(i)})}-\frac{1-x_{t}^{(i)}}{ 1-s(\hat x_{t}^{(i)}) }, &\\ &\delta \mu =\nabla_{\mu} \ln{q(X_{N}|AY, \theta)}= [\sum_{t} \xi_{t}^{(1)} s'(\hat x_{t}^{(1)}), \cdots, \sum_{t} \xi_{t}^{(d)} s'(\hat x_{t}^{(d)})]^{T}, &\\ &\Pi_{a_{j}}=-\nabla_{a_{j}{a_{j}^{T}}} \ln{q(X_{N}|AY, \theta)}=\sum_{t} w_{t}^{(j)}y_{t}{y_{t}^{T}}, &\\ & \Pi_{\mu}=-\nabla_{\mu\mu^{T}} \ln{q(X_{N}|AY, \theta)}= diag[\sum_{t} w_{t}^{(1)}, \cdots, \sum_{t} w_{t}^{(d)}], &\\ &w_{t}^{(i)}= \xi_{t}^{(i)} s^{\prime\prime}(\hat x_{t}^{(i)})+ \frac{ x_{t}^{(i)} s^{'2}(\hat x_{t}^{(i)}) }{ s^{2}(\hat x_{t}^{(i)}) }+\frac{(1-x_{t}^{(i)})s^{'2}(\hat x_{t}^{(i)})}{ (1-s(\hat x_{t}^{(i)}))^{2} }, \ with \ s^{\prime\prime}(r)=\frac{d^{2}s(r)}{d^{2}r}, & \end{array} $$
((70))

from which we get Algorithm 16 to make the BYY sparse learning for logistic regression. Similar to Equation 60, we get its Yang step. Similar to Equations 58 and 59, we have

$$\begin{array}{@{}rcl@{}} & \Pi^{a_{j}}_{X}= \Pi_{a_{j}}+ \Sigma_{j}^{a\, -1}, \ \Pi^{\mu}_{X}= \Pi_{\mu}+ \Sigma^{\mu\, -1}, &\\ & \Sigma_{j}^{a }= \Gamma^{a_{j}\ new}_{X_{N}}+a_{j}{a_{j}^{T}}, \ \Sigma^{\mu }= \Gamma^{\mu\ new}_{X_{N}}+\mu\mu^{T}, & \end{array} $$
((71))

which is put into the Ying step of Algorithm 16. Being different from Algorithm 14, there is no need to consider Σ=σ 2 I, while updating A,μ is made by gradient ascending instead of solving nonlinear equation.

Temporal FA and temporal binary FA

The FA model by Equation 31 has been extended to modelling temporal dependence in (Xu 1999a,2001b,2004a) by adding the following vector based auto-regression

$$\begin{array}{@{}rcl@{}} & y_{t}={By}_{t-1}+\varepsilon_{t}, \ q(\varepsilon_{t}|\phi)=G(\varepsilon_{t}|0, \Lambda).& \end{array} $$
((72))

The joint modelling by Equations 31 and 72 is called temporal factor analysis, shortly temporal FA or TFA.

Learning TFA can be implemented by maximising H(p||q) as follows:

$$\begin{array}{@{}rcl@{}} & H(p||q)= H_{1}(p||q) + H_{2}(p||q) -\sum_{t} \ln{G(y_{t}|{By}_{t-1}, \Lambda)} & \\ & H_{1}(p||q)=\sum_{t} \pi_{1}(x_{t},{Ay}_{t}, \theta)- \frac{1}{2}Tr[\Gamma^{A}_{X_{N}} {\Pi^{A}_{X}}] -\frac{1}{2}Tr[\Gamma_{y\vert x} \Pi_{Y\vert X}], & \\ &H_{2}(p||q)=\sum_{t} \pi_{2}(y_{t},{By}_{t-1}, \psi) -\frac{1}{2}Tr[\Gamma^{B}_{X_{N}} \Pi^{B_{X}}],& \\ & \pi_{1}(x_{t},{Ay}_{t}, \theta)=\ln[q(x_{t}|{Ay}_{t}, \Sigma) G(y_{t}|{By}_{t-1}, \Lambda)]+\frac{1}{N}\ln{q(A|\rho_{A})}, &\\ &\pi_{2}(y_{t},{By}_{t-1}, \psi) = \ln[ G(y_{t}|{By}_{t-1}, \Lambda)G(y_{t-1}|0, \Omega_{t-1})]+\frac{1}{N}\ln{q(B|\rho_{B})}. & \end{array} $$
((73))

Given ν=B y t−1 fixed, maximising H 1(p||q) is decoupled from \(H_{2}(p||q)-\sum _{t} \ln {G(y_{t}|{By}_{t-1}, \Lambda)}\) and thus is handled exactly by learning FA, as summarised in Part-A of Algorithm 17. With Λ fixed, maximising H 2(p||q) is decoupled from \(H_{1}(p||q)-\sum _{t} \ln {G(y_{t}|{By}_{t-1}, \Lambda)}\) too. Also, samples of {y t ,y t−1} are available from implementing Part-A. The problem of maximising H 2(p||q) is equivalent to the special case of a multiple linear regression at μ=0,ν=0 and d=m, and thus B can be learned by Algorithm 14, as summarised in Part-B of Algorithm 17.

One additional issue needs to be handled. The implementation of Part-A needs to know the covariance matrix Ω t−1 of y t−1, which takes the position of Λ in Algorithm 14. It follows from y t =B y t−1+ε t that we have the following equation as a constraint:

$$\begin{array}{@{}rcl@{}} & \Omega_{t}= B \Omega_{t-1}B^{T}+ \Lambda, & \end{array} $$
((74))

which may be recursively updated from Ω 0 after Λ updated in Part-A and B updated in Part-B.

When {y t } is a stationary process with Ω t−1Ω as t, Equation 74 becomes Ω=B Ω B T+Λ or

$$\begin{array}{@{}rcl@{}} & [I-(B\otimes B)]vec(\Omega)=vec(\Lambda),& \end{array} $$
((75))

from which we may get Ω by solving this equation.

In a similar way, we may also extend the binary FA by Equation 46 to modelling temporal dependence by the following modification

$$\begin{array}{@{}rcl@{}} & q(x_{t}|y_{t}, \psi)=G(x|Ay+\mu,\Sigma),\ q(y_{t}|y_{t-1}, \phi)=\prod_{i} \alpha_{i}^{y_{t}^{(i)}}(1-\alpha_{i})^{1-y_{t}^{(i)}}, & \\ & \alpha_{i}=s(\hat y_{t}^{(i)}), \ [\hat y_{t}^{(1)}, \cdots, \hat y_{t}^{(m)}]^{T}= \hat y_{t}={By}_{t-1}+\nu, \end{array} $$

where \(y_{t}^{(i)}\) takes either 0 or 1, and s(r) is a sigmoid function, e.g. by Equation 69. This model is called temporal binary factor analysis, shortly temporal BFA.

Similar to Equation 73, learning temporal BFA can be implemented by maximising H(p||q) with help of maximising H 1(p||q) by Algorithm 6 for learning BFA with α i fixed, as summarised in Part-A of Algorithm 18, and with help of maximising H 2(p||q) by Algorithm 16 to learn B,ν for logistic regression y t−1y t , as summarised in Part-A of Algorithm 18.

Bi-linear matrix system and manifold learning

Putting samples x t ,e t ,y t =1,,N into their corresponding matrix formats X,ER d×N,YR k×N, respectively, we extend the FA model x=A y+e in Equation 31 into the following generalised bi-linear matrix system (BMS)

$$\begin{array}{@{}rcl@{}} X=\mu(AY)+ E,\ E=\left[e_{t}^{(i)}\right], \ q(X|Y,\theta)=q(X-\mu(AY))=q(E|Y), \\ q(E|Y)= \prod_{t=1}^{N} \prod_{i=1}^{d}q(e_{t}^{(i)}|Y), \end{array} $$
((76))

where μ(A Y) is an inverse link function of AY that is linear to either one of A,Y with the other fixed, and μ(Ω)=[μ(ω i,j )] for a matrix Ω=[ω i,j ].

It can be used as a general formulation for existing typical linear models, classified by whether one or more of the following three natures are possessed. Additive noise E The BMS is called additive or non-additive based on whether or not q(E|Y)=q(E). One typical additive family is that elements of E are independent Gaussian noises, i.e.

$$\begin{array}{@{}rcl@{}} q\left(e_{t}^{(i)}|Y\right)=G\left(e_{t}^{(i)}|0, \sigma_{t}^{(i)\ 2}\right). \end{array} $$
((77))

Independent factors Y We get a BMS, featured by whether we have

$$\begin{array}{@{}rcl@{}} q(Y|\theta)=\prod_{t=1}^{N} \prod_{j=1}^{k} q\left(y_{t}^{(j)}\right). \end{array} $$
((78))

Link function μ The BMS is called bi-linear according to whether

$$\begin{array}{@{}rcl@{}} \mu(\xi)=\xi. \end{array} $$
((79))

The special cases of the BMS featured with Equations 77, 78 and 79 all satisfied include FA, BFA, NFA and others. Also, their corresponding implementations of BYY harmony learning are previously introduced by Algorithms 4 to 16. The special cases with only Equations 78 and 79 satisfied were addressed in Sect. 2 of Xu (2011).

Beyond Equation 78, the generalised BMS models with Equations 77 and 79 held were also previously addressed in Sect.II of Xu (2012b) and Sect.5 of Xu (2012a). One type is temporal learning featured by autoregression across columns of Y, e.g. by Equation 72, rather extensively studied since 2000 (Xu 2000b, 2001b,2004a). A recent summary about TFA studies is referred to Sect.5.2 of Xu (2012a).

The other type is manifold learning featured by that Y comes from the following matrix normal distribution (MND) (Dutilleul 1999; Gupta and Nagar 1999; Xu 2012b):

$$\begin{array}{@{}rcl@{}} N (U | C, \Omega, \Sigma) = \frac{e^{-0.5Tr[\Omega^{-1}(U-C)^{T} \Sigma^{-1} (U-C)]}} {(2\pi)^{0.5kN} |\Sigma|^{0.5k}|\Omega |^{0.5N}}, \end{array} $$

where a matrix Ω describes the cross-column dependence of the matrix variate U, and a matrix Σ describes the cross-row dependence of U. This matrix distribution is equivalent to a multivariate Gaussian distribution G(v e c(U)|v e c(C),ΣΩ).

One example is q(Y|θ)=N(Y|0,L −1,I) with L given by the graph Laplacian, which was firstly considered by Eq.(27) in Xu (2012b) and led to a BYY harmony based manifold learning. Such an insight may be observed from Equation 19. Maximisation of H(θ) subject to Equation 21 consists of maximising π(X,Y,θ)= lnq(X|Y,θ)−0.5(k N l n(2π)− ln|L|)−0.5T n that includes to minimise

$$\begin{array}{@{}rcl@{}} T_{n}=Tr[YLY^{T}], \end{array} $$
((80))

which is a key term in the Laplacian eigenmaps for preserving topologically the neighbourhood relation in manifold learning (Belkin and Niyogi 2003). Differently, the BYY harmony learning obtains Y =a r g maxY π(X N ,Y,θ) in place of learning an approximate linear mapping YW X.

The other example is q(Y|θ)=N(Y|0,L −1,Λ) that was firstly given by Eq.(107) in Xu (2012a), which is featured by a diagonal matrix Λ that is added in as free parameters to be adjusted. Accordingly, Equation 80 is modified into

$$\begin{array}{@{}rcl@{}} T_{n}=\ln{|\Lambda|}+Tr[\Lambda^{-1}YLY^{T}]. \end{array} $$
((81))

This Λ takes a role similar to the one in Algorithm 4. Actually, this situation can be regarded as an extended counterpart of FA-b while the situation with T n by Equation 80 can be regarded as an extended counterpart of FA-a. The BYY harmony learning helps to learn Λ for determining an appropriate manifold dimension k, i.e. the row dimension of Y. Following the schematic Algorithm 2, we can develop one detailed BYY harmony learning algorithm for implementing the BMS by Equation 76.

Conceptually, there are also other choices beyond Equation 78, which can be very diversified. The next subsection further examines a family of choices featured with certain decoupled parts of Y.

Decoupled BMS, regulatory networks and LMM model

We may narrow our consideration on the dependence among the parts of Y by considering linear dependence within Y by a linear product of a matrix B for describing dependence and a matrix with mutually independent elements, that is, we consider

$$\begin{array}{@{}rcl@{}} X=\mu(AYB^{T})+ E, with Equation~\ref{eq:nature2} satisfied. \end{array} $$
((82))

This formulation extends those previous models for independent factor analyses into their counterparts in a BMS formulation.

From the perspective of making the maximum likelihood learning on the parametric distribution of X, the formulation by Equation 82 includes the ones by both Equations 80 and 81 as its special cases with

$$\begin{array}{@{}rcl@{}} \eta(r)=r and q(Y|\theta)=N(Y|0,\Lambda_{c}, \Lambda_{r}), \end{array} $$
((83))

where both Λ c ,Λ r are diagonal matrices.

It follows from Theorem 2.3.10 in (Gupta and Nagar 1999) that we have N(Y B |0,B Λ c B T,Λ r ) with Y B T=Y B . Let L −1=B B T and E by Equation 76, we are led to Equation 80 when Λ r =I and to Equation 81 when Λ r I. Generally, we may also consider Equation 78 with elements from other Gaussian and nonGaussian distributions.

The above relations do not hold for the BYY harmony learning even when Equation 79 holds. Instead, the formulation by Equation 82 is more preferred than its counterpart in Equation 76. During the implementation of the BYY harmony learning, either q(Y|θ)=N(Y|0,L −1,I) or q(Y|θ)=N(Y|0,L −1,Λ) takes a role of to controlling the compleixty of Y, i.e. the row dimension and also the matrix sparsity.

Usually, B is not learned but provided from or designed based on a sample set X, e.g. L −1=B B T. Sometimes, B is learned subject to the following constraint

$$\begin{array}{@{}rcl@{}} B=B_{o}D_{B}, D_{B} is diagonal, {B_{o}^{T}}B_{o}=I~with elements being either 0 or 1. \end{array} $$
((84))

Gene regulatory networks (TRN) takes an important role in biology networks and modelling TRN based on gene expression data is one of major topics in the studies of computational genomics (Bar-Joseph et al. 2012; Karlebach and Shamir 2008; Morris and Mattick 2014). In the previous efforts (Tu et al. 2011,2012a,2012b), the BFA and NFA have been applied to model gene transcriptional regulation, which leads to improvements of networks component analysis (NCA) (Liao et al. 2003). Still, Equations 82 and 83 with μ(r)=r jointly also provide a new TRN model. Instead of pre-specifying the topology of A according to some priori knowledge (Liao et al. 2003), we get the topological information underlying the samples of X by the graph Laplacian L and then get B by L −1=B B T, while A is obtained via learning with or without pre-specifying its topology. Also, we may consider a priori of A with help of Equation 56. During the implementation of the BYY harmony learning, an appropriate number of transcription factors may be determined via learning the diagonal matrix Λ r .

We may further partition Y=[Y s ,F] and correspondingly B=[B s ,Z], with elements of Y s being stochastic variables and elements of F being unknown constants, where F and Z could be empty when all the elements of Y are stochastic variables. By this partition, we get \(AYB^{T}={AY}_{s}{B_{s}^{T}}+AFZ^{T}\). For simplicity, we drop the subscript s and still use F to denote the unknown constant matrix product AF instead of further decomposing it into two parts. Similarly, we also partition Z and get a constant offset term C. As a result, Equation 82 is rewritten into

$$\begin{array}{@{}rcl@{}} X=\mu(AY B^{T}+FZ^{T}+C) +E, together with Equation~\ref{eq:nature2}, \end{array} $$
((85))

which returns back to Equation 82 simply with F=0.

Let A Y=Y A , we have \(E(Y_{A}{Y_{A}^{T}})=AE(YY^{T})A^{T}\). When the columns of Y are i.i.d. from a Gaussian with a zero mean and a diagonal covariance matrix Λ r , Equation 85 becomes X=Y A B T+F Z T+E with Y A denoting random effects and F denoting fixed effects; that is, we are led to the linear mixture model (LMM) when μ(r)=r and generalised LMM (GLMM) when μ(r)≠r.

Unknowns in LMM or GLMM may be estimated by one of the algorithms developed in the literature of statistics under the principle of the least square error or maximum likelihood (Demidenko 2013). Both LMM and GLMM have been applied for modeling various associations in the studies of biology and recently in the studies of computational genomics (Yang et al. 2014; Zhou and Stephens 2014; Zou et al. 2014). The BYY harmony learning provides one alternative method for estimating the unknowns in LMM or GLMM, with one advantage of determining an appropriate row dimension of Y and a sparse matrix A. Conventionally, B and Z are design matrices that are usually pre-specified based on given samples and priori knowledge. Also, either or both of B and Z may consist of partially given elements and partially unknowns to be estimated via learning. One example is shown in Figure four in Xu (2011).

Following the schematic Algorithm 2, we can further develop the detailed BYY harmony learning algorithm for learning Equation 85. Here, we consider the special case μ(r)=r to learn Ψ that consists of all the unknowns (i.e. A,C,Σ c ,Σ r and the rest unknowns). Noticing that v e c(A Y B T+F Z T+C)=v e c(A Y B T)+v e c(F Z T)+v e c(C), v e c(A Y B T)=(BA)v e c(Y)=(B Y TI)v e c(A) and v e c(F Z T)=(ZI)v e c(F), it follows from Equations 19 and 21 that we learn Ψ by

$$\begin{array}{@{}rcl@{}} \max_{\Psi, Y, F} \ \pi(\Psi, Y, F), \ where \cr \pi(\Psi, Y, F)= \ln{[N(E|0, \Sigma_{c}, \Sigma_{r})N(Y|0, \Lambda_{c}, \Lambda_{r})N(A|0, D_{c}, D_{r}) N(F|0, K_{c}, K_{r})] },\cr E=X-\mu(AY B^{T}+FZ^{T}+C), \ \Sigma_{E}= \Sigma_{c}\otimes \Sigma_{r}, \\ subject \ to \hskip 10cm \\ vec(Y_{*})=arg\max_{Y} \pi(\Psi, Y, F)= {\Gamma^{Y}_{X}}(B\otimes A)^{T} \Sigma_{E}^{-1}vec(X-FZ^{T}-C), \\ vec(A_{*})=arg\max_{A} \pi(\Psi, Y, F)= {\Gamma^{A}_{X}}(YB^{T}\otimes I)^{T} \Sigma_{E}^{-1}vec(X-FZ^{T}-C),\\ vec(F_{*})=arg\max_{F} \pi(\Psi, Y, F)= {\Gamma^{F}_{X}}(Z\otimes I)^{T} \Sigma_{E}^{-1}vec(X-AY B^{T}-C), \\ {\Gamma^{Y}_{X}}=\frac{\eta}{\eta +1}\Pi^{Y\ \, -1}_{X}, \ {\Gamma^{A}_{X}}=\frac{\eta}{\eta +1}\Pi^{A\ \, -1}_{X}, \ {\Gamma^{F}_{X}}=\frac{\eta}{\eta +1}\Pi^{F\ \, -1}_{X},\\ {\Pi^{Y}_{X}}=-\frac{\partial^{2} \pi(\Psi, Y, F)}{\partial vec(Y) \partial vec(Y)^{T}}= (B\otimes A)^{T} \Sigma_{E}^{-1}(B\otimes A)+(\Lambda_{c}\otimes \Lambda_{r})^{-1}. \\ {\Pi^{A}_{X}}=-\frac{\partial^{2} \pi(\Psi, Y, F)}{\partial vec(A) \partial vec(A)^{T}}= (YB^{T}\otimes I)^{T} \Sigma_{E}^{-1} (YB^{T}\otimes I)+(D_{c}\otimes D_{r})^{-1}. \\ {\Pi^{F}_{X}}=-\frac{\partial^{2} \pi(\Psi, Y, F)}{\partial vec(F) \partial vec(F)^{T}}= (Z\otimes I)^{T} \Sigma_{E}^{-1} (Z\otimes I)+(K_{c}\otimes K_{r})^{-1}. \end{array} $$
((86))

A preservation principle of multiple convex combination

We observe the following estimators for the sample mean and sample covariance:

$$\mu = \frac{1}{N }\sum_{t=1}^{N} x_{t},\ \Sigma = \frac{1}{N}\sum_{t=1}^{N} (x_{t} - \mu)(x_{t} - \mu)^{T}, $$

each of which is featured by a convex combination of a number of individual statistics x t or (x t μ)(x t μ)T. Also, we observe the Ying step of Algorithm 3 and find that μ ,Σ are such convex combinations too. Actually, such convex combinations can be found in also the algorithms introduced in the previous sections.

Moreover, the harmony functional H(p||q) by Equation 9 is an estimation function that comes from a convex combination of an infinite many of individual estimation function featured by the Ying machine q(X|R)q(R) at an infinite many individuals of R, weighted by the Yang machine p(R|X)p(X).

The above examples are all the explicit combinations of explicit individual statistics or estimation functions. Even generally, such a convex combination applies to many implicit functions. For example, we examine the following convex combination

$$\begin{array}{@{}rcl@{}} f(\mu)=\sum_{t} a_{t} f_{t}(\mu), \ f_{t}(\mu)=\Vert x_{t}-\mu\Vert^{2}, \ \sum_{t} a_{t} =1, a_{t} \ge 0, \end{array} $$
((87))

from which we observe the following natures: (a) The gradient field μ f(μ) is a convex combination of the gradient fields \(\{\nabla _{\mu }\,f_{t}(\mu)\}_{t=1}^{N}\). (b) The root of μ f(μ)=0 is also a convex combination of the roots of \(\{\nabla _{\mu }\,f_{t}(\mu)=0\}_{t=1}^{N}\). (c) The minimum of f(μ) is a convex combination of the minimums of \(\{f_{t}(\mu)\}_{t=1}^{N}\) too.

These natures are closely related to the first order derivative or the gradient field of estimation functions. The nature (a) describes a global feature of the gradient fields of estimation functions, and the nature (b) describes features within some important local areas (e.g. around the sinks) of these gradient fields. While the nature (c) is equivalent to the nature (b) if \(\{f_{t}(\mu)\}_{t=1}^{N}\) have gradient fields. Generally, the nature (c) may even apply to those individual estimation functions that do not have gradient fields.

In Equation 87, a convex combination of individual convex functions implies or induces all the above three natures. Given a convex combination of individual convex functions, if it also preserves at least one of the three natures above, we say that it preserves a nature of multiple convex combination (MCC). The classic maximum likelihood learning preserves such a MCC nature too, because both \((1/N)\sum _{t=1}^{N}\ln {q(x_{t}|\theta)}\) and \((1/N)\sum _{t=1}^{N}\nabla _{\theta }\ln {q(x_{t}|\theta)}\) are convex combinations.

Such a nature is not implied everywhere. One example is the BYY harmony learning subject to Equation 12 as follows

$$\begin{array}{@{}rcl@{}} & H(\theta)= \int p(Y| \theta, X_{N})\pi(X_{N},Y, \theta)dY,\ s.t. p(Y| \theta, X_{N}) =q(Y| \theta, X_{N}), & \end{array} $$

which is a special case of Equation 9 and thus is still a convex combination of an infinite many of individual estimators π(X N ,Y,θ) at an infinite many individual values of Y, weighted by the Yang machine p(Y|θ,X N ). But, considering the gradient field directly may not preserve the MCC nature.

As shown in Eq.(25) of Xu (2010a), we get such a gradient field as follows:

$$\begin{array}{@{}rcl@{}} &\nabla_{\varphi} H(\theta)= \int p_{\delta}(Y| \theta, X_{N}) \nabla_{\varphi} \pi(X_{N},Y, \theta) dY,& \\ &p_{\delta}(Y| \theta, X_{N})= p(Y| \theta, X_{N})[1+\Delta \pi(X_{N},Y, \theta)], \\ &\Delta \pi(X_{N},Y, \theta)=\pi(X_{N},Y, \theta)- \int p(Y| \theta, X_{N})\pi(X_{N},Y, \theta)dY, & \end{array} $$
((88))

based on which we may develop a gradient based local search algorithm.

However, it suffers a problem of pre-specifying an appropriate learning stepsize. One alternative considers combining the roots of φ π(X N ,Y,θ)=0 at individual values of Y to approximate the root of φ H(θ)=0. One example is given by Eq. (11) in Xu (2010a) for learning Gaussian mixture, that is, letting \(p^{new}_{\ell | x_{t}}\) in Algorithm 3 to be replaced by

$$\begin{array}{@{}rcl@{}} p^{new}_{\ell| x_{t}} =p_{\ell t}(\theta^{new})[1+\delta_{t}^{(i)}(\theta^{new})]. \end{array} $$
((89))

Similarly, one other example can be found in Algorithm 2 and Eq. (10a) in Xu (2009) for learning radial basis functions (RBF) and extensions.

Still, this type of implementation may cause learning instability because the resulted \(p^{new}_{\ell | x_{t}}\) may break the constraint \(0 \le p^{new}_{\ell | x_{t}}\le 1\).

The above observation motivates another preservation principle of multiple convex combinations. We consider an estimator via making \(\max _{\theta } f(\theta), \ f(\theta)=\sum _{t} a_{t} f_{t}(\theta), s.t. \sum _{t} a_{t} =1, a_{t} \ge 0\), where each individual f t (θ) possesses more than one of the natures ξ (j)(f t ),j=1,…,c, w i t h,c≥1. The problem can be further modified into the following one:

$$\begin{array}{@{}rcl@{}} & \max_{\theta} f(\theta), \ f(\theta)=\sum_{t} a_{t}\, f_{t}(\theta), &\\ & subject to not only \sum_{t} a_{t} =1, a_{t} \ge 0, but also &\cr & each corresponding nature \xi^{(j)}(f) is a convex combination \sum_{t} b^{(j)}_{t} \xi^{(j)}(f_{t}), & \end{array} $$
((90))

where the weights \( \sum _{t} b^{(j)}_{t} =1, b^{(j)}_{t} \ge 0\) may be different for a different j and also may not be necessarily same as the weights \(\sum _{t} a_{t} =1, a_{t} \ge 0\).

As an example, we modify Equation 9 to explicitly satisfy the principle of preserving one MCC nature as follows:

$$\begin{array}{@{}rcl@{}} & \max_{\theta} H(\theta), \ H(\theta)= \int p(Y| \theta, X_{N})\pi(X_{N},Y, \theta)dY,& \\ & subject to \hskip 0.4cm p(Y| \theta, X_{N})=q(Y| \theta, X_{N}),\ \nabla_{\varphi} H(\theta)=\int p_{Y} \nabla_{\varphi} \pi(X_{N},Y, \theta)dY,& \\ & p_{Y}\in {\cal C}_{p}=\{ p_{Y}: 0 \le p_{Y}\le 1, \int p_{Y}dY=1\}, & \end{array} $$
((91))

where φθ is a subset of parameters to be estimated in our consideration. It can be the entire set of θ or a part of θ. Under this setting, we get the root of φ H(θ)=0 by a convex combination of the roots of φ π(X N ,Y,θ)=0.

Actually, Algorithm 1, Algorithm 3, Algorithm 5, Algorithm 6 and Algorithm 8, as well as their corresponding EM algorithms, are all the examples that pursuit along this direction. The Yang step or the E step actually gets such a p Y p while the Ying step or the M step estimates the root of φ H(θ)=0 by a convex combination of the roots of φ π(X N ,Y,θ)=0.

Comparing φ H(θ) in Equation 88 and φ H(θ) in Equation 91, we get an alternative implementation that consists of two steps as follows: (1) Get δ that consists of p δ (Y|θ,X N ) by Equation 88 at all the possible values of Y. (2) Project the set δ to the convex set p under a nearest principle.

There are two key issues to be handled as follows:

  • One is to be the nearest in what a sense? in a square or L 1 distance?

  • The other is an effective algorithm to find such a projection.

Another important issue is a theoretical guarantee on whether H(θ) keeps increasing or nondecreasing such that learning convergence is guaranteed.

Results and discussion

The results of BYY harmony learning implementations are summarized in Tables 3 & 4 for those made before 2010 and in Table 1 for those made after 2010. Most of the fundamentals and major implementing techniques of the BYY harmony learning are developed in the period of 1995 to 2001, for which we provide an outline chronologically in Table 3 featured by the time points at which the major innovative studies started at. In particular, the threads of 1995(c), 1997(a), 1999(a) and 2000(a) reach the present formulation H(p||q) in Equation 9 though these were considered on merely R={Y} in \( \int [\cdot ] d{R}\). Subsequent developments in the next decade are then further outlined in Table 4. Also, further details are referred to the following recent overviews:

  • Theoretical aspects and relations to other methods see Sect.4.1, Appendix A and B in Xu (2010a), and Sects.4.1 and 4.2 in Xu (2012a).

    Table 3 A foundation period of BYY studies (1995 to 2001)
    Table 4 Further advances of BYY studies (2002 to 2013)
  • Algorithms and applications see the roadmaps in Figure three and Figure eleven of Xu (2010a), also in Figure one of Xu (2011) and Sect.5 of Xu (2012a), plus recent applications in (Pang et al. 2013; Shi et al. 2011a,2011b,2011c,2014; Tu and Xu 2011a; Tu et al. 2011,2012a,2012b; Tu and Xu 2014; Wang et al. 2011).

  • Outlines on major topics in Xu ( 2012a ) see Sect.7 for 3 topics on statistical learning in general, 8 topics on BYY system, 13 topics on best harmony learning and 4 topics on implementation, as well as 15 topics on exemplar learning tasks and algorithms. Readers are also referred to Sect.3.2 and Sect.3.4 on topics and demanding issues about BYY system design, to Sect.4.2.3 on novelty and features of best harmony theory.

Before closing this paper, we continue the previous discussion made on Figure 4. As illustrated in Figure 5 and also referred to Appendix B(2) of Xu (2010a), learning is featured by a dynamic process of implementing learning theory to learn from what it observes and to adapt its environment, which may also be understood from a famous ancient Chinese TCM WuXing theory. A learning process is featured by repeatedly circling of five actions or states. For each circling, the first action A-1 gathers samples and information as the system’s input; A-2 transfers the input into inner candidate assumptions or suggestions; A-3 integrates or regulates evidences about candidates that comes from A-2; A-4 selects good candidates or trims off bad ones; and A-5 interprets or manages the environment.

Figure 5
figure 5

Negative feedback stabilises dynamics.

The harmonising dynamics discussed previously in Figure 3 and the corresponding subsection may also be observed from this perspective. At the centre of Figure 5, the bottom of the Yin-Yang logo has a black centre, which is usually called fish eye. This indicates the output of A-5, while its surrounding white ring indicates the Yang domain. The starting part of the Yang arrow indicates A-1 for picking samples in the Yang domain to get \({p_{h}^{N}}(X)\), and the arrow ends at the white fish eye on the top, implementing A-2 by p(Y,θ|X). On the other hand, the surrounding black ring of the white fish eye indicates the Ying domain that collects all the candidates as well as the associated evidences. The starting part of the Ying arrow indicates A-4 for choosing good candidates probabilistically via q(X|Y,θ), and the arrow ends at the bottom black fish eye and completes one circling.

As addressed by Equation 27 and the discussions thereafter, the signal η is measured at two fish eyes and also modulated by the inner attention of the system. A small η reflects either a bad Ying-Yang mutual agreement (a big mismatch to the desire) in the top fish eye or a bad fitting in the bottom fish eye.

A poor performance incurred from a poor selection of Y at A-4, resulting in a small value η that is feedback to A-2 to harmonise the attempts of updating θ. In such a negative feedback mechanism, the dynamics of information harmonising is stabilised. Interestingly, such a mechanism is executed in a pattern ‘A-2/Huo modulates A-4/Jin’, which complies with the classic ‘XiangKe’ principle of the Chinese TCM WuXing theory. In other word, the ‘XiangKe’ principle can be regarded as an ancient negative feedback principle.

Conclusions

Based on Lagrange variety preservation of Yang structure, this paper proposes a generic framework of dynamic BYY harmony learning, which not only unifies attention, detection, problem-solving, adaptation, learning and model selection from an information harmonising perspective but also provides a new type of Ying-Yang alternative nonlocal search to overcome a dilemma of suboptimal solution versus learning instability typically suffered by the existing Ying-Yang alternative nonlocal search. Algorithms are developed for learning Gaussian mixture, factor analysis (FA), mixture of local FA, binary FA, nonGaussian FA, de-noised Gaussian mixture, sparse multivariate regression, temporal FA and temporal binary FA, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model. These algorithms are featured with not only a favourable nature of automatic model selection but also a unified formulation in performing unsupervised learning and semi-supervised learning. Moreover, a principle of preserving multiple convex combinations is also proposed to improve the BYY harmony learning, which leads another type of Ying-Yang alternative nonlocal search.

References

  • Akaike, H (1974) A new look at the statistical model identification. Automatic Control IEEE Trans 19(6): 716–723.

    Article  MATH  MathSciNet  Google Scholar 

  • Akaike H (1987) Factor analysis and aic. Psychometrika 52(3): 317–332.

    Article  MATH  MathSciNet  Google Scholar 

  • Barron, A, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. Inf Theory IEEE Trans 44(6): 2743–2760.

    Article  MATH  MathSciNet  Google Scholar 

  • Bartels, RH, Stewart G (1972) Solution of the matrix equation ax+ xb= c. Commun ACM 15(9): 820–826.

    Article  Google Scholar 

  • Bar-Joseph, Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nature Rev Genet 13(8): 552–564.

    Article  Google Scholar 

  • Belkin, M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6): 1373–1396.

    Article  MATH  Google Scholar 

  • Chen, G, Heng P-A, Xu L (2014) Projection-embedded byy learning algorithm for gaussian mixture-based clustering. Appl Inf 1(2): 1–20.

    Google Scholar 

  • Corduneanu A, Bishop CM (2001) Variational bayesian model selection for mixture distributions In: Artificial Intelligence and Statistics, 27–34.. Morgan Kaufmann Waltham, MA.

    Google Scholar 

  • Dayan, P, Hinton GE, Neal RM, Zemel RS (1995) The helmholtz machine. Neural Comput 7(5): 889–904.

    Article  Google Scholar 

  • Dempster, AP, Laird NM, Rubin DB, et al. (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1): 1–38.

    MATH  MathSciNet  Google Scholar 

  • Demidenko, E (2013) Mixed Models: Theory and Applications with R. John Wiley & Sons, Hoboken, New Jersey.

    Google Scholar 

  • Diaconis, P, Ylvisaker D, et al. (1979) Conjugate priors for exponential families. Ann Stat 7(2): 269–281.

    Article  MATH  MathSciNet  Google Scholar 

  • Dutilleul, P (1999) The mle algorithm for the matrix normal distribution. J Stat Comput Simul 64(2): 105–123.

    Article  MATH  Google Scholar 

  • Fang, S-C, Rajasekera JR, Tsao H-SJ (1997) Entropy Optimization and Mathematical Programming, Vol. 8. Springer, New York.

    Book  MATH  Google Scholar 

  • Figueiredo, MAF, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24: 381–396.

    Article  Google Scholar 

  • Floudas, CA, Visweswaran V (1995) Quadratic optimization In: Handbook of Global Optimization, 217–269.. Springer, New York.

    Chapter  Google Scholar 

  • Gupta, AK, Nagar DK (1999) Matrix Variate Distributions, Vol. 104. CRC Press, Chapman & Hall, Boca Raton, Florida.

    Google Scholar 

  • Hoerl, RW (1985) Ridge analysis 25 years later. Am Stat 39(3): 186–192.

    MathSciNet  Google Scholar 

  • Jeffreys, H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc Lond. Series A. Math Phys Sci 186(1007): 453–461.

    Article  MATH  MathSciNet  Google Scholar 

  • Jordan, MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2): 183–233.

    Article  MATH  Google Scholar 

  • Karlebach, G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 9(10): 770–780.

    Article  Google Scholar 

  • Liao, JC, Boscolo R, Yang Y-L, Tran LM, Sabatti C, Roychowdhury VP (2003) Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci 100(26): 15522–15527.

    Article  Google Scholar 

  • McGrory, CA, Titterington DM (2007) Variational approximations in bayesian model selection for finite mixture distributions. Comput Stat Data Anal 51: 5352–5367.

    Article  MATH  MathSciNet  Google Scholar 

  • Miyajima, S (2013) Fast enclosure for solutions of sylvester equations. Linear Algebra Appl 439(4): 856–878.

    Article  MATH  MathSciNet  Google Scholar 

  • Morris, KV, Mattick JS (2014) The rise of regulatory rna. Nature Rev Genet 15(6): 423–437.

    Article  Google Scholar 

  • Ntzoufras, I, Tarantola C (2013) Conjugate and conditional conjugate bayesian analysis of discrete graphical models of marginal independence. Comput Stat Data Anal 66: 161–177.

    Article  MathSciNet  Google Scholar 

  • Pang, Z, Tu S, Wu X, Xu L (2013) Discriminative gmm-hmm acoustic model selection using two-level bayesian ying yang harmony learning In: Intelligent Science and Intelligent Data Engineering, 719–726.. Springer, Berlin Heidelberg.

    Chapter  Google Scholar 

  • Redner, RA, Walker HF (1984) Mixture densities, maximum likelihood and the em algorithm. SIAM Rev 26(2): 195–239.

    Article  MATH  MathSciNet  Google Scholar 

  • Rissanen, J (1978) Modeling by shortest data description. Automatica 14(5): 465–471.

    Article  MATH  Google Scholar 

  • Rubin, DB, Thayer DT (1982) Em algorithms for ml factor analysis. Psychometrika 47(1): 69–76.

    Article  MATH  MathSciNet  Google Scholar 

  • Schwarz, G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464.

    Article  MATH  Google Scholar 

  • Shi, L, Tu S, Xu L (2011a) Learning gaussian mixture with automatic model selection: A comparative study on three bayesian related approaches. Front Electrical Electronic Eng China 6(2): 215–244.

  • Shi, L, Tu SK, Xu L (2011b) Learning gaussian mixture with automatic model selection: a comparative study on three bayesian related approaches. Front Electr Electron Eng China 6: 215–244. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (B).

  • Shi, L, Wang P, Liu H, Xu L, Bao Z (2011c) Radar hrrp statistical recognition with local factor analysis by automatic bayesian ying-yang harmony learning. Signal Process IEEE Trans 59(2): 610–617.

  • Shi, L, Liu Z-Y, Tu S, Xu L (2014) Learning local factor analysis versus mixture of factor analyzers with automatic model selection. Neurocomputing 139: 3–14.

    Article  Google Scholar 

  • Tibshirani, R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58: 267–288.

    MATH  MathSciNet  Google Scholar 

  • Tikhonov, A, Goncharsky A, Stepanov V, Yagola A (1995) Numerical methods for the solution of ill-posed problems. Kluwer Academic, Netherlands.

    Book  MATH  Google Scholar 

  • Tipping, ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc: Series B (Statistical Methodology) 61(3): 611–622.

    Article  MATH  MathSciNet  Google Scholar 

  • Tu, SK, Xu L (2011a) Parameterizations make different model selections : empirical findings from factor analysis. Front Electr Electron Eng China 6: 256–274. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (B).

  • Tu, S, Xu L (2011b) An investigation of several typical model selection criteria for detecting the number of signals. Front Electr Electron Eng China 6(2): 245–255.

  • Tu, SK, Chen RS, Xu L (2011) A binary matrix factorization algorithm for protein complex prediction. Proteome Sci 9(Suppl 1): 18.

  • Tu, S, Chen R, Xu L (2012a) Transcription network analysis by a sparse binary factor analysis algorithm. J Integrative Bioinformatics 9(2): 198.

  • Tu, S, Luo D, Chen R, Xu L (2012b) A non-gaussian factor analysis approach to transcription network component analysis In: Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012 IEEE Symposium On, 404–411.. IEEE.

  • Tu, S, Xu L (2014) Learning binary factor analysis with automatic model selection. Neurocomputing 134: 149–158.

    Article  Google Scholar 

  • Wallace, CS, Dowe DL (1999) Minimum message length and kolmogorov complexity. Comput J 42(4): 270–283.

    Article  MATH  Google Scholar 

  • Wang, P, Shi L, Du L, Liu H, Xu L, Bao Z (2011) Radar hrrp statistical recognition with temporal factor analysis by automatic bayesian ying-yang harmony learning. Front Electr Electron Eng China 6(2): 300–317.

    Article  Google Scholar 

  • Xu, L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival penalized competitive learning In: Pattern Recognit, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference On, 496–499.. IEEE, New Jersey.

    Google Scholar 

  • Xu, L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. Neural Netw IEEE Trans 4(4): 636–649.

    Article  Google Scholar 

  • Xu, L (1995) Bayesian-kullback coupled ying-yang machines: Unified learnings and new results on vector quantization In: Proc. Int. Conf. Neural Information Process (ICONIP ‘95), 977–988.. Publishing House of Electronics Industry, Beijing.

    Google Scholar 

  • Xu L (1996) How many clusters?: A ying-yang machine based theory for a classical open problem in pattern recognition In: Neural Netw, 1996., IEEE International Conference On, 1546–1551.. IEEE, New Jersey.

    Google Scholar 

  • Xu, L, Jordan MI (1996) On convergence properties of the em algorithm for gaussian mixtures. Neural Comput 8(1): 129–151.

    Article  Google Scholar 

  • Xu, L (1997a) Bayesian ying–yang machine, clustering and number of clusters. Pattern Recognit Lett 18(11): 1167–1178.

  • Xu L (1997b) Bayesian ying yang system and theory as a unified statistical learning approach:(i) unsupervised and semi-unsupervised learning In: Brain-like Computing and Intelligent Information Systems, 241–274.. Springer-Verlag, Berlin Heidelberg.

  • Xu, L (1997c) Bayesian ying yang system and theory as a unified statistical learning approach (ii): from unsupervised learning to supervised learning and temporal modeling In: Proceedings of Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, 25–42.. Springer, Berlin Heidelberg.

  • Xu L (1998a) Rbf nets, mixture experts, and bayesian ying–yang learning. Neurocomputing 19(1-3): 223–257.

  • Xu, L (1998b) Bayesian kullback ying–yang dependence reduction theory. Neurocomputing 22(1): 81–111.

  • Xu L (1998c) Bayesian ying-yang dimension reduction and determination. J Comput Intell Finance 6(5): 11–16.

  • Xu, L (1998d) Bkyy dimension reduction and determination In: Neural Netw Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference On, 1822–1827.. IEEE, New Jersey.

  • Xu L (1999a) Temporal byy learning and its applications to extended kalman filtering, hidden markov model, and sensor-motor integration In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 949–954.. IEEE, New Jersey.

  • Xu, L (1999b) Bayesian ying yang theory for empirical learning, regularisation and model selection: general formulation In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 552–557.. IEEE, New Jersey.

  • Xu L (1999c) Bayesian ying yang supervised learning, modular models, and three layer nets In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 540–545.. IEEE, New Jersey.

  • Xu, L (1999d) Byy data smoothing based learning on a small size of samples In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 546–551.. IEEE, New Jersey.

  • Xu L (1999e) Byy ying yang unsupervised and supervised learning: theory and applications In: Neural Netw and Signal Processing, Proceedings of 1999 Chinese Conference On, 112–29.. Publishing house of Electronic industry, Beijing.

  • Xu, L (2000a) Byy prod-sum factor systems and harmony learning. invited talk In: Proceedings of International Conference on Neural Information Processing (ICONIP’2000), 548–558, KAIST, Taejon.

  • Xu L (2000b) Temporal byy learning for state space approach, hidden markov model, and blind source separation. Signal Process IEEE Trans 48(7): 2132–2144.

  • Xu, L (2000c) Byy learning system and theory for parameter estimation, data smoothing based regularisation and model selection. Neural Parallel Sci Comput 8(1): 55–83.

  • Xu L (2000d) Best harmony learning In: Intelligent Data Engineering and Automated Learning (IDEAL 2000). Data Mining, Financial Engineering, and Intelligent Agents, 116–125.. Springer, Berlin Heidelberg.

  • Xu, L (2001a) Best harmony, unified rpcl and automated model selection for unsupervised and supervised learning on gaussian mixtures, three-layer nets and me-rbf-svm models. Int J Neural Syst 11(01): 43–69.

  • Xu L (2001b) Byy harmony learning, independent state space, and generalised apt financial analyses. Neural Netw IEEE Trans 12(4): 822–849.

  • Xu, L (2001c) Byy harmony learning, model selection, and information approach: Further results In: Neural Information Processing (ICONIP’2001), 2001. Proceedings International Joint Conference On, 30–37.. APPNA, Shanghai.

  • Xu L (2001d) Byy harmony learning, local independent analyses, and apt financial applications In: Neural Netw, 2001. Proceedings. IJCNN’01. International Joint Conference On, 1817–1822.. IEEE, New Jersey.

  • Xu, L (2001e) An overview on unsupervised learning from data mining perspective In: Advances in Self-Organising Maps, 181–209.. Springer, Berlin Heidelberg.

  • Xu L (2002) Byy harmony neural networks, structural rpcl, and topological self-organizing on mixture models. Neural Netw 15: 1125–1151.

    Article  Google Scholar 

  • Xu, L (2003a) Independent component analysis and extensions with noise and time: a bayesian ying-yang learning perspective. Neural Inf Process Lett Rev 1: 1–52.

  • Xu L (2003b) Data smoothing regularization, multi-sets-learning, and problem solving strategies. Neural Netw 16: 817–825.

  • Xu, L (2004a) Temporal byy encoding, markovian state spaces, and space dimension determination. Neural Netw IEEE Trans 15(5): 1276–1295.

  • Xu L (2004b) Advances on byy harmony learning: information theoretic perspective, generalized projection geometry, and independent factor autodetermination. Neural Netw IEEE Trans 15(4): 885–902.

  • Xu, L (2004c) Bi-directional byy learning for mining structures with projected polyhedra and topological map In: Proceedings of IEEE ICDM2004 Workshop on Foundations of Data Mining, 2–14.. ICDM, Brighton.

  • Xu L (2007a) A unified perspective and new results on rht computing, mixture based learning, and multi-learner based problem solving. Pattern Recognit 40: 2129–2153.

  • Xu, L (2007b) A trend on regularization and model selection in statistical learning: A bayesian ying yang learning perspective In: Challenges for Computational Intelligence, 365–406.. Springer, Berlin Heidelberg.

  • Xu L (2008) Bayesian ying yang system, best harmony learning, and gaussian manifold based family In: Computational Intelligence: Research Frontiers, 48–78.. Springer, Berlin Heidelberg.

    Chapter  Google Scholar 

  • Xu, L (2009) Learning algorithms for rbf functions and subspace based functions In: E S Olivas e.a. (ed) Handbook of Research on Machine Learning, Applications and Trends: Algorithms, Methods and Techniques, 60–94.. IGI Global, Hershey, PA.

    Google Scholar 

  • Xu L (2010a) Bayesian ying-yang system, best harmony learning, and five action circling. Front Electr Electron Eng China 5: 281–328. A special issue on Emerging Themes on Information Theory and Bayesian Approach.

  • Xu, L (2010b) Machine learning problems from optimization perspective. J Global Optimization 47(3): 369–401.

  • Xu L (2011) Codimensional matrix pairing perspective of byy harmony learning: hierarchy of bilinear systems, joint decomposition of data-covariance, and applications of network biology. Front Electr Electron Eng China 6: 86–119. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (A).

    Article  Google Scholar 

  • Xu, L (2012a) On essential topics of byy harmony learning: current status, challenging issues, and gene analysis applications. Front Electr Electron Eng China 7: 147–196.

  • Xu L (2012b) Semi-blind bilinear matrix system, byy harmony learning, and gene analysis applications In: Proceedings of The 6th International Conference on New Trends in Information Science, Service Science and Data Mining, 661–666.. AICIT, Taipei.

  • Yang, J, Zaitlen NA, Goddard ME, Visscher PM, Price AL (2014) Advantages and pitfalls in the application of mixed-model association methods. Nat Genet 46(2): 100–106.

    Article  Google Scholar 

  • Zhou, X, Stephens M (2014) Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods 11(4): 407–409.

    Article  Google Scholar 

  • Zou, J, Lippert C, Heckerman D, Aryee M, Listgarten J (2014) Epigenome-wide association studies without the need for cell-type composition. Nat Methods 11(3): 309–311.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by a CUHK Direct grant project 4055025 and a starting-up for the Zhi-Yuan chair professorship by Shanghai Jiao Tong University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Xu.

Additional information

Competing interests

The author declares no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, L. Further advances on Bayesian Ying-Yang harmony learning. Appl Inform 2, 5 (2015). https://doi.org/10.1186/s40535-015-0008-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40535-015-0008-4

Keywords