Further advances on Bayesian YingYang harmony learning
 Lei Xu^{1, 2}Email author
Received: 18 July 2014
Accepted: 21 April 2015
Published: 13 June 2015
Abstract
After a short tutorial on the fundamentals of Bayes approaches and Bayesian YingYang (BYY) harmony learning, this paper introduces new progresses. A generic information harmonising dynamics of BYY harmony learning is proposed with the help of a Lagrange variety preservation principle, which provides Lagrangelike implementations of YingYang alternative nonlocal search for various learning tasks and unifies attention, detection, problemsolving, adaptation, learning and model selection from an information harmonising perspective. In this framework, new algorithms are developed to implement YingYang alternative nonlocal search for learning Gaussian mixture and several typical exemplars of linear matrix system, including factor analysis (FA), mixture of local FA, binary FA, nonGaussian FA, denoised Gaussian mixture, sparse multivariate regression, temporal FA and temporal binary FA, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model. These algorithms are featured with a favourable nature of automatic model selection and a unified formulation in performing unsupervised learning and semisupervised learning. Also, we propose a principle of preserving multiple convex combinations, which leads alternative search algorithms. Finally, we provide a chronological outline of the history of BYY learning studies.
Keywords
Automatic model selection Lagrange Variety preservation YingYang alternation Denoised Gaussian mixture Factor analysis Local factors Binary factors nonGaussian factors Temporal factors Multivariate regression Bilinear matrix system Linear mixed modelBackground
Bayes approach and automatic model selection
Learning tasks associated with the front level can be viewed from a perspective of learning a mapping x→y, called representative model, by which an observed sample x in a visible domain X is mapped into its corresponding encoding y as a signal or inner code to perform a task of problem solving, such as abstraction, classification, inference and control. Existing learning methods for a representative model can be roughly divided into two groups as follows: (1) One is featured by learning a mapping x→y according to whether a principle is satisfied by the resulted inner encodings of y, while not explicitly taking the other directional mapping y→x in consideration. One exemplar family is featured by a linear mapping y=W x that transforms x into y of independent components, such as principal component analysis (PCA) and independent component analyses (ICA) (Xu 2003a). The other widely studied family is supervised learning by a linear or a nonlinear mapping that makes samples of y to approach the desired target samples. (2) The other group is featured by learning a mapping x→y as an inverse of a given mapping y→x that describes how observed samples are generated. Some efforts aim at that the cascade of x→y and y→x implement a unitary transform x→x, as often encountered in adaptive control. Most of studies consider y→x in a probabilistic sense by q(xy) together with y described by q(y). Accordingly, x→y is either directly the Bayesian inverse of q(xy)q(y) or its certain approximation.
Typically, the mapping y→x in the front level is unknown and should be learned from a given set \(X_{N}=\{x_{t}\}_{t=1}^{N}\) of samples, which is also called generative learning. Usually, the corresponding distribution structure (or called generative models) is designed according to types of applications. One widely studied structure is the linear system shown in Figure 1. As to be further addressed in the subsequent sections, this structure not only covers subspace methods, Gaussian mixture, factor analysis and its extensions to binary or nonGaussian factors but also can be further generalised to many others.
This EM iteration is guaranteed to converge to a local maximum of L(θ) without requiring any learning stepsize, while the gradientbased algorithm needs an appropriate learning stepsize that results in learning instability if the size is too big or a very slow convergence if the size is too small. Moreover, the EM algorithm keeps the constraints of Gaussian mixture satisfied and demonstrates a superlinear convergence rate, with further details referred to Xu and Jordan (1996).
How to use a priori q(θk) is a topic that has a long history and has been considered from several aspects. The classic Bayes school uses different parametric distributions on different parts of θ according to the natures of learning tasks and empirical experiences. Typical examples are those of conjugate priors (Diaconis and Ylvisaker 1979; Ntzoufras and Tarantola 2013). Extensive studies along this line have been made in the machine learning literature, especially on Dirichletmultinomial for Gaussian mixture. Related studies also include those on multivariate linear regression and extensions. When Gaussian priori is used on each regression coefficient, learning by Equation 4 implements the ridge regression (Hoerl 1985) and Tikhonov regularisation (Tikhonov et al. 1995). When Laplace priori is used on each regression coefficient, learning by Equation 4 implements LASSO regression (Tibshirani 1996) or called sparse learning.
Another Bayes school prefers to use a noninformative priori. For a parameter varies on a compact support, such a priori is simply a uniform distribution. However, there is no such a uniform distribution on an infinite large support. Typically, a noninformative improper distribution q(θk) is used under the name of Jeffery priori (Jeffreys 1946), which has been widely used in the machine learning literature too. Also, there are some efforts that attempt to blend the two schools, e.g. the Jeffery priori is jointly used with a proper priori by the minimum message length (MML) method (Figueiredo and Jain 2002; Wallace and Dowe 1999). Moreover, there is also one effort called induced bias cancellation (IBC), by which the use of a priori is to cancel an implicit prior induced from using a learning model on a finite size of samples, e.g. see Eqs (20) and (21) in Xu (2000a) and also Sect. 3.4.3 in Xu (2007a). Interestingly, as addressed on page 304 of Xu (2010a), this IBC may be regarded as a degenerated but easy computing approximation of the normalised maximum likelihood (NML) that is obtained from a minimax principle (Barron et al. 1998), which takes a key role in the recent developments of the MDL encoding.
by which learning is made via a twostage implementation. The first stage enumerates all possible numbers of k to obtain a set of candidate models featured by different values of k, and estimates θ ^{∗} by Equation 1 for each candidate. At the second stage, we select the best candidate by Equation 5 with L(X _{ N },k) given by Equation 6. In implementation, the minimum description length (MDL) (Rissanen 1978) is actually equivalent to this BIC. There are also a number of other variants of L(X _{ N },k) available in the literature, e.g. another classic one is Akaike’s information criterion (AIC) (Akaike 1974, 1987).
However, a twostage implementation suffers from a huge computation because it requires parameter learning for each candidate. Also, estimating θ ^{∗} by Equation 1 will become less reliable when the component number k is large and thus incurs for more free parameters.

there is an indicator Ψ _{ π }(θ) on θ or its subset, based on which a particular subset π can be effectively discarded if we have$$\begin{array}{@{}rcl@{}} \Psi_{\pi}(\theta) \to 0, \end{array} $$(7)
e.g. Ψ _{ π }(θ) is the variance of y ^{(i)} in Figure 1.

in learning implementation there is an intrinsic mechanism that leads to Equation 7 when the corresponding structure is redundant and thus can be effectively discarded.
Such automatic model selection is actually made during implementing the inverse problem X _{ N }→θ. Thus, we merge the corresponding two levels in Figure 1 because it combines both the inverse problem X _{ N }→θ and the inverse problem X _{ N }→k.
For the existing studies, there are three roads towards automatic model selection. One is a heuristic road, featured by an early effort called Rival Penalised Competitive Learning (RPCL) made in the early 1990s (Xu et al. 1992, 1993), which gets an appropriate number k of clusters automatically determined during learning.
The second road is getting an aid from appropriate priories. For examples, learning by Equation 4 demonstrates such a nature by using either a Laplace priori in sparse learning (Tibshirani 1996) or jointly the Jeffery priori and a proper priori by the minimum message length (MML) (Figueiredo and Jain 2002). Another example is the Variational Bayes (VB) (Corduneanu and Bishop 2001; McGrory and Titterington 2007) that approximately maximises a lower bound of L(X _{ N },k) in Equation 5 via learning the hyper parameters in both a priori q(θk) and an approximate posteriori p(θX _{ N }).
The third road is the following BYY harmony learning (BYY) that was firstly proposed in 1995 (Xu 1995) and subsequently developed systematically, which provides a general framework for learning X _{ N }→θ and X _{ N }→k under the BYY best harmony principle.
Bayesian YingYang harmony learning
where G(xμ,Σ) denotes a Gaussian density with the mean vector μ and the covariance matrix Σ.
For the rest of the three components, we start at designing the structures of q(XR) and q(R), based on which we further design the structure of p(RX) that is typically a sort of an inverse of the Ying q(XR)q(R) machine. This is consistent to the YingYang philosophy, according to which Ying is primary and comes first, while the Yang is secondary and bases on the Ying.

A principle of least redundant representation for q(R).

A principle of divideconquer for q(XR).

A principle of YingYang uncertainty conversation or variety preservation for p(RX).
Further details are referred to Sect.4.2 of (Xu 2010a) and Sect.3.2 of (Xu 2012a). The first two principles are adopted from the existing studies, while the third is specific to the BYY system. In a compliment to the YinYang philosophy, it requires that Yang machine preserves a dynamic range to appropriately accommodate uncertainty or information contained in the Ying machine. That is, we have U(p(X,R))=U(q(X,R)) under a uncertainty measure U(p) as shown within the table of Figure four(a) in Xu (2009).
On the one hand, maximising H(pq) forces the Ying q(XR)q(R) to match the Yang p(RX)p(X). There are always certain structural constraints imposed on the YingYang structures and also a constraint comes from \(p(X)={p_{h}^{N}}(X)\) by Equation 8 on a finite size of samples, because of which a perfect equality q(XR)q(R)=p(RX)p(X) may not be really reached but still be approached as close as possible. At this equality, H(pq) becomes the negative entropy that describes the complexity of the BYY system. Further maximising it will decrease the system complexity and thus provides an ability for determining an appropriate k.
Maximising H(pq) consists of minimising the second term for a best matching or agreement between the YingYang pair and of minimising the first term for a least amount of information to be communicated from the Yang to the Ying towards an agreement.
Recent BYY applications and empirical studies
Papers  Outcomes 

Shi et al. (2011a)  A comparative investigation has been made on three Bayesian related approaches, namely, variational Bayesian (VB), minimum message length (MML) and BYY harmony learning, through the task of learning Gaussian mixture model (GMM) with an appropriate number of components automatically determined. On not only simulated GMM data sets but also the Berkeley segmentation database of real world images, extensive experiments have shown that BYY harmony learning considerably outperforms both MML and VB regardless whether a Jeffreys prior or a conjugate DirichletNormalWishart (DNW) prior is used and whether the hyperparameters of DNW prior are further optimised. 
Tu and Xu (2011a)  A further comparison has been made on factor analysis (FA) with an appropriate number of factors determined, and extensive experiments have shown that not only BYY and VB outperform AIC, BIC and DNLL but also BYY outperforms VB considerably. Moreover, using VB to optimise the hyperparameters of priors deteriorates the performances while using BYY for this purpose can improve the performances. 
Tu and Xu (2011b)  Empirical comparisons have also been made on factor selection performances of AIC, BIC, Bozdogan’s AIC, HannanQuinn criterion, Minka’s (MK) criterion, Kritchman & Nadler’s hypothesis tests (KN), Perry & Wolfe’s MiniMax rank (MM) and BYY harmony learning, by varying signaltonoise ratio (SNR) and training sample size N. It has been shown that AIC and BYY harmony learning, as well as MK, KN and MM, are relatively more robust than the others against decreasing N and SNR, and BYY is superior for a small size N. 
Extension of FA has been made to binary FA with automatic factor selection. Again, it is empirically shown that BYY outperforms VB and BIC. Also, efforts of (Shi et al. 2014) extend the studies of (Shi et al. 2011a) and two FA parameterizations in (Tu and Xu 2011a) into Mixture of Factor Analyzers (MFA) and Local Factor Analysis (LFA) for the problem of automatically determining the component number and the number of factors of each FA. On not only a wide range of synthetic experiments but also real applications of face recognition, handwritten digit image clustering and unsupervised image segmentation, it has been also shown that BYY outperforms VB reliably on both MFA and LFA.  
Chen et al. (2014)  Further developments of (Shi et al. 2011a) have also been made to avoid some learning instability (see Remarks at the bottom of this table), an implementation of BYY harmony learning by either a projectionembedded algorithm or the algorithm by Table ?? in this paper needs no priori but outperforms not only MML with Jeffreys prior and VB with DirichletNormalWishart prior but also BYY with these priors given in (Shi et al. 2011a). On the Berkeley segmentation data set, the semantic image segmentation performances have shown that BYY outperforms not only MML, VB, BYYJef and BYYDNW but also three leading image segmentation algorithms, namely gPbowtucm, MNCut and Mean Shift. 

A Lagrange implementation of the principle of variety preservation is proposed for learning the Yang structure, with a new YingYang alternation nonlocal search obtained and the abovementioned dilemma removed.

An information harmonising perspective for BYY harmony learning such that the tasks of attention, detection, problemsolving, adaptation, learning and model selection are integrated in a concise formulation.

Learning algorithms that implement YingYang alternative nonlocal search for learning GMM, FA, local FA, binary FA, nonGaussian FA, denoised GMM, temporal FA, temporal binary FA and sparse multivariate regression, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model, with a favourable nature of automatic model selection and a unified formulation in performing unsupervised and semisupervised learning.

A principle of preserving multiple convex combinations for implementing BYY harmony learning, which leads another type of YingYang alternative nonlocal search algorithms.
Finally, at the end of this paper, a chronological outline is given on the innovative time points in the history of BYY harmony learning studies.
Methods
BYY harmony learning: Lagrange YingYang alternation
Even earlier in 2007, another example is given by Eq.(72) in Xu (2007a), under the name of equal covariance with U(p(X,Y))=U(q(X,Y)) denoting that the Yang preserves the covariance of q(X,Y).
The existing algorithms for maxθ H(θ) directly impose the constraint U(p(X,Y))=U(q(X,Y)), which makes learning suffer a dilemma of either local optimal solution or some learning instability, see the remarks in Table 1.
where η>0 is a Lagrange coefficient. A nonzero value η will relax the target K L(p(X,Y)∥q(X,Y))=0. The smaller the value η is, it becomes more relaxed, or vice versa.
with H _{ L }(θ ^{ n e w })≥H _{ L }(θ ^{ o l d }).
which keeps H _{ L }(θ) to be nondecreasing too.
Therefore, alternatively updating Equations 15 and 16 makes H _{ L }(θ) monotonically nondecrease and finally converge. That is, learning stability is guaranteed.
Given h fixed, the term E _{ X }(h) can be ignored because it is irrelevant to updating θ and p(YX). With help of E _{ X }(h), an appropriate h can be estimated in a way similar to ones summarised in Sect.2 of (Xu L 2003b).
from which we get two types of detailed implementation according to the types of variables in Y.
When the variables in Y are discrete valued, the integral over Y becomes summation. It follows from Equations 15 and 16 that we are led to the general procedure for YingYang alternative implementation given in Algorithm 1.
where η _{ u },Γ _{ u } are the mean and the covariance of p(u).
where C o v _{ p(u)} u denotes the covariance matrix of p(u) and v e c(A) denotes the vector obtained by stacking the column vectors of A one by one.
Maximising the above H(θ), we get another type of YingYang alternative implementation, as summarised in Algorithm 2.
which acts as the Ying step of Algorithm 2.
where \(E_{YX}\approx 0.5d_{Y}\ln {(2\pi e)} +0.5\ln {\Gamma ^{Y}_{X } }\) is obtained by approximately regarding it as the entropy of a Gaussian density with a covariance matrix \(\Gamma ^{Y}_{X_{N} }.\)
from which we observe that the maximisation of H _{ L }(θ) consists of not only a best YingYang harmony but also a degree η of jointly a topdown maximum likelihood learning and a bottomup best matching between the posteriors p(YX _{ N }) and q(Yθ,X _{ N }).
which may also be obtained from considering the constraint by Equation 12 in a Lagrange. At the special case η=1, we may regard it as counterpart of Equation 24, with a difference in that H(θ) replaces lnq(X _{ N }θ).
which becomes the counterpart of Equation 24 generally instead of only at η=1. Alternatively, we may reach a tighter lower bound by an appropriate value of η.
Related studies: KL η HL spectrum
Year  Outcomes 

1998  The following convex combination with 0≤η≤1 is heuristically proposed (1−η)K L(p(YX)p(X)∥q(YR)q(Y))−η H(θ),(A) as a criterion for model selection, e.g. see Eq. (49) in Xu (1998a) and Eq. (22) in Xu (1998b). The above equation (A) can be rewritten into a format that is exactly equivalent to H _{ L }(θ)=(1+η)H(θ)+η E _{ YX } in Equation 17. 
2000  It is further proposed to make maxθ H _{ L }(θ) with η>0 monotonically decreased from a big value (i.e. remove the constraint η≤1), see Eq. (23) in Xu (2000a), which is further addressed for learning Gaussian mixture in Xu (2001a), e.g. see paragraphs around its Eq. (42) and Eq. (43). 
2003  The above equation (A) has been also reexamined from a perspective of the KL ηHL spectrum, with details referred to Eqs. (6264) in (Xu 2003a). 
Information harmonising dynamics
According to the YingYang philosophy placed at the upper right corner of Figure 3, the Ying and Yang constitutes a harmony system surviving in an environment, by which the Ying is primary while the Yang has not only a nature of variety but also a good adaptability to both the Ying and its environment. We may not only understand Equations 13, 14 and 17 from a classic perspective but also get new insight on how the Ying and Yang interact dynamically.

Balance within the Yang domain, i.e. seeking a match between \({p_{h}^{N}}(X)\) by Equation 8 and \( q(X_{N}\theta)=\int q(XY, \theta)q(Y \theta) dY\), measured by a divergence \(KL({p_{h}^{N}}(X)\Vert q(X_{N}\theta))\) or equivalently a likelihood L(θ)= lnq(X _{ N }θ).

Balance along the Yang pathway, i.e. to satisfy the constraint by Equation 12, e.g. measured by −K L(p(YX _{ N })∥q(Yθ,X _{ N })).

Balance between YingYang, i.e. both (a) and (b), measured by K L(p(YX)p(X)∥q(XY,θ)q(Yθ)), as in Equation 13.
Here, we focus on the standard cases, i.e., Ying dominated models or the Ying is primary. For some exceptional cases that the Yang is primary, e.g. forward architecture (see Sect.II(C) in Xu (2001b)), we may consider a balance within the Yang domain and a balance via the Yang pathway.
Typically, η could be a monotonically increasing function of a goodness that measures such a balance, while a best YingYang harmony is reached at a balance that the YingYang system has a least complexity.
Quantitatively, the harmonising dynamics remains to be an open topic that demands further investigation. Qualitatively, this dynamics may be roughly depicted via the dynamics of η as follows.
The dynamics of maximising H _{ L }(θ) focuses at maximising H(θ) that makes p(Yθ,X _{ N })=δ(Y−Y ^{∗}) with Y ^{∗}=a r g maxY π(X _{ N },Y,θ) become mostly focused and least flexible in order to rapidly satisfy the most urgent need of Ying, that is, the BYY harmony learning degenerates to one special case that is an extension of competitive learning. Though it still works when the resulted H(θ) is used as a model selection criterion, e.g. see Eq.(10a) in Xu (1996), it becomes prone to an initialisation and poor in automatic model selection because of the winnertakeall (WTA) competition among the inner representations of Y. Therefore, we should not let η _{ t } always stay at a too small value.
such that η≈1+η. In such cases, maximising H _{ L }(θ) by Equation 17 actually focuses at maximising η[H(θ)+E _{ YX }], or equivalently minimising the Kullback divergence η K L(p(YX)∥q(X _{ N }Y,θ)q(Yθ)) for a YingYang best matching, which makes p(Yθ,X _{ N }) tend to Equation 12 and thus enjoy a larger varying range or a big flexibility to cope with new samples. However, the harmonising information H(θ) in the centre of Figure 3 becomes neglectable, i.e. becoming weak in reducing the system complexity. In such a case, Algorithm 1 and Algorithm 2 become equivalent to the EM algorithm for the maximum likelihood, which is poor in model selection too. This means that the dynamics is approaching an equilibrium as η tends a big value, during which model selection or structure changing is gradually shut off while parameters may still be refined.
In the beginning, a BYY system is given with a predesigned YingYang structure and usually with all the unknown parameters initialized either randomly or according to a priori knowledge. Thus, the BYY system fits a given set X _{ N } of samples badly, resulting in a poor YingYang balance with a small η value in a way similar to the first extreme case. The dynamics focuses on not only adjusting the structure but also updating the parameters towards a balance with η quickly growing up, which gradually tends to an equilibrium with X _{ N } well described by a YingYang structure in an appropriate complexity.
Surviving in an environment, the BYY system typically stays at one equilibrium of its harmonising dynamics. As the environment changes, the dynamics is featured by performing the following actions:
(A) Equilibrium and attention When the system feels familiar with its observations, the dynamics stays at one equilibrium with a big value of η. An unexpected environmental change will make η drop. A large drop will trigger the system’s attention to detect environmental novelty. In other words, there is an attention mechanism associated with η.
(B) Detection and problemsolving A small drop of η is associated with a deviation from one equilibrium, which causes an incremental of KL. This incremental is associated with actions of detecting objects, recognising patterns and solving problems (e.g. inference or control) by the mapping X→Y via p(Yθ,X _{ N }).
(C) Adaptation and learning When the two opposed changes of η and of KL are not big enough such that the value of η K L may not change considerably, learning will not be triggered and H _{ L }(θ) by Equation 17 approximately stays unchanged. However, maximising H _{ L }(θ) will start to minimise KL when the incremental of KL becomes large while η remains a high value, i.e. becoming close to the second extreme case by Equation 26. In this case, the learning made by Algorithm 1 or Algorithm 2 becomes closer to the maximum likelihood learning that merely updates the parameters in the system without a big structural change, that is, no model selection occurs.
(D) Model selection and structure pruning A big drop of η will happen when the BYY system faces a largely different environment, i.e. becoming the extreme case η=0, the dynamics has to not only adjust the structure but also update the parameters towards a new equilibrium with η brought up quickly.
Conceptually, η monotonically decreases with a vigilance signal v, and this v monotonically increases with d _{ M }, d _{ D } and d _{ U }, where d _{ M } reflects the discrepancy between data X and its counterpart \(\hat X\) reconstructed by the model, e.g. measured by the negative loglikelihood − lnq(X _{ N }θ) or \(KL({p_{h}^{N}}(X)\Vert q(X_{N}\theta))\), while d _{ D } reflects the deviation of an inner representation Y from the desired Y _{ d }, e.g. measured by the square error Y and its corresponding \(\hat Y\). Moreover, d _{ U } is a measure that reflects salient occurrences that attract attentions. Further investigation is needed on the detailed forms of d _{ M }, d _{ D } and d _{ U }, as well as the specific form of g(f(·,·,·)), which may be considered by nonlinear regression.
As illustrated in Figure 3, the strength η controls the flexibility and adaptability that Yang enjoys, described by an entropy gain \(\eta \int p(Y X){p_{h}^{N}}(X)\ln {p(Y X){p_{h}^{N}}(X)}dYdX =\eta [E_{YX} +E_{X}(h)]\). Transferring this information from the Yang to the Ying, the Ying attempts to harmonise the information by updating parameters and modifying its structure to increase an amount of negative entropy η H(θ). Therefore, a net amount of harmonising information (1+η)H(θ)+η[E _{ YX }+E _{ X }(h)] is maximized, by which we are led to Equations 14 and 17.
Learning Gaussian mixture and learning factor analysis
which leads to what is typically called factor analysis (FA), where Σ is a nonnegative diagonal matrix.
For the maximum likelihood learning, FAa and FAb are equivalent. However, FAb becomes much more favourable by using a learning algorithm with a nature of automatic model selection. Readers are referred to Sect.2.2 in Xu (2011 and Tu and Xu (2011a) for further studies on FAb versus FAa.
Usually, ν is set to be 0. Here we use ν to denote a constant vector for convenience of a further extension in Algorithm 14.
The orthogonal constraint by Equation 32 also takes a role of removing a scale indeterminacy of the linear system by Equation 31, because an arbitrary diagonal matrix D≠I will make Equation 32 break though we may have A y=(A D)(D ^{−1} y)=A ^{∗} y ^{∗} with y ^{∗} still from G(yν,Λ). Further details are referred to Sect.2.2 in Xu (2011).
Also, there are alternative constraints in place of Equation 32, e.g. see Eqs. (33) and (34) in Xu (2011).
where γ ^{∗} is obtainable by any onevariate iterative algorithm, e.g. Newton.
In summary, we can turn Algorithm 2 into Algorithm 4 for learning factor analyses, via modifying the Ying step, that is, we update Σ ^{ n e w },Λ ^{ n e w } based on Equation 34 and then update A ^{ n e w } according to a choice of possible constraints on A.
When Σ=σ ^{2} I, we also get an alternative algorithm for learning Principal Component Analysis (PCA) with automatic model selection on the number of principal components. Further details about PCA versus FA are referred to Sect.3.2 of (Xu 2010a).
Learning local factor analysis
where δ _{ i,j } is the Kronecker delta with δ _{ i,j }=1 if i=j and δ _{ i,j }=0 otherwise, which actually describes i.i.d. samples \(X_{N}=\{x_{t}\}_{t=1}^{N}\) by a mixture of local factor analysis or local subspaces at a special case \(\Sigma _{\ell }=\sigma ^{2}_{\ell }I\).
from which we further get θ ^{ n e w } via maximising \(\sum _{t=1}^{N} \sum _{\ell =1}^{k} p(\ell x_{t},\theta)H(\theta \ell, x_{t}),\) resulting in the Ying step of a new YingYang alternating algorithm for learning a mixture of local factor analysis, as in Algorithm 5. Actually, this Ying step combines the Ying of Algorithm 3 and the Ying of Algorithm 4.
from which and together with Equation 44, we see that the Yang step of Algorithm 5 actually combines the Yang of Algorithm 3 and the Yang of Algorithm 4.
This algorithm degenerates back to not only Algorithm 4 with k=1 but also Algorithm 3 with y=0 and A _{ ℓ }=0 for each ℓ.
Learning binary factor analysis
which is called binary factor analysis (BFA).
Together with adding the constraint on y in Equation 28, we are lead to an equivalent form of Equation 28. In other words, learning BFA may be regarded as a relaxation or extension of learning Gaussian mixture.
One example was given by Eq. (20) in Xu (2010a) for binary FA, and the other example may also be found in Sect. 2.1.5 of Xu (2012a) on learning Gaussian mixture.
from which we get the Yang step of Algorithm 6 for binary factor analyses, similar to getting the Yang step of Algorithm 3 from the Yang step of Algorithm 1.
With p(yx,θ) fixed, we get θ ^{ n e w } by maximising H(θ), resulting in the Ying step of Algorithm 6.
Imposing the constraint on y in Equation 29 and letting _{ tf } to cover the entire domain y, this algorithm degenerates to Algorithm 3 for Gaussian mixture when Σ _{ ℓ }=Σ.
from which we get Algorithm 7 with a simplified Ying step, but its Yang step needs to get \(\xi _{y x_{t}}\) by solving a constrained quadratic optimisation via one of typical existing techniques (Fang et al. 1997; Floudas and Visweswaran 1995).
Learning nonGaussian factor analysis
Similar to Equation 49, maximising H _{ L }(θ) gets p(zx _{ t },θ) in the Yang step. Similar to Equation 44, we also get \( y_{z,t}=[y_{z,t}^{(1)}, \dots, y_{z,t}^{(k)}]^{T}=arg\max _{y} \pi (x_{t},y, z,\theta)\) and \( \Gamma _{z,y\vert x}=arg\max _{\ \Gamma _{z,y\vert x}} H_{L}(\theta z, x_{t}).\)
We maximise H _{ L }(θ) to update θ, resulting in the Ying step of Algorithm 8 for learning NFA. The Ying step consists of the first part for updating each component \(\alpha _{j}^{(i) }G(y^{(i)}\nu _{z^{(i)}}^{(i)},\lambda _{z^{(i)}}^{(i)})\) and the second part for updating G(xA y+μ,Σ). Also, the role of \(\phantom {\dot {i}\!}\delta _{j,z^{(i)}}\) is picking those components that have contributions to the corresponding \(\alpha _{j}^{(i) }, \nu _{j}^{(i) }, \lambda _{j}^{(i) }\) according to whether z=j. The number m _{ i } of the components is determined via trimming off \(G(y^{(i)}\nu _{z^{(i)}}^{(i)},\lambda _{z^{(i)}}^{(i)})\) if \((\alpha _{j}^{(i) }\lambda _{j}^{(i)})^{new}\to 0\).
Unsupervised vs semisupervised
Instead of knowing i.i.d. samples \(X_{N}=\{x_{t}\}_{t=1}^{N}\), there maybe a subset X _{ s }⊂X _{ N } in which each x _{ t }∈X _{ s } is associated with a supervision sample \(y_{t}^{*}\). The problem is called unsupervised learning when X _{ s } is an empty set, and called supervised learning when X _{ s }=X _{ N }. Generally, the problem is called semisupervised learning as X _{ s } is between the two extreme cases.
For the BYY harmony learning, unsupervised, semisupervised and supervised learning are all expressed in a same formulation. There are two types of implementation according to whether y is discrete or real.
where \(y_{t}^{*}\) is the teaching label associated with X _{ s }, and γ>0 is a confidence factor. The bigger the γ>0 is, the higher our confidence is on the supervision sample.
from which we modify Algorithm 3 into Algorithm 9, with the Ying step kept unchanged while the Yang step modified into Algorithm 9.
We can always assign a teaching label \(\ell _{t}^{*}\) to each sample x _{ t }. If there is no teaching label, we assign \(\ell _{t}^{*}\) to be a number larger than k and thus always have \(\delta _{\ell, \ell _{t}^{*}}=0\). Otherwise, we let \(\ell _{t}^{*}\) to be its teaching label and have \(\delta _{\ell, \ell _{t}^{*}}=1\) when \(\ell =\ell _{t}^{*}\).
Similarly, we modify Algorithm 6 for learning binary FA into a semisupervised version, i.e. Algorithm 10. Whether or not there is a teaching sample \(y_{t}^{*}\) for x _{ t }, we may always assign one \(y_{t}^{*}\) to each sample x _{ t }. If there is no teaching sample, we assign \(y_{t}^{*}\) to be out of _{ tf } and thus have \(\delta _{y, y_{t}^{*}}=0\). Otherwise, we let \(y_{t}^{*}\) to be its teaching sample and have \(\delta _{y, y_{t}^{*}}=1\) when \(y=y_{t}^{*}\).
where I _{ t } is an indicator explained by the remark given in Algorithm 11. It follows from Equation 54 that Algorithm 4 for learning FA can be modified into Algorithm 11 with some changes in the Ying step.
Moreover, we may combine Equations 53 and 54 to modify Algorithm 8 for learning NFA into Algorithm 12. Similar to Algorithm 10, we may always assign one discrete vector \(z_{t}^{*}=[z_{t}^{(1)*},\cdots,z_{t}^{(m)*}]\) to each sample x _{ t }. If there is no teaching information, we assign \(z_{t}^{*}\) to take a value that is out of our consideration, e.g. letting every \( z_{t}^{(i)*}\) to be a big number, we always have \(\delta _{z, z_{t}^{*}}=0\) for z∈_{ tf }. Otherwise, we let \(z_{t}^{*}\) to be its teaching label about z _{ t }, and use \(\delta _{z, z_{t}^{*}}=1\) to indicate \(z=z_{t}^{*}\).
Similar to the Yang step of Algorithm 9 and of Algorithm 10, we get \(p_{z x_{t}}(\theta)\) with a difference that \(p_{z x_{t}} =p_{z x_{t}}(\theta ^{new})\) is not globally rescaled by a factor. Instead, a rescaling is distributed among each updating in the Ying step. Another difference from Algorithm 10 lies in that each z _{ t } is also associated with another real valued vector \(y_{t}=[y_{t}^{(1)},\cdots,y_{t}^{(m)}]\). For each teaching label \(z_{t}^{*}\), we may have two situations. One is that the corresponding teaching vector \(y_{z,t}^{*}=[y_{z,t}^{(1)*},\cdots,y_{z,t}^{(m)*}]^{T}\) is given together with \(z_{t}^{*}\). The other is that we have \(z_{t}^{*}\) only and need to estimate \(y_{z,t}^{*}\).
Also, the situation is different from getting y _{ z,t }=y(z _{ t },x _{ t },θ ^{ n e w }) in Algorithm 8 where we only have x _{ t } without knowing both \(z_{t}^{*}\) and \(y_{z,t}^{*}\). Here, we estimate \(y_{z,t}^{*}= y(z_{t}^{*}, x_{t}, \theta ^{new})\) based on given the teaching signal \(z_{t}^{*}\).
Still, it relates to updating A,Σ,R ^{ x y } in Algorithm 11 in that \(\delta _{z, z_{t}^{*}}\) takes a role of I _{ t } though the situation becomes more complicated due to the role of \(z_{t}^{*}\) and a scalar Gaussian mixture of each component \(y_{t}^{(i)}\).
BYY harmony sparse learning : a dual view
In all the previous sections, the BYY harmony learning implements the maximisation of H(θ) in Equation 11 without considering a priori q(θ). In this section, we show that learning performance can be further improved by a priori aided learning from a dual perspective.
from which we observe that p(AX),p(YX) take a same position in the first line and the last line, respectively, and that H _{ d }(θ) is actually a dual counterpart of H(θ) in Equation 11 as follows
\(H(\theta)= \int p(Y X) {p_{h}^{N}}(X)\ln [q(XAY, \psi)q(Y \phi)]dY\;dX,\)
\(H_{d}(\theta)= \int p(A X) {p_{h}^{N}}(X)\ln [q(XAY, \psi)q(A\rho)]dAdX.\)
This dual view motivates to improve the learning via not only updating A aided with a priori q(Aρ) but also maximising H _{ d }(θ).
where θ ^{−} is resulted from removing A,ρ from θ. Anyone of the algorithms introduced in the previous sections can implement the maximisation of the last line above.
where θ ^{−} is obtained from implementing the maximisation of the last line in Equation 56, which are available by the algorithms introduced in the previous sections.
where ⊗ is the Kronecker product. This equation is equivalent to Eq. (51) in Xu (2011), i.e. the problem of solving a Sylvester matrix equation (Bartels and Stewart 1972; Miyajima 2013).
which is put in the above Equation 59 for updating ρ ^{ n e w }.
Computations of \({\cal B}, \Gamma ^{A}_{X_{N}}, {\Pi ^{A}_{X}}\) are rather simple since \(\Sigma _{j}^{a }, \Lambda, \Sigma \) are typically diagonal matrices, and even Σ=σ ^{2} I. Such uncorrelated structures facilitate learning featured with the nature of automatic model selection, see Sect.2.2 of Xu (2012a) and Sect.2.2 of Xu (2010a), that pushes redundant elements of A towards zeros via pushing its corresponding variances towards zeros. As a result, learning leads to a sparse matrix A.
Such a BYY harmony sparse learning comes from q(Aρ) that takes a dual role of q(Yϕ). Being different from the existing sparse learning studies (Shi et al. 2011a, 2014; Tu and Xu 2011a; Xu 2012b) that consider either q(Aρ) in a long tail distribution with extensive computing cost or q(Aρ) in Equation 56 with help of one additional q(ρ) (see Sect.III of Xu (2012b)), here the updating by Equation 59 is made by q(Aρ) in Equation 56 without considering such a priori q(ρ).
Of course, we may progress to consider a priori q(ρ) and also some priories about Λ,Σ, which will lead to another layer of integral about q(ρ),Λ,Σ. Readers are referred to Sect.2.3 in Xu (2011) for the details of implementation.
which has been already listed in Algorithm 4, Algorithm 5, Algorithm 6, Algorithm 7, Algorithm 8 and Algorithm 12 as one alternative of A ^{ n e w }=R ^{ x y } Λ ^{ n e w −1} in the Ying step.
The implementation of maximising the first line in Equation 55 is featured by the order of integrals \(\int [\cdot ] dY dA\). In a dual view, we may also swap the order to consider maximising the last line in Equation 55. The detailed implementation will be quite similar. Moreover, we may alternatively conduct the two implementations.
Denoise Gaussian mixture
That is, we get a Gaussian mixture with each covariance matrix added with the variance of a common noise e. Given y ^{(j)}=1, we see that \(\hat x=xe=a_{j}\) comes from G(a _{ j }μ _{ j },Σ _{ j }) and provides a denoised version of observed sample x. Since y ^{(j)} takes 1 by a probability α _{ j }, the denoised \(\hat X\) actually comes from a mixture \(\sum _{j} \alpha _{j} G(x\mu _{j}, \Sigma _{j})\). Thus, this study is called, in Sect.3.1 of Xu (2011), learning denoised Gaussian mixture or shortly denoised GM.
where c _{ η } is a constant that does not relate to θ,ℓ.
Putting the above Equations 66, 65 and 64 into Algorithm 13, we get a new YingYang alternating algorithm for learning denoise GM, which improves its counterpart in Sect.3.1 of Xu (2011) in that the Lagrange technique used in Algorithm 3 is used to help the YingYang alternative implementation. Also, p(ℓx _{ t },θ) in Equation 65 has been extended to cover semisupervised learning in the same way as in Algorithm 9.
Conventionally, noises are filtered by a preprocess (if needed) with help of a standard noise filtering method. In many applications, however, the problems of filtering noises and making clustering or density estimation are actually two coupled tasks. Instead, the denoise GM provides a model to consider both in a same learning process, while Algorithm 13 provides a useful tool that implements both the tasks. Moreover, we can include some knowledge (e.g. teaching labels) in an easy way. One example is its potential application to image segmentation. Applying to a noisy image, a _{ t,ℓ } outcomes denoised pixels for each segmented region, while pixel classification can be made by ℓ ^{∗}=a r g maxℓ p _{ ℓ,t }. For a sharpen image, we may merely use \(\phantom {\dot {i}\!}a_{t,\ell ^{*}}\) as the denoised pixels of each segmented region.
Sparse linear and logistic regression
Though we may directly use Algorithm 11 for learning, it is difficult to trim off the redundant elements of y via checking whether λ _{ i }→0 in the Ying step of Algorithm 4. In this case, the contribution of \(\{ y_{t}^{*}\}\) will make none λ _{ i } in Λ in Equation 34 tend to zero. In contrast, learning by Algorithm 4 and Algorithm 11 is still able to push redundant elements of A towards zero when updating A is made by Equation 57 together with Equations 59 and 60.
with q(A)=q(Aρ) given by Equation 56. Also, from Equation 67 we have \(q(X_{N}AY, \theta)=\prod _{t} G(x_{t}{Ay}_{t}+\mu, \Sigma)\).
To get a further insight, we observe a special case that x _{ t } is simply univariate, i.e. d=1, at which A becomes a vector a ^{ T } and Equation 67 actually becomes the widely studied linear regression problem, for which Algorithm 14 is simplified into Algorithm 15. It differs from the ordinary linear regression in that Λ is corrected by a term \( \frac {\sigma ^{2}}{N} \Sigma ^{a\, 1}\) for solving a.
which is put into the Ying step of Algorithm 16. Being different from Algorithm 14, there is no need to consider Σ=σ ^{2} I, while updating A,μ is made by gradient ascending instead of solving nonlinear equation.
Temporal FA and temporal binary FA
The joint modelling by Equations 31 and 72 is called temporal factor analysis, shortly temporal FA or TFA.
Given ν=B y _{ t−1} fixed, maximising H _{1}(pq) is decoupled from \(H_{2}(pq)\sum _{t} \ln {G(y_{t}{By}_{t1}, \Lambda)}\) and thus is handled exactly by learning FA, as summarised in PartA of Algorithm 17. With Λ fixed, maximising H _{2}(pq) is decoupled from \(H_{1}(pq)\sum _{t} \ln {G(y_{t}{By}_{t1}, \Lambda)}\) too. Also, samples of {y _{ t },y _{ t−1}} are available from implementing PartA. The problem of maximising H _{2}(pq) is equivalent to the special case of a multiple linear regression at μ=0,ν=0 and d=m, and thus B can be learned by Algorithm 14, as summarised in PartB of Algorithm 17.
which may be recursively updated from Ω _{0} after Λ updated in PartA and B updated in PartB.
from which we may get Ω by solving this equation.
where \(y_{t}^{(i)}\) takes either 0 or 1, and s(r) is a sigmoid function, e.g. by Equation 69. This model is called temporal binary factor analysis, shortly temporal BFA.
Similar to Equation 73, learning temporal BFA can be implemented by maximising H(pq) with help of maximising H _{1}(pq) by Algorithm 6 for learning BFA with α _{ i } fixed, as summarised in PartA of Algorithm 18, and with help of maximising H _{2}(pq) by Algorithm 16 to learn B,ν for logistic regression y _{ t−1}→y _{ t }, as summarised in PartA of Algorithm 18.
Bilinear matrix system and manifold learning
where μ(A Y) is an inverse link function of AY that is linear to either one of A,Y with the other fixed, and μ(Ω)=[μ(ω _{ i,j })] for a matrix Ω=[ω _{ i,j }].
The special cases of the BMS featured with Equations 77, 78 and 79 all satisfied include FA, BFA, NFA and others. Also, their corresponding implementations of BYY harmony learning are previously introduced by Algorithms 4 to 16. The special cases with only Equations 78 and 79 satisfied were addressed in Sect. 2 of Xu (2011).
Beyond Equation 78, the generalised BMS models with Equations 77 and 79 held were also previously addressed in Sect.II of Xu (2012b) and Sect.5 of Xu (2012a). One type is temporal learning featured by autoregression across columns of Y, e.g. by Equation 72, rather extensively studied since 2000 (Xu 2000b, 2001b,2004a). A recent summary about TFA studies is referred to Sect.5.2 of Xu (2012a).
where a matrix Ω describes the crosscolumn dependence of the matrix variate U, and a matrix Σ describes the crossrow dependence of U. This matrix distribution is equivalent to a multivariate Gaussian distribution G(v e c(U)v e c(C),Σ⊗Ω).
which is a key term in the Laplacian eigenmaps for preserving topologically the neighbourhood relation in manifold learning (Belkin and Niyogi 2003). Differently, the BYY harmony learning obtains Y _{∗}=a r g maxY π(X _{ N },Y,θ) in place of learning an approximate linear mapping Y≈W X.
This Λ takes a role similar to the one in Algorithm 4. Actually, this situation can be regarded as an extended counterpart of FAb while the situation with T _{ n } by Equation 80 can be regarded as an extended counterpart of FAa. The BYY harmony learning helps to learn Λ for determining an appropriate manifold dimension k, i.e. the row dimension of Y. Following the schematic Algorithm 2, we can develop one detailed BYY harmony learning algorithm for implementing the BMS by Equation 76.
Conceptually, there are also other choices beyond Equation 78, which can be very diversified. The next subsection further examines a family of choices featured with certain decoupled parts of Y.
Decoupled BMS, regulatory networks and LMM model
This formulation extends those previous models for independent factor analyses into their counterparts in a BMS formulation.
where both Λ _{ c },Λ _{ r } are diagonal matrices.
It follows from Theorem 2.3.10 in (Gupta and Nagar 1999) that we have N(Y _{ B }0,B Λ _{ c } B ^{ T },Λ _{ r }) with Y B ^{ T }=Y _{ B }. Let L ^{−1}=B B ^{ T } and E by Equation 76, we are led to Equation 80 when Λ _{ r }=I and to Equation 81 when Λ _{ r }≠I. Generally, we may also consider Equation 78 with elements from other Gaussian and nonGaussian distributions.
The above relations do not hold for the BYY harmony learning even when Equation 79 holds. Instead, the formulation by Equation 82 is more preferred than its counterpart in Equation 76. During the implementation of the BYY harmony learning, either q(Yθ)=N(Y0,L ^{−1},I) or q(Yθ)=N(Y0,L ^{−1},Λ) takes a role of to controlling the compleixty of Y, i.e. the row dimension and also the matrix sparsity.
Gene regulatory networks (TRN) takes an important role in biology networks and modelling TRN based on gene expression data is one of major topics in the studies of computational genomics (BarJoseph et al. 2012; Karlebach and Shamir 2008; Morris and Mattick 2014). In the previous efforts (Tu et al. 2011,2012a,2012b), the BFA and NFA have been applied to model gene transcriptional regulation, which leads to improvements of networks component analysis (NCA) (Liao et al. 2003). Still, Equations 82 and 83 with μ(r)=r jointly also provide a new TRN model. Instead of prespecifying the topology of A according to some priori knowledge (Liao et al. 2003), we get the topological information underlying the samples of X by the graph Laplacian L and then get B by L ^{−1}=B B ^{ T }, while A is obtained via learning with or without prespecifying its topology. Also, we may consider a priori of A with help of Equation 56. During the implementation of the BYY harmony learning, an appropriate number of transcription factors may be determined via learning the diagonal matrix Λ _{ r }.
which returns back to Equation 82 simply with F=0.
Let A Y=Y _{ A }, we have \(E(Y_{A}{Y_{A}^{T}})=AE(YY^{T})A^{T}\). When the columns of Y are i.i.d. from a Gaussian with a zero mean and a diagonal covariance matrix Λ _{ r }, Equation 85 becomes X=Y _{ A } B ^{ T }+F Z ^{ T }+E with Y _{ A } denoting random effects and F denoting fixed effects; that is, we are led to the linear mixture model (LMM) when μ(r)=r and generalised LMM (GLMM) when μ(r)≠r.
Unknowns in LMM or GLMM may be estimated by one of the algorithms developed in the literature of statistics under the principle of the least square error or maximum likelihood (Demidenko 2013). Both LMM and GLMM have been applied for modeling various associations in the studies of biology and recently in the studies of computational genomics (Yang et al. 2014; Zhou and Stephens 2014; Zou et al. 2014). The BYY harmony learning provides one alternative method for estimating the unknowns in LMM or GLMM, with one advantage of determining an appropriate row dimension of Y and a sparse matrix A. Conventionally, B and Z are design matrices that are usually prespecified based on given samples and priori knowledge. Also, either or both of B and Z may consist of partially given elements and partially unknowns to be estimated via learning. One example is shown in Figure four in Xu (2011).
A preservation principle of multiple convex combination
Moreover, the harmony functional H(pq) by Equation 9 is an estimation function that comes from a convex combination of an infinite many of individual estimation function featured by the Ying machine q(XR)q(R) at an infinite many individuals of R, weighted by the Yang machine p(RX)p(X).
from which we observe the following natures: (a) The gradient field ∇_{ μ } f(μ) is a convex combination of the gradient fields \(\{\nabla _{\mu }\,f_{t}(\mu)\}_{t=1}^{N}\). (b) The root of ∇_{ μ } f(μ)=0 is also a convex combination of the roots of \(\{\nabla _{\mu }\,f_{t}(\mu)=0\}_{t=1}^{N}\). (c) The minimum of f(μ) is a convex combination of the minimums of \(\{f_{t}(\mu)\}_{t=1}^{N}\) too.
These natures are closely related to the first order derivative or the gradient field of estimation functions. The nature (a) describes a global feature of the gradient fields of estimation functions, and the nature (b) describes features within some important local areas (e.g. around the sinks) of these gradient fields. While the nature (c) is equivalent to the nature (b) if \(\{f_{t}(\mu)\}_{t=1}^{N}\) have gradient fields. Generally, the nature (c) may even apply to those individual estimation functions that do not have gradient fields.
In Equation 87, a convex combination of individual convex functions implies or induces all the above three natures. Given a convex combination of individual convex functions, if it also preserves at least one of the three natures above, we say that it preserves a nature of multiple convex combination (MCC). The classic maximum likelihood learning preserves such a MCC nature too, because both \((1/N)\sum _{t=1}^{N}\ln {q(x_{t}\theta)}\) and \((1/N)\sum _{t=1}^{N}\nabla _{\theta }\ln {q(x_{t}\theta)}\) are convex combinations.
which is a special case of Equation 9 and thus is still a convex combination of an infinite many of individual estimators π(X _{ N },Y,θ) at an infinite many individual values of Y, weighted by the Yang machine p(Yθ,X _{ N }). But, considering the gradient field directly may not preserve the MCC nature.
based on which we may develop a gradient based local search algorithm.
Similarly, one other example can be found in Algorithm 2 and Eq. (10a) in Xu (2009) for learning radial basis functions (RBF) and extensions.
Still, this type of implementation may cause learning instability because the resulted \(p^{new}_{\ell  x_{t}}\) may break the constraint \(0 \le p^{new}_{\ell  x_{t}}\le 1\).
where the weights \( \sum _{t} b^{(j)}_{t} =1, b^{(j)}_{t} \ge 0\) may be different for a different j and also may not be necessarily same as the weights \(\sum _{t} a_{t} =1, a_{t} \ge 0\).
where φ⊆θ is a subset of parameters to be estimated in our consideration. It can be the entire set of θ or a part of θ. Under this setting, we get the root of ∇_{ φ } H(θ)=0 by a convex combination of the roots of ∇_{ φ } π(X _{ N },Y,θ)=0.
Actually, Algorithm 1, Algorithm 3, Algorithm 5, Algorithm 6 and Algorithm 8, as well as their corresponding EM algorithms, are all the examples that pursuit along this direction. The Yang step or the E step actually gets such a p _{ Y }∈_{ p } while the Ying step or the M step estimates the root of ∇_{ φ } H(θ)=0 by a convex combination of the roots of ∇_{ φ } π(X _{ N },Y,θ)=0.
Comparing ∇_{ φ } H(θ) in Equation 88 and ∇_{ φ } H(θ) in Equation 91, we get an alternative implementation that consists of two steps as follows: (1) Get _{ δ } that consists of p _{ δ }(Yθ,X _{ N }) by Equation 88 at all the possible values of Y. (2) Project the set _{ δ } to the convex set _{ p } under a nearest principle.

One is to be the nearest in what a sense? in a square or L _{1} distance?

The other is an effective algorithm to find such a projection.
Another important issue is a theoretical guarantee on whether H(θ) keeps increasing or nondecreasing such that learning convergence is guaranteed.
Results and discussion

Theoretical aspects and relations to other methods see Sect.4.1, Appendix A and B in Xu (2010a), and Sects.4.1 and 4.2 in Xu (2012a).Table 3
A foundation period of BYY studies (1995 to 2001)
Year
Outcomes
1995
The following fundamental points of BYY harmony learning were firstly proposed in Xu (1995):
(a)
The BYY system is proposed as a unified perspective for statistical learning.
(b)
Under the name of BKYY learning, the YingYang best matching by the minimisation of K L(p(YX)p(X)∥q(YR)q(Y)) has been proposed for learning parameters θ.
(c)
One simplified version of H(θ) is proposed to get a hardcut version of EM algorithm, see its Eqs. (19) and (20) and a criterion for selecting the number of components in Gaussian mixture (i.e. the cluster number), see Eqs. (22) and (24) in Xu (1995).
(d)
One preliminary version of the BYY harmony learning based automatic model selection was presented, see its Sect. 5.2.
(e)
The relationship H(pq)=H _{ RX }−K L(pq) by Equation 10 was also firstly identified, see Eqs. (8), (11) and (12) in Xu (1995).
1996
Points (c)(d) were verified experimentally in Xu (1996).
1997
Four progresses are made as follows:
(a)
Beyond 1995(d), suggested H(θ) in a general expression as model selection criterion, see Eq. (12) in Xu (1997a) and Eq. (3.8) in Xu (1997b). Also, addressed its special cases on Gaussian mixture.
(b)
Proposed to use \({p_{h}^{N}}(X)\) by Equation 8 and learn h for regularisation, see Eq. (3.10) in Xu (1997b). A smoothed EM is proposed for Gaussian mixture, see Eq. (18) in Xu (1997c).
(c)
Proposed semisupervised EM algorithm for Gaussian mixture, see Eq.(7.14) in XU (1997b).
(d)
Extended BKYY to BCYY by replacing Kullback divergence with its convex counterpart, see Sect.5 in Xu (1997a) and Eqs.(19)(23) in Xu (1997c).
1998
The following progresses are made:
(a)
Proposed equation (A) in Table 2 as a criterion for model complexity, e.g. see Eq. (49) in Xu (1998a) and Eq. (22) in Xu (1998b).
(b)
As an exemplar of 1997(a), derived model selection criteria for threelayer net and RBF net (see Eq. (56) and Eqs (61)(64) in Xu (1998a)) and also for FA (see Eqs. (37)(43) in Xu (1998b)).
(c)
Beyond 1995(c), developed adaptive EM algorithms for learning RBF net (see Sect.3.2) and FA (see Sect.4.2.4) in Xu (1998b) and Sect.3.2 in Xu (1998c).
1999
Further efforts are made, among which major ones are as follows:
(a)
Beyond 1997(a), proposed a general form for parameter learning and model selection, see Sect.2 in Xu (1999b), Sect.2.2 in Xu (1999a), and Sect.2.2 in Xu (1999c).
(b)
Beyond 1997(b), systematically studied data smoothing regularisation in Xu (1999d), with an approximation technique in Equation 18 and estimating techniques for h.
(c)
Proposed Taylor expansion approximation by Equation 18 to remove the integral in BYY implementation, see Eq. (90) and Eq. (91) in Xu (1999e), later in the journal papers (Xu 2000c, 2001b).
2000
In Xu (2000d,2000a), H(θ) based harmony learning has been elaborated into its present formulation, supported by mathematical analysis on YingYang best harmony versus YingYang best matching, and featured with three innovative points:
(a)
Beyond 1999(a), proposed a general form of maxθ H(θ) with automatic model selection, see Eq. (29) in Xu (2000d) and Sect.4 in Xu (2000a).
(b)
Proposed Eq (23) in Xu (2000a) to implement equation (A) in Table 2 by learning θ with automatic model selection.
(c)
Also proposed normalisation regularisation in parallel to data smoothing regularisation in the above 1998(b), see Sect. 2 and Sect.3 in Xu (2000a) and Eq. (21) in Xu (2000d).
2001
Further progresses are made as follows:
(a)
Used p(Yθ,X _{ N })=q(Yθ,X _{ N }) in Equation 12 to get Yang structure for maxθ H(θ), see Eq. (40) in Xu (2001a), Eqs. (24)(27) in Xu (2001c).
(b)
Developed a BYY harmony learning algorithm for Kernel regression and support vectors, see Sec.4.5 and Table seven in Xu (2001a).
(c)
Understood H(θ) in its general form from an information transferring aspect via three layer encoding, see Sect.4.3 in Xu (2001c).
(d)
Beyond 1998(b), derived model selection criteria for local PCA, see Eq. (23) in Xu (2001c), and local ICA, see Eq. (33) in Xu (2001d).
Table 4Further advances of BYY studies (2002 to 2013)
Year
BYY harmony learning formulation
2004
(a)
H(pq) in Equation 9 with R={Y,θ} is proposed in Sect.II(B) of Xu (2004b), not only integrating the thread of data smoothing regularisation via 1997(b) and 1999(b) and of normalisation regularisation via 2000(c) into a specific formulation of a priori; but also covering a usual priori q(θ) as a component.
(b)
Subsequent elaborations are referred to Sect.3.4 in Xu (2007a), Sect.3.4 in Xu (2007b), Sect.2 in Xu (2008), Eq. (8) in Xu (2009), especially to Sect.4 in Xu (2010a) and Sect.4 in Xu (2012a) for recent surveys.
2007
Beyond 2001(a), efforts on designing the structure of p(RX) based on q(XR) and q(R) progress from the early concept of bidirectional architecture further towards.
(a)
Either a preservation principle p(Yθ,X _{ N })=q(Yθ,X _{ N }), e.g. by Eq. (40) in Xu (2001a), Eq. (24), and Eq. (27) in Xu (2001c);
(b)
Or that p(Yθ,X _{ N }) preserves certain statistics of q(Yθ,X _{ N }), e.g. equal covariance by Eqs. (72)(73) in Xu (2007a), which are elaborated under the name of uncertainty conversation or variety preservation between Ying and Yang, see pp6972 in Xu (2009), with details referred to Sect.4.2 in Xu (2010a) and Sect.3.2.2 in Xu (2012a).
2008
Learning tasks are summarised into three levels of inverse problems and integrated into a unified representation of BYY system, see Xu (2008, 2009), and an introduction in Sect.1 of Xu (2010a).
(a)
RadonNikodym derivative based formulation of YingYang harmony information was proposed, with degenerated cases covering Shannon information and Kullback Leibler information. Details are referred to Sect.4.1 in Xu (2010a) and an overview in Figure five of Xu (2012a).
(b)
Hierarchical temporal BYY harmony learning was developed in Sect. 5 of Xu (2010a), see Figures twelve and fourteen in Xu (2010a) and Figure eleven in Xu (2012a).
(c)
BYY system provides an allinone formulation for unsupervised, supervised and semi supervised learning, see Sect.4.4 in Xu (2010a) and Table two in Xu (2012a).
2011
Codim matrix pair formulation and a hierarchy of codim matrix pairs for BYY harmony learning have been proposed, with details referred to Sect.2.2, Sect.4 and Figure three in Xu (2011). Its special cases cover not only several typical learning models but also denoised Gaussian mixture (see Algorithm 13), manifold learning as previously discussed about Equation 80, and the dual formulation as previously introduced in Equations 55 and 56.
Type
BYY system design
3A
Started from the very beginning in 1995 Xu (1995), BYY system was classified into three architectures (3A), i.e. forward architecture with q(XR) in a free structure, backward architecture with p(RX) in a free structure, and bidirectional architecture with both q(XR) and p(RX) in parametric structures, rather thoroughly examined before the mid of 2000th (Xu 2000c,2001a,2001e,2002, 2003a,2004a,2004c).
3P
Focuses are turned to three principles (3P) for designing the structures of each component in a BYY system, i.e. the principle of least redundancy for q(Y), the principle of dividandconquer for q(XR), and the principle of uncertainty conversation or variety preservation for p(YX), as stated above by the item 2007(a) and (b). An overview is referred to Figure three in Xu (2012a).

Algorithms and applications see the roadmaps in Figure three and Figure eleven of Xu (2010a), also in Figure one of Xu (2011) and Sect.5 of Xu (2012a), plus recent applications in (Pang et al. 2013; Shi et al. 2011a,2011b,2011c,2014; Tu and Xu 2011a; Tu et al. 2011,2012a,2012b; Tu and Xu 2014; Wang et al. 2011).

Outlines on major topics in Xu ( 2012a ) see Sect.7 for 3 topics on statistical learning in general, 8 topics on BYY system, 13 topics on best harmony learning and 4 topics on implementation, as well as 15 topics on exemplar learning tasks and algorithms. Readers are also referred to Sect.3.2 and Sect.3.4 on topics and demanding issues about BYY system design, to Sect.4.2.3 on novelty and features of best harmony theory.
The harmonising dynamics discussed previously in Figure 3 and the corresponding subsection may also be observed from this perspective. At the centre of Figure 5, the bottom of the YinYang logo has a black centre, which is usually called fish eye. This indicates the output of A5, while its surrounding white ring indicates the Yang domain. The starting part of the Yang arrow indicates A1 for picking samples in the Yang domain to get \({p_{h}^{N}}(X)\), and the arrow ends at the white fish eye on the top, implementing A2 by p(Y,θX). On the other hand, the surrounding black ring of the white fish eye indicates the Ying domain that collects all the candidates as well as the associated evidences. The starting part of the Ying arrow indicates A4 for choosing good candidates probabilistically via q(XY,θ), and the arrow ends at the bottom black fish eye and completes one circling.
As addressed by Equation 27 and the discussions thereafter, the signal η is measured at two fish eyes and also modulated by the inner attention of the system. A small η reflects either a bad YingYang mutual agreement (a big mismatch to the desire) in the top fish eye or a bad fitting in the bottom fish eye.
A poor performance incurred from a poor selection of Y at A4, resulting in a small value η that is feedback to A2 to harmonise the attempts of updating θ. In such a negative feedback mechanism, the dynamics of information harmonising is stabilised. Interestingly, such a mechanism is executed in a pattern ‘A2/Huo modulates A4/Jin’, which complies with the classic ‘XiangKe’ principle of the Chinese TCM WuXing theory. In other word, the ‘XiangKe’ principle can be regarded as an ancient negative feedback principle.
Conclusions
Based on Lagrange variety preservation of Yang structure, this paper proposes a generic framework of dynamic BYY harmony learning, which not only unifies attention, detection, problemsolving, adaptation, learning and model selection from an information harmonising perspective but also provides a new type of YingYang alternative nonlocal search to overcome a dilemma of suboptimal solution versus learning instability typically suffered by the existing YingYang alternative nonlocal search. Algorithms are developed for learning Gaussian mixture, factor analysis (FA), mixture of local FA, binary FA, nonGaussian FA, denoised Gaussian mixture, sparse multivariate regression, temporal FA and temporal binary FA, as well as a generalised bilinear matrix system that covers not only these linear models but also manifold learning, gene regulatory networks and the generalised linear mixed model. These algorithms are featured with not only a favourable nature of automatic model selection but also a unified formulation in performing unsupervised learning and semisupervised learning. Moreover, a principle of preserving multiple convex combinations is also proposed to improve the BYY harmony learning, which leads another type of YingYang alternative nonlocal search.
Declarations
Acknowledgements
This work was supported by a CUHK Direct grant project 4055025 and a startingup for the ZhiYuan chair professorship by Shanghai Jiao Tong University.
Authors’ Affiliations
References
 Akaike, H (1974) A new look at the statistical model identification. Automatic Control IEEE Trans 19(6): 716–723.MATHMathSciNetView ArticleGoogle Scholar
 Akaike H (1987) Factor analysis and aic. Psychometrika 52(3): 317–332.MATHMathSciNetView ArticleGoogle Scholar
 Barron, A, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. Inf Theory IEEE Trans 44(6): 2743–2760.MATHMathSciNetView ArticleGoogle Scholar
 Bartels, RH, Stewart G (1972) Solution of the matrix equation ax+ xb= c. Commun ACM 15(9): 820–826.View ArticleGoogle Scholar
 BarJoseph, Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using timeseries gene expression data. Nature Rev Genet 13(8): 552–564.View ArticleGoogle Scholar
 Belkin, M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6): 1373–1396.MATHView ArticleGoogle Scholar
 Chen, G, Heng PA, Xu L (2014) Projectionembedded byy learning algorithm for gaussian mixturebased clustering. Appl Inf 1(2): 1–20.Google Scholar
 Corduneanu A, Bishop CM (2001) Variational bayesian model selection for mixture distributions In: Artificial Intelligence and Statistics, 27–34.. Morgan Kaufmann Waltham, MA.Google Scholar
 Dayan, P, Hinton GE, Neal RM, Zemel RS (1995) The helmholtz machine. Neural Comput 7(5): 889–904.View ArticleGoogle Scholar
 Dempster, AP, Laird NM, Rubin DB, et al. (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1): 1–38.MATHMathSciNetGoogle Scholar
 Demidenko, E (2013) Mixed Models: Theory and Applications with R. John Wiley & Sons, Hoboken, New Jersey.Google Scholar
 Diaconis, P, Ylvisaker D, et al. (1979) Conjugate priors for exponential families. Ann Stat 7(2): 269–281.MATHMathSciNetView ArticleGoogle Scholar
 Dutilleul, P (1999) The mle algorithm for the matrix normal distribution. J Stat Comput Simul 64(2): 105–123.MATHView ArticleGoogle Scholar
 Fang, SC, Rajasekera JR, Tsao HSJ (1997) Entropy Optimization and Mathematical Programming, Vol. 8. Springer, New York.MATHView ArticleGoogle Scholar
 Figueiredo, MAF, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24: 381–396.View ArticleGoogle Scholar
 Floudas, CA, Visweswaran V (1995) Quadratic optimization In: Handbook of Global Optimization, 217–269.. Springer, New York.View ArticleGoogle Scholar
 Gupta, AK, Nagar DK (1999) Matrix Variate Distributions, Vol. 104. CRC Press, Chapman & Hall, Boca Raton, Florida.Google Scholar
 Hoerl, RW (1985) Ridge analysis 25 years later. Am Stat 39(3): 186–192.MathSciNetGoogle Scholar
 Jeffreys, H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc Lond. Series A. Math Phys Sci 186(1007): 453–461.MATHMathSciNetView ArticleGoogle Scholar
 Jordan, MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2): 183–233.MATHView ArticleGoogle Scholar
 Karlebach, G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 9(10): 770–780.View ArticleGoogle Scholar
 Liao, JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP (2003) Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci 100(26): 15522–15527.View ArticleGoogle Scholar
 McGrory, CA, Titterington DM (2007) Variational approximations in bayesian model selection for finite mixture distributions. Comput Stat Data Anal 51: 5352–5367.MATHMathSciNetView ArticleGoogle Scholar
 Miyajima, S (2013) Fast enclosure for solutions of sylvester equations. Linear Algebra Appl 439(4): 856–878.MATHMathSciNetView ArticleGoogle Scholar
 Morris, KV, Mattick JS (2014) The rise of regulatory rna. Nature Rev Genet 15(6): 423–437.View ArticleGoogle Scholar
 Ntzoufras, I, Tarantola C (2013) Conjugate and conditional conjugate bayesian analysis of discrete graphical models of marginal independence. Comput Stat Data Anal 66: 161–177.MathSciNetView ArticleGoogle Scholar
 Pang, Z, Tu S, Wu X, Xu L (2013) Discriminative gmmhmm acoustic model selection using twolevel bayesian ying yang harmony learning In: Intelligent Science and Intelligent Data Engineering, 719–726.. Springer, Berlin Heidelberg.View ArticleGoogle Scholar
 Redner, RA, Walker HF (1984) Mixture densities, maximum likelihood and the em algorithm. SIAM Rev 26(2): 195–239.MATHMathSciNetView ArticleGoogle Scholar
 Rissanen, J (1978) Modeling by shortest data description. Automatica 14(5): 465–471.MATHView ArticleGoogle Scholar
 Rubin, DB, Thayer DT (1982) Em algorithms for ml factor analysis. Psychometrika 47(1): 69–76.MATHMathSciNetView ArticleGoogle Scholar
 Schwarz, G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464.MATHView ArticleGoogle Scholar
 Shi, L, Tu S, Xu L (2011a) Learning gaussian mixture with automatic model selection: A comparative study on three bayesian related approaches. Front Electrical Electronic Eng China 6(2): 215–244.Google Scholar
 Shi, L, Tu SK, Xu L (2011b) Learning gaussian mixture with automatic model selection: a comparative study on three bayesian related approaches. Front Electr Electron Eng China 6: 215–244. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (B).Google Scholar
 Shi, L, Wang P, Liu H, Xu L, Bao Z (2011c) Radar hrrp statistical recognition with local factor analysis by automatic bayesian yingyang harmony learning. Signal Process IEEE Trans 59(2): 610–617.Google Scholar
 Shi, L, Liu ZY, Tu S, Xu L (2014) Learning local factor analysis versus mixture of factor analyzers with automatic model selection. Neurocomputing 139: 3–14.View ArticleGoogle Scholar
 Tibshirani, R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58: 267–288.MATHMathSciNetGoogle Scholar
 Tikhonov, A, Goncharsky A, Stepanov V, Yagola A (1995) Numerical methods for the solution of illposed problems. Kluwer Academic, Netherlands.MATHView ArticleGoogle Scholar
 Tipping, ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc: Series B (Statistical Methodology) 61(3): 611–622.MATHMathSciNetView ArticleGoogle Scholar
 Tu, SK, Xu L (2011a) Parameterizations make different model selections : empirical findings from factor analysis. Front Electr Electron Eng China 6: 256–274. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (B).Google Scholar
 Tu, S, Xu L (2011b) An investigation of several typical model selection criteria for detecting the number of signals. Front Electr Electron Eng China 6(2): 245–255.Google Scholar
 Tu, SK, Chen RS, Xu L (2011) A binary matrix factorization algorithm for protein complex prediction. Proteome Sci 9(Suppl 1): 18.Google Scholar
 Tu, S, Chen R, Xu L (2012a) Transcription network analysis by a sparse binary factor analysis algorithm. J Integrative Bioinformatics 9(2): 198.Google Scholar
 Tu, S, Luo D, Chen R, Xu L (2012b) A nongaussian factor analysis approach to transcription network component analysis In: Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012 IEEE Symposium On, 404–411.. IEEE.Google Scholar
 Tu, S, Xu L (2014) Learning binary factor analysis with automatic model selection. Neurocomputing 134: 149–158.View ArticleGoogle Scholar
 Wallace, CS, Dowe DL (1999) Minimum message length and kolmogorov complexity. Comput J 42(4): 270–283.MATHView ArticleGoogle Scholar
 Wang, P, Shi L, Du L, Liu H, Xu L, Bao Z (2011) Radar hrrp statistical recognition with temporal factor analysis by automatic bayesian yingyang harmony learning. Front Electr Electron Eng China 6(2): 300–317.View ArticleGoogle Scholar
 Xu, L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival penalized competitive learning In: Pattern Recognit, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference On, 496–499.. IEEE, New Jersey.Google Scholar
 Xu, L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. Neural Netw IEEE Trans 4(4): 636–649.View ArticleGoogle Scholar
 Xu, L (1995) Bayesiankullback coupled yingyang machines: Unified learnings and new results on vector quantization In: Proc. Int. Conf. Neural Information Process (ICONIP ‘95), 977–988.. Publishing House of Electronics Industry, Beijing.Google Scholar
 Xu L (1996) How many clusters?: A yingyang machine based theory for a classical open problem in pattern recognition In: Neural Netw, 1996., IEEE International Conference On, 1546–1551.. IEEE, New Jersey.Google Scholar
 Xu, L, Jordan MI (1996) On convergence properties of the em algorithm for gaussian mixtures. Neural Comput 8(1): 129–151.View ArticleGoogle Scholar
 Xu, L (1997a) Bayesian ying–yang machine, clustering and number of clusters. Pattern Recognit Lett 18(11): 1167–1178.Google Scholar
 Xu L (1997b) Bayesian ying yang system and theory as a unified statistical learning approach:(i) unsupervised and semiunsupervised learning In: Brainlike Computing and Intelligent Information Systems, 241–274.. SpringerVerlag, Berlin Heidelberg.Google Scholar
 Xu, L (1997c) Bayesian ying yang system and theory as a unified statistical learning approach (ii): from unsupervised learning to supervised learning and temporal modeling In: Proceedings of Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, 25–42.. Springer, Berlin Heidelberg.Google Scholar
 Xu L (1998a) Rbf nets, mixture experts, and bayesian ying–yang learning. Neurocomputing 19(13): 223–257.Google Scholar
 Xu, L (1998b) Bayesian kullback ying–yang dependence reduction theory. Neurocomputing 22(1): 81–111.Google Scholar
 Xu L (1998c) Bayesian yingyang dimension reduction and determination. J Comput Intell Finance 6(5): 11–16.Google Scholar
 Xu, L (1998d) Bkyy dimension reduction and determination In: Neural Netw Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference On, 1822–1827.. IEEE, New Jersey.Google Scholar
 Xu L (1999a) Temporal byy learning and its applications to extended kalman filtering, hidden markov model, and sensormotor integration In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 949–954.. IEEE, New Jersey.Google Scholar
 Xu, L (1999b) Bayesian ying yang theory for empirical learning, regularisation and model selection: general formulation In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 552–557.. IEEE, New Jersey.Google Scholar
 Xu L (1999c) Bayesian ying yang supervised learning, modular models, and three layer nets In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 540–545.. IEEE, New Jersey.Google Scholar
 Xu, L (1999d) Byy data smoothing based learning on a small size of samples In: Neural Netw, 1999. IJCNN’99. International Joint Conference On, 546–551.. IEEE, New Jersey.Google Scholar
 Xu L (1999e) Byy ying yang unsupervised and supervised learning: theory and applications In: Neural Netw and Signal Processing, Proceedings of 1999 Chinese Conference On, 112–29.. Publishing house of Electronic industry, Beijing.Google Scholar
 Xu, L (2000a) Byy prodsum factor systems and harmony learning. invited talk In: Proceedings of International Conference on Neural Information Processing (ICONIP’2000), 548–558, KAIST, Taejon.Google Scholar
 Xu L (2000b) Temporal byy learning for state space approach, hidden markov model, and blind source separation. Signal Process IEEE Trans 48(7): 2132–2144.Google Scholar
 Xu, L (2000c) Byy learning system and theory for parameter estimation, data smoothing based regularisation and model selection. Neural Parallel Sci Comput 8(1): 55–83.Google Scholar
 Xu L (2000d) Best harmony learning In: Intelligent Data Engineering and Automated Learning (IDEAL 2000). Data Mining, Financial Engineering, and Intelligent Agents, 116–125.. Springer, Berlin Heidelberg.Google Scholar
 Xu, L (2001a) Best harmony, unified rpcl and automated model selection for unsupervised and supervised learning on gaussian mixtures, threelayer nets and merbfsvm models. Int J Neural Syst 11(01): 43–69.Google Scholar
 Xu L (2001b) Byy harmony learning, independent state space, and generalised apt financial analyses. Neural Netw IEEE Trans 12(4): 822–849.Google Scholar
 Xu, L (2001c) Byy harmony learning, model selection, and information approach: Further results In: Neural Information Processing (ICONIP’2001), 2001. Proceedings International Joint Conference On, 30–37.. APPNA, Shanghai.Google Scholar
 Xu L (2001d) Byy harmony learning, local independent analyses, and apt financial applications In: Neural Netw, 2001. Proceedings. IJCNN’01. International Joint Conference On, 1817–1822.. IEEE, New Jersey.Google Scholar
 Xu, L (2001e) An overview on unsupervised learning from data mining perspective In: Advances in SelfOrganising Maps, 181–209.. Springer, Berlin Heidelberg.Google Scholar
 Xu L (2002) Byy harmony neural networks, structural rpcl, and topological selforganizing on mixture models. Neural Netw 15: 1125–1151.View ArticleGoogle Scholar
 Xu, L (2003a) Independent component analysis and extensions with noise and time: a bayesian yingyang learning perspective. Neural Inf Process Lett Rev 1: 1–52.Google Scholar
 Xu L (2003b) Data smoothing regularization, multisetslearning, and problem solving strategies. Neural Netw 16: 817–825.Google Scholar
 Xu, L (2004a) Temporal byy encoding, markovian state spaces, and space dimension determination. Neural Netw IEEE Trans 15(5): 1276–1295.Google Scholar
 Xu L (2004b) Advances on byy harmony learning: information theoretic perspective, generalized projection geometry, and independent factor autodetermination. Neural Netw IEEE Trans 15(4): 885–902.Google Scholar
 Xu, L (2004c) Bidirectional byy learning for mining structures with projected polyhedra and topological map In: Proceedings of IEEE ICDM2004 Workshop on Foundations of Data Mining, 2–14.. ICDM, Brighton.Google Scholar
 Xu L (2007a) A unified perspective and new results on rht computing, mixture based learning, and multilearner based problem solving. Pattern Recognit 40: 2129–2153.Google Scholar
 Xu, L (2007b) A trend on regularization and model selection in statistical learning: A bayesian ying yang learning perspective In: Challenges for Computational Intelligence, 365–406.. Springer, Berlin Heidelberg.Google Scholar
 Xu L (2008) Bayesian ying yang system, best harmony learning, and gaussian manifold based family In: Computational Intelligence: Research Frontiers, 48–78.. Springer, Berlin Heidelberg.View ArticleGoogle Scholar
 Xu, L (2009) Learning algorithms for rbf functions and subspace based functions In: E S Olivas e.a. (ed) Handbook of Research on Machine Learning, Applications and Trends: Algorithms, Methods and Techniques, 60–94.. IGI Global, Hershey, PA.Google Scholar
 Xu L (2010a) Bayesian yingyang system, best harmony learning, and five action circling. Front Electr Electron Eng China 5: 281–328. A special issue on Emerging Themes on Information Theory and Bayesian Approach.Google Scholar
 Xu, L (2010b) Machine learning problems from optimization perspective. J Global Optimization 47(3): 369–401.Google Scholar
 Xu L (2011) Codimensional matrix pairing perspective of byy harmony learning: hierarchy of bilinear systems, joint decomposition of datacovariance, and applications of network biology. Front Electr Electron Eng China 6: 86–119. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (A).View ArticleGoogle Scholar
 Xu, L (2012a) On essential topics of byy harmony learning: current status, challenging issues, and gene analysis applications. Front Electr Electron Eng China 7: 147–196.Google Scholar
 Xu L (2012b) Semiblind bilinear matrix system, byy harmony learning, and gene analysis applications In: Proceedings of The 6th International Conference on New Trends in Information Science, Service Science and Data Mining, 661–666.. AICIT, Taipei.Google Scholar
 Yang, J, Zaitlen NA, Goddard ME, Visscher PM, Price AL (2014) Advantages and pitfalls in the application of mixedmodel association methods. Nat Genet 46(2): 100–106.View ArticleGoogle Scholar
 Zhou, X, Stephens M (2014) Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat Methods 11(4): 407–409.View ArticleGoogle Scholar
 Zou, J, Lippert C, Heckerman D, Aryee M, Listgarten J (2014) Epigenomewide association studies without the need for celltype composition. Nat Methods 11(3): 309–311.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.