Skip to main content

Projection-embedded BYY learning algorithm for Gaussian mixture-based clustering


On learning the Gaussian mixture model, existing BYY learning algorithms are featured by a gradient-based line search with an appropriate stepsize. Learning becomes either unstable if the stepsize is too large or slow and gets stuck in a local optimal solution if the stepsize is too small. An algorithm without a learning stepsize has been proposed with expectation-maximization (EM) like two alternative steps. However, its learning process may still be unstable. This paper tackles this problem of unreliability by a modified algorithm called projection-embedded Bayesian Ying-Yang learning algorithm (pBYY). Experiments have shown that pBYY outperforms learning algorithms developed from not only minimum message length with Jeffreys prior (MML-Jef) and Variational Bayesian with Dirichlet-Normal-Wishart (VB-DNW) prior but also BYY with these priors (BYY-Jef and BYY-DNW). pBYY obtains the superiority with an easy implementation, while DNW prior-based learning algorithms suffer a complicated and tedious computation load. The performance of pBYY has also been demonstrated on the Berkeley Segmentation Dataset for the topic of unsupervised image segmentation. The resulted performances of semantic image segmentation have shown that pBYY outperforms not only MML-Jef, VB-DNW, BYY-Jef, and BYY-DNW but also three leading image segmentation algorithms, namely gPb-owt-ucm, MN-Cut, and mean shift.



Gaussian mixture model (GMM) has been widely used in different areas, e.g., clustering, image segmentation (Zhang et al. [2001]), speaker identification (Reynolds [1995]), document classification (Nigam et al. [2000]), market analysis (Chiu and Xu [2001]), etc. Learning a GMM consists of parameter learning for estimating all unknown parameters and model selection for determining the number of Gaussian components k. Parameter learning is usually implemented under the maximum likelihood principle by an expectation-maximization (EM) algorithm (Redner and Walker [1984]). A conventional model selection approach is featured by a two-stage implementation, which suffers from a huge computation because it requires parameter learning for each candidate GMM. Moreover, parameter learning will become less reliable as k becomes larger, which implies more free parameters.

One road to tackle these problems is referred as automatic model selection that automatically determines k during parameter learning. An early effort is rival penalized competitive learning (RPCL) (Xu et al. [1992]; Xu [1998]) with the number k automatically determined during learning. Automatic model selection may also be approached via appropriate priors on unknown parameters by Bayesian approaches. Two examples are minimum message length (MML) (Figueiredo and Jain [2002]) and variational Bayesian (VB) (Corduneanu and Bishop [2001]). Firstly proposed in (Xu [1995]) and systematically developed in the past two decades, Bayesian Ying-Yang (BYY) learning provides not only a new model selection criteria but also a family of learning algorithms that is capable of automatic model selection during parameter learning, with details referred to recent tutorial and survey by (Xu [2010], [2012]).

A systematic comparison has been recently made by (Shi et al. [2011]) among MML, VB, and BYY with two types of priors. One is the Jeffreys prior and another is a parametric conjugate prior that imposes a Dirichlet prior on mixing weights and a joint normal-Wishart prior on mean vectors and covariance matrices, shortly denoted as DNW. Automatic model selection performances of these approaches are evaluated through extensive experiments, with several interesting empirical findings. Among them, it has been shown that BYY considerably outperforms both VB and MML. Different from VB and MML that rely on appropriate priors to perform model selection, BYY is capable of selecting model automatically even without imposing any priors on parameters, while its performance can be further improved with appropriate priors incorporated. Similar findings have also been obtained (Zhu et al. [2013]), where a simplified BYY learning algorithm with DNW priors is shown to outperform or at least be competitive to the existing state-of-the-art image segmentation methods.

The algorithms by (Shi et al. [2011]) for implementing BYY are featured by a gradient-based line search with an appropriate stepsize. Learning becomes either unstable if this stepsize is too large or slow and gets stuck in a local optimal solution if the stepsize is too small. Given in Algorithm two (Xu [2009]) and Equation (11) (Xu [2010]), there is a Ying-Yang two-step alternation algorithm that is similar to the EM algorithm without a learning stepsize for the learning procedure. However, the Ying step (Xu [2010]) ignores the constrain that the covariance matrix of each Gaussian component must be positive definite matrix, so the learning procedure may become unstable.

To constrain the covariance matrix as a positive definite matrix, this paper proposes a projection operation into the Yang step, which results in a modified algorithm called projection-embedded BYY learning algorithm or shortly denoted as pBYY. To facilitate its implementation, we also add a Kullback Leibler divergence-based indicator into the algorithm to improve the detection of redundant Gaussian components. Experiments have shown that pBYY significantly outperforms not only the Jeffreys-based MML (Figueiredo and Jain [2002]) and the DNW-based VB but also the BYY learning algorithms with these two types of priors (Shi et al. [2011]), and it further avoids the cost of complicated and tedious computation brought by the DNW prior.

Gaussian mixture model and four learning principles

GMM assumes that an observation xRd is drawn from the following mixture of k Gaussian distributions:

q ( x | θ ) = i = 1 k α i G x | μ i , Σ i , θ = α , μ i , Σ i i = 1 k , α i 0 , i = 1 k α i = 1 ,

where G(x|μ,Σ) denotes a Gaussian density with a mean μ and a covariance matrix Σ.

GMM can be also regarded as a latent variable model by introducing a binary latent vector y= [y 1,y 2,…,y k ]T, subject to y i {0,1},i, and i = 1 k y i =1, the latent variable y i =1 means that the random variable x is drawn from i th Gaussian component. The generative process of an observation x is interpreted as that y is sampled from a multinomial distribution with probabilities α and then x is randomly generated by the i th Gaussian component with y i =1. Let XRd×n denote the set of n i.i.d. d-dimension observation samples, YRk×n denote the set of latent vectors for the observable set X, we have the following:

q ( X , Y | θ ) = q ( X | Y , θ ) q ( Y | θ ) , q ( X | Y , θ ) = t = 1 n i = 1 k G ( x t | μ i , Σ i ) y it , q ( Y | θ ) = t = 1 n i = 1 k α i y it .

Learning a GMM consists of parameter learning for estimating all the unknown parameters in θ and model selection for determining the number of Gaussian components k, which can be implemented differently under different learning principles.

The most widely used principle is called the maximum likelihood (ML), that is, we estimate θ by

max θ q ( X | θ ) , q ( X | θ ) = Y q ( X | Y , θ ) q ( Y | θ ) = t = 1 n q ( x t | θ ) .

The ML learning with a known k is typically made by the well known EM algorithm (Redner and Walker [1984]). However, an unknown k is poorly estimated by Equation (3) when the sample number n is not large enough. The task of determining an appropriate k is called model selection, which is usually made in a two-stage implementation with the help of a model selection criterion. However, such a two-stage implementation suffers from a huge computation and an unreliable estimation. The problems are tackled by automatic model selection that automatically determines k during learning θ without such a two-stage implementation.

There are three Bayesian related learning principles that can be implemented with such the property of automatic model selection.

One is called minimum message length (MML) (Wallace and Dowe [1999]), which is actually an information theoretic restatement of Occam’s Razor. The MML was introduced to learn GMM with the property of automatic model selection (Figueiredo and Jain [2002]). Learning is made by the following maximization:

max θ J MML ( X | θ ) , J MML ( X | θ ) = ln q ( X | θ ) + ln q ( θ ) 1 2 ln | I ( θ ) | ,

where |I(θ)| represents the determinant of Fisher information matrix with respect to (w.r.t) Θ. Equation (4) is mathematically equivalent to a maximum a posteriori (MAP) approach with modifying a proper prior q(θ) into being proportional to q(θ)/|I(θ)|1/2.

Using the Jeffreys prior q(θ)|I(θ)|1/2 directly, Equation (4) degenerates to be ML learning principle. To avoid this situation, Figueiredo and Jain ([2002]) considered the following:

ln q ( θ ) | I ( θ ) | 1 / 2 ρ 2 i = 1 k ln α i k ( ρ + 1 ) 2 ln N ,

where ρ=d+0.5d(d+1) is the number of free parameters in each Gaussian component. In (Shi et al. [2011]), it has shown that some improvement can be obtained by an algorithm that implements the MML principle with the help of a Dirichlet prior and a joint normal-Wishart prior (shortly DNW prior).

The other Bayesian related learning principle is called variational Bayesian (Corduneanu and Bishop [2001]). The naive Bayes considers q(X|θ)q(θ) with a prior q(θ) which takes a strong role. Unfortunately, a poor q(θ) may affect the learning performance seriously. Such a bad influence can be smoothed out by considering the following marginal distribution:

q ( X ) = q ( X | θ ) q ( θ ) dθ.

However, it is difficult in computation with integral. The VB tackles this difficulty via constructing a lower bound J VB with the help of Jensen’s inequality as follows:

max J VB , J VB = p ( θ , Y | X ) ln q ( X , Y | θ ) q ( θ ) p ( θ , Y | X ) dY dθ. ln q ( X ) J VB .

The goal is to choose a suitable posterior distribution p(θ,Y|X) from a distribution family , so that the lower bound J VB can readily be evaluated and yet sufficiently flexible. One challenge is to provide a suitable distribution family . In (Corduneanu and Bishop [2001]), the family of prior distribution can be approximately factorized as follows:

p ( θ , Y | X ) = p ( Y | X ) i p ( θ i | X ) .

With q(X,Y|θ) by Equation (2) and a DNW prior q(θ), the above p(θ i |X) can be obtained with p(Y|X) and p(θ j |X)ji given by the following equation (Bishop and Nasrabadi [2006]):

p ( θ i | X ) = j i p ( Y | X ) p ( θ j | X ) ln q ( X , Y , θ ) dθdY j i p ( Y | X ) p ( θ j | X ) ln q ( X , Y , θ ) dθdY dX .

A tight bound is unavailable to be obtained by Equation (8), which affects the learning performances. Also, DNW is quite tedious and has hyperparameters λ , ξ , m i , Σ i 1 β , Φ , γ to be updated, which is time-consuming and may fall into local optimal. To avoid the tedious computation of the DNW prior-based VB, an algorithm for implementing VB principle is developed (Shi et al. [2011]) with the help of the Jeffreys prior via approximately using a block-diagonal complete data Fisher information (Figueiredo and Jain [2002]).

The last Bayesian related principle is BYY harmony learning. Firstly proposed by (Xu [1995]) and systematically developed in the past two decades, BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularization, and a class of algorithms that approach automatic model selection during parameter learning. Readers are referred to (Xu [2010], [2012], [2014]) for latest systematical introductions about BYY harmony learning.

Briefly, a BYY system consists of Yang machine and Ying machine, corresponding to two types of decomposition, namely, Yang p(R|X)p(X) and Ying q(X|R)q(R) respectively, where the data X is regarded to be generated from its inner representation R={Y,θ} that consists of latent variables Y and parameters θ, supported by a hyperparameter set Ξ. The harmony measure is mathematically expressed as follows:

H p | | q = p R | X p ( X ) ln q ( X | R ) q ( R ) dXdR.

Maximizing this H(p||q) leads to not only a best matching between the Ying-Yang pair but also a compact model with a least complexity. Such an ability can be observed from several perspectives (see Section 4 in (Xu [2010])).

Applied to GMM by Equation (2), we have R={Y,θ} and q(R)=q(Y|θ)q(θ|Ξ). Comparing Equation (9) and Equation (7), the key difference is that there is only q(X,Y|θ)q(θ) inside the basket ln[] for the BYY harmony learning while there is also a denominator p(θ,Y|X) for the VB learning. Maximizing J VB leads to a best match between q(X,Y|θ)q(θ) and p(θ,Y|X), while maximizing H(p||q) leads to not only such a best match but also a modeling of q(X,Y|θ)q(θ) in a least complexity. Readers are referred to Section 4 and its figure five in (Xu [2012]) for various aspects of this key difference, as well as how they relate and differ from MML and minimum description length (MDL) (Barron et al. [1998]; Rissanen [1978]).

Maximizing H(p||q) leads to specific algorithms according to not only what types of q(θ|Ξ) are chosen for the Ying machine but also how the structure of p(θ,Y|X) is designed for the Yang machine. Details are referred to Section 4.2 in (Xu [2010]) and Section 3.2 in (Xu [2012]). For the GMM by Equation (2), we introduce two typical examples here.

One example is p(θ,Y|X) given by Equation (8) together with a DNW prior. Putting them into Equation (9), the DNW prior-based BYY harmony learning algorithm has been developed for maximizing H(p||q) in (Shi et al. [2011]). Extensive experiments have shown that the DNW prior-based BYY considerably outperforms both VB and MML for any type of priors and with whether or not hyper-parameters optimized. As the hyper-parameters of DNW prior are optimized by its corresponding learning principle, BYY further improves its performance and outperforms the others significantly, because learning hyper-parameters is a part of the entire BYY harmony learning. However, both VB and MML deteriorate when there are too many free hyper-parameters, especially the performance of VB drops drastically. The reason is that VB and MML maximize the marginal likelihood via variational approximation and Laplace approximation, respectively, where maximizing the marginal likelihood with respect to a free priori q(θ|Ξ) makes it tend to the maximum likelihood.

Another example is the following structure:

p ( θ , Y | X ) = p ( Y | X , θ ) p ( θ | X ) , p ( Y | X , θ ) = q ( X , Y | θ ) q ( X , Y | θ ) dY , p ( θ | X ) is free of structure .

Maximizing H(p||q,Ξ) with respect to p(θ|X) makes Equation (9) simplified into

max θ H ( θ ) , H ( θ ) = H 0 ( θ ) + ln q ( θ ) , H 0 ( θ ) = t = 1 n i = 1 k p ( i | x t , θ ) ln G ( x t | μ i , Σ i ) α i , p ( i | x t , θ ) = α i G ( x t | μ i , Σ i ) i = 1 k α i G ( x t | μ i , Σ i ) .

Automatic model selection and two-step alternation

Given a known k, learning the unknown parameters θ on a GMM is usually implemented under the maximum likelihood principle by an EM algorithm (Redner and Walker [1984]), which is one typical instance of Algorithm 1 featured by a two-step alternation. As remarked at the bottom of the table, we get the EM algorithm after simply removing the lines of trimming with

p it = p i | x t , θ new , η i = 0 , ρ i = 0 , i = 1 , , k ,

where p(i|x t ,θ) is the Bayes posteriori probability as follows:

p ( i | x t , θ ) = α i G x t | μ i , Σ i i = 1 k α i p G ( x t | μ i , Σ i ) , θ = { θ i } i = 1 k , θ i = { α i , μ i , Σ i } .

Generally, η i ,ρ i come from a priori distribution that takes a regularization role. This role is shut off by simply setting them to zero. When η i =0,ρ i >0, the EM algorithm is extended to the smoothed EM algorithm that was firstly proposed in 1997 (Xu [2010]). Also, we get the EM algorithm for naive Bayes with Jeffreys priori on α i ,Σ i with

η i = d + 0.5 d ( d + 1 ) 1 2 n , ρ i = d 2 n .

An unknown k is poorly estimated via the ML learning by Equation (3), especially when the sample number n is not large enough. The task of determining an appropriate k is made by model selection, which is usually made in a two-stage implementation. The first stage enumerates k to get a set of candidate models with unknown parameters of each candidate estimated by the EM algorithm. In the second stage, the best candidate is selected by a model selection criterion. Examples of such criteria include Akaike’s information criterion (AIC) (Akaike [1974]), Bayesian inference criterion (BIC), minimum description length (MDL) criterion (which stems from another viewpoint but coincides with BIC when it is simplified to an analytically computable criterion), etc (Barron et al. [1998]; Rissanen [1978]). However, this two-stage implementation suffers from a huge computation because it requires parameter learning for each k. Moreover, a larger k often implies more unknown parameters, thus parameter estimation becomes less reliable and the criterion evaluation reduces its accuracy (see Section 2.1 in (Xu [2010]) for a detailed discussion).

One road to tackle the problems is referred to automatic model selection that means to automatically determine an appropriate k during parameter learning. An early effort is RPCL (Xu et al. [1992]; Xu [1998]). The key idea is that not only the winning Gaussian component moves a little bit to adapt the current sample but also the rival (i.e., the second winner) Gaussian component is repelled a little bit from this sample to reduce a duplicated information allocation. As a result, an extra Gaussian component is driven far away from data.

A batch learning version of RPCL learning may be also obtained as one instance of Algorithm 1, simply with

p ℓt = 1 , = arg max j p j | x t , θ new , γ , = arg max p j | x t , θ new , 0 , otherwise,

by which learning is made on a cluster when p t =1 and penalizing or de-learning is made on a cluster when p t =−γ. Usually, the penalizing strength is set γ≈0.0050.05. When γ=0, it degenerates to the so called hard-cut EM algorithm, see Equations (19) and (20) in (Xu [1995]).

According to its general formulation (e.g., see the last part of Section 2.1 in (Xu [2010])), automatic model selection is a nature of learning a mixture of k individual substructures with the following two features:

There is an indicator Ψ j (θ) on θ or its subset, based on which a particular structural component j can be effectively discarded if its corresponding Ψ j (θ)→0. Taking the GMM as an example, we may consider

Ψ j ( θ ) = α j , or Ψ j ( θ ) = α j Tr Σ j .

With initial k large enough, there is an intrinsic mechanism that drives such an indicator Ψ j (θ) towards zero if the corresponding structure is redundant and thus can be effectively discarded.

Three Bayesian-related approaches introduced in the previous subsection can all be implemented with such a nature of automatic model selection. For both MML and VB, this nature comes from an appropriate prior q(θ|Ξ). Favorably, BYY is capable of automatic model selection even without imposing any priors on the parameters, and its performance can be further improved as appropriate priors are incorporated. Actually, the BYY harmony learning by maximizing H(p||q) bases on q(R)=q(Y|θ)q(θ|Ξ) to make model selection, with q(Y|θ) in a role that is not only equally important to q(θ|Ξ) but also easy computing, while q(θ|Ξ) is still handled in a way similar to MML and VB.

The BYY harmony learning by Equation (11) can be implemented by Algorithm 1, with the Yang step given as follows:

p it = p it ( θ new ) , p it ( θ ) = p ( i | x t , θ ) 1 + δ i , t ( θ ) , δ i , t ( θ ) = π t ( θ i ) i p ( i | x t , θ i ) π t ( θ i ) , π t ( θ i ) = ln G ( x t | μ i , Σ i ) α i .

The algorithm implements a BYY harmony learning without a priori lnq(θ) in Algorithm (1) by simply setting η i =0,ρ i =0 or a data smoothing based BYY harmony learning when η i =0,ρ i >0. Readers are referred to Section 3.1 of (Xu [2010]) for further details. Also, we may implement the Jeffreys priori based on BYY harmony learning by using Equation (14), see table one in (Shi et al. [2011]).


Learning unreliability and convex combination

As introduced in Section 3.1 of (Xu [2010]), the existing algorithms for implementing BYY principle come from taking a gradient of H(θ) by Equation (11) w.r.t a subset ϕ of parameters. That is, we consider

ϕ H ( θ ) = ϕ H 0 ( θ ) + ϕ ln q ( θ ) , ϕ H 0 ( θ ) = t = 1 n i = 1 k p i , t ( θ ) ϕ π t ( θ i ) ,

with p i,t (θ) and π t (θ i ) given in Equation (17).

Based on this gradient, one attempt to update parameters is gradient-based local search. The parameter ϕ can be updated iteratively as below:

ϕ new = ϕ old + η ϕ H ( θ ) ,

where η>0 is a small learning stepsize. Both the BYY learning algorithm given in figure seven of (Xu [2010]) and the BYY-Jef algorithm given in table one of (Shi et al. [2011]) are derived from Equation (19) with the help of some computing tricks and simplification. However, the performance of such algorithms all depend on an appropriate stepsize. Learning becomes either unstable if η is too large or slow and gets stuck in a local optimal if η is too small. No such a learning stepsize is required for EM algorithms.

Another typical implementation attempts to make the BYY harmony learning by Equation (11) also in a Ying-Yang two-step alternation, as previously suggested in Section 2.1 and table one of (Xu [2012]). This two-step alternation algorithm is actually derived from approximately letting p i,t (θ) in Equation (18) to be fixed at its value p it =p it (θnew) such that we can solve the root of ϕ H(θ)=0 subject to this fixation to get the Ying step in Algorithm 1.

Still, there lacks theoretical analyses that either guarantee the learning convergence or provide the convergence conditions. Oppositely, we find empirically that the learning process of this BYY two-step alternation may become unstable.

Actually, the root of ϕ H(θ)=0 subject to p it =p it (θnew) can be considerably deviated from the true root of ϕ H(θ)=0 since this true root is coupled with p it (θ) that varies with θ. Not only correctly solving the root of ϕ H(θ)=0 is a challenging task but also it is unclear whether fixing p it =p it (θnew) makes the learning procedure become unstable.

From the likelihood by Equation (3) and Equation (1), it can be observed that

ϕ ln q ( X | θ ) = t = 1 n i = 1 k p ( i | x t , θ ) ϕ π t ( θ i ) ,

with p(i|x t ,θ) given in Equation (13). Fixing p(i|x t ,θ)=p it =p (i|x t ,θnew), solving the root of ϕ lnq(X|θ)=0 leads to the Ying step in Algorithm 1, or precisely the M step of the EM algorithm while letting p it =p (i|x t ,θnew) is just the E step of the EM algorithm. As well known, the convergence of the EM algroithm has been theoretically proved. That is, though the root of ϕ lnq(X|θ)=0 is also coupled with p(i|x t ,θ) that varies with θ, this deviation actually does not affect the convergence.

The difference between p it =p(i|x t ,θ)=p (i|x t ,θnew) and p it =p it (θnew) is that p(i|x t ,θ),i=1,…,k remains to be probability with θ i , while p it (θ i ),i=1,…,k given in Equation (17) are no longer the probabilities and even take negative values sometimes. Thus p it (θnew) is more sparse than p(i|x t ,θnew), and Yang step in the BYY theory introduces a nature of automatical model selection into the iteration procedure.

To further investigate the influence of replacing p (i|x t ,θnew) by p it (θnew), we now focus on Ying step, which can be reformulated as below:

α i = t = 1 n p it N , μ i = t = 1 N p it t = 1 n p it x t , Σ i = t = 1 N p it t = 1 n p it ( x t μ i ) ( x t μ i ) T .

For EM algorithm, both μ i and Σ i are constrained in the convex hulls spanned by x t and (x t μ i )(x t μ i )T, respectively, because its p it still remains in the probability space. However, in BYY algorithm, p it is no longer the probabilities and even take negative values sometimes. Thus, μ i and Σ i may break through their corresponding convex hulls. For GMM, the model parameters θ must satisfy following constrains:

i = 1 k α i = 1 , α i 0 , i { 1 , 2 , , k } , Σ i + d × d i { 1 , 2 , , k } ,

where + d × d denotes the set of positive semidefinite matrix of size d×d. Thus the updated α i and Σ i in BYY may no longer exist in their feasible regions sometimes. Instead of projecting α i and Σ i to the set of positive semidefinite matrix directly, we are motivated to project ϕ H 0(θ) back to the convex hull of local gradients ϕ π t (θ i ), t=1,…,n, via projecting p it (θnew) onto the following set of probabilities to preserve more information of α i and Σ i :

P = p 1 , , p k : p i 0 , i = 1 k p i = 1 .

For updating each mean vector μ i , we are encouraged to use p it (θnew), because the updating equation of μ i is no longer a convex combination of all observable samples, and the redundant components can be pushed outside the convex hull; thus, this operation accelerates the speed of model selection.

The relative structure among the original {p it (θnew)} is encoded by the position of the vector p t H = p 1 t ( θ new ) , , p kt ( θ new ) T in Rk. Projecting p t H from Rk to in Equation (23) means to find a vector p t = p 1 t , , p kt T P that is the nearest one to p t H and thus best keeps the relative structure within elements of p t H . To be specific, we choose the nearest one in a sense of the least square distance, that is, we consider the following optimization problem:

p t = arg min p P p p t H 2 .

The above implementation maybe regarded as a two-step approach of making the BYY harmony learning by Equation (17) under a principle of multiple convex combination preservation (Xu [2014]).

Fast approximation and pBYY-Jef algorithm

The problem Equation (24) is often encountered in the literature of applied mathematics and scientific computing and tackled by several algorithms such as variants of the method of alternating projections (Bauschke and Borwein [1993]) and variants of Dykstra’s algorithm (Bauschke and Borwein [1994]). However, these algorithms suffer from a huge computing cost, especially on a large-size data set.

Alternatively, we propose a fast approximation algorithm with two steps, motivated by the Kolmogorov’s criterion (see Chapter of 1 (Escalante and Raydan [2011])). Let S (x) denote the projection point of an arbitrary point x k onto a non-empty closed convex set S n ; Kolmogorov’s criterion states that z = S (x) if and only if z S and (zz)T(xz)≤0 for all zS, from which we can get the following:

Theorem 1.

Let F p ={ p 1 ,, p k : i = 1 k p i =1} with P F p , we have P (x)= P F p (x) for an arbitrary point x k .


Let z = F p (x), z = P (x) and z = P ( z ). From (zz )T(zz )≤0 for all zP, we have (zz )T(zx+xz )≤0 or (zz )T(zx)+(zz )T(xz )≤0. It follows (zz )T(zx)=0 since z = F p (x) is the projection point of x to the hyperplane F p and thus orthogonal to the vector zz that lies in this hyperplane F p . Therefore, we get the inequality (zz )T(xz )≤0, which holds for all zP and thus z=z according to Kolmogorov’s criterion. End. □

Based on this theorem, we split the projection into two steps. First, we consider the following orthogonal projection of p t H onto the hyperplane F p :

f t = I n n T p t H f 0 + f 0 , f 0 = 1 k 1 ,

where n= 1 k 1 is the normal vector of the hyperplane i = 1 k p i =1, f 0 is the center point of the closed convex set , and all elements in 1Rk×1 are equal to 1.

Second, we further project f t onto . However, accurately calculating the projecting point is still very time-consuming. Instead, we consider a fast approximation along the line between f t and f 0 as follows:

p t = λ f 0 + ( 1 λ ) f t

with a minimum λ that make p t locate within .

In a summary, we get a modified algorithm as one new instance of Algorithm 1. Its Ying step remains unchanged but its Yang step gets {p it (θnew)} by Equation (17) and then makes the nearest projection onto by Equation (25) and Equation (26). For clarity, we rewrite Algorithm 1 into a detailed form in Algorithm-2 that is dedicated to implementing this projection-embedded BYY learning (shortly named pBYY).

The pBYY implementation repeats the Ying step and the Yang step alternatively. It gets out of the repeating circle in two cases. One is that learning is finally completed as the repeating circle converges with an unchanged k. The other is after trimming one Gaussian component with k reducing by 1, after which it goes to the line of initialization and start a new repeating circle. This re-initialization is helpful to avoid accumulation of estimating bias, though it requires extra computing costs. Whether we need this depending on a trade-off of computing cost versus estimating accuracy. We may remove this re-initialization by simply deleting the line ‘go to Initialization’.

Trimming a Gaussian component bases on an indicator Ψ j (θ) as given in Equation (16). Empirically, we find that there are scenarios and add the following new indicator for detection:

KL ij = G ( x | μ i , Σ i ) ln G ( x | μ i , Σ i ) G ( x | μ j , Σ j ) dx = 1 2 ln | Σ j | | Σ i | d + Tr ( Σ j 1 Σ i ) + d i , j M , d i , j M = ( μ i μ j ) T Σ j 1 ( μ i μ j ) .

That is, we use the Kullback–Leibler (KL) divergence to measure the similarity between two Gaussian components. When K L ij becomes more close to 0 for any ji, we may regard that the i th Gaussian component is redundant and thus discarded.

Results and discussion

Performance measures and algorithms

When samples locate in a space with its dimension less than 3, we can visualize and judge the clustering performance manually. However, samples are usually located in a high dimensional space for practical problems. Also, human evaluation is too subjective. In this paper, we consider four typical measures for clustering performance and model selection on number of clusters.

First, a traditional criterion to measure the performances of model selection could be named as the correct selection rate (CSR), namely how many times the algorithm gets the accurate number of clusters among a large number of trials. Sometimes, this criterion is argued to be too strict. For example, there exists four clusters in the set of observation samples. If an algorithm splits one cluster into two but gets the other three clusters correctly, this trial gets a zero count in computing CSR, though the clustering result still has some reasonable interpretation.

Second, one popular measure in the current literature is called variational information (VI), which evaluates the distance between one clustering result C and the ground-truth as follows:

VI ( C , C ) = H ( C ) + H ( C ) MI ( C , C ) , H ( C ) = i = 1 k P ( i ) log P ( i ) , MI ( C , C ) = i = 1 k j = 1 m P ( i , j ) log 2 P ( i , j ) P ( i ) P ( j ) , P ( i ) = | C i | N , P ( i , j ) = | C i C j | N ,

with |C i | denoting the size of cluster C i , where we get k clusters {C i } in clustering and m clusters {C j } in clustering C . This MI denotes the mutual information that describes how much we can reduce the uncertainty about the cluster of a random sample when knowing its cluster in another clustering of the same set of observation samples (Wagner and Wagner [2007]). The smaller the VI value is, the better the performance is.

The last popular measure is called probabilistic Rand index (PRI). It further considers to partition the set of all (unordered) pairs of observation samples in into the disjoint union of the following sets: R 11 = {pairs that are in the same cluster under and C } R 00 = {pairs that are in the different clusters under and C } R 10 = {pairs that are in the same cluster under but in different ones under C } R 01 = {pairs that are in the different clusters under but in the same under C }.

Assume that each sample is randomly assigned to one cluster. The probability that two samples are in the same cluster in both partitions is p 11 = 1 k · 1 m . Corresponding to the R 10 , R 01 , and R 00 , we get p 10 = 1 k · 1 1 m , p 01 = 1 1 k · 1 m , and p 00 = 1 1 k · 1 1 m . Then, PRI can be expressed as follows (Carpineto and Romano [2012]):

PRI ( C , C ) = w 11 n 11 + w 00 n 00 w 11 n 11 + w 10 n 10 + w 01 n 01 + w 00 n 00 ,

where n ab =|R ab | and w ab =− log2(p ab ) for a,b{0,1}. Simple analysis show that PRI vary between 0 (no agreement on any pair of samples in clusterings and C ) and 1 (when two clusterings are equal).

Moreover, one popular application of clustering algorithms is image segmentation. To evaluate the performances of semantic image segmentation, one widely used measure is the covering rate (CR) (Richardson and Green [1997]), by whcih a larger CR value indicates a better performance.

We aim at comparisons of the proposed Algorithm-2 with those typical algorithms investigated in (Shi et al. [2011]). For clarification, we summarize as follows: BYY-Jef and BYY-DNW: both come from table one and table six in (Shi et al. [2011]). MML-Jef: this was taken from table two in (Shi et al. [2011]), same as the one given in (Figueiredo and Jain [2002]). VB-DNW: this was taken from table six in (Shi et al. [2011]), same as the one given in (Bishop and Nasrabadi [2006]; Corduneanu and Bishop [2001]).

All algorithms are programmed in MATLAB R2010b on a 32-bit PC with 3.1 GHz Intel Core i5-2400 CPU and 4 GB memory.

All data sets and source codes used in this paper can be downloaded from the website

Empirical comparison

We start at three types of synthetic data sets illustrated in Figure 1. Each type of data set is processed 500 independent trails with random initializations. In the algorithm implementations, the mean vector of each Gaussian component is initialized randomly, and the initial mixing weight and initial covariance matrix of each Gaussian component are computed with help of the k mean algorithm.

Figure 1
figure 1

Three synthetic data sets and their ground-truth clusters. For both (a) and (c), data sets of 60 samples are generated from a 2-dimensional 4-component GMM, with equal mixing weights of 1 4 ; (b) 75 samples are generated from a 2-dimensional 5-component GMM, with equal mixing weight of 1 5 . (Each red curve indicates to a contour of equal probability density per component, and the read diamonds indicate the Gaussian means).

The comparisons of performance of each algorithm are shown in Table 1. We observe that pBYY significantly outperforms all the other algorithms almost in all the cases, without using any priori. The only exception occurs on the data set GMM-b, where BYY-DNW scored the best VI value though pBYY also got a value that is very close to the VI score. We also observe how the choice of an appropriate learning stepsize affects the performance of BYY-Jef and BYY-DNW. Closely related to the configurations of data sets, this choice is a difficult task. On the configuration type of GMM-b similar to the datasets studied in (Shi et al. [2011]), experiments reconfirm the statement that BYY outperforms its counterparts of VB and MML (Shi et al. [2011]). However, the statement seemly no longer holds for the configuration types of GMM-a and GMM-c, probably due to inappropriate learning stepsizes. Favorably, this statement has been reconfirmed by pBYY on the data sets of GMM-a and GMM-c with re-initialization period T b being set as 5, namely, pBYY still significantly outperforms not only VB-DNW and MML-Jef but also BYY-Jef and BYY-DNW.

Table 1 Performance of each algorithm on three synthetic data sets after 500 trials, with the initial number of Gaussian components is set as k =20, where ‘a’ indicates the best within its column

Table 2 presents a set of real-world data, where acidity, enzyme, and galaxy data sets come from (Richardson and Green [1997]). On these data sets, it is difficult to use CSR, PRI, and VI because the information about the correct clustering result is unavailable. Following (Bishop and Nasrabadi [2006]), we compare the performances of these algorithms on modeling the distributions of acidity, enzyme, and galaxy data sets visually. As demonstrated in Figure 2, BYY-Jef, BYY-DNW, and pBYY all obviously outperform VB-DNW and MML-Jef on the acidity and enzime, with pBYY performing best and MML-Jef outperforming. VB-DNW. On the galaxy data set, pBYY and BYY-DNW perform similarly and both outperform BYY-Jef, VB-DNW, and MML-Jef. In summary, these experiments confirm the previous findings obtained on synthetic data sets. In other words, pBYY outperforms not only VB and MML but also BYY-Jef and BYY-DNW.

Figure 2
figure 2

Density fitting on 1D real-world data sets. From left to right: acidity, enzyme, and galaxy data sets; while from top to bottom: BYY-DNW, BYY-Jef, MML-Jef, VB-DNW, and pBYY algorithms. (Red curve indicates the overall density function and green curve the density function per component).

Table 2 Details of 1D real data sets

To further evaluate the performance of pBYY algorithm, we apply the proposed algorithm to unsupervised image segmentation on 100 testing images from Berkeley Segmentation Data Set (BSDS), where each image has five ground-truth segmentations hand-drawn by persons, as illustrated in Figure 3. For a clustering-based image segmentation algorithm, an important issue is how to get features as input vectors. In this paper, we use the features proposed by Varma and Zisserman ([2003]), which has been used with promising image segmentation results (Nikou et al. [2010]; Shi et al. [2011]; Zhu et al. [2013]). To concentrate on the performance of clustering algorithms, we do not conduct post-processing operations, such as region merging and graph cut, although they may further improve the segmentation results.

Figure 3
figure 3

Ground-truth segmentation results hand-drawn by five different human objects on the image 296058.

We compare the performance of pBYY algorithm with several leading segmentation algorithms, including gPb-owt-ucm (Arbelaez et al. [2011]), multiscale graph decomposition (MN-Cut) (Cour et al. [2005]), and mean shift (Comaniciu and Meer [2002]). To make a fair comparison, these algorithms are implemented under the same prespecified configuration. For MML-Jef, VB-DNW, MN-Cut, and pBYY, the initial cluster number is set to be 20. For mean shift, the minimum region area is set at 5,000 pixels. For gPb-owt-ucm, we use the segmentation results posted by (Arbelaez et al. [2011]), and set the threshold to be 0.5. These settings are fixed throughout all the evaluations. To simplify the computation, we also ignore the re-initiation step in Algorithm 2 to accelerate the speed of pBYY algorithm.

Following the existing convention (Arbelaez et al. [2011]), we use PRI, VI, and CR to measure the comparison performance. The result of PRI, VI, and CR scores are shown in Table 3. Moreover, pairwise comparisons of pBYY with each competing algorithm are illustrated in Figure 4. By the PRI and CR measures, pBYY outperforms almost all the algorithms.

Figure 4
figure 4

Pairwise comparison of segmentation algorithms on the BSDS500. The coordinates of the blue dots are the PRI, VI, and CR scores pBYY and its competing algorithms obtained per image. Red line represents the boundary of equal performance by the two algorithms, and the boxed digits indicate the number of images where one algorithm is better. For example, the last one shows MN-Cut outscores pBYY merely on 4/100 images.

Table 3 Performance scores on the BSDS

There is one exception at the center of the first row. The mean shift performs better than pBYY on 53 pieces of images according to PRI. Figure 5 shows the comparisons on four images randomly picked from the BSDS. Human judgement may clearly identify that the segmentations by pBYY look much bettter than the counterparts by mean shift. By the VI criterion, pBYY outperforms MML-Jef and VB-DNW but fails to win gPb-owt-ucm, MN-Cut, and mean shift. Observed from Figure 5, a human judgement may identify that the segmentations by pBYY are much better than the counterparts by gPb-owt-ucm, MN-Cut, and mean shift. Seemingly, the VI is more suitable to measure the clustering-based segmentations for a purpose of getting superpixels, while pBYY outperforms all the algorithms for semantic image segmentation but not be so for segmentations towards superpixels.

Figure 5
figure 5

Comparisons on four images from the BSDS500.


On learning the Gaussian mixture model, the existing BYY learning algorithms are featured by either a gradient-based local search that needs an appropriate stepsize to be prespecified or a EM-like two-step alternation that does not request a learning stepsize but may lead to a unstable learning. The proposed pBYY still implements such a two-step alternation but removes the learning unreliability by an embedded projection, outperforming the existing BYY learning algorithms significantly. In the machine learning literature, Bayesian approach with appropriate priori provides a standard direction of developing learning algorithms for model selection, with VB and MML being two typical instances. In (Shi et al. [2011]), BYY outperforms MML and VB with the help of the same types of priories, but still fail to prevail with no priori. It has been shown in this paper that pBYY without any priori has outperformed MML-Jef, VB-DNW, BYY-Jef, and BYY-DNW, which confirms that the BYY best harmony learning provides a new perspective for automatic model selection even without a prior. Especially, this pBYY uses an easy computation to prevail the tedious computation required for using the DNW prior. More interestingly, the semantic image segmentation performance on the Berkeley Segmentation Data Set of 100 testing images have shown that pBYY outperforms not only MML-Jef, VB-DNW, BYY-Jef and BYY-DNW but also gPb-owt-ucm, MN-Cut, and mean shift.


  1. Akaike H: A new look at the statistical model identification. Automatic Control IEEE Trans 1974,19(6):716–723. 10.1109/TAC.1974.1100705

    Article  MathSciNet  Google Scholar 

  2. Arbelaez P, Maire M, Fowlkes C, Malik J: Contour detection and hierarchical image segmentation. Pattern Anal Mach Intell IEEE Trans 2011,33(5):898–916. 10.1109/TPAMI.2010.161

    Article  Google Scholar 

  3. Bauschke H, Borwein JM: On the convergence of von Neumann’s alternating projection algorithm for two sets. Set-Valued Anal 1993,1(2):185–212. 10.1007/BF01027691

    Article  MathSciNet  Google Scholar 

  4. Bauschke H, Borwein JM: Dykstra’s alternating projection algorithm for two sets. J Approximation Theory 1994,79(3):418–443. 10.1006/jath.1994.1136

    Article  MathSciNet  Google Scholar 

  5. Barron A, Rissanen J, Yu B: The minimum description length principle in coding and modeling. Inf Theory IEEE Trans 1998,44(6):2743–2760. 10.1109/18.720554

    Article  MathSciNet  Google Scholar 

  6. Bishop CM, Nasrabadi NM: Pattern recognition and machine learning vol 1. Springer, New York; 2006.

    Google Scholar 

  7. Carpineto C, Romano G: Consensus clustering based on a new probabilistic rand index with application to subtopic retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on 2012,34(12):2315–2326. 10.1109/TPAMI.2012.80

    Article  Google Scholar 

  8. Chiu, KC, Xu L (2001) Tests of Gaussian temporal factor loadings in financial APT In: Proc. of 3rd International Conference on Independent Component Analysis and Blind Signal Separation, December 9–12, 313–318, San Diego, California, USA.

    Google Scholar 

  9. Corduneanu A, Bishop CM: Variational Bayesian model selection for mixture distributions. In Artificial Intelligence and Statistics, vol 2001. Morgan Kaufmann, Waltham, MA; 2001:27–34.

    Google Scholar 

  10. Cour, T, Benezit F, Shi J (2005) Spectral segmentation with multiscale graph decomposition In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference On, vol 2, 1124–1131.. IEEE.

    Google Scholar 

  11. Comaniciu D, Meer P: Mean shift: a robust approach toward feature space analysis. Pattern Anal Mach Intell IEEE Trans 2002,24(5):603–619. 10.1109/34.1000236

    Article  Google Scholar 

  12. Escalante, R, Raydan M (2011) Alternating projection methods. vol 8. SIAM.

    Book  Google Scholar 

  13. Figueiredo MAT, Jain AK: Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 2002,24(3):381–396. 10.1109/34.990138

    Article  Google Scholar 

  14. Nigam K, McCallum AK, Thrun S, Mitchell T: Text classification from labeled and unlabeled documents using EM. Mach Learn 2000,39(2–3):103–134. 10.1023/A:1007692713085

    Article  Google Scholar 

  15. Nikou C, Likas C, Galatsanos NP: A Bayesian framework for image segmentation with spatially varying mixtures. Image Process IEEE Trans 2010,19(9):2278–2289. 10.1109/TIP.2010.2047903

    Article  MathSciNet  Google Scholar 

  16. Reynolds DA: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun 1995,17(1):91–108. 10.1016/0167-6393(95)00009-D

    Article  Google Scholar 

  17. Redner RA, Walker HF: Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 1984,26(2):195–239. 10.1137/1026034

    Article  MathSciNet  Google Scholar 

  18. Richardson S, Green PJ: On Bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc: Series B (Statistical Methodology) 1997,59(4):731–792. 10.1111/1467-9868.00095

    Article  MathSciNet  Google Scholar 

  19. Rissanen J: Modeling by shortest data description. Automatica 1978,14(5):465–471. 10.1016/0005-1098(78)90005-5

    Article  Google Scholar 

  20. Shi L, Tu S, Xu L: Learning Gaussian mixture with automatic model selection: a comparative study on three Bayesian related approaches. Frontiers Electrical Electron Eng China 2011,6(2):215–244. 10.1007/s11460-011-0153-z

    Article  Google Scholar 

  21. Varma, M, Zisserman A (2003) Texture classification: are filter banks necessary? In: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference On, vol 2, 691–698.. IEEE.

    Google Scholar 

  22. Wallace CS, Dowe DL: Minimum message length and Kolmogorov complexity. Comput J 1999,42(4):270–283. 10.1093/comjnl/42.4.270

    Article  Google Scholar 

  23. Wagner, S, Wagner D (2007) Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik.

    Google Scholar 

  24. Xu, L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival penalized competitive learning In: Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings. 11th IAPR International Conference On, 496–499.. IEEE.

    Google Scholar 

  25. Xu, L (1995) Bayesian-kullback coupled Ying-Yang machines: unified learnings and new results on vector quantization In: Proceedings of International Conference on Neural Information Processing, Oct 30–Nov.3, 977–988, Beijing, China.

    Google Scholar 

  26. Xu, L (1998) Rival penalized competitive learning, finite mixture, and multisets clustering In: Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference On, vol 3, 2525–2530.. IEEE.

    Google Scholar 

  27. Xu L: Learning algorithms for RBF functions and subspace based functions. In Handbook of Research on Machine Learning, Applications and Trends: Algorithms, Methods and Techniques. Edited by: Olivas E. IGI Global, Hershey, PA; 2009:60–94. 10.4018/978-1-60566-766-9.ch003

    Google Scholar 

  28. Xu L: Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers Electrical Electron Eng China 2010,5(3):281–328. 10.1007/s11460-010-0108-9

    Article  Google Scholar 

  29. Xu L: On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. Frontiers Electrical Electron Eng 2012,7(1):147–196.

    Google Scholar 

  30. Xu, L (2014) Further advances on Bayesian Ying-Yang harmony learning. Appl Inform, to appear.

    Google Scholar 

  31. Zhang Y, Brady M, Smith S: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. Med Imaging IEEE Trans 2001,20(1):45–57. 10.1109/42.906424

    Article  Google Scholar 

  32. Zhu S, Zhao J, Guo L, Zhang Y: Unsupervised natural image segmentation via Bayesian Ying–Yang harmony learning theory. Neurocomputing 2013, 121: 532–539. 10.1016/j.neucom.2013.05.017

    Article  Google Scholar 

Download references


Lei Xu was supported by a starting-up grant for the Zhi-Yuan chair professorship by Shanghai Jiao Tong University. Pheng-Ann Heng was partly supported by Hong Kong Research Grants Council General Research Fund (Project No. 412513).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lei Xu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GYC proposed the idea of pBYY algorithm and designed the experiment part with PAH, and LX improved the original idea of pBYY algorithm and refined the presentation of this method. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Heng, PA. & Xu, L. Projection-embedded BYY learning algorithm for Gaussian mixture-based clustering. Appl Inform 1, 2 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: