# Projection-embedded BYY learning algorithm for Gaussian mixture-based clustering

- Guangyong Chen
^{1}, - Pheng-Ann Heng
^{1}and - Lei Xu
^{1, 2}Email author

**1**:2

https://doi.org/10.1186/s40535-014-0002-2

© Chen et al.; licensee Springer. 2014

**Received: **23 July 2014

**Accepted: **15 October 2014

**Published: **27 November 2014

## Abstract

On learning the Gaussian mixture model, existing BYY learning algorithms are featured by a gradient-based line search with an appropriate stepsize. Learning becomes either unstable if the stepsize is too large or slow and gets stuck in a local optimal solution if the stepsize is too small. An algorithm without a learning stepsize has been proposed with expectation-maximization (EM) like two alternative steps. However, its learning process may still be unstable. This paper tackles this problem of unreliability by a modified algorithm called projection-embedded Bayesian Ying-Yang learning algorithm (pBYY). Experiments have shown that pBYY outperforms learning algorithms developed from not only minimum message length with Jeffreys prior (MML-Jef) and Variational Bayesian with Dirichlet-Normal-Wishart (VB-DNW) prior but also BYY with these priors (BYY-Jef and BYY-DNW). pBYY obtains the superiority with an easy implementation, while DNW prior-based learning algorithms suffer a complicated and tedious computation load. The performance of pBYY has also been demonstrated on the Berkeley Segmentation Dataset for the topic of unsupervised image segmentation. The resulted performances of semantic image segmentation have shown that pBYY outperforms not only MML-Jef, VB-DNW, BYY-Jef, and BYY-DNW but also three leading image segmentation algorithms, namely gPb-owt-ucm, MN-Cut, and mean shift.

## Keywords

## Background

### Introduction

Gaussian mixture model (GMM) has been widely used in different areas, e.g., clustering, image segmentation (Zhang et al. [2001]), speaker identification (Reynolds [1995]), document classification (Nigam et al. [2000]), market analysis (Chiu and Xu [2001]), etc. Learning a GMM consists of parameter learning for estimating all unknown parameters and model selection for determining the number of Gaussian components *k*. Parameter learning is usually implemented under the maximum likelihood principle by an expectation-maximization (EM) algorithm (Redner and Walker [1984]). A conventional model selection approach is featured by a two-stage implementation, which suffers from a huge computation because it requires parameter learning for each candidate GMM. Moreover, parameter learning will become less reliable as *k* becomes larger, which implies more free parameters.

One road to tackle these problems is referred as automatic model selection that automatically determines *k* during parameter learning. An early effort is rival penalized competitive learning (RPCL) (Xu et al. [1992]; Xu [1998]) with the number *k* automatically determined during learning. Automatic model selection may also be approached via appropriate priors on unknown parameters by Bayesian approaches. Two examples are minimum message length (MML) (Figueiredo and Jain [2002]) and variational Bayesian (VB) (Corduneanu and Bishop [2001]). Firstly proposed in (Xu [1995]) and systematically developed in the past two decades, Bayesian Ying-Yang (BYY) learning provides not only a new model selection criteria but also a family of learning algorithms that is capable of automatic model selection during parameter learning, with details referred to recent tutorial and survey by (Xu [2010], [2012]).

A systematic comparison has been recently made by (Shi et al. [2011]) among MML, VB, and BYY with two types of priors. One is the Jeffreys prior and another is a parametric conjugate prior that imposes a Dirichlet prior on mixing weights and a joint normal-Wishart prior on mean vectors and covariance matrices, shortly denoted as DNW. Automatic model selection performances of these approaches are evaluated through extensive experiments, with several interesting empirical findings. Among them, it has been shown that BYY considerably outperforms both VB and MML. Different from VB and MML that rely on appropriate priors to perform model selection, BYY is capable of selecting model automatically even without imposing any priors on parameters, while its performance can be further improved with appropriate priors incorporated. Similar findings have also been obtained (Zhu et al. [2013]), where a simplified BYY learning algorithm with DNW priors is shown to outperform or at least be competitive to the existing state-of-the-art image segmentation methods.

The algorithms by (Shi et al. [2011]) for implementing BYY are featured by a gradient-based line search with an appropriate stepsize. Learning becomes either unstable if this stepsize is too large or slow and gets stuck in a local optimal solution if the stepsize is too small. Given in Algorithm two (Xu [2009]) and Equation (11) (Xu [2010]), there is a Ying-Yang two-step alternation algorithm that is similar to the EM algorithm without a learning stepsize for the learning procedure. However, the Ying step (Xu [2010]) ignores the constrain that the covariance matrix of each Gaussian component must be positive definite matrix, so the learning procedure may become unstable.

To constrain the covariance matrix as a positive definite matrix, this paper proposes a projection operation into the Yang step, which results in a modified algorithm called projection-embedded BYY learning algorithm or shortly denoted as pBYY. To facilitate its implementation, we also add a Kullback Leibler divergence-based indicator into the algorithm to improve the detection of redundant Gaussian components. Experiments have shown that pBYY significantly outperforms not only the Jeffreys-based MML (Figueiredo and Jain [2002]) and the DNW-based VB but also the BYY learning algorithms with these two types of priors (Shi et al. [2011]), and it further avoids the cost of complicated and tedious computation brought by the DNW prior.

### Gaussian mixture model and four learning principles

*x*∈

*R*

^{ d }is drawn from the following mixture of

*k*Gaussian distributions:

where *G*(*x*|*μ*,*Σ*) denotes a Gaussian density with a mean *μ* and a covariance matrix *Σ*.

*y*= [

*y*

_{1},

*y*

_{2},…,

*y*

_{ k }]

^{ T }, subject to

*y*

_{ i }∈{0,1},∀

*i*, and $\sum _{i=1}^{k}{y}_{i}=1$, the latent variable

*y*

_{ i }=1 means that the random variable

*x*is drawn from

*i*th Gaussian component. The generative process of an observation

*x*is interpreted as that

*y*is sampled from a multinomial distribution with probabilities

*α*and then

*x*is randomly generated by the

*i*th Gaussian component with

*y*

_{ i }=1. Let

*X*∈

*R*

^{ d×n }denote the set of

*n*i.i.d.

*d*-dimension observation samples,

*Y*∈

*R*

^{ k×n }denote the set of latent vectors for the observable set

*X*, we have the following:

Learning a GMM consists of parameter learning for estimating all the unknown parameters in *θ* and model selection for determining the number of Gaussian components *k*, which can be implemented differently under different learning principles.

*θ*by

The ML learning with a known *k* is typically made by the well known EM algorithm (Redner and Walker [1984]). However, an unknown *k* is poorly estimated by Equation (3) when the sample number *n* is not large enough. The task of determining an appropriate *k* is called model selection, which is usually made in a two-stage implementation with the help of a model selection criterion. However, such a two-stage implementation suffers from a huge computation and an unreliable estimation. The problems are tackled by automatic model selection that automatically determines *k* during learning *θ* without such a two-stage implementation.

There are three Bayesian related learning principles that can be implemented with such the property of automatic model selection.

where |**I**(*θ*)| represents the determinant of Fisher information matrix with respect to (w.r.t) *Θ*. Equation (4) is mathematically equivalent to a maximum a posteriori (MAP) approach with modifying a proper prior *q*(*θ*) into being proportional to *q*(*θ*)/|**I**(*θ*)|^{1/2}.

*q*(

*θ*)∝|

**I**(

*θ*)|

^{1/2}directly, Equation (4) degenerates to be ML learning principle. To avoid this situation, Figueiredo and Jain ([2002]) considered the following:

where *ρ*=*d*+0.5*d*(*d*+1) is the number of free parameters in each Gaussian component. In (Shi et al. [2011]), it has shown that some improvement can be obtained by an algorithm that implements the MML principle with the help of a Dirichlet prior and a *joint* normal-Wishart prior (shortly DNW prior).

*q*(

*X*|

*θ*)

*q*(

*θ*) with a prior

*q*(

*θ*) which takes a strong role. Unfortunately, a poor

*q*(

*θ*) may affect the learning performance seriously. Such a bad influence can be smoothed out by considering the following marginal distribution:

*J*

_{ VB }with the help of Jensen’s inequality as follows:

*p*(

*θ*,

*Y*|

*X*) from a distribution family , so that the lower bound

*J*

_{ VB }can readily be evaluated and yet sufficiently flexible. One challenge is to provide a suitable distribution family . In (Corduneanu and Bishop [2001]), the family of prior distribution can be approximately factorized as follows:

*q*(

*X*,

*Y*|

*θ*) by Equation (2) and a DNW prior

*q*(

*θ*), the above

*p*(

*θ*

_{ i }|

*X*) can be obtained with

*p*(

*Y*|

*X*) and

*p*(

*θ*

_{ j }|

*X*)∀

*j*≠

*i*given by the following equation (Bishop and Nasrabadi [2006]):

A tight bound is unavailable to be obtained by Equation (8), which affects the learning performances. Also, DNW is quite tedious and has hyperparameters $\left\{\lambda ,\xi ,{m}_{i},\frac{{\Sigma}_{i}^{-1}}{\beta},\Phi ,\gamma \right\}$ to be updated, which is time-consuming and may fall into local optimal. To avoid the tedious computation of the DNW prior-based VB, an algorithm for implementing VB principle is developed (Shi et al. [2011]) with the help of the Jeffreys prior via approximately using a block-diagonal complete data Fisher information (Figueiredo and Jain [2002]).

The last Bayesian related principle is BYY harmony learning. Firstly proposed by (Xu [1995]) and systematically developed in the past two decades, BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularization, and a class of algorithms that approach automatic model selection during parameter learning. Readers are referred to (Xu [2010], [2012], [2014]) for latest systematical introductions about BYY harmony learning.

*p*(

*R*|

*X*)

*p*(

*X*) and Ying

*q*(

*X*|

*R*)

*q*(

*R*) respectively, where the data

*X*is regarded to be generated from its inner representation

*R*={

*Y*,

*θ*} that consists of latent variables

*Y*and parameters

*θ*, supported by a hyperparameter set

*Ξ*. The harmony measure is mathematically expressed as follows:

Maximizing this *H*(*p*||*q*) leads to not only a best matching between the Ying-Yang pair but also a compact model with a least complexity. Such an ability can be observed from several perspectives (see Section 4 in (Xu [2010])).

Applied to GMM by Equation (2), we have *R*={*Y*,*θ*} and *q*(*R*)=*q*(*Y*|*θ*)*q*(*θ*|*Ξ*). Comparing Equation (9) and Equation (7), the key difference is that there is only *q*(*X*,*Y*|*θ*)*q*(*θ*) inside the basket ln[∗] for the BYY harmony learning while there is also a denominator *p*(*θ*,*Y*|*X*) for the VB learning. Maximizing *J*
_{
VB
} leads to a best match between *q*(*X*,*Y*|*θ*)*q*(*θ*) and *p*(*θ*,*Y*|*X*), while maximizing *H*(*p*||*q*) leads to not only such a best match but also a modeling of *q*(*X*,*Y*|*θ*)*q*(*θ*) in a least complexity. Readers are referred to Section 4 and its figure five in (Xu [2012]) for various aspects of this key difference, as well as how they relate and differ from MML and minimum description length (MDL) (Barron et al. [1998]; Rissanen [1978]).

Maximizing *H*(*p*||*q*) leads to specific algorithms according to not only what types of *q*(*θ*|*Ξ*) are chosen for the Ying machine but also how the structure of *p*(*θ*,*Y*|*X*) is designed for the Yang machine. Details are referred to Section 4.2 in (Xu [2010]) and Section 3.2 in (Xu [2012]). For the GMM by Equation (2), we introduce two typical examples here.

One example is *p*(*θ*,*Y*|*X*) given by Equation (8) together with a DNW prior. Putting them into Equation (9), the DNW prior-based BYY harmony learning algorithm has been developed for maximizing *H*(*p*||*q*) in (Shi et al. [2011]). Extensive experiments have shown that the DNW prior-based BYY considerably outperforms both VB and MML for any type of priors and with whether or not hyper-parameters optimized. As the hyper-parameters of DNW prior are optimized by its corresponding learning principle, BYY further improves its performance and outperforms the others significantly, because learning hyper-parameters is a part of the entire BYY harmony learning. However, both VB and MML deteriorate when there are too many free hyper-parameters, especially the performance of VB drops drastically. The reason is that VB and MML maximize the marginal likelihood via variational approximation and Laplace approximation, respectively, where maximizing the marginal likelihood with respect to a free priori *q*(*θ*|*Ξ*) makes it tend to the maximum likelihood.

*H*(

*p*||

*q*,

*Ξ*) with respect to

*p*(

*θ*|

*X*) makes Equation (9) simplified into

### Automatic model selection and two-step alternation

*k*, learning the unknown parameters

*θ*on a GMM is usually implemented under the maximum likelihood principle by an EM algorithm (Redner and Walker [1984]), which is one typical instance of Algorithm 1 featured by a two-step alternation. As remarked at the bottom of the table, we get the EM algorithm after simply removing the lines of

**trimming**with

*p*(

*i*|

*x*

_{ t },

*θ*) is the Bayes posteriori probability as follows:

*η*

_{ i },

*ρ*

_{ i }come from a priori distribution that takes a regularization role. This role is shut off by simply setting them to zero. When

*η*

_{ i }=0,

*ρ*

_{ i }>0, the EM algorithm is extended to the smoothed EM algorithm that was firstly proposed in 1997 (Xu [2010]). Also, we get the EM algorithm for naive Bayes with Jeffreys priori on

*α*

_{ i },

*Σ*

_{ i }with

An unknown *k* is poorly estimated via the ML learning by Equation (3), especially when the sample number *n* is not large enough. The task of determining an appropriate *k* is made by model selection, which is usually made in a two-stage implementation. The first stage enumerates *k* to get a set of candidate models with unknown parameters of each candidate estimated by the EM algorithm. In the second stage, the best candidate is selected by a model selection criterion. Examples of such criteria include Akaike’s information criterion (AIC) (Akaike [1974]), Bayesian inference criterion (BIC), minimum description length (MDL) criterion (which stems from another viewpoint but coincides with BIC when it is simplified to an analytically computable criterion), etc (Barron et al. [1998]; Rissanen [1978]). However, this two-stage implementation suffers from a huge computation because it requires parameter learning for each $k\in \mathcal{\mathcal{M}}$. Moreover, a larger *k* often implies more unknown parameters, thus parameter estimation becomes less reliable and the criterion evaluation reduces its accuracy (see Section 2.1 in (Xu [2010]) for a detailed discussion).

One road to tackle the problems is referred to automatic model selection that means to automatically determine an appropriate *k* during parameter learning. An early effort is RPCL (Xu et al. [1992]; Xu [1998]). The key idea is that not only the winning Gaussian component moves a little bit to adapt the current sample but also the rival (i.e., the second winner) Gaussian component is repelled a little bit from this sample to reduce a duplicated information allocation. As a result, an extra Gaussian component is driven far away from data.

by which learning is made on a cluster when *p*
_{
ℓ
t
}=1 and penalizing or de-learning is made on a cluster when *p*
_{
ℓ
t
}=−*γ*. Usually, the penalizing strength is set *γ*≈0.005∼0.05. When *γ*=0, it degenerates to the so called hard-cut EM algorithm, see Equations (19) and (20) in (Xu [1995]).

According to its general formulation (e.g., see the last part of Section 2.1 in (Xu [2010])), automatic model selection is a nature of learning a mixture of *k* individual substructures with the following two features:

*Ψ*

_{ j }(

*θ*) on

*θ*or its subset, based on which a particular structural component

*j*can be effectively discarded if its corresponding

*Ψ*

_{ j }(

*θ*)→0. Taking the GMM as an example, we may consider

With initial *k* large enough, there is an intrinsic mechanism that drives such an indicator *Ψ*
_{
j
}(*θ*) towards zero if the corresponding structure is redundant and thus can be effectively discarded.

Three Bayesian-related approaches introduced in the previous subsection can all be implemented with such a nature of automatic model selection. For both MML and VB, this nature comes from an appropriate prior *q*(*θ*|*Ξ*). Favorably, BYY is capable of automatic model selection even without imposing any priors on the parameters, and its performance can be further improved as appropriate priors are incorporated. Actually, the BYY harmony learning by maximizing *H*(*p*||*q*) bases on *q*(*R*)=*q*(*Y*|*θ*)*q*(*θ*|*Ξ*) to make model selection, with *q*(*Y*|*θ*) in a role that is not only equally important to *q*(*θ*|*Ξ*) but also easy computing, while *q*(*θ*|*Ξ*) is still handled in a way similar to MML and VB.

The algorithm implements a BYY harmony learning without a priori ln*q*(*θ*) in Algorithm (1) by simply setting *η*
_{
i
}=0,*ρ*
_{
i
}=0 or a data smoothing based BYY harmony learning when *η*
_{
i
}=0,*ρ*
_{
i
}>0. Readers are referred to Section 3.1 of (Xu [2010]) for further details. Also, we may implement the Jeffreys priori based on BYY harmony learning by using Equation (14), see table one in (Shi et al. [2011]).

## Methods

### Learning unreliability and convex combination

*H*(

*θ*) by Equation (11) w.r.t a subset

*ϕ*of parameters. That is, we consider

with *p*
_{
i,t
}(*θ*) and *π*
_{
t
}(*θ*
_{
i
}) given in Equation (17).

*ϕ*can be updated iteratively as below:

where *η*>0 is a small learning stepsize. Both the BYY learning algorithm given in figure seven of (Xu [2010]) and the BYY-Jef algorithm given in table one of (Shi et al. [2011]) are derived from Equation (19) with the help of some computing tricks and simplification. However, the performance of such algorithms all depend on an appropriate stepsize. Learning becomes either unstable if *η* is too large or slow and gets stuck in a local optimal if *η* is too small. No such a learning stepsize is required for EM algorithms.

Another typical implementation attempts to make the BYY harmony learning by Equation (11) also in a Ying-Yang two-step alternation, as previously suggested in Section 2.1 and table one of (Xu [2012]). This two-step alternation algorithm is actually derived from approximately letting *p*
_{
i,t
}(*θ*) in Equation (18) to be fixed at its value *p*
_{
it
}=*p*
_{
it
}(*θ*^{
n
e
w
}) such that we can solve the root of ∇_{
ϕ
}
*H*(*θ*)=0 subject to this fixation to get the Ying step in Algorithm 1.

Still, there lacks theoretical analyses that either guarantee the learning convergence or provide the convergence conditions. Oppositely, we find empirically that the learning process of this BYY two-step alternation may become unstable.

Actually, the root of ∇_{
ϕ
}
*H*(*θ*)=0 subject to *p*
_{
it
}=*p*
_{
it
}(*θ*^{
n
e
w
}) can be considerably deviated from the true root of ∇_{
ϕ
}
*H*(*θ*)=0 since this true root is coupled with *p*
_{
it
}(*θ*) that varies with *θ*. Not only correctly solving the root of ∇_{
ϕ
}
*H*(*θ*)=0 is a challenging task but also it is unclear whether fixing *p*
_{
it
}=*p*
_{
it
}(*θ*^{
n
e
w
}) makes the learning procedure become unstable.

with *p*(*i*|*x*
_{
t
},*θ*) given in Equation (13). Fixing *p*(*i*|*x*
_{
t
},*θ*)=*p*
_{
it
}=*p* (*i*|*x*
_{
t
},*θ*^{
n
e
w
}), solving the root of ∇_{
ϕ
} ln*q*(*X*|*θ*)=0 leads to the Ying step in Algorithm 1, or precisely the M step of the EM algorithm while letting *p*
_{
it
}=*p* (*i*|*x*
_{
t
},*θ*^{
n
e
w
}) is just the E step of the EM algorithm. As well known, the convergence of the EM algroithm has been theoretically proved. That is, though the root of ∇_{
ϕ
} ln*q*(*X*|*θ*)=0 is also coupled with *p*(*i*|*x*
_{
t
},*θ*) that varies with *θ*, this deviation actually does not affect the convergence.

The difference between *p*
_{
it
}=*p*(*i*|*x*
_{
t
},*θ*)=*p* (*i*|*x*
_{
t
},*θ*^{
n
e
w
}) and *p*
_{
it
}=*p*
_{
it
}(*θ*^{
n
e
w
}) is that *p*(*i*|*x*
_{
t
},*θ*),*i*=1,…,*k* remains to be probability with *θ*
_{
i
}, while *p*
_{
it
}(*θ*
_{
i
}),*i*=1,…,*k* given in Equation (17) are no longer the probabilities and even take negative values sometimes. Thus *p*
_{
it
}(*θ*^{
n
e
w
}) is more sparse than *p*(*i*|*x*
_{
t
},*θ*^{
n
e
w
}), and Yang step in the BYY theory introduces a nature of automatical model selection into the iteration procedure.

*p*(

*i*|

*x*

_{ t },

*θ*

^{ n e w }) by

*p*

_{ it }(

*θ*

^{ n e w }), we now focus on Ying step, which can be reformulated as below:

*μ*

_{ i }and

*Σ*

_{ i }are constrained in the convex hulls spanned by

*x*

_{ t }and (

*x*

_{ t }−

*μ*

_{ i })(

*x*

_{ t }−

*μ*

_{ i })

^{ T }, respectively, because its

*p*

_{ it }still remains in the probability space. However, in BYY algorithm,

*p*

_{ it }is no longer the probabilities and even take negative values sometimes. Thus,

*μ*

_{ i }and

*Σ*

_{ i }may break through their corresponding convex hulls. For GMM, the model parameters

*θ*must satisfy following constrains:

*d*×

*d*. Thus the updated

*α*

_{ i }and

*Σ*

_{ i }in BYY may no longer exist in their feasible regions sometimes. Instead of projecting

*α*

_{ i }and

*Σ*

_{ i }to the set of positive semidefinite matrix directly, we are motivated to project ∇

_{ ϕ }

*H*

_{0}(

*θ*) back to the convex hull of local gradients ∇

_{ ϕ }

*π*

_{ t }(

*θ*

_{ i }),

*t*=1,…,

*n*, via projecting

*p*

_{ it }(

*θ*

^{ n e w }) onto the following set of probabilities to preserve more information of

*α*

_{ i }and

*Σ*

_{ i }:

For updating each mean vector *μ*
_{
i
}, we are encouraged to use *p*
_{
it
}(*θ*^{
n
e
w
}), because the updating equation of *μ*
_{
i
} is no longer a convex combination of all observable samples, and the redundant components can be pushed outside the convex hull; thus, this operation accelerates the speed of model selection.

*p*

_{ it }(

*θ*

^{ n e w })} is encoded by the position of the vector ${p}_{t}^{H}={\left[{p}_{1t}\left({\theta}^{\mathit{\text{new}}}\right),\dots ,{p}_{\mathit{\text{kt}}}\left({\theta}^{\mathit{\text{new}}}\right)\right]}^{T}$ in

*R*

^{ k }. Projecting ${p}_{t}^{H}$ from

*R*

^{ k }to in Equation (23) means to find a vector ${p}_{t}={\left[{p}_{1t},\dots ,{p}_{\mathit{\text{kt}}}\right]}^{T}\in \mathcal{P}$ that is the nearest one to ${p}_{t}^{H}$ and thus best keeps the relative structure within elements of ${p}_{t}^{H}$. To be specific, we choose the nearest one in a sense of the least square distance, that is, we consider the following optimization problem:

The above implementation maybe regarded as a two-step approach of making the BYY harmony learning by Equation (17) under a principle of *multiple convex combination* preservation (Xu [2014]).

### Fast approximation and pBYY-Jef algorithm

The problem Equation (24) is often encountered in the literature of applied mathematics and scientific computing and tackled by several algorithms such as variants of the method of alternating projections (Bauschke and Borwein [1993]) and variants of Dykstra’s algorithm (Bauschke and Borwein [1994]). However, these algorithms suffer from a huge computing cost, especially on a large-size data set.

Alternatively, we propose a fast approximation algorithm with two steps, motivated by the Kolmogorov’s criterion (see Chapter of 1 (Escalante and Raydan [2011])). Let $\prod _{\mathcal{S}}\left(x\right)$ denote the projection point of an arbitrary point $x\in {\mathbb{R}}^{k}$ onto a non-empty closed convex set $\mathcal{S}\subset {\mathbb{R}}^{n}$; Kolmogorov’s criterion states that ${z}^{\ast}=\prod _{\mathcal{S}}\left(x\right)$ if and only if ${z}^{\ast}\in \mathcal{S}$ and (*z*−*z*^{∗})^{
T
}(*x*−*z*^{∗})≤0 for all $z\in \mathcal{S}$, from which we can get the following:

####
**Theorem** **1**.

Let ${\mathcal{F}}_{p}=\{{p}_{1},\dots ,{p}_{k}:\sum _{i=1}^{k}{p}_{i}=1\}$ with $\mathcal{P}\subset {\mathcal{F}}_{p}$, we have $\prod _{\mathcal{P}}\left(x\right)=\prod _{\mathcal{P}}\prod _{{\mathcal{F}}_{p}}\left(x\right)$ for an arbitrary point $x\in {\mathbb{R}}^{k}$.

####
*Proof*.

Let ${z}^{\prime}=\prod _{{\mathcal{F}}_{p}}\left(x\right)$, ${z}^{\ast}=\prod _{\mathcal{P}}\left(x\right)$ and ${{z}^{\prime}}^{\ast}=\prod _{\mathcal{P}}\left({z}^{\prime}\right)$. From (*z*−*z*^{′}
^{∗})^{
T
}(*z*^{′}−*z*^{′}
^{∗})≤0 for all $z\in \mathcal{P}$, we have (*z*−*z*^{′}
^{∗})^{
T
}(*z*^{′}−*x*+*x*−*z*^{′}
^{∗})≤0 or (*z*−*z*^{′}
^{∗})^{
T
}(*z*^{′}−*x*)+(*z*−*z*^{′}
^{∗})^{
T
}(*x*−*z*^{′}
^{∗})≤0. It follows (*z*−*z*^{′}
^{∗})^{
T
}(*z*^{′}−*x*)=0 since ${z}^{\prime}=\prod _{{\mathcal{F}}_{p}}\left(x\right)$ is the projection point of *x* to the hyperplane ${\mathcal{F}}_{p}$ and thus orthogonal to the vector *z*−*z*^{′}
^{∗} that lies in this hyperplane ${\mathcal{F}}_{p}$. Therefore, we get the inequality (*z*−*z*^{′}
^{∗})^{
T
}(*x*−*z*^{′}
^{∗})≤0, which holds for all $z\in \mathcal{P}$ and thus *z*^{∗}=*z*^{′}
^{∗} according to Kolmogorov’s criterion. **End**. □

where $\mathbf{\text{n}}=\frac{1}{\sqrt{k}}\mathbf{\text{1}}$ is the normal vector of the hyperplane $\sum _{i=1}^{k}{p}_{i}=1$, *f*
_{0} is the center point of the closed convex set , and all elements in **1**∈*R*^{
k×1} are equal to 1.

*f*

_{ t }onto . However, accurately calculating the projecting point is still very time-consuming. Instead, we consider a fast approximation along the line between

*f*

_{ t }and

*f*

_{0}as follows:

with a minimum *λ* that make *p*
_{
t
} locate within .

In a summary, we get a modified algorithm as one new instance of Algorithm 1. Its Ying step remains unchanged but its Yang step gets {*p*
_{
it
}(*θ*^{
n
e
w
})} by Equation (17) and then makes the nearest projection onto by Equation (25) and Equation (26). For clarity, we rewrite Algorithm 1 into a detailed form in **Algorithm**-2 that is dedicated to implementing this projection-embedded BYY learning (shortly named pBYY).

The pBYY implementation repeats the Ying step and the Yang step alternatively. It gets out of the repeating circle in two cases. One is that learning is finally completed as the repeating circle converges with an unchanged *k*. The other is after trimming one Gaussian component with *k* reducing by 1, after which it goes to the line of **initialization** and start a new repeating circle. This re-initialization is helpful to avoid accumulation of estimating bias, though it requires extra computing costs. Whether we need this depending on a trade-off of computing cost versus estimating accuracy. We may remove this re-initialization by simply deleting the line ‘go to **Initialization**’.

*Ψ*

_{ j }(

*θ*) as given in Equation (16). Empirically, we find that there are scenarios and add the following new indicator for detection:

That is, we use the Kullback–Leibler (KL) divergence to measure the similarity between two Gaussian components. When *K* *L*
_{
ij
} becomes more close to 0 for any *j*≠*i*, we may regard that the *i* th Gaussian component is redundant and thus discarded.

## Results and discussion

### Performance measures and algorithms

When samples locate in a space with its dimension less than 3, we can visualize and judge the clustering performance manually. However, samples are usually located in a high dimensional space for practical problems. Also, human evaluation is too subjective. In this paper, we consider four typical measures for clustering performance and model selection on number of clusters.

First, a traditional criterion to measure the performances of model selection could be named as the correct selection rate (CSR), namely how many times the algorithm gets the accurate number of clusters among a large number of trials. Sometimes, this criterion is argued to be too strict. For example, there exists four clusters in the set of observation samples. If an algorithm splits one cluster into two but gets the other three clusters correctly, this trial gets a zero count in computing CSR, though the clustering result still has some reasonable interpretation.

with |*C*
_{
i
}| denoting the size of cluster *C*
_{
i
}, where we get *k* clusters {*C*
_{
i
}} in clustering and *m* clusters {*C*
_{
j
}} in clustering ${\mathcal{C}}^{\prime}$. This MI denotes the mutual information that describes how much we can reduce the uncertainty about the cluster of a random sample when knowing its cluster in another clustering of the same set of observation samples (Wagner and Wagner [2007]). The smaller the VI value is, the better the performance is.

The last popular measure is called probabilistic Rand index (PRI). It further considers to partition the set of all (unordered) pairs of observation samples in into the disjoint union of the following sets:${\mathcal{R}}_{11}=$ {pairs that are in the same cluster under and ${\mathcal{C}}^{\prime}$}${\mathcal{R}}_{00}=$ {pairs that are in the different clusters under and ${\mathcal{C}}^{\prime}$}${\mathcal{R}}_{10}=$ {pairs that are in the same cluster under but in different ones under ${\mathcal{C}}^{\prime}$}${\mathcal{R}}_{01}=$ {pairs that are in the different clusters under but in the same under ${\mathcal{C}}^{\prime}$}.

where *n*
_{
ab
}=|*R*
_{
ab
}| and *w*
_{
ab
}=− log2(*p*
_{
ab
}) for *a*,*b*∈{0,1}. Simple analysis show that PRI vary between 0 (no agreement on any pair of samples in clusterings and ${\mathcal{C}}^{\prime}$) and 1 (when two clusterings are equal).

Moreover, one popular application of clustering algorithms is image segmentation. To evaluate the performances of semantic image segmentation, one widely used measure is the covering rate (CR) (Richardson and Green [1997]), by whcih a larger CR value indicates a better performance.

We aim at comparisons of the proposed **Algorithm**-2 with those typical algorithms investigated in (Shi et al. [2011]). For clarification, we summarize as follows: **BYY-Jef and BYY-DNW**: both come from table one and table six in (Shi et al. [2011]). **MML-Jef**: this was taken from table two in (Shi et al. [2011]), same as the one given in (Figueiredo and Jain [2002]). **VB-DNW**: this was taken from table six in (Shi et al. [2011]), same as the one given in (Bishop and Nasrabadi [2006]; Corduneanu and Bishop [2001]).

All algorithms are programmed in MATLAB R2010b on a 32-bit PC with 3.1 GHz Intel Core i5-2400 CPU and 4 GB memory.

All data sets and source codes used in this paper can be downloaded from the website http://www.cse.cuhk.edu.hk/~gychen/pBYY.

### Empirical comparison

*k*mean algorithm.

**pBYY**significantly outperforms all the other algorithms almost in all the cases, without using any priori. The only exception occurs on the data set

**GMM-b**, where BYY-DNW scored the best VI value though pBYY also got a value that is very close to the VI score. We also observe how the choice of an appropriate learning stepsize affects the performance of BYY-Jef and BYY-DNW. Closely related to the configurations of data sets, this choice is a difficult task. On the configuration type of

**GMM-b**similar to the datasets studied in (Shi et al. [2011]), experiments reconfirm the statement that BYY outperforms its counterparts of VB and MML (Shi et al. [2011]). However, the statement seemly no longer holds for the configuration types of

**GMM-a**and

**GMM-c**, probably due to inappropriate learning stepsizes. Favorably, this statement has been reconfirmed by pBYY on the data sets of

**GMM-a**and

**GMM-c**with re-initialization period

*T*

_{ b }being set as 5, namely, pBYY still significantly outperforms not only VB-DNW and MML-Jef but also BYY-Jef and BYY-DNW.

**Performance of each algorithm on three synthetic data sets after 500 trials, with the initial number of Gaussian components is set as**
k
**=20, where**
^{
‘a’
}
**indicates the best within its column**

Data set | GMM-a | GMM-b | GMM-c | ||||||
---|---|---|---|---|---|---|---|---|---|

Algorithms | CSR | VI | PRI | CSR | VI | PRI | CSR | VI | PRI |

VB-DNW | 0.4660 | 1.0243 | 0.7730 | 0.5160 | 0.6264 | 0.8599 | 0.1060 | 1.3337 | 0.6469 |

MML-Jef | 0.1700 | 3.2637 | 0.7345 | 0.1600 | 4.8235 | 0.7573 | 0.4140 | 58.0039 | 0.6388 |

BYY-Jef | 0.2167 | 1.1135 | 0.7006 | 0.5533 | 0.6650 | 0.8257 | 0.0100 | 1.6889 | 0.4732 |

BYY-DNW | 0.1433 | 1.1947 | 0.7039 | 0.0700 | 0.5373 | 0.8760 | 0 | 1.7948 | 0.4622 |

pBYY | 0.7260 | 0.5852 | 0.8692 | 0.8840 | 0.5482 | 0.8779 | 0.6100 | 1.1328 | 0.7451 |

**Details of 1D real data sets**

Data set | Instances | Input feature |
---|---|---|

Acidity | 155 | 1 |

Enzyme | 245 | 1 |

Galaxy | 82 | 1 |

We compare the performance of pBYY algorithm with several leading segmentation algorithms, including gPb-owt-ucm (Arbelaez et al. [2011]), multiscale graph decomposition (MN-Cut) (Cour et al. [2005]), and mean shift (Comaniciu and Meer [2002]). To make a fair comparison, these algorithms are implemented under the same prespecified configuration. For MML-Jef, VB-DNW, MN-Cut, and pBYY, the initial cluster number is set to be 20. For mean shift, the minimum region area is set at 5,000 pixels. For gPb-owt-ucm, we use the segmentation results posted by (Arbelaez et al. [2011]), and set the threshold to be 0.5. These settings are fixed throughout all the evaluations. To simplify the computation, we also ignore the re-initiation step in Algorithm 2 to accelerate the speed of pBYY algorithm.

**Performance scores on the BSDS**

BSDS500 | |||||||
---|---|---|---|---|---|---|---|

Human | Mean shift | MN-Cut | gPb-owt-ucm | MML-Jef | VB-DNW | pBYY | |

PRI | 0.88 | 0.8157 | 0.8066 | 0.7489 | 0.7851 | 0.7866 | 0.8196 |

VI | 1.17 | 2.2912 | 2.5163 | 1.7539 | 3.4966 | 3.5589 | 2.8140 |

CR | 0.72 | 0.439 | 0.393 | 0.439 | 0.325 | 0.325 | 0.487 |

## Conclusions

On learning the Gaussian mixture model, the existing BYY learning algorithms are featured by either a gradient-based local search that needs an appropriate stepsize to be prespecified or a EM-like two-step alternation that does not request a learning stepsize but may lead to a unstable learning. The proposed pBYY still implements such a two-step alternation but removes the learning unreliability by an embedded projection, outperforming the existing BYY learning algorithms significantly. In the machine learning literature, Bayesian approach with appropriate priori provides a standard direction of developing learning algorithms for model selection, with VB and MML being two typical instances. In (Shi et al. [2011]), BYY outperforms MML and VB with the help of the same types of priories, but still fail to prevail with no priori. It has been shown in this paper that pBYY without any priori has outperformed MML-Jef, VB-DNW, BYY-Jef, and BYY-DNW, which confirms that the BYY best harmony learning provides a new perspective for automatic model selection even without a prior. Especially, this pBYY uses an easy computation to prevail the tedious computation required for using the DNW prior. More interestingly, the semantic image segmentation performance on the Berkeley Segmentation Data Set of 100 testing images have shown that pBYY outperforms not only MML-Jef, VB-DNW, BYY-Jef and BYY-DNW but also gPb-owt-ucm, MN-Cut, and mean shift.

## Declarations

### Acknowledgements

Lei Xu was supported by a starting-up grant for the Zhi-Yuan chair professorship by Shanghai Jiao Tong University. Pheng-Ann Heng was partly supported by Hong Kong Research Grants Council General Research Fund (Project No. 412513).

## Authors’ Affiliations

## References

- Akaike H:
**A new look at the statistical model identification.***Automatic Control IEEE Trans*1974,**19**(6):716–723. 10.1109/TAC.1974.1100705MathSciNetView ArticleGoogle Scholar - Arbelaez P, Maire M, Fowlkes C, Malik J:
**Contour detection and hierarchical image segmentation.***Pattern Anal Mach Intell IEEE Trans*2011,**33**(5):898–916. 10.1109/TPAMI.2010.161View ArticleGoogle Scholar - Bauschke H, Borwein JM:
**On the convergence of von Neumann’s alternating projection algorithm for two sets.***Set-Valued Anal*1993,**1**(2):185–212. 10.1007/BF01027691MathSciNetView ArticleGoogle Scholar - Bauschke H, Borwein JM:
**Dykstra’s alternating projection algorithm for two sets.***J Approximation Theory*1994,**79**(3):418–443. 10.1006/jath.1994.1136MathSciNetView ArticleGoogle Scholar - Barron A, Rissanen J, Yu B:
**The minimum description length principle in coding and modeling.***Inf Theory IEEE Trans*1998,**44**(6):2743–2760. 10.1109/18.720554MathSciNetView ArticleGoogle Scholar - Bishop CM, Nasrabadi NM:
*Pattern recognition and machine learning vol 1*. Springer, New York; 2006.Google Scholar - Carpineto C, Romano G:
**Consensus clustering based on a new probabilistic rand index with application to subtopic retrieval.***Pattern Analysis and Machine Intelligence, IEEE Transactions on*2012,**34**(12):2315–2326. 10.1109/TPAMI.2012.80View ArticleGoogle Scholar - Chiu, KC, Xu L (2001) Tests of Gaussian temporal factor loadings in financial APT In: Proc. of 3rd International Conference on Independent Component Analysis and Blind Signal Separation, December 9–12, 313–318, San Diego, California, USA.Google Scholar
- Corduneanu A, Bishop CM:
**Variational Bayesian model selection for mixture distributions.**In*Artificial Intelligence and Statistics, vol 2001*. Morgan Kaufmann, Waltham, MA; 2001:27–34.Google Scholar - Cour, T, Benezit F, Shi J (2005) Spectral segmentation with multiscale graph decomposition In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference On, vol 2, 1124–1131.. IEEE.Google Scholar
- Comaniciu D, Meer P:
**Mean shift: a robust approach toward feature space analysis.***Pattern Anal Mach Intell IEEE Trans*2002,**24**(5):603–619. 10.1109/34.1000236View ArticleGoogle Scholar - Escalante, R, Raydan M (2011) Alternating projection methods. vol 8. SIAM.View ArticleGoogle Scholar
- Figueiredo MAT, Jain AK:
**Unsupervised learning of finite mixture models.***Pattern Anal Mach Intell IEEE Trans*2002,**24**(3):381–396. 10.1109/34.990138View ArticleGoogle Scholar - Nigam K, McCallum AK, Thrun S, Mitchell T:
**Text classification from labeled and unlabeled documents using EM.***Mach Learn*2000,**39**(2–3):103–134. 10.1023/A:1007692713085View ArticleGoogle Scholar - Nikou C, Likas C, Galatsanos NP:
**A Bayesian framework for image segmentation with spatially varying mixtures.***Image Process IEEE Trans*2010,**19**(9):2278–2289. 10.1109/TIP.2010.2047903MathSciNetView ArticleGoogle Scholar - Reynolds DA:
**Speaker identification and verification using Gaussian mixture speaker models.***Speech Commun*1995,**17**(1):91–108. 10.1016/0167-6393(95)00009-DView ArticleGoogle Scholar - Redner RA, Walker HF:
**Mixture densities, maximum likelihood and the EM algorithm.***SIAM Rev*1984,**26**(2):195–239. 10.1137/1026034MathSciNetView ArticleGoogle Scholar - Richardson S, Green PJ:
**On Bayesian analysis of mixtures with an unknown number of components (with discussion).***J R Stat Soc: Series B (Statistical Methodology)*1997,**59**(4):731–792. 10.1111/1467-9868.00095MathSciNetView ArticleGoogle Scholar - Rissanen J:
**Modeling by shortest data description.***Automatica*1978,**14**(5):465–471. 10.1016/0005-1098(78)90005-5View ArticleGoogle Scholar - Shi L, Tu S, Xu L:
**Learning Gaussian mixture with automatic model selection: a comparative study on three Bayesian related approaches.***Frontiers Electrical Electron Eng China*2011,**6**(2):215–244. 10.1007/s11460-011-0153-zView ArticleGoogle Scholar - Varma, M, Zisserman A (2003) Texture classification: are filter banks necessary? In: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference On, vol 2, 691–698.. IEEE.Google Scholar
- Wallace CS, Dowe DL:
**Minimum message length and Kolmogorov complexity.***Comput J*1999,**42**(4):270–283. 10.1093/comjnl/42.4.270View ArticleGoogle Scholar - Wagner, S, Wagner D (2007) Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik.Google Scholar
- Xu, L, Krzyzak A, Oja E (1992) Unsupervised and supervised classifications by rival penalized competitive learning In: Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings. 11th IAPR International Conference On, 496–499.. IEEE.Google Scholar
- Xu, L (1995) Bayesian-kullback coupled Ying-Yang machines: unified learnings and new results on vector quantization In: Proceedings of International Conference on Neural Information Processing, Oct 30–Nov.3, 977–988, Beijing, China.Google Scholar
- Xu, L (1998) Rival penalized competitive learning, finite mixture, and multisets clustering In: Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference On, vol 3, 2525–2530.. IEEE.Google Scholar
- Xu L:
**Learning algorithms for RBF functions and subspace based functions.**In*Handbook of Research on Machine Learning, Applications and Trends: Algorithms, Methods and Techniques*. Edited by: Olivas E. IGI Global, Hershey, PA; 2009:60–94. 10.4018/978-1-60566-766-9.ch003Google Scholar - Xu L:
**Bayesian Ying-Yang system, best harmony learning, and five action circling.***Frontiers Electrical Electron Eng China*2010,**5**(3):281–328. 10.1007/s11460-010-0108-9View ArticleGoogle Scholar - Xu L:
**On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications.***Frontiers Electrical Electron Eng*2012,**7**(1):147–196.Google Scholar - Xu, L (2014) Further advances on Bayesian Ying-Yang harmony learning. Appl Inform, to appear.Google Scholar
- Zhang Y, Brady M, Smith S:
**Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm.***Med Imaging IEEE Trans*2001,**20**(1):45–57. 10.1109/42.906424View ArticleGoogle Scholar - Zhu S, Zhao J, Guo L, Zhang Y:
**Unsupervised natural image segmentation via Bayesian Ying–Yang harmony learning theory.***Neurocomputing*2013,**121:**532–539. 10.1016/j.neucom.2013.05.017View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.