Introduction
Gaussian mixture model (GMM) has been widely used in different areas, e.g., clustering, image segmentation (Zhang et al. [2001]), speaker identification (Reynolds [1995]), document classification (Nigam et al. [2000]), market analysis (Chiu and Xu [2001]), etc. Learning a GMM consists of parameter learning for estimating all unknown parameters and model selection for determining the number of Gaussian components k. Parameter learning is usually implemented under the maximum likelihood principle by an expectation-maximization (EM) algorithm (Redner and Walker [1984]). A conventional model selection approach is featured by a two-stage implementation, which suffers from a huge computation because it requires parameter learning for each candidate GMM. Moreover, parameter learning will become less reliable as k becomes larger, which implies more free parameters.
One road to tackle these problems is referred as automatic model selection that automatically determines k during parameter learning. An early effort is rival penalized competitive learning (RPCL) (Xu et al. [1992]; Xu [1998]) with the number k automatically determined during learning. Automatic model selection may also be approached via appropriate priors on unknown parameters by Bayesian approaches. Two examples are minimum message length (MML) (Figueiredo and Jain [2002]) and variational Bayesian (VB) (Corduneanu and Bishop [2001]). Firstly proposed in (Xu [1995]) and systematically developed in the past two decades, Bayesian Ying-Yang (BYY) learning provides not only a new model selection criteria but also a family of learning algorithms that is capable of automatic model selection during parameter learning, with details referred to recent tutorial and survey by (Xu [2010], [2012]).
A systematic comparison has been recently made by (Shi et al. [2011]) among MML, VB, and BYY with two types of priors. One is the Jeffreys prior and another is a parametric conjugate prior that imposes a Dirichlet prior on mixing weights and a joint normal-Wishart prior on mean vectors and covariance matrices, shortly denoted as DNW. Automatic model selection performances of these approaches are evaluated through extensive experiments, with several interesting empirical findings. Among them, it has been shown that BYY considerably outperforms both VB and MML. Different from VB and MML that rely on appropriate priors to perform model selection, BYY is capable of selecting model automatically even without imposing any priors on parameters, while its performance can be further improved with appropriate priors incorporated. Similar findings have also been obtained (Zhu et al. [2013]), where a simplified BYY learning algorithm with DNW priors is shown to outperform or at least be competitive to the existing state-of-the-art image segmentation methods.
The algorithms by (Shi et al. [2011]) for implementing BYY are featured by a gradient-based line search with an appropriate stepsize. Learning becomes either unstable if this stepsize is too large or slow and gets stuck in a local optimal solution if the stepsize is too small. Given in Algorithm two (Xu [2009]) and Equation (11) (Xu [2010]), there is a Ying-Yang two-step alternation algorithm that is similar to the EM algorithm without a learning stepsize for the learning procedure. However, the Ying step (Xu [2010]) ignores the constrain that the covariance matrix of each Gaussian component must be positive definite matrix, so the learning procedure may become unstable.
To constrain the covariance matrix as a positive definite matrix, this paper proposes a projection operation into the Yang step, which results in a modified algorithm called projection-embedded BYY learning algorithm or shortly denoted as pBYY. To facilitate its implementation, we also add a Kullback Leibler divergence-based indicator into the algorithm to improve the detection of redundant Gaussian components. Experiments have shown that pBYY significantly outperforms not only the Jeffreys-based MML (Figueiredo and Jain [2002]) and the DNW-based VB but also the BYY learning algorithms with these two types of priors (Shi et al. [2011]), and it further avoids the cost of complicated and tedious computation brought by the DNW prior.
Gaussian mixture model and four learning principles
GMM assumes that an observation x∈Rd is drawn from the following mixture of k Gaussian distributions:
(1)
where G(x|μ,Σ) denotes a Gaussian density with a mean μ and a covariance matrix Σ.
GMM can be also regarded as a latent variable model by introducing a binary latent vector y= [y
1,y
2,…,y
k
]T, subject to y
i
∈{0,1},∀i, and , the latent variable y
i
=1 means that the random variable x is drawn from i th Gaussian component. The generative process of an observation x is interpreted as that y is sampled from a multinomial distribution with probabilities α and then x is randomly generated by the i th Gaussian component with y
i
=1. Let X∈Rd×n denote the set of n i.i.d. d-dimension observation samples, Y∈Rk×n denote the set of latent vectors for the observable set X, we have the following:
(2)
Learning a GMM consists of parameter learning for estimating all the unknown parameters in θ and model selection for determining the number of Gaussian components k, which can be implemented differently under different learning principles.
The most widely used principle is called the maximum likelihood (ML), that is, we estimate θ by
(3)
The ML learning with a known k is typically made by the well known EM algorithm (Redner and Walker [1984]). However, an unknown k is poorly estimated by Equation (3) when the sample number n is not large enough. The task of determining an appropriate k is called model selection, which is usually made in a two-stage implementation with the help of a model selection criterion. However, such a two-stage implementation suffers from a huge computation and an unreliable estimation. The problems are tackled by automatic model selection that automatically determines k during learning θ without such a two-stage implementation.
There are three Bayesian related learning principles that can be implemented with such the property of automatic model selection.
One is called minimum message length (MML) (Wallace and Dowe [1999]), which is actually an information theoretic restatement of Occam’s Razor. The MML was introduced to learn GMM with the property of automatic model selection (Figueiredo and Jain [2002]). Learning is made by the following maximization:
(4)
where |I(θ)| represents the determinant of Fisher information matrix with respect to (w.r.t) Θ. Equation (4) is mathematically equivalent to a maximum a posteriori (MAP) approach with modifying a proper prior q(θ) into being proportional to q(θ)/|I(θ)|1/2.
Using the Jeffreys prior q(θ)∝|I(θ)|1/2 directly, Equation (4) degenerates to be ML learning principle. To avoid this situation, Figueiredo and Jain ([2002]) considered the following:
(5)
where ρ=d+0.5d(d+1) is the number of free parameters in each Gaussian component. In (Shi et al. [2011]), it has shown that some improvement can be obtained by an algorithm that implements the MML principle with the help of a Dirichlet prior and a joint normal-Wishart prior (shortly DNW prior).
The other Bayesian related learning principle is called variational Bayesian (Corduneanu and Bishop [2001]). The naive Bayes considers q(X|θ)q(θ) with a prior q(θ) which takes a strong role. Unfortunately, a poor q(θ) may affect the learning performance seriously. Such a bad influence can be smoothed out by considering the following marginal distribution:
(6)
However, it is difficult in computation with integral. The VB tackles this difficulty via constructing a lower bound J
VB
with the help of Jensen’s inequality as follows:
(7)
The goal is to choose a suitable posterior distribution p(θ,Y|X) from a distribution family , so that the lower bound J
VB
can readily be evaluated and yet sufficiently flexible. One challenge is to provide a suitable distribution family . In (Corduneanu and Bishop [2001]), the family of prior distribution can be approximately factorized as follows:
(8)
With q(X,Y|θ) by Equation (2) and a DNW prior q(θ), the above p(θ
i
|X) can be obtained with p(Y|X) and p(θ
j
|X)∀j≠i given by the following equation (Bishop and Nasrabadi [2006]):
A tight bound is unavailable to be obtained by Equation (8), which affects the learning performances. Also, DNW is quite tedious and has hyperparameters to be updated, which is time-consuming and may fall into local optimal. To avoid the tedious computation of the DNW prior-based VB, an algorithm for implementing VB principle is developed (Shi et al. [2011]) with the help of the Jeffreys prior via approximately using a block-diagonal complete data Fisher information (Figueiredo and Jain [2002]).
The last Bayesian related principle is BYY harmony learning. Firstly proposed by (Xu [1995]) and systematically developed in the past two decades, BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularization, and a class of algorithms that approach automatic model selection during parameter learning. Readers are referred to (Xu [2010], [2012], [2014]) for latest systematical introductions about BYY harmony learning.
Briefly, a BYY system consists of Yang machine and Ying machine, corresponding to two types of decomposition, namely, Yang p(R|X)p(X) and Ying q(X|R)q(R) respectively, where the data X is regarded to be generated from its inner representation R={Y,θ} that consists of latent variables Y and parameters θ, supported by a hyperparameter set Ξ. The harmony measure is mathematically expressed as follows:
(9)
Maximizing this H(p||q) leads to not only a best matching between the Ying-Yang pair but also a compact model with a least complexity. Such an ability can be observed from several perspectives (see Section 4 in (Xu [2010])).
Applied to GMM by Equation (2), we have R={Y,θ} and q(R)=q(Y|θ)q(θ|Ξ). Comparing Equation (9) and Equation (7), the key difference is that there is only q(X,Y|θ)q(θ) inside the basket ln[∗] for the BYY harmony learning while there is also a denominator p(θ,Y|X) for the VB learning. Maximizing J
VB
leads to a best match between q(X,Y|θ)q(θ) and p(θ,Y|X), while maximizing H(p||q) leads to not only such a best match but also a modeling of q(X,Y|θ)q(θ) in a least complexity. Readers are referred to Section 4 and its figure five in (Xu [2012]) for various aspects of this key difference, as well as how they relate and differ from MML and minimum description length (MDL) (Barron et al. [1998]; Rissanen [1978]).
Maximizing H(p||q) leads to specific algorithms according to not only what types of q(θ|Ξ) are chosen for the Ying machine but also how the structure of p(θ,Y|X) is designed for the Yang machine. Details are referred to Section 4.2 in (Xu [2010]) and Section 3.2 in (Xu [2012]). For the GMM by Equation (2), we introduce two typical examples here.
One example is p(θ,Y|X) given by Equation (8) together with a DNW prior. Putting them into Equation (9), the DNW prior-based BYY harmony learning algorithm has been developed for maximizing H(p||q) in (Shi et al. [2011]). Extensive experiments have shown that the DNW prior-based BYY considerably outperforms both VB and MML for any type of priors and with whether or not hyper-parameters optimized. As the hyper-parameters of DNW prior are optimized by its corresponding learning principle, BYY further improves its performance and outperforms the others significantly, because learning hyper-parameters is a part of the entire BYY harmony learning. However, both VB and MML deteriorate when there are too many free hyper-parameters, especially the performance of VB drops drastically. The reason is that VB and MML maximize the marginal likelihood via variational approximation and Laplace approximation, respectively, where maximizing the marginal likelihood with respect to a free priori q(θ|Ξ) makes it tend to the maximum likelihood.
Another example is the following structure:
(10)
Maximizing H(p||q,Ξ) with respect to p(θ|X) makes Equation (9) simplified into
(11)
Automatic model selection and two-step alternation
Given a known k, learning the unknown parameters θ on a GMM is usually implemented under the maximum likelihood principle by an EM algorithm (Redner and Walker [1984]), which is one typical instance of Algorithm 1 featured by a two-step alternation. As remarked at the bottom of the table, we get the EM algorithm after simply removing the lines of trimming with
(12)
where p(i|x
t
,θ) is the Bayes posteriori probability as follows:
(13)
Generally, η
i
,ρ
i
come from a priori distribution that takes a regularization role. This role is shut off by simply setting them to zero. When η
i
=0,ρ
i
>0, the EM algorithm is extended to the smoothed EM algorithm that was firstly proposed in 1997 (Xu [2010]). Also, we get the EM algorithm for naive Bayes with Jeffreys priori on α
i
,Σ
i
with
(14)
An unknown k is poorly estimated via the ML learning by Equation (3), especially when the sample number n is not large enough. The task of determining an appropriate k is made by model selection, which is usually made in a two-stage implementation. The first stage enumerates k to get a set of candidate models with unknown parameters of each candidate estimated by the EM algorithm. In the second stage, the best candidate is selected by a model selection criterion. Examples of such criteria include Akaike’s information criterion (AIC) (Akaike [1974]), Bayesian inference criterion (BIC), minimum description length (MDL) criterion (which stems from another viewpoint but coincides with BIC when it is simplified to an analytically computable criterion), etc (Barron et al. [1998]; Rissanen [1978]). However, this two-stage implementation suffers from a huge computation because it requires parameter learning for each . Moreover, a larger k often implies more unknown parameters, thus parameter estimation becomes less reliable and the criterion evaluation reduces its accuracy (see Section 2.1 in (Xu [2010]) for a detailed discussion).
One road to tackle the problems is referred to automatic model selection that means to automatically determine an appropriate k during parameter learning. An early effort is RPCL (Xu et al. [1992]; Xu [1998]). The key idea is that not only the winning Gaussian component moves a little bit to adapt the current sample but also the rival (i.e., the second winner) Gaussian component is repelled a little bit from this sample to reduce a duplicated information allocation. As a result, an extra Gaussian component is driven far away from data.
A batch learning version of RPCL learning may be also obtained as one instance of Algorithm 1, simply with
(15)
by which learning is made on a cluster when p
ℓ
t
=1 and penalizing or de-learning is made on a cluster when p
ℓ
t
=−γ. Usually, the penalizing strength is set γ≈0.005∼0.05. When γ=0, it degenerates to the so called hard-cut EM algorithm, see Equations (19) and (20) in (Xu [1995]).
According to its general formulation (e.g., see the last part of Section 2.1 in (Xu [2010])), automatic model selection is a nature of learning a mixture of k individual substructures with the following two features:
There is an indicator Ψ
j
(θ) on θ or its subset, based on which a particular structural component j can be effectively discarded if its corresponding Ψ
j
(θ)→0. Taking the GMM as an example, we may consider
(16)
With initial k large enough, there is an intrinsic mechanism that drives such an indicator Ψ
j
(θ) towards zero if the corresponding structure is redundant and thus can be effectively discarded.
Three Bayesian-related approaches introduced in the previous subsection can all be implemented with such a nature of automatic model selection. For both MML and VB, this nature comes from an appropriate prior q(θ|Ξ). Favorably, BYY is capable of automatic model selection even without imposing any priors on the parameters, and its performance can be further improved as appropriate priors are incorporated. Actually, the BYY harmony learning by maximizing H(p||q) bases on q(R)=q(Y|θ)q(θ|Ξ) to make model selection, with q(Y|θ) in a role that is not only equally important to q(θ|Ξ) but also easy computing, while q(θ|Ξ) is still handled in a way similar to MML and VB.
The BYY harmony learning by Equation (11) can be implemented by Algorithm 1, with the Yang step given as follows:
(17)
The algorithm implements a BYY harmony learning without a priori lnq(θ) in Algorithm (1) by simply setting η
i
=0,ρ
i
=0 or a data smoothing based BYY harmony learning when η
i
=0,ρ
i
>0. Readers are referred to Section 3.1 of (Xu [2010]) for further details. Also, we may implement the Jeffreys priori based on BYY harmony learning by using Equation (14), see table one in (Shi et al. [2011]).