 Research
 Open Access
Bilinear matrixvariate analyses, integrative hypothesis tests, and casecontrol studies
 Lei Xu^{1, 2}Email author
 Received: 26 September 2014
 Accepted: 10 February 2015
 Published: 7 May 2015
Abstract
We pursue a threefold purpose in this paper. First, we suggest a KullbackLeibler formulation for developing a statistics and making discriminative projection for casecontrol studies, based on which existing typical methods are revisited and then further extended to matrixvariate counterparts. Second, we propose a bilinear matrix form, based on which multivariate discriminative analysis and logistic, Cox, and linear mixed regression are extended into their matrixvariate counterparts. Third, we systematically address the necessity, feasibility, and methodology of integrative hypothesis tests (IHT) from the complementarity of modelbased test and boundarybased test (BBT) in the data (D)space, statistics (S)space, and probability (P)space. We elaborate four IHT components (modelling, comparison, classification, and assurance) and summarise four IHT types in the Dspace. Then, we extend the existing efforts on multivariate tests to BBTs in the Sspace. Particularly, we extend the classic univariate onetail ztest to the multivariate ones, which is then applied to a multivariate samplepairing delta (SPD) test for detecting a collective inclining dominance. Also, we propose a SPD discriminative analysis that extends this SPD test. Moreover, we propose a multivariate bitest that tests the classic null and also a null about the inference reliability due to test space complexity, including a further development of Fisher combination. Finally, we suggest possible applications for gene expression biomarkers and exomesequencingbased joint singlenucleotide variant (SNV) detection.
Keywords
 Kullback divergence
 Discriminative projection
 Logistic
 Cox
 and linear mixed regressions
 Bilinear form
 Boundarybased test
 Integrative hypothesis test
 Bayesian Ying Yang
 Statistics integration
 Dependence decoupling
 Bitest
 Test reliability
 Controlling testing complexity
 Inclining dominance
 Gene expression
 Joint SNVs detection
Background
Typically, multivariate statistical analysis and related machinelearning studies consider a basic sampling unit in a vector x _{ t }. Though an entire data set may be regarded as given in a format of matrix that consists of x _{1},⋯,x _{ N } as the columns, each statistics is computed from an assembly of vector samples and featured by vector inner product as a basic modelling unit.
Another field that demands matrixvariatebased analyses is computational biology or particularly computational genomics. Typically, expression profiles of basic units (e.g. gene, miRNA, lncRNA) are analysed via vector samples (e.g. via rows or columns of expression matrix) (Simon et al. 2003). Advanced studies also examine expression profiles under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (BarJoseph et al. 2012) and thus demand that sampling units in matrix format or even a highdimensional array are considered. In a genomewide association study or exomesequencing analysis (DePristo et al. 2011; Gibson 2012; Purcell et al. 2007), though a majority of methods is still featured by vectorvariate analysis, there are already some efforts made on matrixvariatebased data analysis.

Twosample test and Hotelling statistics.

Logistic regression, Wald test, and Rao’s score.

Discriminative analyses and integrative hypothesis tests (IHT).

Cox model and linear mixed model
Then, we pursuit a threefold purpose as follows: (1) A KullbackLeiblerdivergencebased formulation for developing statistics and discriminative criterion for the casecontrol studies, based on which existing typical methods are revisited and extended to their matrixvariate counterparts. (2) A bilinear matrix form, based on which discriminative analysis, logistic regression, Cox model, and linear mixed model are extended into their matrixvariate counterparts. (3) A systematic investigation of the necessity, feasibility, and implementing methods of IHT from the perspective of modelbased test (MBT) versus boundarybased test (BBT) in the three levels of space, namely the data sample space (Dspace), the statistics space (Sspace), and probability space (Pspace).

The complementarity of MBT versus BBT in the Dspace, the basic IHT components (modelling, comparison, classification, and assurance), and four types of IHT.

Bayesian Ying Yang (BYY)harmonylearningbased IHT formulation for coordinately optimising the performances of task A, task B, and task C in the Dspace.

The MBT vs BBT perspective in the Sspace, especially extensions of the existing efforts on the integration of multiple statistics to the Sspace BBT, with the help of dependence decoupling.

A Sspace BBTbased extension of univariate onetail ztest for testing the null of multivariate zero mean, which is then applied to multivariate samplepairing delta (SPD) test for detecting a collective inclining dominance.

A SPD discriminative analysis that not only improves the multivariate SPD test but also further extends it to matrixvariate ones.

A multivariate bitest on both the classic null and also a null about test reliability by controlling the testing complexity, including a further development of the Fisher combination.
Finally, we discuss several possible IHT applications for expressionprofilebased biomarker finding and exomesequencingbased joint singlenucleotide variant (SNV) detection.
Hypothesis tests for casecontrol studies
for which a statistics is computed from the samples to test the opposite assumption H _{1} that there is a significant difference between the two populations.
where N=N _{0}+N _{1}, and c _{1},c _{0} are the mean vectors of the case and control populations, respectively. Also, the covariance matrix is assumed to be Σ=Σ _{0}=Σ _{1}.
Generally, we evaluate the difference between two populations based on population modelling by a parametric model q(xθ), that is, firstly modelling each population of samples and then evaluating the overall difference between two resulted models. The performance is measured by the p value that describes the false alarm probability of judging that H _{0} by Equation (1) significantly breaks. Such efforts are usually referred as modelbased tests or sometimes called model comparison or class comparison (Simon et al. 2003).
where Δ(w) is called the score vector, and I(w) is called the Fisher information matrix.
as a testing statistics that has an asymptotic normal distribution under the null assumption.
as a testing statistics that has an asymptotic distribution of \({\chi ^{2}_{k}}\), where k is the number of constraints imposed by the null hypothesis. It degenerates to \({\chi ^{2}_{1}}\) when w consists of only one parameter.
This logistic regression examines the difference between two populations via firstly building up a hyperplane boundary and then tests Equation (5) that directly aims at whether the boundary depends on variables in consideration.
Discriminative analyses and integrative tests
That is, the case set X _{1} is separated into a subset \(X^{(1)}_{1}\) with unchanged labels and a subset \(X^{(0)}_{1}\) of samples that are relabelled as control samples, and similarly, the control set X _{0} into \(X^{(0)}_{0}\) with unchanged labels and \(X^{(1)}_{0}\) relabelled as case samples.
On the onedimensional y _{ t }, it follows from Equation (2) that \(T^{2}=\frac {N_{0}N_{1}}{N} J_{y}\) and that FDA is equivalent to seeking a direction w along which two populations differ mostly.
On a small size of samples, the resulted w by FDA may suffer the wellknown overfitting problem, for which efforts have been made on learning a linear boundary in the literature of machine learning. One classical method is the support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002).
Widely adopted in the studies of pattern classification and machine learning, the performance of discriminative analyses is typically measured by the misclassification rate of Equation (10), featuring the separation or overlap of two populations around the boundary and reflecting the confusing chance incurred by a decision or prediction (sometimes called class prediction (Simon et al. 2003)).
The performance of discriminative analyses may also be measured by T ^{2} that considers the separation of two populations of y _{ t }=w ^{ T } x _{ t }. Monotonically varying with T ^{2}, the p value may be obtained by a univariate ttest. Here, the performance is measured by only considering the salient difference between two populations along the normal direction of the boundary, instead of considering the overall difference in the entire space as addressed after Equation (2).
Alternatively, see Equation (31) in (Xu 2013a), the performance of discriminative analyses may be also measured by a statistics that jointly considers the separating boundary and its outcome by Equation (10).
Since there are different choices for evaluating the difference between two populations, we are motivated to examine whether they can be integrated for a better evaluation. The name of IHT was previously advocated in (Xu 2013a, 2013b) for a joint consideration of the misclassification rate and the p value about the overall difference. This paper will further proceed along this direction.
Cox regression and linear mixed model
Again, we can test H _{0} by Equation (5) with the Wald test by Equation (7) or Rao’s score test by Equation (8), with help getting Δ(w),I(w) still by Equation (6) but with L given by the above partial likelihood L(w).
Actually, the core part y _{ t }=w ^{ T } x _{ t } of Equations (3) and (13) is also the core part of the classic multivariate linear regression y _{ t }=w ^{ T } x _{ t }+e _{ t } with w estimated by minimising \(\sum _{t} \textbf {e}_{t}^{2}\).
where F=[f _{1},⋯,f _{ m }], and E=[e _{1},⋯,e _{ m }]. One typical case is that f _{1},⋯,f _{ m } are mutually i.i.d. with each f _{ i }∼G(f _{ i }0,K). Also, e _{1},⋯,e _{ m } are i.i.d. with each e _{ i }∼G(e _{ i }0,R).
From inner product to bilinear form
In many studies of multivariate statistical analysis and machine learning, a basic sampling unit is a vector \(\boldsymbol {x}_{t}=\left [\!{x}_{t}^{(1)},\cdots, {x}_{t}^{(d)}\right ]^{T}\), and the basic computing operation is the inner product w ^{ T } x _{ t } that is linear with respect to the elements of x _{ t } and also of w. Though w ^{ T } x _{ t } becomes XW in Equation (17), it actually consists of a set of vector inner products in parallel.
which is quadratic with respect to w ^{(i)} and v ^{(j)} but still linear with respect to the elements of X _{ t } and is featured by two consecutive layers of inner products. Similarly, we may also have \(\boldsymbol {w}^{T}X_{t}\textbf {v}=\textbf {v}^{T}\textbf {x}_{t}^{w}\) and \(\boldsymbol {x}_{t}^{w}={X_{t}^{T}}\textbf {w}\). We call such a matrixvariatebased basiccomputing operation a bilinear form. This bilinear form leads us to matrixvariate LDA and factor analyses in (Xu 2013a, 2013b). Also, using matrix normal distribution, the implementations are made by the Bayesian Ying Yang harmony learning (Xu 1995, 2015).
which is still linear with respect to the elements of X _{ t } but unable be decomposed into two inner products, where vec[ O] denotes the vectorisation of a matrix O.
That is, the weighting along the rows of X _{ t } is unrelated to one along the columns of X _{ t }. It significantly reduces the number of free parameters of o ^{(i,j)} from md into m+d for w ^{(i)} and v ^{(j)}, which is favourable because we usually have a smallsize N for a given sample set _{ N }. However, it also suffers the limitation of being applicable only to the cases where the dependence across rows of X _{ t } is not related to one along the columns of X _{ t }. To extend such a limitation, further generalisations of bilinear matrix forms will be proposed in Equation (40).
Methods
KL statistics and matrixvariate tests
from which the Hotelling T ^{2} statistics (Hotelling 1931) and FDA are obtained as its special cases.
which directly measures the discrepancy between the case population and control population and provides a general formulation for modelbased tests. In contrast, s _{ KL } by Equation (21) indirectly considers the difference of the case population from the pool of both populations under H _{0}.
It relates to the Hotelling statistics by Equation (2) via \( T^{2} =2\frac {N_{0}N_{1}}{N_{0}+N_{1}}s_{\textit {KL}}\), i.e. the Hotelling statistics is covered as a special case of the general formulation by Equation (21).
The equivalence no longer exists when we consider other examples of q(xθ _{1}) and q(xθ _{0}). Because the case population reflects an abnormal situation and thus has a distribution that is quite different from the control population; q(xθ _{1}) may come from a parametric family that is different from the one of q(xθ _{0}). For an example, we may consider a Gaussian for the control samples while a mixture of two Gaussians for the case samples.
In addition to testing c _{0}=c _{1} as considered by the Hotelling statistics, we may use s _{ KL } by Equation (23) to develop statistics for other null hypotheses of the type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\). For examples, \({\theta ^{s}_{i}}\) could be a covariance Σ _{ i }.
Generally, we may use s _{ KL } by Equation (21) to develop a statistics for testing a general relation given by a vector equation h(θ)=0 that consists of one or several joint equations, for which we estimate θ _{0} from samples of X _{0}∪X _{1} subject to the constraint h(θ)=0 and estimate θ _{1} from only the case samples X _{1} without the constraint. The above type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\) is a special case \(h(\theta)={\theta ^{s}_{0}} {\theta ^{s}_{1}}=0\). Also, the equality may be extended to several subsets \(\{{\theta ^{s}_{i}}\}\) that are equal to each other, with each \({\theta ^{s}_{i}}\) to be either of the mean vector c _{ i } or a covariance Σ _{ i }. Even the simplest case θ ^{ s }=0, θ ^{ s }⊆θ has been widely studied. For examples, θ ^{ s } could be the variances for the variance analyses or w=0 in Equation (5) for logistic regression and Cox regression.
Not only Equation (21) provides a general formulation of developing a statistics for a composite test, but also a bird view of the existing statistics for further understanding, improvements, and extensions.
where a matrix Ω describes the crosscolumn dependence of the matrix variate X, and a matrix Σ describes the crossrow dependence of X. This matrix distribution is equivalent to a multivariate Gaussian distribution G(vec(X)vec(C),Σ⊗Ω), where ⊗ denotes the Kronecker product.
as the matrixvariate counterpart of Equation (24), where parameters are typically estimated by the maximum likelihood principle (Xu, 2015).
Generally, with help of Equation (25), we may also develop statistics for distributions other than matrix normal distributions.
Modelbased twosample tests
The tests for H _{0} by Equation (22) are featured by comparing the difference between two parametric models q(xθ _{1}) and q(xθ _{0}) on the entire domain of x. Its basis is modelling the case population by q(xθ _{1}) with its parameter θ _{1} estimated from X _{1} and modelling the control population by q(xθ _{0}) with its parameter θ _{0} estimated from X _{0}. Thus, these tests are called modelbased twosample tests or modelbased tests in short wherever there is no confusion caused.
Typically, a statistics s is considered to measure the difference between two models. The bigger the value s is, the larger the difference is. We reject H _{0} when s takes a large enough value s ^{∗}, while the false positive probability of this rejection is called the p value.
Usually, how to get a statistics s from samples is taskdependent. It is typically a function of the first and secondorder statistics that are random variables directly obtained from samples of populations, e.g. see the Hotelling statistics by Equation (2). Equation (23) provides a general perspective of getting such a statistics s _{ KL }, covering not only the first and secondorder statistics but also ones beyond.
from which we observe how an overall difference is structured from the statistics on individual differences. For K L _{sum}, the role of antidispersion difference δ _{ α,Σ } is cancelled while the position difference δ c is averaged. For K L _{dif}, the role of δ _{ α,Σ } is summed up while the position difference δ c is cancelled. In other words, the roles of K L _{sum} and K L _{dif} are complementary. According to the nature of tasks, we may use either of them separately or the both of them jointly.
The modelling error depends not only on what type of model is used but also on an appropriate model complexity. Using a model with a big model complexity can lead to an overoptimistic result, i.e. suffering an overfitting problem. To remedy it, we need to consider either an average of modelling errors on training and testing samples (e.g. by cross validation (Stone 1974)) or approximated generalisation error by one of the modelselection criterion (e.g. BIC (Schwarz 1978)).
Four Tasks of Integrative Hypothesis Tests
Tasks  Description 

Task A (modelling)  estimate θ _{ ω } such that q(xθ _{ ω }) models the corresponding population of samples, with the performance evaluated by its corresponding ε _{ A }, e.g., the average error or generalisation error. 
Task B (comparison)  develop a statistics s based on the resulted models to test H _{0} by Equation (22), with the performance evaluated by its corresponding ε _{ B } that measures the difference between two populations, e.g., the pvalue. 
Task C (classification)  classify each sample to either ω=1 or 0, with the performance evaluated by its corresponding ε _{ C }, e.g., either the rate of incorrect classification by Equation (44) or alternatively the corresponding pvalue obtained by a test based on a statistics by Equation (47). 
Task D (assurance)  test whether a reliable separating boundary exists between the two populations of samples, with the performance evaluated by its corresponding ε _{ D }. 
Matrixvariate discriminative analysis
with a m×m _{ s } matrix V and a d×d _{ s } matrix W. It degenerates back to Equation (18) when m _{ s }=1,d _{ s }=1. Mapping into one variable y _{ t } may lose too much discriminative information. Instead, Equation (35) maps X _{ t } into either of a sizereduced matrix, a column vector, or a row vector according to practical problems, e.g. from not only genomics data in genetic biology but also image or table data in various tasks of big data analyses.
Each \(y_{t}^{(k,\ell)}\) above and the bilinear form by Equation (18) suffer the limitation discussed after Equation (20), which is relaxed with v ^{(j)} replaced by \(v^{(j)}_{i}\) or v ^{(j,ℓ)} replaced by \(v^{(j,\ell)}_{i}\), i.e. adding another dimension by a subscript i.
where T r[ A] denotes the trace of the matrix A.
This solution does not relate to w, and thus, the job is done after getting w ^{∗} by Equation (34).
Also, we may update V by a gradientbased approach via ∇_{ V } J(w,V). Practically, a regularisation may be added on J(w,v) and J(w,V) via Gaussian priories on w,v, and V. Alternatively, we may make sparse learning via Laplace priories on w,v, and V.
where ξ could be either of x _{ t } and X _{ t } or the corresponding projections y _{ t } and Y _{ t }. Mapping samples into the projections helps to reduce the dimension of x _{ t } and X _{ t } for tackling the overfitting difficulty of task A in Table 1, especially when the size of samples is not large enough. Also, it facilitates visualisation of two populations in a low dimension (especially below 3D dimension) such that classification is made with human interaction.
Boundarybased tests
That is, it performs task C to get the decomposition by Equation (10) on which we may directly get the measure ε _{ C } by Equation (44).
Two Boundary based tests for Task B
Type  Description 

(1)  on the projected samples of y _{ t }=w ^{ T } x _{ t }, we use the one dimensional case of Equation (24) or the Welch’s ttest to test Equation (1) merely along the normal direction of the boundary. 
(2)  measuring the distances of samples from a separating boundary, we consider \( s_{B}=\frac {\sum _{{\mathbf {x}}\in X^{(1)}_{1}\cup X^{(0)}_{0} }  \frac {{\mathbf {w}}^{T}({\mathbf {x}}{\mathbf {c}}_0)}{\Vert {\mathbf {w}} \Vert }{\!~\!}^{q}}{ \sum _{{\mathbf {x}}\in X^{(1)}_{0}\cup X^{(0)}_{1} }  \frac {{\mathbf {w}}^{T}({\mathbf {x}}{\mathbf {c}}_0) }{\Vert {\mathbf {w}} \Vert }{\!~\!}^q+\gamma _{B}}, \ q \ge 0. \) with q=2 for the square distance, q=1 for the Euclidean one. 
Four Types of Integrative Hypothesis Tests
Types  Description 

Type1 (model based IHT)  For Task A, each of two populations is modelled by a parametric model, with ε _{ A } measured by the negative loglikelihood by Equation (32) or its extension to generalisation error. For Task B, a model based test is made to compare the difference between two parametric models, with ε _{ B } by the corresponding pvalue. For Task C, we get the classification by Equation (45), with ε _{ C } by Equation (44) or the pvalue by a BBT via a statistics obtained from Equation (10). 
Type2 (boundary based IHT)  A separating boundary is modelled by a hyperplane with its normal w, based on which Task D is handled by a boundary existence test by Equation (5) with ε _{ D } measured by the corresponding pvalue. For Task C we get the classification by Equation (46) with ε _{ C } by Equation (44) or alternatively the corresponding pvalue obtained by Equation (47), and for Task B we get the pvalue by one of two BBT choices in Table 2. 
Type3 (mixing IHT)  Mix the above two types with two populations and their separating boundary all in parametric models. A basic one uses ε _{ A },ε _{ B } from Type1 and ε _{ C },ε _{ D } from Type2. The other uses ε _{ C },ε _{ D } from Type2 while ε _{ A },ε _{ B } are modified by Equation (58). 
Type4 (YingYang IHT)  Instead of mixing, the parametric models are jointly learned for two populations of samples and their separating boundary. One example is the BYY harmony learning based formulation to be introduced after Equation (60). 
Choice (2) in Table 2 provides a statistics for task B on samples without dimension reduction. The statistics s _{ B } comes from considering that samples of \(X^{(1)}_{1}, \ X^{(0)}_{0} \) should be distant from the boundary (as illustrated by two blue arrows in Figure 3) while samples of \( X^{(1)}_{0}, \ X^{(0)}_{1}\) should not be far from this boundary (see two red arrows). Actually, s _{ B } is a special case of the ones given by Equations (26) and (30) in (Xu 2013a). The only difference is that γ _{ B }>0 is added here to trade off the contribution from \(X^{(1)}_{0}\cup X^{(0)}_{1}\).
Some choices for obtaining w
Choice  Description 

(a)  get w via FDA by Equation (11), as addressed in the previous subsection. 
(b)  estimate w by maximizing L by Equation (4), as to be addressed in the next subsection. 
(c)  get w as the normal direction of a separating hyperplane by one of machine learning approaches, e.g., support vector machine (SVM) (Cortes and Vapnik 1995; Suykens et al. 2002). 
Replacing Equation (11) with the matrixvariate FDA by Equation (33), we get the projection y _{ t }=w ^{ T } X _{ t } v column by column along the direction w and row by row along the direction v. With every appearance of x replaced by \(\boldsymbol {x}_{t}^{v}=X_{t}\mathbf {v}\), all the above studies directly apply. Similarly, we may also consider the dual representation \(y_{t}= \mathbf {v}^{T}\mathbf {x}_{t}^{w} \) with \(\boldsymbol {x}_{t}^{w} ={X}_{t}^{T}\mathbf {w}\) to get a linear separating boundary featured by v. It follows from Equations (19) and (20) that w and v jointly form a linear boundary by vec[ O] to separate samples of vec[ X _{ t }].
Furthermore, extension can be made on the generalised bilinear form via Equation (40) and Equation (41), with each x replaced by \(\boldsymbol {x}_{t}^{v} \) given in Equation (40).
Extensions can be also made on the generalised bilinear form by Equation (35). Samples of two populations are projected into a dimensionreduced matrix Y _{ t }=V ^{ T } X _{ t } W, and then, a matrixvariate Hotelling test can be made by Equation (28) with X _{ t } replaced by Y _{ t } and the subscript x replaced by y, where the matrices W,V actually take the roles of the boundary.
Matrixvariate logistic regression
where ξ _{ t } consists of dummy variables. Moreover, random effects may also be added, in a way similar to that of the linear mixed model by Equation (15).
where \( \bar \omega \) denotes the mean of ω _{ t }.
Being different from the BBT addressed in the previous subsection, testing H _{0} by Equation (5) directly aims at whether a boundary w exists. Such a test is thus named boundary existence test. It is widely known as a test for regression analyses. Also, we may regard it as a twosample test that is complementary to the BBT choice (1) in Table 2. The two tests jointly cover the entire space of samples.
The boundary existence test actually tackles another essential problem of discriminative analysis, namely, task D in Table 1. Given two populations with a finite sample size, it is not difficult to draw a boundary to separate them if there is no restriction on the complexity of the boundary. However, a boundary with a high complexity will be unreliable to separate new samples that come randomly from the same populations. To be reliable, the boundary should have an appropriate complexity too. It follows from Equation (45) that an optimal separating boundary is related to the models q(xθ _{1}) and q(xθ _{0}). In other words, appropriate boundary complexity is related to an appropriate model boundary complexity. Thus, task D and task A in Table 1 are coupled.
Typically, we consider a linear boundary because of its simple complexity. In the literature of pattern recognition (Cortes and Vapnik 1995; Cover 1965) efforts on whether samples of two populations are linearly separable by a hyperplane or a maximummargin hyperplane can be regarded as examples related to task D in Table 1.
Next, we proceed to consider matrixvariate logistic regression. Putting the case and control samples into a paired set {X _{ t },ω _{ t }},t=1,⋯,N, we extend Equation (3) with the inner product y _{ t }=w ^{ T } x _{ t } to be replaced by the bilinear form by Equation (18) or its extension by Equation (40).
Given V, the above studies directly apply when \(\boldsymbol {x}_{t}^{\textbf {v}}\) in Equation (40) replaces x _{ t } in Equations 3, 4, 7, and 8. The task of learning w,V can be made via the matrixvariate FDA by Equations (34) or (42).
where η _{ w }>0,η _{ V }>0 are small learning step sizes.
for the bilinear form by Equation (18) simply with v replacing w in Equations 6, 7, 8, and 49. Similarly, extension may also be made to test H _{0}: v _{ i }=0,∀i.
with p(ω _{ t }x _{ t },θ) given by Equation (3), where θ ^{∗} is estimated via maximising L by Equation (4) under H _{0} by Equation (5) and \(\hat \theta \) is estimated via maximising L by Equation (4) without H _{0}.
Similarly, we may get a matrixvariate Cox regression with the inner product w ^{ T } x _{ t } in Equation (13) replaced by the bilinear form by Equation (18) or its extension Equation (40). Accordingly, we test the H _{0} by Equation (5) and the H _{0} by Equation (51), using the Wald test with Equation (7) or Rao’s score by Equation (8) with Δ(w),I(w) computed from Equation (6) but L given by the partial likelihood L(w).
where E _{ t } is independent of X _{ t } and comes from N(Y _{ t }−V ^{ T } X _{ t } W0,Λ,D) by Equation (26), while both Λ,D are diagonal matrices.
which may be again handled by Equation (37) with w replaced by W.
as a bilinear mixed model extended from Equation (15).
Integrative hypothesis test
Discriminative analysis and testing of H _{0} by Equation (1) are made from either a modelbased perspective (e.g. performing task A and task B in Table 1) or a boundarybased perspective (e.g. performing task C and task D in Table 1). Moreover, all the four tasks are associated with another problem called feature selection, that is, selecting a number of elements in x to form a subset x _{ f } such that one or more of the four tasks achieves a good enough performance.
In the existing efforts, each of four tasks has been studied individually, with each having its strength and limited coverage. However, performances of these tasks are coupled, and thus, a best set of features for one task may not be necessarily the best for the others.
The complementary nature of task B and task C was preliminarily discussed in Section VI in (Xu 2012a), where a modelbased test for task B is named as Atest (a test in the observed data domain) and a boundarybased test for task C is named as Itest (a test in the inner representation domain). Under the name of IHT, good performances of task B and task C are demanded jointly (Xu 2013a, 2013b). This paper further extends IHT to include task A and task D.
As indicated by the blue vertical dashed line in Figure 4, there are many miRNAs that share a same small p value ε _{ B } but can take different values of misclassification ε _{ C } in a big range. Also, as indicated by the blue horizontal dashed line in Figure 4, there could be multiple miRNAs that take a same misclassification but take different p values. In other words, though the performance of one task is optimised, the performance of the other can still be poor. Thus, we need to jointly seek the good performances of both the tasks, i.e. IHT is necessary. On the other hand, it is observable from the red dots within the blue circle in Figure 4 that there are indeed a few scattering points with each taking both a small p value ε _{ B } and a small misclassification ε _{ C }, i.e. it is also feasible to achieve the goal of IHT too.
Such a 2D plot’s evaluation provides a tool for better joint performances of task B and task C, by which we may interactively observe the configuration of scattering points and locate the candidate points that are nearest to the origin of the coordinate space.
Extensions can be further made to a joint evaluation of the IHT performance with task A and task D also included, such that the strengths of different tests and methods are integrated in a rather systemic way, for which we address four types of IHT in Table 3.
From the modelbased perspective, the first type is an extension of the one addressed in Figure 2, with ε _{ C } added in to get a 3D plots for a joint evaluation of ε _{ A }, ε _{ B }, and ε _{ C }. Instead of Equation (45), we may get ε _{ C } by some nonparametric classifiers, e.g. the classic kNN classifier and the kernel classifiers (Williams 2003). Moreover, we are unable to handle task D because the boundary involved here does not have an explicit expression to be tested.
From the boundarybased perspective, the second type considers samples jointly by a separating boundary and projected samples, evaluated by ε _{ D } for the existence of boundary, ε _{ C } for the misclassification by the boundary, and ε _{ B } for measuring the difference of two populations either along the normal direction of the boundary or according to the sample deviations from the boundary. Again, we may use a 3D plots for a joint evaluation of ε _{ B }, ε _{ C }, and ε _{ D }. However, it is difficult to handle task A merely based on the boundary.
Even better, we may estimate each θ _{ ω } by the maximum likelihood on the entire set X of samples but with the likelihood of each sample weighted by its corresponding posteriori p(ωsample) by Equation (3).
BYYharmonylearningbased formulation
The 2D plots and 3D plots only provides a preliminary tool for IHT, we need further studies on not only appropriate combinations of multiple p values and misclassification rates but also simultaneous optimisation of multiple measures. For the latter purpose, the mixmodelled IHT in Table 3 is further extended via iteratively learning θ _{0} and θ _{1} by Equation (58) to update the models \(q\left (X^{(0)}_{0}  \theta _{0}\right), q\left (X^{(1)}_{1} \theta _{1}\right)\) and also reestimating the boundary w, e.g. by a FDA method based on the updated models.
Leaving the task D for a future study, in the sequel, we further understand the task of learning the models from a perspective of learning a Ying machine and the task of learning the boundary from a perspective of learning a Yang machine, which leads to a BYYharmonylearningbased formulation for IHT.
from which we observe that a large K L _{10} comes from a large L _{1} that reflects a good modelling of α _{1} q(xθ _{1}) (i.e. a good performance of task A) and a small confusion error \(e^{c}_{0,1}+e^{c}_{1,0}\) that is closely related to a small misclassification (i.e. a good performance of task C). In other words, three tasks are coordinately optimised.
From this K L _{sum}, we need to get θ _{ ω },ω=0,1 by the ML learning. In other words, K L _{sum} merely takes a role of evaluating the performances of task B and task C, but do not have a port to accommodate samples for estimating θ _{ ω },ω=0,1. Favourably, such a port is provided in the BYY harmony learning such that task A, task B, and task C are all jointly implemented.
Firstly, proposed in (Xu 1995) and systematically developed in the past two decades, the BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularisation, and developing a class of algorithms that implement automatic model selection during parameter learning. Readers are referred to (Xu 2010, 2012b, 2015) for the latest introduction about the BYY harmony learning.
Maximising this H(pq) makes this Ying Yang pair not only best matched but also have the least complexity. Such an ability can also be further observed from several perspectives (see Section 4.1 in (Xu 2010)).
where p(x) provides a port to accommodate samples \(\{\mathbf {x}_{t}\}^{N}_{t=1}\) via an empirical \(p(\mathbf {x})=\frac {1}{N}\sum _{t} \delta (\mathbf {x}\mathbf {x}_{t})\) with δ(x) being the Dirac delta, which thus makes it possible to estimate θ _{ ω },ω=0,1 via maximising H(pq).
Approximately considering p(x)≈q(xθ), \(e^{H}_{0,1}+e^{H}_{1,0}\approx e^{c}_{0,1}+e^{c}_{1,0}\), and \({L_{1}^{H}}+{L_{0}^{H}}\approx L_{1} +L_{0}\), we observe that H(pq) shares a nature similar to K L _{sum} in Equation (59), while a difference is that the modelling part \({L_{1}^{H}}+{L_{0}^{H}}\) is provided with a port p(x) to accommodate samples such that task A can be performed via maximising H(pq) without a need of separately estimating θ _{ ω } by the ML learning.
For q(xθ)=G(xc,Σ), we implement the maximisation of H(pq) to estimate θ _{ ω } by directly adopting the semisupervised BYY harmony learning for Gaussian mixture given in (Xu 2015), i.e. its algorithm 9, by which the performances of task A, task B, and task C are coordinated. Moreover, H(pq) can be extended into its matrixvariate counterpart. Particularly, algorithm 9 in (Xu 2015) can be extended into the algorithm ?? given below for learning \(\alpha _{\omega }N\left (XC^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\).
During implementation of the above algorithm, not only task A is performed but also task C can be simply handled in the Yang step by checking whether \(\phantom {\dot {i}\!}p_{1 \textbf {x}_{t}} \ge p_{0 \textbf {x}_{t}} \) to classify each sample into the case or control. Also, task B can be made after learning by putting the resulted parameters into s _{ KL }=K L _{10} or s _{ KL }=K L _{sum} to get the corresponding p value.
Semisupervised testing and validating
Issues  Description 

Issue1  Estimate the parameters by semisupervised learning on the training set, from which we get the corresponding pvalue p and a classifier. Using this classifier on the training set and the testing set, it follows from Equation (44) that we get \(\varepsilon _{C}^{tr}\) and \(\varepsilon _{C}^{te}\). This is what we traditionally get. 
Issue2  Lump the training samples and testing samples together, and estimate the parameters by semisupervised learning on the lumped set, we also get the corresponding \(\tilde {p}\), \(\tilde {\varepsilon }_{C}^{tr}\) and \(\tilde {\varepsilon }_{C}^{te}\). 
Issue3  \(\tilde {p}\) is actually more reliable than p because testing samples are used for regularising parameter estimation. This \(\tilde {p}\) is also different from the traditional compounded pvalue because the label information of testing samples have not been compounded. 
Issue4  Without using the label information of testing samples, \(\tilde {\varepsilon }_{C}^{te}\) shares the concept same as \(\varepsilon _{C}^{te}\), but is actually more reliable because of regularization. 
Issue5  Merging the training set and testing set to get a big training set and treating the validating set as a new testing set, which actually extends this procedure to improve the validation. 
Integrating p values, inferring rejection domain, and Sspace boundarybased tests
Each IHT type in Table 3 involves more than one measure, which incurs for the problem about how different measures are jointly evaluated. Though 2D or 3D plots provide a possible joint evaluation, how to appropriately scale each measure is still a challenging issue. In general, we need to integrate multiple measures into a scalar index based on which the joint performance can be evaluated, which relates closely to efforts made on combing multiple classifiers (Xu and Amari 2008; Xu et al. 1992b) and evidence combination (Barnett 2008).
For an IHT task, the final scalar index is typically the p value. When multiple measures are all in the p values, what we encounter becomes the task of p value combination, e.g. by the Fisher combination (Fisher 1948).
In Table 3, ε _{ B } and ε _{ D } are already given in p values. But ε _{ A } is usually measured by a square error or negative loglikelihood, and ε _{ C } is measured by a misclassification rate. Alternatively, ε _{ C } may be given in a p value via the statistics in Equation (47). Let s=−ε _{ A } or generally s=−ε for a monotonic measure ε≥0 that prefers values close to zero, we may get the corresponding p value with help of the permutation method.
However, p value combination has a weak point. Each p value is merely a positive number that indicates the false alarm probability, losing certain useful information already. Under the term metaanalysis (Evangelou and Ioannidis 2013), efforts have been made by transforming p values into multiple Z statistics such that the missing information is added in without or with help of information directly from data (Zaykin 2011).
Actually, the Hotelling T ^{2} statistics by Equation (24) and getting a statistics by Equation (21) may also be regarded as examples that get an integrated statistics s _{ f }. Generally, a multivariate hypothesis test may also be regarded as an integration of multiple univariate hypothesis tests.
Typically, an integrated statistics s _{ f }=g(s,Ψ)≥0 comes from s=[s ^{(1)},⋯,s ^{(d)}] such that s _{ f }≥0 monotonically increases as the situation differs far from H _{0}, where each s ^{(i)} comes from one univariate hypothesis test (e.g. s=c _{1}−c _{1} in the Hotelling T ^{2} statistics) with a set Ψ of parameters shaping the integration (e.g. the covariance Σ in the Hotelling T ^{2} statistics). The set Ψ is specified without or with help of information obtained directly from input data. A critical value \({\tilde s}_{f}\) is computed from the original pair of the sample set X _{0},X _{1}. Then, the false alarm probability \(p(s_{f}>{\tilde s}_{f}H_{0})\) is obtained as the p value, where and hereafter p(·H _{0}) denotes under the condition that H _{0} is satisfied.
However, choices for such a s _{ f }=g(s,Ψ) are very limited in the existing studies, mostly in a quadratic form such as Hotelling statistics, Rao’s score by Equation (8), and the Wald test by Equation (7). This is equivalent to approximately regarding s ^{(1)},⋯,s ^{(d)} from a multivariate Gaussian distribution, while other distributions are seldom studied yet.
where \(\boldsymbol {\tilde s}_{{X}_{10}}=I_{\textit {nf}}(X_{0}X_{1})\) means that \(\boldsymbol {\tilde s}\) is inferred from the given sample set X _{0},X _{1} by an inferring method I _{ nf }, and the subscript X _{10} is used as the abbreviation of X _{1}X _{0}, which will be used whenever its omission will not cause confusion.
where # S denotes the cardinality of a set S, the subscript \(X^{\pi }_{10}\) is used as the abbreviation of \(X_{0}^{\pi }X_{1}^{\pi }\), and Π consists of a large enough set of permutations made by either enumeration or random shuffling, including that π=empty denotes the sample pair X _{0},X _{1}.
Recalling the classic studies of getting an integrated statistics s _{ f }, we observe that \({\tilde s}_{f}=g(\boldsymbol {s},\Psi)\) actually define a closed shell or boundary that divides the space of multivariate statistics s (shortly Sspace) into two parts, with its inside as the acceptance domain and its outside as the rejection domain \(\Gamma (\boldsymbol {\tilde s})\). For example, the acceptance domain obtained by both the Hotelling statistics and Rao’s score by Equation (8) is a hyperelliptic volume. We may further extend a hyperelliptic volume to a bounded volume in another shape. Actually, a bounded acceptance domain corresponds a probabilistic modelling by a singlemode distribution. Thus, the corresponding tests are called Sspace modelbased tests.
Sspace boundary based test (BBT)
Step  Description 

(1)  infer \(\tilde {\mathbf {s}}=I_{\textit {nf}}(X_{0} X_{1})\) in the multidimensional space of statistics s, where \(\tilde {\mathbf {s}}_{{X}_{10}}=I_{\textit {nf}}(X_{0}X_{1})\) means that \(\tilde {\mathbf {s}}\) is inferred from the given sample set X _{0},X _{1} by an inferring method I _{ nf }, and the subscript X _{10} is used as the abbreviation of X _{1}X _{0}, which will be used whenever its omission will not cause confusion. 
(2)  use \(\tilde {\mathbf {s}}\) to design an unbounded boundary that divides the space of statistics s into two separated and unbounded halfspaces. 
(3)  let the one that does not contain the origin 0 as the rejection domain \(\Gamma (\tilde {\mathbf {s}})\), with the corresponding boundary side named as the Rside. The other one is the acceptance domain. 
(4)  tend to reject H _{0} as s deviates from the Rside of boundary with a nonzero distance. The larger the distance is, the more seriously H _{0} breaks. 
Also, integration can be made by considering the complementarity of Sspace BBTs and Sspace modelbased tests, via combining \(\Gamma (\boldsymbol {\tilde s})\) and the acceptance domains, obtained from not only the above complementary aspects, but also different sources, e.g. a bottomup source from univariate tests on input data and a topdown source inversely transformed from the p values via a metaanalysis (Evangelou and Ioannidis 2013). Also, based on the resulted \(\Gamma (\boldsymbol {\tilde s})\), an easy computing expression \(s_{f}=g(\boldsymbol {s},\Gamma (\boldsymbol {\tilde s}))\) may be obtained to get an asymptotic distribution \(p(s_{f}\Gamma (\boldsymbol {\tilde s}))\) for a fast estimation of the p value, see examples given after Equation (70).
Sspace BBT for the multivariate zero mean
by the Hotelling T ^{2} statistics. The second example is the Wald testing statistics by Equation (7), and another example will be given in the next subsection.
In the existing studies, such a test is typically made via either the \({\chi ^{2}_{k}}\) statistics or Hotelling’s T ^{2} statistics. Also, Rao’s score by Equation (8) is such a type of statistics. As addressed in the previous subsection, they are all featured by an integrated statistics s _{ f }≥0 that monotonically increases as s deviates away from the origin and belong to the Sspace modelbased tests. Also, all these tests may be regarded as extensions of one typical univariate twotail test (e.g. by t ^{2} test), that is, a univariate statistics s deviates away from the origin s=0 via the value s.
The counterpart of a univariate twotail test is a univariate onetail test that examines how far s deviates from (−∞,0], i.e. testing the statement s≤0. When either rejecting s≤0 or rejecting s≥0 happens, we reject H _{0}:s=0. Even when the statement s≤0 is not rejected, there are still chances that H _{0}:s=0 will be rejected.
Typical studies of univariate onetail tests include the onetailed ttest and onetailed ztest. However, we are not clear what are their counterparts in multivariate tests. As addressed above, Hotelling’s T ^{2} test can be regarded as a multivariate counterpart of a twotailed test.
Given \({ \tilde s}\) and thus \(\Gamma ({ \tilde s})\), any s obtained from the casecontrol samples under H _{0} may cause a false alarm if s falls in \(\Gamma ({ \tilde s})\), which happens in a probability \(p(s \in \Gamma ({ \tilde s})  H_{0})\), i.e. the p value by the inference \({ \tilde s}\). If it is small enough, the statement \(s \notin \Gamma ({ \tilde s}) \) will be rejected, which implies that s=0 or H _{0} by Equation (1) is rejected.
where each \(\Gamma ({ \tilde s}^{(i)})\) is given by Equation (67) for computing \(p\left (\textbf {s}^{(i)}\in \Gamma \left ({ \tilde s}^{(i)}\right) H_{0}\right)\). This actually provides an example that extends a onetail univariate hypothesis test to a vectorvariate one.
and U is a d×m matrix with its columns consisting of the eigenvectors of Σ _{ π } such that Λ _{ u }=U ^{ T } Σ _{ π } U.
Another issue is that only those major components in Equation (68) are useful while some components are not only useless but also disturbing, especially when we consider a limited size of samples. To do so, one may consider that the columns of the matrix U consist of the eigenvectors of Σ _{ π } corresponding to the mlargest diagonal elements of Λ _{ u }. Such an implementation of Equation (69) is typically called principal component analysis (PCA). How to decide an appropriate number of components is a model selection task (Tu and Xu 2011, 2012; Xu 2011). Moreover, one novel direction for this task will be addressed later in thip paper between Equation (91) and Equation (99). Actually, Equation (69) only applies to remove the secondorder dependence. One may further consider nonGaussian factor analysis (NFA) and binary factor analysis (BFA) to remove dependencies among nonGaussian components (Tu and Xu (2014); Xu (2003, 2009) and also Section 5 in Xu (2012b)).
Choice (a) is illustrated in Figure 5A same as the one in Equation (68) with each \(\Gamma ({ \tilde s}^{(i)})\) given by Equation (67). As illustrated in Figure 5B, each of two other choices is a half space bounded by a plane and on the side away from the origin. Choice (b) is more suitable to the case after using Equation (69) in choice (b). Except for the degenerated cases that the normal direction of the hyperplane becomes in parallel to one of the coordinate axis, choice (b) and choice (c) will approximately describe a certain dependence across the components of s.
After using Equation (69) to make the statistics s become an mdimensional vector with the secondorder dependence removed, we may observe that the scope of \(\Gamma (\boldsymbol {\tilde s})\) becomes narrowed as m reduces. When m=1, the scope of \(\Gamma (\boldsymbol {\tilde s})\) is narrowed to a onetail test along the axis of only one component.
Approximately, s _{ w } comes from a normal distribution with the mean μ _{ w } and the variance s _{ w }, based on which we can make a one univariate test.
SPD test and SPD discriminative analysis
where \(c_{\omega }, \sigma ^{2}_{\omega }, \alpha _{\omega }\) is the sample mean, variance, and proportion of the samples in X _{ ω }, respectively, and r _{ xy } is the mutual correlation between x and y.
with D(X _{10})<0 indicating that there is a collective inclining dominance (i.e. the representations of cases are bigger than the ones of controls), D(X _{10})<0 indicating a reversed dominance, and D(X _{10})=0 indicating no dominance.
Recalling Equation (66), it follows from \({\tilde s}=D(X_{10})= c_{1}c_{0}\) that D(X _{10}) is approximated from a normal distribution. Thus, the above collective inclining dominance can be tested by the onetailed ttest and onetailed ztest addressed in the previous subsections. We may get the mean \( \mu \left (X_{10}^{\pi }\right)\) and the variance \(\sigma ^{2}\left (X_{10}^{\pi }\right)\) from \(\left \{ D(X_{10}^{\pi }, \pi \in \Pi \right \}\) and then approximately compute the p value by a univariate onetail ztest.
with each D ^{(i)}(X _{10}) by Equation (76). The task is detecting whether there is a collective inclining dominance, i.e. whether s deviates far away from the origin such that H _{0} by Equation (1) breaks. The task can be handled by the Sspace BBT in Table 6 as a multivariate extension of a onetail univariate hypothesis test, following the method introduced from Equations (68) to (71) given previously.
Without losing generality, we consider that the components of s are mutually independent, e.g. obtaining a secondorder independence by Equation (69). Then, we seek how to choose an appropriate w.
which may tend to ∞ if it is unbounded. To avoid it, some bound will be imposed on w.
by which the solution of w=[w ^{(1)},…,w ^{(d)}]^{ T } is reached at one vertex, i.e. w ^{(i)} takes either a ^{(i)} or b ^{(i)}. Particularly, when Ω consists of only one pair X _{1},X _{0}, the above maximisation leads to choice (b) in Equation (70) if we let −a ^{(i)}=b ^{(i)}=1 and to choice (c) if we let −a ^{(i)}=b ^{(i)}=D ^{(i)}(X _{10}).
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma ^{\phi } =\sum _{{\omega } \in \Omega } \mathbf {s}^{\phi }\mathbf {s}^{\phi \ T} \).
Integrating Equations (80) and (81), we consider to maximise ρ _{ γ }(w) with \(\sigma _{\pi }^{\gamma }(\mathbf {w})\) minimised simultaneously or subject to a constraint \( \sigma _{\pi }^{\gamma }(\mathbf {w})\le \text {constant}\).
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma _{\pi }^{0.5}\Sigma ^{\phi } \Sigma _{\pi }^{0.5}\).
Given v as fixed, the study from Equations (79) and (84) applies directly for us to get w.
Given w as fixed, \( \mathbf {w}^{T}D_{M}\left (X_{10}^{\pi }\right)={D_{c}^{T}}\left (X_{10}\right)\) becomes a twodimensional row vector and, it follows from Equation (89) that we have \(\boldsymbol {s}_{\mathbf {w}} =\mathbf {v}^{T}{D_{c}^{T}}\left (X_{10}\right)\) in the same form as Equation (79). With v in the place of w and D _{ c }(X _{10}) in the place of s, similarly, the study from Equations (79) and (84) applies directly for us to get v. Generally, we iteratively update v with a fixed w and update w with a fixed v, for a number of circles getting converged. Still, whether such an alternative iterating procedure can converge is an open issue that demands further investigation.
The p values and testing complexity control
as the p value. This concept is the same as the one used in the conventional literature where X _{10} and I _{ nf } are usually implied but not spelled out.
Being different from those studies considering a univariate statistics, the p value by a multidimensional statistics vector s highly depends on the dimension m of this vector or the complexity of the testing space. Given a limited sample size, the p value by Equation (90) will reduce as the value of m increases, causing a phenomenon similar to the overfitting problem in the studies of machine learning and statistical modelling. In other words, we encounter a ‘dimension curse’ in hypothesis testing too. Therefore, we need to appropriately control the complexity of testing space, i.e. selecting one appropriate m.
Given a criterion J(m), the problem of selecting a best subset is a typical problem of feature selection. Generally, it involves an exhaustive evaluation of all the combinations of m features (i.e. m components of s) and all the possible values of m, which is a NP hard problem. Usually, the branch and bound policy (Narendra and Fukunaga 1977; Somol et al. 2004) and the best first strategy are used to save computing cost (Xu et al. 1988). In this paper, we only consider one simple selection strategy that evaluates the components of s incrementally one by one.
To facilitate it, we perform Equation (69) to make the components of s become decorrelated and start to pick one component that corresponds to the smallest value of a given criterion J(m). Then, we successively add in one component such that J(m) gets a bigger drop further and so on and so forth until no further reduction is caused. Finally, the selected components form the resulted feature set with a size m ^{∗}.
which is obtained on all the possible sets of \(X^{\pi }_{10}\) that come under H _{0} instead of merely on a given pair X _{10}.
Though this probability is useless to judge whether X _{10} contains enough information to reject H _{0}, it reflects how the complexity of testing space affects a background portion of the false alarm probability. Actually, it reflects an inverse of the effective volume of the support that the statistics s locates. As m increases, the volume increases exponentially, and thus, p(s∈Γ I _{ nf },H _{0}) will reduce negativeexponentially. Such an exponentially decreasing tendency is also contained in p(s∈Γ I _{ nf },X _{10},H _{0}) for the same reason, which affects the accuracy of the estimated p value.
where and hereafter ¬H _{0} denotes rejecting H _{0}. The denominator aims at cancelling out the disturbing portion in the numerator, such that \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\) provides not only a better estimation of false alarm probability of rejecting H _{0} but also a better criterion J(m) for selecting a best subset of the components of s and thus inferring one appropriate m ^{∗}.
We observe that the pp value has two factors. One is \(pp^{o}_{{X}_{10}}\) that describes the proportion of the pairs of \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(p_{X^{\pi }_{10}}\le p_{X_{10}}\), that is, on each of these pairs we should also reject H _{0} if we reject H _{0} on X _{10}. In other words, \(pp^{o}_{{X}_{10}}\) reflects the information of relative difference contained in P _{ Π }. The other factor \(\phantom {\dot {i}\!}{rp}_{{X}_{10}}\) is the ratio of the average false alarm probability per pair over the disturbing background per pair, reflecting the strength of discriminative information contained in P _{ Π }.
In implementation, we may use \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) to make an initial screening. When \( {rp}_{{X}_{10}}>1\phantom {\dot {i}\!}\), inference is nonsense and no further computing should be made. Generally, \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) will be much smaller than 1, and thus, \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\) will be much smaller, while \(pp^{o}_{{X}_{10}}\) provides a worst case upper bound of \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\).
We should observe \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) at not only one same value of m but also an appropriate m ^{∗}. In addition to using \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (93) as J(m) for making an incremental selection, we may also consider \(pp^{o}_{{X}_{10}}\) or \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as J(m), resulting in \( m^{\ast }_{o} \) or \(m^{\ast }_{\textit {rp}}\). Also, it follows from some mathematical derivation that we have \( m^{\ast } \ge m^{\ast }_{\textit {rp}} \ge m^{\ast }_{o}\) with \(m^{\ast }_{o}\) being a most conservative lower bound. We will be more confident when all these values are identical or not different too much. Moreover, further insights can be obtained from the following considerations.
On one side, we desire that the exponentially decreasing tendency contained in p(s∈Γ I _{ nf },X _{10},H _{0}) is removed via the normalisation by p(s∈Γ I _{ nf },H _{0}) such that \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\) in Equation (93) will no longer have such a decreasing tendency. With \(p_{X^{\pi }_{10}}=p(\textbf {s}\in \Gamma \ I_{\textit {nf}}, X_{10}^{\pi }, H_{0})\) in Equation (92) replaced by \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\), we may turn P _{ Π } into its counterparts P _{ pp }, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P _{ rp }. We compute not only the varying curve for each of \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as m increases, but also the varying curve of the mean of the elements in each of P _{ pp }, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P _{ rp } as m increases. Then, we compare each curve with its corresponding mean curve and desire that the mean curve is as flat as possible or at least flat around m ^{∗}.
On the other side, desiring a flat mean curve is not a sole principle. W also desire that the discriminative information should be kept in each of \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as much as possible. Observing the factorization \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}= pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) in Equation (93), the strength of discriminative information is contained in \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) with an exponentially decreasing tendency that is supposed to be mutually cancelled out by the denominator and the numerator but perhaps not completely, while the discriminative information of relative difference is contained in \(pp^{o}_{{X}_{10}}\) and kept unchanged as long as every inequality between \(\phantom {\dot {i}\!}p_{X^{\pi }_{10}}\) and \(\phantom {\dot {i}\!}p_{X_{10}}\) remains unchanged.
Bitest, twin p values, and Pspace BBT
We examine a decision that both H _{0} and I _{0} are rejected, featured with two p values.
As addressed after Equation (91), the multivariate statistics s inferred by I _{ nf } suffers a systematic bias that will make I _{ nf } unreliable. This unreliability varies with the dimension m that takes an important role in I _{ nf }. Though corrected by the denominator in Equation (93), there are still some residuals that will not be completely cancelled out, the effect of which still varies with m and reduces the reliability of I _{ nf }. The test I _{0} is formulated for this reliability via controlling an appropriate m ^{∗} and a level of false alarm probability of rejecting I _{0}.
One should notice the difference between testing H _{0} and testing I _{0}. Testing H _{0} examines only the input, while testing I _{0} examines both the input and the performance of testing H _{0}. The inference I _{ nf } gets X _{10} as the input and the outcomes \(p_{{X}_{10}}, {pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\). Using \(\phantom {\dot {i}\!}o_{{X}_{10}}\) to denote anyone of these indices, regarding I _{ nf } as reliable on X _{10} actually implies that it should also be regarded as reliable on any pair \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(o_{{X}^{\pi }_{10}}\) being smaller than \(\phantom {\dot {i}\!}o_{{X}_{10}}\). Thus, the false alarm probability of rejecting I _{0} is computed by \(p\left (o_{{X}^{\pi }_{10}}\le o_{{X}_{10}}\neg H_{0}, H_{0}\right)\).
where and hereafter ¬I _{0} denotes rejecting I _{0}. Reflecting the discriminative information of relative difference, this p value of rejecting I _{0} will be not affected as long as the exponentially decreasing tendency will not change every inequality between \(p_{X^{\pi }_{10}}\) and \(p_{X_{10}}\).
Multivariate Bitest and Implementations
Type  Description 

Test bihypotheses and twin pvalues  
test H _{0}  whether the casecontrol populations are different, by an inference I _{ nf } in the space of multivariate statistics s based on samples from the two populations. H _{0} is rejected if \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\!\le \! \alpha \), where the false alarm probability \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\,=\,pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\) is given by Equation (93) and α is a prespecified level. 
test I _{0}  whether the dimension m of s is appropriate such that I _{ nf } is reliable, with the pvalue given by \( p(\neg I_{0} \neg H_{0}, H_{0}) = p({pp}_{X^{\pi }_{10}}\le \alpha {pp}_{{X}_{10}}< \alpha, H_0), \) which is not smaller than \(pp^{o}_{{X}_{10}}\) that reflects the relative discriminative information among \( {pp}_{{X}_{10}}\phantom {\dot {i}\!}\) while ignoring \({rp}_{{X}_{10}}\phantom {\dot {i}\!}\) that reflects the strength of discriminative information. 
Bitext Implementations  
Stochastic way  (a) Make the components of s decorrelated by Equation (69). (b) Get \(p({\mathbf {s}}\in \Gamma \ I_{\textit {nf}}, X_{10}^{\pi }, H_0)=p({\mathbf {s}}\in \Gamma (\tilde {\bf {s}}) H_0)\) by Equation (68) with \(\Gamma (\tilde {\bf s})\) taking one of three choices in Equation (70), and then getting P _{ Π } by Equation (92). (c) Get \( {pp}_{{X}_{10}}, pp^{o}_{{X}_{10}}, {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (93) and then getting p(¬I _{0}¬H _{0},H _{0}) as above. (d) Using \(pp^{o}_{{X}_{10}}\) or p(¬I _{0}¬H _{0},H _{0}) as J(m) to infer an appropriate \(m^{*}_{o}\) and select the \(m^{*}_{o}\) best components of s. 
Nonstochastic way  (a) Make the components of s decorrelated by Equation (69). (b) Get {p _{ i }} with each pvalue p _{ i } obatined by an univariate test. (c) Get \( pp^{o}_{{X}_{10}}\) by Equation (99) and \({rp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{10}}= \prod _{i} p_{i}\), as well as getting p(¬I _{0}¬H _{0},H _{0}) as above. (d) The same as the above (2)(d). 
where the extra components of s will contribute a constant factor \(\prod _{i > m^{\ast }}\delta _{i}\) that will be cancelled out via the denominator and the numerator in Equation (93).
Together with Equation (97), we get \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}=pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\) for testing both H _{0} and I _{0} without stochastic simulation via permutation.
On the other perspective, we observe that the traditional p value p _{ F } of the Fisher combination is actually the false alarm probability by Equation (95), only reflecting the discriminative information of relative difference between \(\prod _{i} p_{i}^{\pi }\) and \( \prod _{i} p_{i} \) but ignoring the strength of discriminative information contained in \( \prod _{i} p_{i}\). In other words, the Fisher combination just provides a half story for combining {p _{ i }}, and we can use the formulation \( {pp}_{{X}_{10}}=pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) to complete the whole story, using \(pp^{o}_{{X}_{10}}\) by Equation (99) and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{10}}= \prod _{i} p_{i}\).
The last but not least, one should notice that the p value of testing H _{0} measures the chances in the Sspace (i.e. the space of multivariate statistics), and the p value of testing I _{0} measures an event in the Pspace (i.e. the space of false alarm probabilities). In other words, testing H _{0} involves a Sspace BBT while testing I _{0} involves a Pspace BBT.
Discussions
Gene expression analyses
Gene expression analyses take important roles in bioinformatics and computational genetics. Expression profiles are featured by data matrix with its row indicating expressions of different samples t=1,⋯,N while its column consisting of expressions i=1,⋯,m from different genes, miRNAs, and lncRNAs.
In recent years, developments of data acquisition techniques lead us to consider expression profiles in a cubic or even a highdimension array. As illustrated in Figure 1, one additional dimension j=1,⋯,d is added for examining expressions under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (BarJoseph et al. 2012). For examples, current cancer studies consider each basic unit (i.e. a gene, a miRNA, a lncRNA) in paired expressions of normal and tumour tissue from the same individual, that is, each individual is featured at least by a 2×d matrix X _{ t }. Generally, each example X _{ t } is a m×d matrix. In Table 7, we suggest a list of topics for such matrixvariatebased applications.
Typically, the number d of rows (i.e. gene, miRNA, and lnclRNA) is huge, while the sample size n is small. It is difficult and also unreliable to consider the entire m×d matrix as a sample X _{ t }. Instead, we pick k tuple out of m rows to form a m×k matrix as a sample X _{ t }. Without losing generality, we focus on that each sample X _{ t } is a 2×k matrix from paired expressions of normal tissue and tumour tissue.
Several IHT Applications
IHT types  Applications 

Model based and Mixmodelled  (a) Starting at the case that X _{ t } is degenerated into an 1×2 matrix, we conduct the Hotelling test by Equation (2) and its extension K L _{ sum } in Equation (31), in comparison with both univariate ttest and a paired ttest. (b) For the general case with k≥2, we conduct a matrixvariate test by Equation (28), as well as by the matrixvariate counterparts of K L _{1,0}, K L _{ sum }, and K L _{ s u m∗}, in comparison with not only the Hotelling’s Tsquare test on the k dimensional vector x _{ t } obtained from Equation (100) but also the paired Hotelling’s Tsquare test on 2×k matrixvariate samples of X _{ t }. (c) Considering each sample X _{ t } in a 2×k matrix, we investigate the bilinear discriminant analysis by Equations 18, 33, and 34, in comparison with the classic FDA by Equation (11) on the k dimensional vector x _{ t } obtained from Equation (100). (d) Investigate the generalised bilinear discriminant analysis by Equations 40, 41, and 34. For simplicity, we get v _{ i },i=1,⋯,d by Equation (43) and then solve w by Equation (34). When k becomes too big, we further regularise the learning of v _{ i } by minimising \( J_y=\frac {\alpha _{0} \sigma _{0}^{y\ 2} +\alpha _{1}\sigma _{1}^{y\ 2}} {(c^{y}_{0} c^{y}_{1})^{2}}+ \sum _{i=1}^{m} \gamma _{i} \sum _{j=1}^{d} u_{i}^{(j)} ^{q}, \) with q=2 for Tikhonov, q=1 for sparse learning. 
Boundary based and Mixmodelled  (a) Consider a logistic regression by Equation (3) with w in one of the ways given in Table 4, we test Equation (5) by the Rao’s score Equation (8), and get ε _{ C } by Equation (44), and ε _{ B } by the pvalue with one of choices in Table 2. (b) Extend all the above studies on Equation (3) with y _{ t }=w ^{ T } x _{ t } replaced by the bilinear form Equation (18). (c) Make a survival analysis via the Cox regression by Equation (13) in comparison with its bilinear extension by Equations (18) or (40). Again, IHT is made by ε _{ D }, ε _{ C }, and ε _{ B } in a way similar to the above. 
BYY harmony  (a) Use either Algorithm 9 in Ref. (Xu, 2015) to get α ^{(i)},c ^{(i)}, Σ ^{(i)},i=0,1 or Algorithm ?? to get α ^{(i)},C ^{(i)},Σ ^{(i)},Ω ^{(i)},i=0,1 for model based IHT. (b) Perform the procedure given in Table 5 for training, testing and validating in a small size of samples. 
Exome sequencing analyses
The casecontrol study is also a major problem in a genomewide association study (GWAS) or exomesequencing analysis (DePristo et al. 2011; Purcell et al. 2007). Typically, a digit score (i.e. 0,1,2) is assigned to a Single Nucleotide Polymorphism (SNP) allele per site and per individual. In such a representation, each sample is univariate when each site is considered one by one. One variate twosample test takes a fundamental role for detecting a single SNP in the GWAS, e.g. the PLINK provides one widely used tool box (Purcell et al. 2007).
Moreover, each sample can be a vector when multiple sites are considered jointly. Recently, there have been everincreasing efforts on finding multiple SNVs jointly (DePristo et al. 2011; Derkach et al. 2013; Evangelou and Ioannidis 2013; Lin et al. 2014; Liu et al. 2014; Pan et al. 2014). Also, we may test whether there is a collective inclining dominance of the representations of case samples over the ones of control samples, or vice versa, with help of the method proposed from Equations (79) and (84), as well as the extension introduced around Equations (87) and (89).
It follows from Equation (72) that we get D(X _{10}) to be also a 3×3 matrix as a collective measure, which may be further examined to test whether two populations differ significantly. We may visualise the matrix by plotting them in two 2D histograms and observe their configurations.
Conclusions
Statistical analyses for casecontrol studies have been addressed rather comprehensively. First, a KullbackLeibler divergencebased formulation is suggested to develop testing statistics and discriminative criterion for the casecontrol studies. Based on this formulation, typical existing methods are revisited, and their matrixvariate counterparts are developed. Second, a bilinear matrix form is proposed to obtain the matrixvariate counterparts from existing multivariate statistical analyses, such as discriminative analysis, logistic regression, Cox model, and linear mixed model. Third, the necessity and feasibility of integrative hypothesis tests (IHT) are addressed from the complementarity of BMTs and BBTs in the Dspace, together with empirical illustration. Moreover, four basic components of IHT are elaborated, and four IHT types are summarised according to how the components are integrated. Then, in the space of multiple statistics (shortly Sspace), the Sspace BBT is proposed to perform BBT based on an unbounded boundary, with the help of informationpreserved decoupling. Moreover, a Sspace BBTbased extension of univariate onetail ztest is developed to test the null of multivariate zero mean and then applied to a multivariate SPD test for detecting a collective inclining dominance for the casecontrol studies. Also, a SPD discriminative analysis is proposed with this multivariate SPD test improved and extended to matrixvariate ones. Furthermore, a multivariate bitest is proposed to test not only the classic null but also a null about inference reliability due to the complexity of testing space, including a new insight on and a further development of the Fisher combination. Finally, possible applications have been suggested for expressionprofilebased biomarker finding and exomesequencingbased joint SNV detection.
Declarations
Acknowledgements
This work was supported by a CUHK Direct grant project 4055025 and by the ZhiYuan chair professorship by Shanghai Jiao Tong University.
Authors’ Affiliations
References
 BarJoseph, Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using timeseries gene expression data. Nat Rev Genet13(8): 552–564.View ArticleGoogle Scholar
 Barnett, JA (2008) Computational methods for a mathematical theory of evidence. In: Yager L Liu L (eds)Classic Works of the DempsterShafer Theory of Belief Functions. Studies in Fuzziness and Soft Computing, 197–216.. Springer, Berlin Heidelberg.View ArticleGoogle Scholar
 Cortes, C, Vapnik V (1995) Supportvector networks. Mach Learn20(3): 273–297.MATHGoogle Scholar
 Cox, DR, Oakes D (1984) Analysis of survival data. CRC Press, Chapman & Hall, Boca Raton, Florida.Google Scholar
 Cover, TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. Electronic Computers, IEEE Transactions on 14(3): 326–334.View ArticleMATHGoogle Scholar
 Demidenko, E (2013) Mixed models: theory and applications with R. Probability and Statistics. John Wiley & Sons, Hoboken, New Jersey.MATHGoogle Scholar
 DePristo, MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet 43(5): 491–498.View ArticleGoogle Scholar
 Derkach, A, Lawless JF, Sun L (2013) Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genet Epidemiol 37(1): 110–121.View ArticleGoogle Scholar
 Dutilleul, P (1999) The mle algorithm for the matrix normal distribution. J Stat Comput Simul 64(2): 105–123.View ArticleMATHGoogle Scholar
 Engle, RF (1984) Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Handb Econometrics 2: 775–826.View ArticleMATHGoogle Scholar
 Evangelou, E, Ioannidis JP (2013) Metaanalysis methods for genomewide association studies and beyond. Nat Rev Genet 14(6): 379–389.View ArticleGoogle Scholar
 Fisher, RA (1948) Questions and answers# 14. Am Stat 2(5): 30–31.Google Scholar
 Gibson, G (2012) Rare and common variants: twenty arguments. Nat Rev Genet 13(2): 135–145.View ArticleGoogle Scholar
 Hosmer Jr, DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression. John Wiley & Sons, Hoboken, New Jersey.View ArticleMATHGoogle Scholar
 Hotelling H (1931) The generalization of Student’s ratio. Ann Math Stat 2(3): 360–378.View ArticleMATHGoogle Scholar
 Ji, J, Shi J, Budhu A, Yu Z, Forgues M, Roessler S, Ambs S, Chen Y, Meltzer PS, Croce CM, Qin LX, Man K, Lo CM, Lee J, Ng IOL, Fan J, Tang ZY, Sun HC, Wang XW (2009) Microrna expression, survival, and response to interferon in liver cancer. New Engl J Med 361(15): 1437–1447.View ArticleGoogle Scholar
 Koboldt, DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER (2013) The nextgeneration sequencing revolution and its impact on genomics. Cell 155(1): 27–38.View ArticleGoogle Scholar
 Lin, WY, Lou XY, Gao G, Liu N (2014) Rare variant association testing by adaptive combination of pvalues. PloS one9(1): 85728.View ArticleGoogle Scholar
 Liu, DJ, Peloso GM, Zhan X, Holmen OL, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, OrhoMelander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem K, Melander O, Kathiresan S, Abecasis GR (2014) Metaanalysis of genelevel tests for rare variant association. Nat Genet 46(2): 200–204.View ArticleGoogle Scholar
 Narendra, PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. Comput IEEE Trans 100(9): 917–922.View ArticleMATHGoogle Scholar
 Pan, W, Kim J, Zhang Y, Shen X, Wei P (2014) A powerful and adaptive association test for rare variants. Genetics197(4): 1081–1095.View ArticleGoogle Scholar
 Persson, H, Kvist A, Rego N, Staaf J, VallonChristersson J, Luts L, Loman N, Jonsson G, Naya H, Hoglund M, Borg A, Rovira C (2011) Identification of new microRNAs in paired normal and tumor breast tissue suggests a dual role for the erbb2/her2 gene. Cancer Res 71(1): 78–86.View ArticleGoogle Scholar
 Purcell, S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, Sham PC (2007) Plink: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet81(3): 559–575.View ArticleGoogle Scholar
 Schwarz, G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464.View ArticleMATHMathSciNetGoogle Scholar
 Simon, RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. SpringerVerlag, New York.MATHGoogle Scholar
 Somol, P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. Pattern Anal Mach Intell IEEE Trans26(7): 900–912.View ArticleGoogle Scholar
 Stone, M (1974) Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological)36(2): 111–147.MATHMathSciNetGoogle Scholar
 Suykens, JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3): 293–300.View ArticleMathSciNetMATHGoogle Scholar
 Suykens, JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific Publishing, Singapore.View ArticleMATHGoogle Scholar
 Tu, S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electrical Electronic Eng China 6(2): 245–255.View ArticleMathSciNetGoogle Scholar
 Tu, S, Xu L (2012) A theoretical investigation of several model selection criteria for dimensionality reduction. Pattern Recognit Lett 33(9): 1117–1126.View ArticleGoogle Scholar
 Tu, S, Xu L (2014) Learning binary factor analysis with automatic model selection. Neurocomputing 134: 149–158.View ArticleGoogle Scholar
 Williams CKI (2003) Learning kernel classifiers. J Am Stat Assoc98(462): 489–490.Google Scholar
 Xu, L, Yan P, Chang T (1988) Best first strategy for feature selection In: 9th International Conference on Pattern Recognition, 706–708.. IEEE Computer Society Press, Piscataway, New Jerse.Google Scholar
 Xu, L (1995) BayesianKullback coupled yingyang machines: unified learnings and new results on vector quantization In: Proc. Int. Conf. Neural Information Process (ICONIP ’95), 977–988.. Publishing House of Electronics Industry, Beijing.Google Scholar
 Xu, L (2003) Independent component analysis and extensions with noise and time: a Bayesian yingyang learning perspective. Neural Inform Process Lett Rev 1: 1–52.Google Scholar
 Xu L (2009) Independent Subspaces In: Encyclopedia of Artificial Intelligence, 892–901.. IGI Global IGI Global Snippet, Hershey, Pennsylvania.View ArticleGoogle Scholar
 Xu L (2010) Bayesian yingyang system, best harmony learning, and five action circling. Front Electrical Electronic Eng China5(3): 281–328.View ArticleGoogle Scholar
 Xu, L (2011) Codimensional matrix pairing perspective of BYY harmony learning: hierarchy of bilinear systems, joint decomposition of datacovariance, and applications of network biology. Front Electr Electron Eng China 6: 86–119. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (A).View ArticleGoogle Scholar
 Xu, L (2012a) Semiblind bilinear matrix system, BYY harmony learning, and gene analysis applications In: Proceedings of The 6th International Conference on New Trends in Information Science, Service Science and Data Mining: 2325 October 2012, 661–666.. IEEE, Taipei.Google Scholar
 Xu, L (2012b) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. Front Electrical Electronic Eng 7(1): 147–196.Google Scholar
 Xu, L (2013a) Integrative hypothesis test and A5 formulation: sample pairing delta, case control study, and boundary based statistics In: Intelligence Science and Big Data Engineering. LNCS, 887–902.. Springer, Berlin Heidelberg.Google Scholar
 Xu L (2013b) MatrixVariate discriminative analysis, integrative hypothesis testing, and genopheno A5 analyzer In: Intelligent Science and Intelligent Data Engineering. LNCS, 866–875.. Springer, Berlin Heidelberg.Google Scholar
 Xu, L (2015) Further advances on Bayesian ying yang harmony learning. Applied Informatics 2(5).Google Scholar
 Xu L, Amari SI (2008) Combining classifiers and learning mixtureofexperts. In: J Ramon e.a. (ed)Encyclopedia of Artificial Intelligence, 318–326.. IGI Global, Hershey: PA.Google Scholar
 Xu L, Krzyzak A, Suen CY (1992b) Several methods for combining multiple classifiers and their applications in handwritten character recognition. IEEE Trans Syst Man Cybernet 22: 418–435.Google Scholar
 Zaykin DV (2011) Optimally weighted ztest is a powerful method for combining probabilities in metaanalysis. J Evol Biol 24(8): 1836–1841.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.