 Research
 Open Access
 Published:
Bilinear matrixvariate analyses, integrative hypothesis tests, and casecontrol studies
Applied Informatics volume 2, Article number: 4 (2015)
Abstract
We pursue a threefold purpose in this paper. First, we suggest a KullbackLeibler formulation for developing a statistics and making discriminative projection for casecontrol studies, based on which existing typical methods are revisited and then further extended to matrixvariate counterparts. Second, we propose a bilinear matrix form, based on which multivariate discriminative analysis and logistic, Cox, and linear mixed regression are extended into their matrixvariate counterparts. Third, we systematically address the necessity, feasibility, and methodology of integrative hypothesis tests (IHT) from the complementarity of modelbased test and boundarybased test (BBT) in the data (D)space, statistics (S)space, and probability (P)space. We elaborate four IHT components (modelling, comparison, classification, and assurance) and summarise four IHT types in the Dspace. Then, we extend the existing efforts on multivariate tests to BBTs in the Sspace. Particularly, we extend the classic univariate onetail ztest to the multivariate ones, which is then applied to a multivariate samplepairing delta (SPD) test for detecting a collective inclining dominance. Also, we propose a SPD discriminative analysis that extends this SPD test. Moreover, we propose a multivariate bitest that tests the classic null and also a null about the inference reliability due to test space complexity, including a further development of Fisher combination. Finally, we suggest possible applications for gene expression biomarkers and exomesequencingbased joint singlenucleotide variant (SNV) detection.
Background
Typically, multivariate statistical analysis and related machinelearning studies consider a basic sampling unit in a vector x _{ t }. Though an entire data set may be regarded as given in a format of matrix that consists of x _{1},⋯,x _{ N } as the columns, each statistics is computed from an assembly of vector samples and featured by vector inner product as a basic modelling unit.
Nowadays, not only rapid developments of data acquisition techniques (DePristo et al. 2011; Koboldt et al. 2013) demand that data with a matrix X _{ t } as shown in Figure 1 as a basic sampling unit be considered, but also everincreasing computing ability makes such a demand possible. One typical field that longs for such demands is featured by imagebased tasks, of which a basic sampling unit is naturally a matrix though traditional studies consider sample vectors to simplify computation. However, this simplification will miss some useful structural information, e.g. considering the rows of X _{ t } as independent and identically distributed (i.i.d.) samples will miss the dependence cross rows. Also, recent efforts on bigdata analyses eagerly demand statistical approaches for matrixvariatebased data analysis.
Another field that demands matrixvariatebased analyses is computational biology or particularly computational genomics. Typically, expression profiles of basic units (e.g. gene, miRNA, lncRNA) are analysed via vector samples (e.g. via rows or columns of expression matrix) (Simon et al. 2003). Advanced studies also examine expression profiles under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (BarJoseph et al. 2012) and thus demand that sampling units in matrix format or even a highdimensional array are considered. In a genomewide association study or exomesequencing analysis (DePristo et al. 2011; Gibson 2012; Purcell et al. 2007), though a majority of methods is still featured by vectorvariate analysis, there are already some efforts made on matrixvariatebased data analysis.
In the rest of this paper, we start at providing a background and review on the related topics and methods, including the following:

Twosample test and Hotelling statistics.

Logistic regression, Wald test, and Rao’s score.

Discriminative analyses and integrative hypothesis tests (IHT).

Cox model and linear mixed model
Then, we pursuit a threefold purpose as follows: (1) A KullbackLeiblerdivergencebased formulation for developing statistics and discriminative criterion for the casecontrol studies, based on which existing typical methods are revisited and extended to their matrixvariate counterparts. (2) A bilinear matrix form, based on which discriminative analysis, logistic regression, Cox model, and linear mixed model are extended into their matrixvariate counterparts. (3) A systematic investigation of the necessity, feasibility, and implementing methods of IHT from the perspective of modelbased test (MBT) versus boundarybased test (BBT) in the three levels of space, namely the data sample space (Dspace), the statistics space (Sspace), and probability space (Pspace).
More specifically, the above third one consists of the following:

The complementarity of MBT versus BBT in the Dspace, the basic IHT components (modelling, comparison, classification, and assurance), and four types of IHT.

Bayesian Ying Yang (BYY)harmonylearningbased IHT formulation for coordinately optimising the performances of task A, task B, and task C in the Dspace.

The MBT vs BBT perspective in the Sspace, especially extensions of the existing efforts on the integration of multiple statistics to the Sspace BBT, with the help of dependence decoupling.

A Sspace BBTbased extension of univariate onetail ztest for testing the null of multivariate zero mean, which is then applied to multivariate samplepairing delta (SPD) test for detecting a collective inclining dominance.

A SPD discriminative analysis that not only improves the multivariate SPD test but also further extends it to matrixvariate ones.

A multivariate bitest on both the classic null and also a null about test reliability by controlling the testing complexity, including a further development of the Fisher combination.
Finally, we discuss several possible IHT applications for expressionprofilebased biomarker finding and exomesequencingbased joint singlenucleotide variant (SNV) detection.
Hypothesis tests for casecontrol studies
Most efforts in computational genomics and generally computational biology involve casecontrol studies. For a casecontrol study, we are given two populations of vectorvariate samples X _{ ω }={x _{ t,ω },t=1,⋯,N _{ ω }},ω=0,1, where the one with ω=1 is called the case population while the one with ω=0 is called the control population. The task of a hypothesis test is examining a rejection of the following null assumption:
for which a statistics is computed from the samples to test the opposite assumption H _{1} that there is a significant difference between the two populations.
A typical example is testing whether H _{0} breaks on two populations of samples from a multivariate Gaussian distribution G(xc,Σ) with the mean vector c and the covariance matrix Σ, with help from the following Hotelling statistics (Hotelling 1931):
where N=N _{0}+N _{1}, and c _{1},c _{0} are the mean vectors of the case and control populations, respectively. Also, the covariance matrix is assumed to be Σ=Σ _{0}=Σ _{1}.
Generally, we evaluate the difference between two populations based on population modelling by a parametric model q(xθ), that is, firstly modelling each population of samples and then evaluating the overall difference between two resulted models. The performance is measured by the p value that describes the false alarm probability of judging that H _{0} by Equation (1) significantly breaks. Such efforts are usually referred as modelbased tests or sometimes called model comparison or class comparison (Simon et al. 2003).
Another typical example is logistic regression. Rewriting the above two populations of samples into a set of paired samples {x _{ t },ω _{ t }},t=1,⋯,N with ω _{ t }=1 and ω _{ t }=0 indicating the sample x _{ t } from the case and control population, respectively. We let ω _{ t } be regressed by x _{ t } in the following conditional probability:
All the unknowns in a notation θ are estimated by maximising the following likelihood:
which cannot be analytically solved due to the nonlinearity of s(r) and are usually handled by a gradientbased iterative algorithm (Hosmer et al. 2013). The test of the null assumption by Equation (1) becomes testing the null assumption:
where w is a subset of θ. It is typically made by either the Wald test or the Score test (Engle 1984), both of which are computed from one or both of the following statistics:
where Δ(w) is called the score vector, and I(w) is called the Fisher information matrix.
The Wald test considers the following:
as a testing statistics that has an asymptotic normal distribution under the null assumption.
While the Rao’s score (or simply the score test and often known as the Lagrange multiplier test) considers:
as a testing statistics that has an asymptotic distribution of \({\chi ^{2}_{k}}\), where k is the number of constraints imposed by the null hypothesis. It degenerates to \({\chi ^{2}_{1}}\) when w consists of only one parameter.
This logistic regression examines the difference between two populations via firstly building up a hyperplane boundary and then tests Equation (5) that directly aims at whether the boundary depends on variables in consideration.
Discriminative analyses and integrative tests
Other than directly aiming at the boundary, a different aspect of logistic regression is that we can use p(ω _{ t }x _{ t },θ) by Equation (3) to classify each sample by:
Equivalently, the same result comes from the hyperplane boundary ζ _{ t }=0 with ζ _{ t } given in Equation (3) such that samples are classified into its two sides. The outcome is the following decomposition:
That is, the case set X _{1} is separated into a subset \(X^{(1)}_{1}\) with unchanged labels and a subset \(X^{(0)}_{1}\) of samples that are relabelled as control samples, and similarly, the control set X _{0} into \(X^{(0)}_{0}\) with unchanged labels and \(X^{(1)}_{0}\) relabelled as case samples.
Actually, seeking a hyperplane boundary is the goal of linear discriminative analyses (LDA). One classic example is the Fisher discriminative analysis (FDA). For separating samples of two populations, the FDA seeks a projection y _{ t }=w ^{T} x _{ t } to map each vector x _{ t } into a univariate y _{ t } such that:
where for ω=0,1 we have
On the onedimensional y _{ t }, it follows from Equation (2) that \(T^{2}=\frac {N_{0}N_{1}}{N} J_{y}\) and that FDA is equivalent to seeking a direction w along which two populations differ mostly.
On a small size of samples, the resulted w by FDA may suffer the wellknown overfitting problem, for which efforts have been made on learning a linear boundary in the literature of machine learning. One classical method is the support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002).
Widely adopted in the studies of pattern classification and machine learning, the performance of discriminative analyses is typically measured by the misclassification rate of Equation (10), featuring the separation or overlap of two populations around the boundary and reflecting the confusing chance incurred by a decision or prediction (sometimes called class prediction (Simon et al. 2003)).
The performance of discriminative analyses may also be measured by T ^{2} that considers the separation of two populations of y _{ t }=w ^{T} x _{ t }. Monotonically varying with T ^{2}, the p value may be obtained by a univariate ttest. Here, the performance is measured by only considering the salient difference between two populations along the normal direction of the boundary, instead of considering the overall difference in the entire space as addressed after Equation (2).
Alternatively, see Equation (31) in (Xu 2013a), the performance of discriminative analyses may be also measured by a statistics that jointly considers the separating boundary and its outcome by Equation (10).
Since there are different choices for evaluating the difference between two populations, we are motivated to examine whether they can be integrated for a better evaluation. The name of IHT was previously advocated in (Xu 2013a, 2013b) for a joint consideration of the misclassification rate and the p value about the overall difference. This paper will further proceed along this direction.
Cox regression and linear mixed model
Survival analyses consider the relation of the observed time y _{ t } that a subject t passes before some event occurs to one or more covariates in x _{ t } that may be associated with y _{ t }. The Cox model for survival analysis (Cox and Oakes 1984) describes the hazard ratio as follows:
which shares the common part y _{ t }=w ^{T} x _{ t } with Equation (3). The difference is that w is estimated via maximising the following partial likelihood L(w):
Again, we can test H _{0} by Equation (5) with the Wald test by Equation (7) or Rao’s score test by Equation (8), with help getting Δ(w),I(w) still by Equation (6) but with L given by the above partial likelihood L(w).
Actually, the core part y _{ t }=w ^{T} x _{ t } of Equations (3) and (13) is also the core part of the classic multivariate linear regression y _{ t }=w ^{T} x _{ t }+e _{ t } with w estimated by minimising \(\sum _{t} \textbf {e}_{t}^{2}\).
Denoting y=[ y _{1},⋯,y _{ N }]^{T}, e=[ e _{1},⋯,e _{ N }]^{T}, and X=[x _{1},⋯,x _{ N }]^{T}, we may rewrite y _{ t }=w ^{T} x _{ t }+e _{ t } into y=X w+e as a degenerated case of the following linear mixed model (Demidenko 2013) :
where Z is a design matrix and f is a random effect vector. We may use the existing methods to estimate w,K,R (Demidenko 2013) and then test w=0 via the Wald test by Equation (7) or Rao’s score test by Equation (8) but with the likelihood L replaced by:
Moreover, an N×1 vector y may be further extended to a N×m matrix with one dependent variable extended to mdependent variables. Accordingly, w,f,e are extended to d×m matrices. As a result, we have:
where F=[f _{1},⋯,f _{ m }], and E=[e _{1},⋯,e _{ m }]. One typical case is that f _{1},⋯,f _{ m } are mutually i.i.d. with each f _{ i }∼G(f _{ i }0,K). Also, e _{1},⋯,e _{ m } are i.i.d. with each e _{ i }∼G(e _{ i }0,R).
From inner product to bilinear form
In many studies of multivariate statistical analysis and machine learning, a basic sampling unit is a vector \(\boldsymbol {x}_{t}=\left [\!{x}_{t}^{(1)},\cdots, {x}_{t}^{(d)}\right ]^{T}\), and the basic computing operation is the inner product w ^{T} x _{ t } that is linear with respect to the elements of x _{ t } and also of w. Though w ^{T} x _{ t } becomes XW in Equation (17), it actually consists of a set of vector inner products in parallel.
Efforts have been made in (Xu 2013a, 2013b) to extend this inner product to get a matrixvariate discriminative analysis. Considering that a basic sampling unit is a matrix X _{ t } as shown in Figure 1, the inner product is extended into a bilinear form:
which is quadratic with respect to w ^{(i)} and v ^{(j)} but still linear with respect to the elements of X _{ t } and is featured by two consecutive layers of inner products. Similarly, we may also have \(\boldsymbol {w}^{T}X_{t}\textbf {v}=\textbf {v}^{T}\textbf {x}_{t}^{w}\) and \(\boldsymbol {x}_{t}^{w}={X_{t}^{T}}\textbf {w}\). We call such a matrixvariatebased basiccomputing operation a bilinear form. This bilinear form leads us to matrixvariate LDA and factor analyses in (Xu 2013a, 2013b). Also, using matrix normal distribution, the implementations are made by the Bayesian Ying Yang harmony learning (Xu 1995, 2015).
To get further insight, we directly extend the vector inner product into the following matrix format:
which is still linear with respect to the elements of X _{ t } but unable be decomposed into two inner products, where vec[ O] denotes the vectorisation of a matrix O.
Comparing Equations (18) and (19), we observe that the bilinear form can be regarded as constrained in the following structure :
That is, the weighting along the rows of X _{ t } is unrelated to one along the columns of X _{ t }. It significantly reduces the number of free parameters of o ^{(i,j)} from md into m+d for w ^{(i)} and v ^{(j)}, which is favourable because we usually have a smallsize N for a given sample set _{ N }. However, it also suffers the limitation of being applicable only to the cases where the dependence across rows of X _{ t } is not related to one along the columns of X _{ t }. To extend such a limitation, further generalisations of bilinear matrix forms will be proposed in Equation (40).
Methods
KL statistics and matrixvariate tests
Given the case and control samples X _{ ω }={x _{ t,ω },t=1,⋯,N _{ ω } and ω=0,1} from a parametric family q(xθ), all the unknown parts of the true value θ ^{∗} are estimated under H _{0} by Equation (1), e.g. by the maximum likelihood from X _{0}∪X _{1}. Also, we estimate \(\hat \theta \) from X _{1} and test whether H _{0} breaks by the following formulation (see Equation (36) in (Xu 2012a)):
from which the Hotelling T ^{2} statistics (Hotelling 1931) and FDA are obtained as its special cases.
Alternatively, we may also rewrite H _{0} into
with X _{1} from q(xθ _{1}) and X _{0} from q(xθ _{0}). We estimate θ _{1} from the case samples X _{1} and θ _{0} from the control samples X _{0} by either the maximum likelihood or other learning principles, and test H _{0} by the following casecontrol formula:
which directly measures the discrepancy between the case population and control population and provides a general formulation for modelbased tests. In contrast, s _{ KL } by Equation (21) indirectly considers the difference of the case population from the pool of both populations under H _{0}.
For the special case that q(xθ)=G(xc,Σ), s _{ KL } by Equation (21) and s _{ KL } by Equation (23) are equivalent with merely a slight difference of a constant scale, resulting in:
It relates to the Hotelling statistics by Equation (2) via \( T^{2} =2\frac {N_{0}N_{1}}{N_{0}+N_{1}}s_{\textit {KL}}\), i.e. the Hotelling statistics is covered as a special case of the general formulation by Equation (21).
The equivalence no longer exists when we consider other examples of q(xθ _{1}) and q(xθ _{0}). Because the case population reflects an abnormal situation and thus has a distribution that is quite different from the control population; q(xθ _{1}) may come from a parametric family that is different from the one of q(xθ _{0}). For an example, we may consider a Gaussian for the control samples while a mixture of two Gaussians for the case samples.
In addition to testing c _{0}=c _{1} as considered by the Hotelling statistics, we may use s _{ KL } by Equation (23) to develop statistics for other null hypotheses of the type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\). For examples, \({\theta ^{s}_{i}}\) could be a covariance Σ _{ i }.
Generally, we may use s _{ KL } by Equation (21) to develop a statistics for testing a general relation given by a vector equation h(θ)=0 that consists of one or several joint equations, for which we estimate θ _{0} from samples of X _{0}∪X _{1} subject to the constraint h(θ)=0 and estimate θ _{1} from only the case samples X _{1} without the constraint. The above type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\) is a special case \(h(\theta)={\theta ^{s}_{0}} {\theta ^{s}_{1}}=0\). Also, the equality may be extended to several subsets \(\{{\theta ^{s}_{i}}\}\) that are equal to each other, with each \({\theta ^{s}_{i}}\) to be either of the mean vector c _{ i } or a covariance Σ _{ i }. Even the simplest case θ ^{s}=0, θ ^{s}⊆θ has been widely studied. For examples, θ ^{s} could be the variances for the variance analyses or w=0 in Equation (5) for logistic regression and Cox regression.
Not only Equation (21) provides a general formulation of developing a statistics for a composite test, but also a bird view of the existing statistics for further understanding, improvements, and extensions.
Simply with each vector x replaced by a matrix X, we can extend Equations (21) and (23) to consider matrixvariate samples. Without losing generality, we focus on Equation (23) and get:
We consider q(xθ) given by the following matrix normal distribution (MND) (Dutilleul 1999; Xu 2012a) :
where a matrix Ω describes the crosscolumn dependence of the matrix variate X, and a matrix Σ describes the crossrow dependence of X. This matrix distribution is equivalent to a multivariate Gaussian distribution G(vec(X)vec(C),Σ⊗Ω), where ⊗ denotes the Kronecker product.
With each sample X _{ t,ω } from \(N\left (XC^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\) under the assumption:
it follows from Equation (25) that we obtain:
as the matrixvariate counterpart of Equation (24), where parameters are typically estimated by the maximum likelihood principle (Xu, 2015).
Generally, with help of Equation (25), we may also develop statistics for distributions other than matrix normal distributions.
Modelbased twosample tests
The tests for H _{0} by Equation (22) are featured by comparing the difference between two parametric models q(xθ _{1}) and q(xθ _{0}) on the entire domain of x. Its basis is modelling the case population by q(xθ _{1}) with its parameter θ _{1} estimated from X _{1} and modelling the control population by q(xθ _{0}) with its parameter θ _{0} estimated from X _{0}. Thus, these tests are called modelbased twosample tests or modelbased tests in short wherever there is no confusion caused.
Typically, a statistics s is considered to measure the difference between two models. The bigger the value s is, the larger the difference is. We reject H _{0} when s takes a large enough value s ^{∗}, while the false positive probability of this rejection is called the p value.
Usually, how to get a statistics s from samples is taskdependent. It is typically a function of the first and secondorder statistics that are random variables directly obtained from samples of populations, e.g. see the Hotelling statistics by Equation (2). Equation (23) provides a general perspective of getting such a statistics s _{ KL }, covering not only the first and secondorder statistics but also ones beyond.
Actually, Equation (23) can be further generalised. Adding in the priorities α _{1},α _{0} for q(xθ _{1}) and q(xθ _{0}), we have:
which describes the difference observed from the case side. From the control side, we have also:
We further get their average and difference as follows:
For q(xθ)=G(xc,Σ), we have:
from which we observe how an overall difference is structured from the statistics on individual differences. For K L _{sum}, the role of antidispersion difference δ _{ α,Σ } is cancelled while the position difference δ c is averaged. For K L _{dif}, the role of δ _{ α,Σ } is summed up while the position difference δ c is cancelled. In other words, the roles of K L _{sum} and K L _{dif} are complementary. According to the nature of tasks, we may use either of them separately or the both of them jointly.
The performance of examining H _{0} by Equation (22) is typically evaluated via the p value, which depends on not only how p is approximately estimated but also how well q(xθ _{0}) models X _{0} and q(xθ _{1}) models X _{1}. A poor modelling makes the resulted p unreliable. Thus, the performance evaluation should also consider its corresponding modelling error or generally the likelihood:
The modelling error depends not only on what type of model is used but also on an appropriate model complexity. Using a model with a big model complexity can lead to an overoptimistic result, i.e. suffering an overfitting problem. To remedy it, we need to consider either an average of modelling errors on training and testing samples (e.g. by cross validation (Stone 1974)) or approximated generalisation error by one of the modelselection criterion (e.g. BIC (Schwarz 1978)).
Jointly, modelbased twosample tests involve two tasks, that is, the first two tasks summarised in Table 1. Task A is a typical topic of machine learning, from which those existing studies can be adopted, while task B is a typical topic of a statistical test, with its corresponding ε _{ B } being a nonnegative measure that monotonically decreases towards zero as s tends towards a large value.
It is an open challenge to integrate ε _{ A } and ε _{ B } into one objective to optimise because of lacking investigations on how to combine them. A preliminary study has been made empirically with the help of the 2D scattering plots of ε _{ A } versus ε _{ B } as illustrated in Figure 2. Each scattering point denotes a performance pair (ε _{ A },ε _{ B }), associated with one miRNA on the samples for gene expression. Those points located near the origin (e.g. those in the orange colour) act as the interested candidate points.
Matrixvariate discriminative analysis
As addressed around Equation (11), the classic FDA seeks a projection y _{ t }=w ^{T} x _{ t } to maximize J _{ y }. Moreover, it follows from the bilinear form by Equation (18) that a matrixvariate discriminative analysis is obtained by:
which may be solved by iterating:
Generally, the bilinear form by Equation (18) may also be rewritten into the following matrix format:
with a m×m _{ s } matrix V and a d×d _{ s } matrix W. It degenerates back to Equation (18) when m _{ s }=1,d _{ s }=1. Mapping into one variable y _{ t } may lose too much discriminative information. Instead, Equation (35) maps X _{ t } into either of a sizereduced matrix, a column vector, or a row vector according to practical problems, e.g. from not only genomics data in genetic biology but also image or table data in various tasks of big data analyses.
With X _{ t } replaced by Y _{ t }, equations from Equations (25) to (29) are directly applicable. If X _{ t } comes from an MND, Y _{ t } comes from an MND too. Accordingly, Equation (33) becomes:
where the parameters are given in a way similar to Equation (28). Also, its solution may be obtained by iterating:
Actually, Equation (35) computes a set of the bilinear matrix forms in parallel as follows:
Each \(y_{t}^{(k,\ell)}\) above and the bilinear form by Equation (18) suffer the limitation discussed after Equation (20), which is relaxed with v ^{(j)} replaced by \(v^{(j)}_{i}\) or v ^{(j,ℓ)} replaced by \(v^{(j,\ell)}_{i}\), i.e. adding another dimension by a subscript i.
Focusing on the former, we extend Equation (20) into:
Accordingly, we extend Equation (18) into:
where T r[ A] denotes the trace of the matrix A.
Putting it into Equation (11) and considering choice (a) in Equation (39), we get Equation (33) modified into:
which may be solved by iterating:
For simplicity, we may approximately ignore the coupling across different subscript i and get:
This solution does not relate to w, and thus, the job is done after getting w ^{∗} by Equation (34).
Also, we may update V by a gradientbased approach via ∇_{ V } J(w,V). Practically, a regularisation may be added on J(w,v) and J(w,V) via Gaussian priories on w,v, and V. Alternatively, we may make sparse learning via Laplace priories on w,v, and V.
Being a complementary to modelbased twosample tests that considers H _{0} by Equation (22) from an overall perspective of populations, we may also perform the classification task in Table 1 to evaluate the goodness of the decomposition by Equation (10), measured by another quantity ε _{ C }, e.g. the following rate of incorrect classification
Classically, an optimal classification is given by:
where ξ could be either of x _{ t } and X _{ t } or the corresponding projections y _{ t } and Y _{ t }. Mapping samples into the projections helps to reduce the dimension of x _{ t } and X _{ t } for tackling the overfitting difficulty of task A in Table 1, especially when the size of samples is not large enough. Also, it facilitates visualisation of two populations in a low dimension (especially below 3D dimension) such that classification is made with human interaction.
Boundarybased tests
Actually, the FDA by Equation (11) finds w that defines the normal direction of the best discriminative hyperplane, as shown in Figure 3. In addition to Equation (45), the hyperplane often acts as a separating boundary as follows:
That is, it performs task C to get the decomposition by Equation (10) on which we may directly get the measure ε _{ C } by Equation (44).
Alternatively, testing Equation (1) may be made by the following statistics from Equation (10):
There are also two other choices in Table 2. Choice (1) is a modelbased test for task B from the perspective of onedimensional samples of y _{ t }=w ^{T} x _{ t }. Focusing on a most discriminative direction, this test puts attention only on salient differences. As to be addressed later in Table 3, the test can be made together with testing H _{0} by Equation (5) such that the rest of the entire sample space is taken into consideration.
Choice (2) in Table 2 provides a statistics for task B on samples without dimension reduction. The statistics s _{ B } comes from considering that samples of \(X^{(1)}_{1}, \ X^{(0)}_{0} \) should be distant from the boundary (as illustrated by two blue arrows in Figure 3) while samples of \( X^{(1)}_{0}, \ X^{(0)}_{1}\) should not be far from this boundary (see two red arrows). Actually, s _{ B } is a special case of the ones given by Equations (26) and (30) in (Xu 2013a). The only difference is that γ _{ B }>0 is added here to trade off the contribution from \(X^{(1)}_{0}\cup X^{(0)}_{1}\).
Both two choices in Table 2 are based on the boundary (i.e. either Equation (10) or y _{ t }=w ^{T} x _{ t }) and thus are called boundarybased twosample tests or BBT in short. Different choices of BBT are also coupled with how w is obtained; see some examples outlined in Table 4.
Replacing Equation (11) with the matrixvariate FDA by Equation (33), we get the projection y _{ t }=w ^{T} X _{ t } v column by column along the direction w and row by row along the direction v. With every appearance of x replaced by \(\boldsymbol {x}_{t}^{v}=X_{t}\mathbf {v}\), all the above studies directly apply. Similarly, we may also consider the dual representation \(y_{t}= \mathbf {v}^{T}\mathbf {x}_{t}^{w} \) with \(\boldsymbol {x}_{t}^{w} ={X}_{t}^{T}\mathbf {w}\) to get a linear separating boundary featured by v. It follows from Equations (19) and (20) that w and v jointly form a linear boundary by vec[ O] to separate samples of vec[ X _{ t }].
Furthermore, extension can be made on the generalised bilinear form via Equation (40) and Equation (41), with each x replaced by \(\boldsymbol {x}_{t}^{v} \) given in Equation (40).
Extensions can be also made on the generalised bilinear form by Equation (35). Samples of two populations are projected into a dimensionreduced matrix Y _{ t }=V ^{T} X _{ t } W, and then, a matrixvariate Hotelling test can be made by Equation (28) with X _{ t } replaced by Y _{ t } and the subscript x replaced by y, where the matrices W,V actually take the roles of the boundary.
Matrixvariate logistic regression
Testing H _{0} by Equation (5) has been widely studied in the literature of logistic regression. Actually, the role of this w is the same as the one in Equation (46), i.e. a discriminative boundary that separates every sample into either ω=1 or ω=0. Thus, the choices in Table 4 can be crossutilised for a mutual benefit, e.g. getting w via FDA by Equation (11) is relatively easy to compute and thus provides an initialization for estimating w by Equation (4), while the advantage of Equation (3) over FDA is that dummy or design variables may be taken into consideration for learning w, e.g. we extend ζ _{ t }=y _{ t }+c in Equation (3) into:
where ξ _{ t } consists of dummy variables. Moreover, random effects may also be added, in a way similar to that of the linear mixed model by Equation (15).
Testing H _{0} by Equation (5) is typically handled with the Wald test by Equation (7) or Rao’s score test by Equation (8), for which the score vector and the information matrix are given as follows (Pan et al. 2014):
where \( \bar \omega \) denotes the mean of ω _{ t }.
Being different from the BBT addressed in the previous subsection, testing H _{0} by Equation (5) directly aims at whether a boundary w exists. Such a test is thus named boundary existence test. It is widely known as a test for regression analyses. Also, we may regard it as a twosample test that is complementary to the BBT choice (1) in Table 2. The two tests jointly cover the entire space of samples.
The boundary existence test actually tackles another essential problem of discriminative analysis, namely, task D in Table 1. Given two populations with a finite sample size, it is not difficult to draw a boundary to separate them if there is no restriction on the complexity of the boundary. However, a boundary with a high complexity will be unreliable to separate new samples that come randomly from the same populations. To be reliable, the boundary should have an appropriate complexity too. It follows from Equation (45) that an optimal separating boundary is related to the models q(xθ _{1}) and q(xθ _{0}). In other words, appropriate boundary complexity is related to an appropriate model boundary complexity. Thus, task D and task A in Table 1 are coupled.
Typically, we consider a linear boundary because of its simple complexity. In the literature of pattern recognition (Cortes and Vapnik 1995; Cover 1965) efforts on whether samples of two populations are linearly separable by a hyperplane or a maximummargin hyperplane can be regarded as examples related to task D in Table 1.
Next, we proceed to consider matrixvariate logistic regression. Putting the case and control samples into a paired set {X _{ t },ω _{ t }},t=1,⋯,N, we extend Equation (3) with the inner product y _{ t }=w ^{T} x _{ t } to be replaced by the bilinear form by Equation (18) or its extension by Equation (40).
Given V, the above studies directly apply when \(\boldsymbol {x}_{t}^{\textbf {v}}\) in Equation (40) replaces x _{ t } in Equations 3, 4, 7, and 8. The task of learning w,V can be made via the matrixvariate FDA by Equations (34) or (42).
Alternatively, we may estimate w,V via the maximum likelihood L by Equation (4) with the advantage of taking the effect of covariates into consideration. With −L written as J(w,V), we get it solved by Equation (37) with w replaced by W, e.g. implemented by the following gradientbased updating (Hosmer et al. 2013):
where η _{ w }>0,η _{ V }>0 are small learning step sizes.
Also, we may test the dual problem of Equation (5) as follows:
for the bilinear form by Equation (18) simply with v replacing w in Equations 6, 7, 8, and 49. Similarly, extension may also be made to test H _{0}: v _{ i }=0,∀i.
Moreover, we may also apply Equation (21) to develop a statistics as follows:
with p(ω _{ t }x _{ t },θ) given by Equation (3), where θ ^{∗} is estimated via maximising L by Equation (4) under H _{0} by Equation (5) and \(\hat \theta \) is estimated via maximising L by Equation (4) without H _{0}.
Similarly, we may get a matrixvariate Cox regression with the inner product w ^{T} x _{ t } in Equation (13) replaced by the bilinear form by Equation (18) or its extension Equation (40). Accordingly, we test the H _{0} by Equation (5) and the H _{0} by Equation (51), using the Wald test with Equation (7) or Rao’s score by Equation (8) with Δ(w),I(w) computed from Equation (6) but L given by the partial likelihood L(w).
Furthermore, the univariate y _{ t } can be extended into a vector or matrix Y _{ t }. One typical example is a bilinear regression of Y _{ t } by Equation (35), that is we consider:
where E _{ t } is independent of X _{ t } and comes from N(Y _{ t }−V ^{T} X _{ t } W0,Λ,D) by Equation (26), while both Λ,D are diagonal matrices.
Again, there are two choices to estimate W,V. One is the matrixvariate FDA by Equation (36). The other is maximising the following likelihood:
Particularly, when Λ=λ I,D=d I, we are lead to the following least square error approach:
which may be again handled by Equation (37) with w replaced by W.
It can be observed that Equation (53) is an extension of Equation (17) with F=0. On the other hand, we may extend Equation (17) into a bilinear extension as follows:
which degenerates to:
as a bilinear mixed model extended from Equation (15).
Integrative hypothesis test
Discriminative analysis and testing of H _{0} by Equation (1) are made from either a modelbased perspective (e.g. performing task A and task B in Table 1) or a boundarybased perspective (e.g. performing task C and task D in Table 1). Moreover, all the four tasks are associated with another problem called feature selection, that is, selecting a number of elements in x to form a subset x _{ f } such that one or more of the four tasks achieves a good enough performance.
In the existing efforts, each of four tasks has been studied individually, with each having its strength and limited coverage. However, performances of these tasks are coupled, and thus, a best set of features for one task may not be necessarily the best for the others.
The complementary nature of task B and task C was preliminarily discussed in Section VI in (Xu 2012a), where a modelbased test for task B is named as Atest (a test in the observed data domain) and a boundarybased test for task C is named as Itest (a test in the inner representation domain). Under the name of IHT, good performances of task B and task C are demanded jointly (Xu 2013a, 2013b). This paper further extends IHT to include task A and task D.
We start at jointly optimising the performances of task B and task C. Its necessity and feasibility are empirically justified, with help of the 2D scattering plots of ε _{ B } by the p value for measuring the performance of task B and ε _{ C } by the misclassification rate for measuring the performance of task C. A small ε _{ B } indicates a big difference between q(xθ _{0}) and q(xθ _{1}) from an overall perspective, and a small ε _{ C } indicates a well classification of samples from a separating boundary perspective. Illustrated in Figure 4 are two examples obtained from one empirical study.
As indicated by the blue vertical dashed line in Figure 4, there are many miRNAs that share a same small p value ε _{ B } but can take different values of misclassification ε _{ C } in a big range. Also, as indicated by the blue horizontal dashed line in Figure 4, there could be multiple miRNAs that take a same misclassification but take different p values. In other words, though the performance of one task is optimised, the performance of the other can still be poor. Thus, we need to jointly seek the good performances of both the tasks, i.e. IHT is necessary. On the other hand, it is observable from the red dots within the blue circle in Figure 4 that there are indeed a few scattering points with each taking both a small p value ε _{ B } and a small misclassification ε _{ C }, i.e. it is also feasible to achieve the goal of IHT too.
Such a 2D plot’s evaluation provides a tool for better joint performances of task B and task C, by which we may interactively observe the configuration of scattering points and locate the candidate points that are nearest to the origin of the coordinate space.
Extensions can be further made to a joint evaluation of the IHT performance with task A and task D also included, such that the strengths of different tests and methods are integrated in a rather systemic way, for which we address four types of IHT in Table 3.
From the modelbased perspective, the first type is an extension of the one addressed in Figure 2, with ε _{ C } added in to get a 3D plots for a joint evaluation of ε _{ A }, ε _{ B }, and ε _{ C }. Instead of Equation (45), we may get ε _{ C } by some nonparametric classifiers, e.g. the classic kNN classifier and the kernel classifiers (Williams 2003). Moreover, we are unable to handle task D because the boundary involved here does not have an explicit expression to be tested.
From the boundarybased perspective, the second type considers samples jointly by a separating boundary and projected samples, evaluated by ε _{ D } for the existence of boundary, ε _{ C } for the misclassification by the boundary, and ε _{ B } for measuring the difference of two populations either along the normal direction of the boundary or according to the sample deviations from the boundary. Again, we may use a 3D plots for a joint evaluation of ε _{ B }, ε _{ C }, and ε _{ D }. However, it is difficult to handle task A merely based on the boundary.
The type of mixmodelled IHT combines the above two types to avoid the weak points of each type. Two typical examples are listed in Table 3. One picks ε _{ A },ε _{ B } from type (1) and ε _{ C },ε _{ D } from type (2) for a joint evaluation. The other modifies ε _{ A },ε _{ B } by taking the outcome by Equation (10) of the boundary in consideration, with the original estimated θ _{0} and θ _{1} replaced by the following maximum likelihood estimation:
Even better, we may estimate each θ _{ ω } by the maximum likelihood on the entire set X of samples but with the likelihood of each sample weighted by its corresponding posteriori p(ωsample) by Equation (3).
BYYharmonylearningbased formulation
The 2D plots and 3D plots only provides a preliminary tool for IHT, we need further studies on not only appropriate combinations of multiple p values and misclassification rates but also simultaneous optimisation of multiple measures. For the latter purpose, the mixmodelled IHT in Table 3 is further extended via iteratively learning θ _{0} and θ _{1} by Equation (58) to update the models \(q\left (X^{(0)}_{0}  \theta _{0}\right), q\left (X^{(1)}_{1} \theta _{1}\right)\) and also reestimating the boundary w, e.g. by a FDA method based on the updated models.
Leaving the task D for a future study, in the sequel, we further understand the task of learning the models from a perspective of learning a Ying machine and the task of learning the boundary from a perspective of learning a Yang machine, which leads to a BYYharmonylearningbased formulation for IHT.
We start from revisiting Equation (29) from an IHT perspective. From α _{1} q(xθ _{1})=q(xθ)−α _{0} q(xθ _{0}), we consider the task B by the following measure:
from which we observe that a large K L _{10} comes from a large L _{1} that reflects a good modelling of α _{1} q(xθ _{1}) (i.e. a good performance of task A) and a small confusion error \(e^{c}_{0,1}+e^{c}_{1,0}\) that is closely related to a small misclassification (i.e. a good performance of task C). In other words, three tasks are coordinately optimised.
However, a good modelling on the control samples has not been taken in the consideration of K L _{10}, which may be further improved by considering:
From this K L _{sum}, we need to get θ _{ ω },ω=0,1 by the ML learning. In other words, K L _{sum} merely takes a role of evaluating the performances of task B and task C, but do not have a port to accommodate samples for estimating θ _{ ω },ω=0,1. Favourably, such a port is provided in the BYY harmony learning such that task A, task B, and task C are all jointly implemented.
Firstly, proposed in (Xu 1995) and systematically developed in the past two decades, the BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularisation, and developing a class of algorithms that implement automatic model selection during parameter learning. Readers are referred to (Xu 2010, 2012b, 2015) for the latest introduction about the BYY harmony learning.
Briefly, a BYY system consists of a Yang machine and Ying machine corresponding to two types of decomposition, namely, Yang p(RX)p(X) and Ying q(XR)q(R), respectively. The data X is regarded as generated from its inner representation R that consists of latent variables Y and parameters θ. The harmony measure is mathematically expressed as follows:
Maximising this H(pq) makes this Ying Yang pair not only best matched but also have the least complexity. Such an ability can also be further observed from several perspectives (see Section 4.1 in (Xu 2010)).
Applied to α _{1} q(xθ _{1}) and α _{0} q(xθ _{0}), we have:
where p(x) provides a port to accommodate samples \(\{\mathbf {x}_{t}\}^{N}_{t=1}\) via an empirical \(p(\mathbf {x})=\frac {1}{N}\sum _{t} \delta (\mathbf {x}\mathbf {x}_{t})\) with δ(x) being the Dirac delta, which thus makes it possible to estimate θ _{ ω },ω=0,1 via maximising H(pq).
It follows from p(0x _{ t })+p(1x _{ t })=1 that we get:
Approximately considering p(x)≈q(xθ), \(e^{H}_{0,1}+e^{H}_{1,0}\approx e^{c}_{0,1}+e^{c}_{1,0}\), and \({L_{1}^{H}}+{L_{0}^{H}}\approx L_{1} +L_{0}\), we observe that H(pq) shares a nature similar to K L _{sum} in Equation (59), while a difference is that the modelling part \({L_{1}^{H}}+{L_{0}^{H}}\) is provided with a port p(x) to accommodate samples such that task A can be performed via maximising H(pq) without a need of separately estimating θ _{ ω } by the ML learning.
For q(xθ)=G(xc,Σ), we implement the maximisation of H(pq) to estimate θ _{ ω } by directly adopting the semisupervised BYY harmony learning for Gaussian mixture given in (Xu 2015), i.e. its algorithm 9, by which the performances of task A, task B, and task C are coordinated. Moreover, H(pq) can be extended into its matrixvariate counterpart. Particularly, algorithm 9 in (Xu 2015) can be extended into the algorithm ?? given below for learning \(\alpha _{\omega }N\left (XC^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\).
During implementation of the above algorithm, not only task A is performed but also task C can be simply handled in the Yang step by checking whether \(\phantom {\dot {i}\!}p_{1 \textbf {x}_{t}} \ge p_{0 \textbf {x}_{t}} \) to classify each sample into the case or control. Also, task B can be made after learning by putting the resulted parameters into s _{ KL }=K L _{10} or s _{ KL }=K L _{sum} to get the corresponding p value.
Last but not least, considering semisupervised learning, we also propose an improved procedure in Table 5 for training, testing, and validating on a small size of samples.
Integrating p values, inferring rejection domain, and Sspace boundarybased tests
Each IHT type in Table 3 involves more than one measure, which incurs for the problem about how different measures are jointly evaluated. Though 2D or 3D plots provide a possible joint evaluation, how to appropriately scale each measure is still a challenging issue. In general, we need to integrate multiple measures into a scalar index based on which the joint performance can be evaluated, which relates closely to efforts made on combing multiple classifiers (Xu and Amari 2008; Xu et al. 1992b) and evidence combination (Barnett 2008).
For an IHT task, the final scalar index is typically the p value. When multiple measures are all in the p values, what we encounter becomes the task of p value combination, e.g. by the Fisher combination (Fisher 1948).
In Table 3, ε _{ B } and ε _{ D } are already given in p values. But ε _{ A } is usually measured by a square error or negative loglikelihood, and ε _{ C } is measured by a misclassification rate. Alternatively, ε _{ C } may be given in a p value via the statistics in Equation (47). Let s=−ε _{ A } or generally s=−ε for a monotonic measure ε≥0 that prefers values close to zero, we may get the corresponding p value with help of the permutation method.
However, p value combination has a weak point. Each p value is merely a positive number that indicates the false alarm probability, losing certain useful information already. Under the term metaanalysis (Evangelou and Ioannidis 2013), efforts have been made by transforming p values into multiple Z statistics such that the missing information is added in without or with help of information directly from data (Zaykin 2011).
Actually, the Hotelling T ^{2} statistics by Equation (24) and getting a statistics by Equation (21) may also be regarded as examples that get an integrated statistics s _{ f }. Generally, a multivariate hypothesis test may also be regarded as an integration of multiple univariate hypothesis tests.
Typically, an integrated statistics s _{ f }=g(s,Ψ)≥0 comes from s=[s ^{(1)},⋯,s ^{(d)}] such that s _{ f }≥0 monotonically increases as the situation differs far from H _{0}, where each s ^{(i)} comes from one univariate hypothesis test (e.g. s=c _{1}−c _{1} in the Hotelling T ^{2} statistics) with a set Ψ of parameters shaping the integration (e.g. the covariance Σ in the Hotelling T ^{2} statistics). The set Ψ is specified without or with help of information obtained directly from input data. A critical value \({\tilde s}_{f}\) is computed from the original pair of the sample set X _{0},X _{1}. Then, the false alarm probability \(p(s_{f}>{\tilde s}_{f}H_{0})\) is obtained as the p value, where and hereafter p(·H _{0}) denotes under the condition that H _{0} is satisfied.
However, choices for such a s _{ f }=g(s,Ψ) are very limited in the existing studies, mostly in a quadratic form such as Hotelling statistics, Rao’s score by Equation (8), and the Wald test by Equation (7). This is equivalent to approximately regarding s ^{(1)},⋯,s ^{(d)} from a multivariate Gaussian distribution, while other distributions are seldom studied yet.
Instead of seeking an integrated statistics s _{ f }, we directly seek the domain \(\Gamma (\boldsymbol {\tilde s})\) of rejecting H _{0} in the space of s based on a critical vector \(\boldsymbol {\tilde s}\) as follows:
where \(\boldsymbol {\tilde s}_{{X}_{10}}=I_{\textit {nf}}(X_{0}X_{1})\) means that \(\boldsymbol {\tilde s}\) is inferred from the given sample set X _{0},X _{1} by an inferring method I _{ nf }, and the subscript X _{10} is used as the abbreviation of X _{1}X _{0}, which will be used whenever its omission will not cause confusion.
Then, test is made by checking the probability that s falls in \(\Gamma (\boldsymbol {\tilde s})\) under H _{0}, that is:
We estimate the p value by a permutation test. That is, we get a new pair of sample sets \(X_{0}^{\pi }, X_{1}^{\pi }\) from X _{0},X _{1} by a permutation π that shuffles each label ω of x _{ t,ω } and then we obtain:
where # S denotes the cardinality of a set S, the subscript \(X^{\pi }_{10}\) is used as the abbreviation of \(X_{0}^{\pi }X_{1}^{\pi }\), and Π consists of a large enough set of permutations made by either enumeration or random shuffling, including that π=empty denotes the sample pair X _{0},X _{1}.
Recalling the classic studies of getting an integrated statistics s _{ f }, we observe that \({\tilde s}_{f}=g(\boldsymbol {s},\Psi)\) actually define a closed shell or boundary that divides the space of multivariate statistics s (shortly Sspace) into two parts, with its inside as the acceptance domain and its outside as the rejection domain \(\Gamma (\boldsymbol {\tilde s})\). For example, the acceptance domain obtained by both the Hotelling statistics and Rao’s score by Equation (8) is a hyperelliptic volume. We may further extend a hyperelliptic volume to a bounded volume in another shape. Actually, a bounded acceptance domain corresponds a probabilistic modelling by a singlemode distribution. Thus, the corresponding tests are called Sspace modelbased tests.
On the other hand, we have also a Sspace boundary based test (BBT) as summarised in Table 6. It should not be confused with the BBTs in the space of input data (shortly Dspace), as those previously addressed in Tables 2 and 3, as well as in Figure 3. Those are twosample tests with the boundary for separating two populations in the Dspace while the Sspace BBTs may correspond to any tests in the Dspace.
Also, integration can be made by considering the complementarity of Sspace BBTs and Sspace modelbased tests, via combining \(\Gamma (\boldsymbol {\tilde s})\) and the acceptance domains, obtained from not only the above complementary aspects, but also different sources, e.g. a bottomup source from univariate tests on input data and a topdown source inversely transformed from the p values via a metaanalysis (Evangelou and Ioannidis 2013). Also, based on the resulted \(\Gamma (\boldsymbol {\tilde s})\), an easy computing expression \(s_{f}=g(\boldsymbol {s},\Gamma (\boldsymbol {\tilde s}))\) may be obtained to get an asymptotic distribution \(p(s_{f}\Gamma (\boldsymbol {\tilde s}))\) for a fast estimation of the p value, see examples given after Equation (70).
Sspace BBT for the multivariate zero mean
Testing H _{0} by Equation (1) for the casecontrol studies can be formulated into testing whether a multivariate statistics s=[s ^{(1)},⋯,s ^{(d)}] takes a point far away from the origin of the multidimensional space. One example is a twosample test that examines the following null:
by the Hotelling T ^{2} statistics. The second example is the Wald testing statistics by Equation (7), and another example will be given in the next subsection.
In the existing studies, such a test is typically made via either the \({\chi ^{2}_{k}}\) statistics or Hotelling’s T ^{2} statistics. Also, Rao’s score by Equation (8) is such a type of statistics. As addressed in the previous subsection, they are all featured by an integrated statistics s _{ f }≥0 that monotonically increases as s deviates away from the origin and belong to the Sspace modelbased tests. Also, all these tests may be regarded as extensions of one typical univariate twotail test (e.g. by t ^{2} test), that is, a univariate statistics s deviates away from the origin s=0 via the value s.
The counterpart of a univariate twotail test is a univariate onetail test that examines how far s deviates from (−∞,0], i.e. testing the statement s≤0. When either rejecting s≤0 or rejecting s≥0 happens, we reject H _{0}:s=0. Even when the statement s≤0 is not rejected, there are still chances that H _{0}:s=0 will be rejected.
Typical studies of univariate onetail tests include the onetailed ttest and onetailed ztest. However, we are not clear what are their counterparts in multivariate tests. As addressed above, Hotelling’s T ^{2} test can be regarded as a multivariate counterpart of a twotailed test.
The Sspace BBT given in Table 6 actually provides a road to extend univariate onetail tests to multivariate ones. Observing univariate onetail tests from the perspective of Sspace BBT, we see that \({ \tilde s}=I_{\textit {nf}}(X_{0} X_{1})\) is actually a boundary point that results in:
Given \({ \tilde s}\) and thus \(\Gamma ({ \tilde s})\), any s obtained from the casecontrol samples under H _{0} may cause a false alarm if s falls in \(\Gamma ({ \tilde s})\), which happens in a probability \(p(s \in \Gamma ({ \tilde s})  H_{0})\), i.e. the p value by the inference \({ \tilde s}\). If it is small enough, the statement \(s \notin \Gamma ({ \tilde s}) \) will be rejected, which implies that s=0 or H _{0} by Equation (1) is rejected.
We further consider a statistics s in the multidimensional space from the perspective of Sspace BBT given in Table 6 (2). We start by observing an orthant of the R ^{d} space featured by \(\text {sign}(\boldsymbol {\tilde s}) =\left [\text {sign}\left ({ \tilde s^{(1)}}\right), \dots, \text {sign}\left ({ \tilde s}^{(d)}\right)\right ]^{T}\) and consider one separating boundary, as illustrated in Figure 5A. Such a boundary is equivalent to the following decomposition:
where each \(\Gamma ({ \tilde s}^{(i)})\) is given by Equation (67) for computing \(p\left (\textbf {s}^{(i)}\in \Gamma \left ({ \tilde s}^{(i)}\right) H_{0}\right)\). This actually provides an example that extends a onetail univariate hypothesis test to a vectorvariate one.
In implementation, it is not easy to get the factorization of \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) H_{0})\) by Equation (68). Instead, we approximately consider to remove the secondorder dependence by the following decorrelation:
where Λ _{ u } is a diagonal matrix consisting of the nonzero eigenvalues of the following covariance matrix:
and U is a d×m matrix with its columns consisting of the eigenvectors of Σ _{ π } such that Λ _{ u }=U ^{T} Σ _{ π } U.
Another issue is that only those major components in Equation (68) are useful while some components are not only useless but also disturbing, especially when we consider a limited size of samples. To do so, one may consider that the columns of the matrix U consist of the eigenvectors of Σ _{ π } corresponding to the mlargest diagonal elements of Λ _{ u }. Such an implementation of Equation (69) is typically called principal component analysis (PCA). How to decide an appropriate number of components is a model selection task (Tu and Xu 2011, 2012; Xu 2011). Moreover, one novel direction for this task will be addressed later in thip paper between Equation (91) and Equation (99). Actually, Equation (69) only applies to remove the secondorder dependence. One may further consider nonGaussian factor analysis (NFA) and binary factor analysis (BFA) to remove dependencies among nonGaussian components (Tu and Xu (2014); Xu (2003, 2009) and also Section 5 in Xu (2012b)).
Simply, we use the notation \(\boldsymbol {\tilde s}=I_{\textit {nf}}(X_{0} X_{1})\) to denote a procedure to obtain such major components and then use this \(\boldsymbol {\tilde s}\) to get a separating boundary and its corresponding \(\Gamma (\boldsymbol {\tilde s})\). Illustrated in Figure 5 are three examples as follows:
Choice (a) is illustrated in Figure 5A same as the one in Equation (68) with each \(\Gamma ({ \tilde s}^{(i)})\) given by Equation (67). As illustrated in Figure 5B, each of two other choices is a half space bounded by a plane and on the side away from the origin. Choice (b) is more suitable to the case after using Equation (69) in choice (b). Except for the degenerated cases that the normal direction of the hyperplane becomes in parallel to one of the coordinate axis, choice (b) and choice (c) will approximately describe a certain dependence across the components of s.
After using Equation (69) to make the statistics s become an mdimensional vector with the secondorder dependence removed, we may observe that the scope of \(\Gamma (\boldsymbol {\tilde s})\) becomes narrowed as m reduces. When m=1, the scope of \(\Gamma (\boldsymbol {\tilde s})\) is narrowed to a onetail test along the axis of only one component.
In implementation, we obtain \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) H_{0})\) by Equation (64) via the permutation by Equation (65). Also, choice (b) and choice (c) may be understood from getting an integrated statistics as follows:
Approximately, s _{ w } comes from a normal distribution with the mean μ _{ w } and the variance s _{ w }, based on which we can make a one univariate test.
SPD test and SPD discriminative analysis
Proposed in (Xu 2013a), the SPD method firstly examines the delta δ(x,y) by pairing every case sample x∈X _{1} and every control sample y∈X _{0} and then summarises such deltas as follows:
Generally, δ(x,y) could be either symmetric or antisymmetric. One simple symmetric example is:
where \(c_{\omega }, \sigma ^{2}_{\omega }, \alpha _{\omega }\) is the sample mean, variance, and proportion of the samples in X _{ ω }, respectively, and r _{ xy } is the mutual correlation between x and y.
The above example can be extended to the case that both x,y are vectors with:
Also, we may consider an antisymmetric delta:
where ρ(u) is a monotonic function. One simplest example is ρ(u)=u as follows:
which is equivalent to testing the difference of two sample means. To find the collective inclining structure, we classify δ(x,y) into three groups by x>y,x=y,x<y and get the following decomposition:
with D(X _{10})<0 indicating that there is a collective inclining dominance (i.e. the representations of cases are bigger than the ones of controls), D(X _{10})<0 indicating a reversed dominance, and D(X _{10})=0 indicating no dominance.
Recalling Equation (66), it follows from \({\tilde s}=D(X_{10})= c_{1}c_{0}\) that D(X _{10}) is approximated from a normal distribution. Thus, the above collective inclining dominance can be tested by the onetailed ttest and onetailed ztest addressed in the previous subsections. We may get the mean \( \mu \left (X_{10}^{\pi }\right)\) and the variance \(\sigma ^{2}\left (X_{10}^{\pi }\right)\) from \(\left \{ D(X_{10}^{\pi }, \pi \in \Pi \right \}\) and then approximately compute the p value by a univariate onetail ztest.
When x,y are vectors, we consider:
with each D ^{(i)}(X _{10}) by Equation (76). The task is detecting whether there is a collective inclining dominance, i.e. whether s deviates far away from the origin such that H _{0} by Equation (1) breaks. The task can be handled by the Sspace BBT in Table 6 as a multivariate extension of a onetail univariate hypothesis test, following the method introduced from Equations (68) to (71) given previously.
Also, we may consider this multivariate SPD study from a perspective similar to the FDA by Equation (11). When x,y are the ddimensional vectors, we extend Equation (74) into:
where ρ(u)=[ρ(u ^{(1)},⋯,ρ(u ^{(d)}]^{T} and ρ(u) is the same as the one in Equation (74). That is, the difference x−y is projected onto a most reasonable direction w. In the simplest case ρ(u)=u, we get δ(x,y)=(x−y)^{T} w given in Equation (72) and thus leads to s _{ w }=w ^{T} s in Equation (71) as follows:
Without losing generality, we consider that the components of s are mutually independent, e.g. obtaining a secondorder independence by Equation (69). Then, we seek how to choose an appropriate w.
Under H _{0}, we expect that \( \mathbf {s}_{\mathbf {w}}^{\pi }=D_{\mathbf {w}}\left (X_{10}^{\pi }\right), \pi \in \Pi \) varies around its mean that is typically zero according to Equation (75), that is, we expect that the following standard deviation of \(\boldsymbol {s}_{\mathbf {w}}^{\pi }\) is minimised:
Also, we expect that s _{ w } best preserves discriminative information underlying X _{1},X _{0}, for which we maximise s _{ w }. We apply a bootstrapping method to enhance the reliability by maximising:
which may tend to ∞ if it is unbounded. To avoid it, some bound will be imposed on w.
For γ=1, we usually consider:
by which the solution of w=[w ^{(1)},…,w ^{(d)}]^{T} is reached at one vertex, i.e. w ^{(i)} takes either a ^{(i)} or b ^{(i)}. Particularly, when Ω consists of only one pair X _{1},X _{0}, the above maximisation leads to choice (b) in Equation (70) if we let −a ^{(i)}=b ^{(i)}=1 and to choice (c) if we let −a ^{(i)}=b ^{(i)}=D ^{(i)}(X _{10}).
For γ=2, we consider:
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma ^{\phi } =\sum _{{\omega } \in \Omega } \mathbf {s}^{\phi }\mathbf {s}^{\phi \ T} \).
Integrating Equations (80) and (81), we consider to maximise ρ _{ γ }(w) with \(\sigma _{\pi }^{\gamma }(\mathbf {w})\) minimised simultaneously or subject to a constraint \( \sigma _{\pi }^{\gamma }(\mathbf {w})\le \text {constant}\).
Alternatively, we may consider:
which shares a spirit similar to the FDA by Equation (11). At the typical case γ=2, it becomes
with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma _{\pi }^{0.5}\Sigma ^{\phi } \Sigma _{\pi }^{0.5}\).
Furthermore, we proceed to consider that each D ^{(i)}(X _{10}) in Equation (79) is not a simple difference by Equation (76) but the following 1×2 row vector:
Also, we may extend x−y with each element x ^{(i)}−y ^{(i)} becoming a row vector [x ^{(i)},−y ^{(i)}]. Accordingly, we get:
where v=[v ^{(1)},v ^{(2)}]^{T} and Δ _{ x−y } is a d×2 matrix with the ith row being [x ^{(i)},−y ^{(i)}]. It follows from Equation (72) that the above Equation (87) leads D ^{(i)}(X _{10}) to:
where D _{ M }(X _{10}) is a d×2 matrix with D ^{(i)}(X _{10}) as its ith column. Accordingly, the inner product by Equation (79) becomes:
Given v as fixed, the study from Equations (79) and (84) applies directly for us to get w.
Given w as fixed, \( \mathbf {w}^{T}D_{M}\left (X_{10}^{\pi }\right)={D_{c}^{T}}\left (X_{10}\right)\) becomes a twodimensional row vector and, it follows from Equation (89) that we have \(\boldsymbol {s}_{\mathbf {w}} =\mathbf {v}^{T}{D_{c}^{T}}\left (X_{10}\right)\) in the same form as Equation (79). With v in the place of w and D _{ c }(X _{10}) in the place of s, similarly, the study from Equations (79) and (84) applies directly for us to get v. Generally, we iteratively update v with a fixed w and update w with a fixed v, for a number of circles getting converged. Still, whether such an alternative iterating procedure can converge is an open issue that demands further investigation.
The p values and testing complexity control
Recalling Equation (64) and Table 6, based on a given sample pair X _{10}=X _{0}∥X _{1}, we get a statistics vector \(\boldsymbol {\tilde s}_{{X}_{10}}=I_{\textit {nf}}(X_{0} X_{1})\) and a rejection domain \(\Gamma =\Gamma \left (\boldsymbol {\tilde s}_{{X}_{10}}\right)\) by the inferring method I _{ nf }. Then, we compute the following false alarm probability:
as the p value. This concept is the same as the one used in the conventional literature where X _{10} and I _{ nf } are usually implied but not spelled out.
Being different from those studies considering a univariate statistics, the p value by a multidimensional statistics vector s highly depends on the dimension m of this vector or the complexity of the testing space. Given a limited sample size, the p value by Equation (90) will reduce as the value of m increases, causing a phenomenon similar to the overfitting problem in the studies of machine learning and statistical modelling. In other words, we encounter a ‘dimension curse’ in hypothesis testing too. Therefore, we need to appropriately control the complexity of testing space, i.e. selecting one appropriate m.
Given a criterion J(m), the problem of selecting a best subset is a typical problem of feature selection. Generally, it involves an exhaustive evaluation of all the combinations of m features (i.e. m components of s) and all the possible values of m, which is a NP hard problem. Usually, the branch and bound policy (Narendra and Fukunaga 1977; Somol et al. 2004) and the best first strategy are used to save computing cost (Xu et al. 1988). In this paper, we only consider one simple selection strategy that evaluates the components of s incrementally one by one.
To facilitate it, we perform Equation (69) to make the components of s become decorrelated and start to pick one component that corresponds to the smallest value of a given criterion J(m). Then, we successively add in one component such that J(m) gets a bigger drop further and so on and so forth until no further reduction is caused. Finally, the selected components form the resulted feature set with a size m ^{∗}.
For this purpose, using the p value by Equation (90) as J(m) does not work well because of its tendency of reducing as m increases, resulting in one m ^{∗} that is usually much bigger than the appropriate one. Instead, we consider another false alarm probability as follows:
which is obtained on all the possible sets of \(X^{\pi }_{10}\) that come under H _{0} instead of merely on a given pair X _{10}.
Though this probability is useless to judge whether X _{10} contains enough information to reject H _{0}, it reflects how the complexity of testing space affects a background portion of the false alarm probability. Actually, it reflects an inverse of the effective volume of the support that the statistics s locates. As m increases, the volume increases exponentially, and thus, p(s∈Γ I _{ nf },H _{0}) will reduce negativeexponentially. Such an exponentially decreasing tendency is also contained in p(s∈Γ I _{ nf },X _{10},H _{0}) for the same reason, which affects the accuracy of the estimated p value.
To reduce this background disturbance, we consider Equations (90) and (91) jointly by the following a posteriori version of the p value:
where and hereafter ¬H _{0} denotes rejecting H _{0}. The denominator aims at cancelling out the disturbing portion in the numerator, such that \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\) provides not only a better estimation of false alarm probability of rejecting H _{0} but also a better criterion J(m) for selecting a best subset of the components of s and thus inferring one appropriate m ^{∗}.
Instead of directly handling the above integral, we get a large set Π of sample pairs \(X_{1}^{\pi }, X_{0}^{\pi }\), with each pair \(X_{1}^{\pi }, X_{0}^{\pi }\) resulted from a permutation of X _{0} and X _{1}. Using every pair \(X_{1}^{\pi }, X_{0}^{\pi } \) to infer \(I_{\textit {nf}}\left (X^{\pi }_{0} X^{\pi }_{1}\right)= \boldsymbol {\tilde s}_{X^{\pi }_{10}}\), we get a set of p values as follows:
based on which we compute:
We observe that the pp value has two factors. One is \(pp^{o}_{{X}_{10}}\) that describes the proportion of the pairs of \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(p_{X^{\pi }_{10}}\le p_{X_{10}}\), that is, on each of these pairs we should also reject H _{0} if we reject H _{0} on X _{10}. In other words, \(pp^{o}_{{X}_{10}}\) reflects the information of relative difference contained in P _{ Π }. The other factor \(\phantom {\dot {i}\!}{rp}_{{X}_{10}}\) is the ratio of the average false alarm probability per pair over the disturbing background per pair, reflecting the strength of discriminative information contained in P _{ Π }.
In implementation, we may use \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) to make an initial screening. When \( {rp}_{{X}_{10}}>1\phantom {\dot {i}\!}\), inference is nonsense and no further computing should be made. Generally, \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) will be much smaller than 1, and thus, \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\) will be much smaller, while \(pp^{o}_{{X}_{10}}\) provides a worst case upper bound of \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\).
We should observe \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) at not only one same value of m but also an appropriate m ^{∗}. In addition to using \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (93) as J(m) for making an incremental selection, we may also consider \(pp^{o}_{{X}_{10}}\) or \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as J(m), resulting in \( m^{\ast }_{o} \) or \(m^{\ast }_{\textit {rp}}\). Also, it follows from some mathematical derivation that we have \( m^{\ast } \ge m^{\ast }_{\textit {rp}} \ge m^{\ast }_{o}\) with \(m^{\ast }_{o}\) being a most conservative lower bound. We will be more confident when all these values are identical or not different too much. Moreover, further insights can be obtained from the following considerations.
On one side, we desire that the exponentially decreasing tendency contained in p(s∈Γ I _{ nf },X _{10},H _{0}) is removed via the normalisation by p(s∈Γ I _{ nf },H _{0}) such that \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\) in Equation (93) will no longer have such a decreasing tendency. With \(p_{X^{\pi }_{10}}=p(\textbf {s}\in \Gamma \ I_{\textit {nf}}, X_{10}^{\pi }, H_{0})\) in Equation (92) replaced by \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\), we may turn P _{ Π } into its counterparts P _{ pp }, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P _{ rp }. We compute not only the varying curve for each of \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as m increases, but also the varying curve of the mean of the elements in each of P _{ pp }, \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P _{ rp } as m increases. Then, we compare each curve with its corresponding mean curve and desire that the mean curve is as flat as possible or at least flat around m ^{∗}.
On the other side, desiring a flat mean curve is not a sole principle. W also desire that the discriminative information should be kept in each of \({pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) as much as possible. Observing the factorization \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}= pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) in Equation (93), the strength of discriminative information is contained in \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) with an exponentially decreasing tendency that is supposed to be mutually cancelled out by the denominator and the numerator but perhaps not completely, while the discriminative information of relative difference is contained in \(pp^{o}_{{X}_{10}}\) and kept unchanged as long as every inequality between \(\phantom {\dot {i}\!}p_{X^{\pi }_{10}}\) and \(\phantom {\dot {i}\!}p_{X_{10}}\) remains unchanged.
Bitest, twin p values, and Pspace BBT
Putting the above two sides together, we observe that a Sspace multivariate test is actually a bitest that tests H _{0} together with the following hypothesis:
We examine a decision that both H _{0} and I _{0} are rejected, featured with two p values.
As addressed after Equation (91), the multivariate statistics s inferred by I _{ nf } suffers a systematic bias that will make I _{ nf } unreliable. This unreliability varies with the dimension m that takes an important role in I _{ nf }. Though corrected by the denominator in Equation (93), there are still some residuals that will not be completely cancelled out, the effect of which still varies with m and reduces the reliability of I _{ nf }. The test I _{0} is formulated for this reliability via controlling an appropriate m ^{∗} and a level of false alarm probability of rejecting I _{0}.
One should notice the difference between testing H _{0} and testing I _{0}. Testing H _{0} examines only the input, while testing I _{0} examines both the input and the performance of testing H _{0}. The inference I _{ nf } gets X _{10} as the input and the outcomes \(p_{{X}_{10}}, {pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\). Using \(\phantom {\dot {i}\!}o_{{X}_{10}}\) to denote anyone of these indices, regarding I _{ nf } as reliable on X _{10} actually implies that it should also be regarded as reliable on any pair \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(o_{{X}^{\pi }_{10}}\) being smaller than \(\phantom {\dot {i}\!}o_{{X}_{10}}\). Thus, the false alarm probability of rejecting I _{0} is computed by \(p\left (o_{{X}^{\pi }_{10}}\le o_{{X}_{10}}\neg H_{0}, H_{0}\right)\).
Interestingly, some mathematical derivation shows that letting \(o_{{X}_{10}}\phantom {\dot {i}\!}\) to be anyone of \(p_{{X}_{10}}, {pp}_{{X}_{10}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{10}}\), and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) will always result in the same false alarm probability as follows:
where and hereafter ¬I _{0} denotes rejecting I _{0}. Reflecting the discriminative information of relative difference, this p value of rejecting I _{0} will be not affected as long as the exponentially decreasing tendency will not change every inequality between \(p_{X^{\pi }_{10}}\) and \(p_{X_{10}}\).
As summarised in Table 7, a multivariate test is actually a bitest that tests not only the classic null but also a null about the ‘dimension curse’. The rejection of H _{0} is controlled by a given level α. If \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}\ge \alpha \), H _{0} will not be rejected, and thus, there is no need to test I _{0}. Accordingly, Equation (93) for the p value of rejecting I _{0} is also modified in Table 7. The bitest is implemented with or without using stochastic simulation. Table 7 (2) outlines those previously addressed points for implementation via stochastic simulation, while Table 7 (3) outlines an alternative implementation that does not need stochastic simulation.
This alternative comes from considering Γ in the choice (a) of Equation (70) by which we have:
where the extra components of s will contribute a constant factor \(\prod _{i > m^{\ast }}\delta _{i}\) that will be cancelled out via the denominator and the numerator in Equation (93).
In such a case, we may get \(\phantom {\dot {i}\!}{rp}_{{X}_{10}}={\mu _{\Gamma }}/{\mu }\) without stochastic simulation. First, we have \(\mu =\prod _{i} \mu ^{(i)}\). Each \({ \tilde s}^{(i)}\) under H _{0} is a random variable with a zero mean, and its corresponding false alarm probability p _{ i } is uniformly distributed over [ 0,0.5]. Thus, we get μ ^{(i)}=1/4. Second, we also get \(\phantom {\dot {i}\!}\mu _{\Gamma }\le p_{X_{10}}\) by letting \( p(\textbf {s}\in \Gamma \ I_{\textit {nf}}, X_{10}^{\pi }, H_{0}) \le p_{X_{10}}\) for each π∈Π _{ Γ } to be approximated by its upper bound \(p_{X_{10}}\). Putting the two together, we have:
Next, \(pp^{o}_{{X}_{10}}\) is also considered without stochastic simulation. From Equation (95), we have \( p\left (\neg I_{0} \neg H_{0}, H_{0}\right) =pp^{o}_{{X}_{10}} =p\left (\prod _{i}p_{i}^{\pi }\le \prod _{i}p_{i} H_{0}\right) =p\left (\prod _{i}\left (p_{i}^{\pi }\right)^{2}\le \prod _{i}{p_{i}^{2}} H_{0}\right)\), which leads us to the wellknown Fisher combination (Fisher 1948) that makes a test on the false alarm probabilities {p _{ i }} by the following combination:
This link provides new insights from two perspectives. On one perspective, we may adopt the Fisher combination approach to estimate \(pp^{o}_{{X}_{10}}\) as follows:
Together with Equation (97), we get \(\phantom {\dot {i}\!}{pp}_{{X}_{10}}=pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\) for testing both H _{0} and I _{0} without stochastic simulation via permutation.
On the other perspective, we observe that the traditional p value p _{ F } of the Fisher combination is actually the false alarm probability by Equation (95), only reflecting the discriminative information of relative difference between \(\prod _{i} p_{i}^{\pi }\) and \( \prod _{i} p_{i} \) but ignoring the strength of discriminative information contained in \( \prod _{i} p_{i}\). In other words, the Fisher combination just provides a half story for combining {p _{ i }}, and we can use the formulation \( {pp}_{{X}_{10}}=pp^{o}_{{X}_{10}} {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) to complete the whole story, using \(pp^{o}_{{X}_{10}}\) by Equation (99) and \( {rp}_{{X}_{10}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{10}}= \prod _{i} p_{i}\).
The last but not least, one should notice that the p value of testing H _{0} measures the chances in the Sspace (i.e. the space of multivariate statistics), and the p value of testing I _{0} measures an event in the Pspace (i.e. the space of false alarm probabilities). In other words, testing H _{0} involves a Sspace BBT while testing I _{0} involves a Pspace BBT.
Discussions
Gene expression analyses
Gene expression analyses take important roles in bioinformatics and computational genetics. Expression profiles are featured by data matrix with its row indicating expressions of different samples t=1,⋯,N while its column consisting of expressions i=1,⋯,m from different genes, miRNAs, and lncRNAs.
In recent years, developments of data acquisition techniques lead us to consider expression profiles in a cubic or even a highdimension array. As illustrated in Figure 1, one additional dimension j=1,⋯,d is added for examining expressions under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (BarJoseph et al. 2012). For examples, current cancer studies consider each basic unit (i.e. a gene, a miRNA, a lncRNA) in paired expressions of normal and tumour tissue from the same individual, that is, each individual is featured at least by a 2×d matrix X _{ t }. Generally, each example X _{ t } is a m×d matrix. In Table 7, we suggest a list of topics for such matrixvariatebased applications.
Typically, the number d of rows (i.e. gene, miRNA, and lnclRNA) is huge, while the sample size n is small. It is difficult and also unreliable to consider the entire m×d matrix as a sample X _{ t }. Instead, we pick k tuple out of m rows to form a m×k matrix as a sample X _{ t }. Without losing generality, we focus on that each sample X _{ t } is a 2×k matrix from paired expressions of normal tissue and tumour tissue.
In the existing studies, there are two types of efforts for dealing with such format of samples. The first one reduces each sample \(X_{t}=\left [x_{t}^{(i,j)}\right ], i=1,2; j=1,\cdots, k\) into a 1×k matrix \(x_{t}=\left [x_{t}^{(1)}, \cdots, x_{t}^{(k)}\right ]\) for multivariate hypothesis test. A typical reduction is given by:
The second type of efforts is a paired difference test, e.g. a paired ttest when k=1 and paired Hotelling’s square test when k≥2. In Table 8, comparative empirical IHT studies are suggested on the samples of X _{ t } in a 2×k matrix versus in a 1×k vector.
Exome sequencing analyses
The casecontrol study is also a major problem in a genomewide association study (GWAS) or exomesequencing analysis (DePristo et al. 2011; Purcell et al. 2007). Typically, a digit score (i.e. 0,1,2) is assigned to a Single Nucleotide Polymorphism (SNP) allele per site and per individual. In such a representation, each sample is univariate when each site is considered one by one. One variate twosample test takes a fundamental role for detecting a single SNP in the GWAS, e.g. the PLINK provides one widely used tool box (Purcell et al. 2007).
Moreover, each sample can be a vector when multiple sites are considered jointly. Recently, there have been everincreasing efforts on finding multiple SNVs jointly (DePristo et al. 2011; Derkach et al. 2013; Evangelou and Ioannidis 2013; Lin et al. 2014; Liu et al. 2014; Pan et al. 2014). Also, we may test whether there is a collective inclining dominance of the representations of case samples over the ones of control samples, or vice versa, with help of the method proposed from Equations (79) and (84), as well as the extension introduced around Equations (87) and (89).
Alternatively, we may also consider a SNP allele per site and per individual with δ(x,y) in Equation (75) replaced by one 3×3 matrix \(\Delta (x,y)=\left [ \delta _{xy}^{(i.j)}\right ]\) with:
It follows from Equation (72) that we get D(X _{10}) to be also a 3×3 matrix as a collective measure, which may be further examined to test whether two populations differ significantly. We may visualise the matrix by plotting them in two 2D histograms and observe their configurations.
Conclusions
Statistical analyses for casecontrol studies have been addressed rather comprehensively. First, a KullbackLeibler divergencebased formulation is suggested to develop testing statistics and discriminative criterion for the casecontrol studies. Based on this formulation, typical existing methods are revisited, and their matrixvariate counterparts are developed. Second, a bilinear matrix form is proposed to obtain the matrixvariate counterparts from existing multivariate statistical analyses, such as discriminative analysis, logistic regression, Cox model, and linear mixed model. Third, the necessity and feasibility of integrative hypothesis tests (IHT) are addressed from the complementarity of BMTs and BBTs in the Dspace, together with empirical illustration. Moreover, four basic components of IHT are elaborated, and four IHT types are summarised according to how the components are integrated. Then, in the space of multiple statistics (shortly Sspace), the Sspace BBT is proposed to perform BBT based on an unbounded boundary, with the help of informationpreserved decoupling. Moreover, a Sspace BBTbased extension of univariate onetail ztest is developed to test the null of multivariate zero mean and then applied to a multivariate SPD test for detecting a collective inclining dominance for the casecontrol studies. Also, a SPD discriminative analysis is proposed with this multivariate SPD test improved and extended to matrixvariate ones. Furthermore, a multivariate bitest is proposed to test not only the classic null but also a null about inference reliability due to the complexity of testing space, including a new insight on and a further development of the Fisher combination. Finally, possible applications have been suggested for expressionprofilebased biomarker finding and exomesequencingbased joint SNV detection.
References
BarJoseph, Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using timeseries gene expression data. Nat Rev Genet13(8): 552–564.
Barnett, JA (2008) Computational methods for a mathematical theory of evidence. In: Yager L Liu L (eds)Classic Works of the DempsterShafer Theory of Belief Functions. Studies in Fuzziness and Soft Computing, 197–216.. Springer, Berlin Heidelberg.
Cortes, C, Vapnik V (1995) Supportvector networks. Mach Learn20(3): 273–297.
Cox, DR, Oakes D (1984) Analysis of survival data. CRC Press, Chapman & Hall, Boca Raton, Florida.
Cover, TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. Electronic Computers, IEEE Transactions on 14(3): 326–334.
Demidenko, E (2013) Mixed models: theory and applications with R. Probability and Statistics. John Wiley & Sons, Hoboken, New Jersey.
DePristo, MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet 43(5): 491–498.
Derkach, A, Lawless JF, Sun L (2013) Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genet Epidemiol 37(1): 110–121.
Dutilleul, P (1999) The mle algorithm for the matrix normal distribution. J Stat Comput Simul 64(2): 105–123.
Engle, RF (1984) Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Handb Econometrics 2: 775–826.
Evangelou, E, Ioannidis JP (2013) Metaanalysis methods for genomewide association studies and beyond. Nat Rev Genet 14(6): 379–389.
Fisher, RA (1948) Questions and answers# 14. Am Stat 2(5): 30–31.
Gibson, G (2012) Rare and common variants: twenty arguments. Nat Rev Genet 13(2): 135–145.
Hosmer Jr, DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression. John Wiley & Sons, Hoboken, New Jersey.
Hotelling H (1931) The generalization of Student’s ratio. Ann Math Stat 2(3): 360–378.
Ji, J, Shi J, Budhu A, Yu Z, Forgues M, Roessler S, Ambs S, Chen Y, Meltzer PS, Croce CM, Qin LX, Man K, Lo CM, Lee J, Ng IOL, Fan J, Tang ZY, Sun HC, Wang XW (2009) Microrna expression, survival, and response to interferon in liver cancer. New Engl J Med 361(15): 1437–1447.
Koboldt, DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER (2013) The nextgeneration sequencing revolution and its impact on genomics. Cell 155(1): 27–38.
Lin, WY, Lou XY, Gao G, Liu N (2014) Rare variant association testing by adaptive combination of pvalues. PloS one9(1): 85728.
Liu, DJ, Peloso GM, Zhan X, Holmen OL, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, OrhoMelander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem K, Melander O, Kathiresan S, Abecasis GR (2014) Metaanalysis of genelevel tests for rare variant association. Nat Genet 46(2): 200–204.
Narendra, PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. Comput IEEE Trans 100(9): 917–922.
Pan, W, Kim J, Zhang Y, Shen X, Wei P (2014) A powerful and adaptive association test for rare variants. Genetics197(4): 1081–1095.
Persson, H, Kvist A, Rego N, Staaf J, VallonChristersson J, Luts L, Loman N, Jonsson G, Naya H, Hoglund M, Borg A, Rovira C (2011) Identification of new microRNAs in paired normal and tumor breast tissue suggests a dual role for the erbb2/her2 gene. Cancer Res 71(1): 78–86.
Purcell, S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, Sham PC (2007) Plink: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet81(3): 559–575.
Schwarz, G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464.
Simon, RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. SpringerVerlag, New York.
Somol, P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. Pattern Anal Mach Intell IEEE Trans26(7): 900–912.
Stone, M (1974) Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological)36(2): 111–147.
Suykens, JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3): 293–300.
Suykens, JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific Publishing, Singapore.
Tu, S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electrical Electronic Eng China 6(2): 245–255.
Tu, S, Xu L (2012) A theoretical investigation of several model selection criteria for dimensionality reduction. Pattern Recognit Lett 33(9): 1117–1126.
Tu, S, Xu L (2014) Learning binary factor analysis with automatic model selection. Neurocomputing 134: 149–158.
Williams CKI (2003) Learning kernel classifiers. J Am Stat Assoc98(462): 489–490.
Xu, L, Yan P, Chang T (1988) Best first strategy for feature selection In: 9th International Conference on Pattern Recognition, 706–708.. IEEE Computer Society Press, Piscataway, New Jerse.
Xu, L (1995) BayesianKullback coupled yingyang machines: unified learnings and new results on vector quantization In: Proc. Int. Conf. Neural Information Process (ICONIP ’95), 977–988.. Publishing House of Electronics Industry, Beijing.
Xu, L (2003) Independent component analysis and extensions with noise and time: a Bayesian yingyang learning perspective. Neural Inform Process Lett Rev 1: 1–52.
Xu L (2009) Independent Subspaces In: Encyclopedia of Artificial Intelligence, 892–901.. IGI Global IGI Global Snippet, Hershey, Pennsylvania.
Xu L (2010) Bayesian yingyang system, best harmony learning, and five action circling. Front Electrical Electronic Eng China5(3): 281–328.
Xu, L (2011) Codimensional matrix pairing perspective of BYY harmony learning: hierarchy of bilinear systems, joint decomposition of datacovariance, and applications of network biology. Front Electr Electron Eng China 6: 86–119. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (A).
Xu, L (2012a) Semiblind bilinear matrix system, BYY harmony learning, and gene analysis applications In: Proceedings of The 6th International Conference on New Trends in Information Science, Service Science and Data Mining: 2325 October 2012, 661–666.. IEEE, Taipei.
Xu, L (2012b) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. Front Electrical Electronic Eng 7(1): 147–196.
Xu, L (2013a) Integrative hypothesis test and A5 formulation: sample pairing delta, case control study, and boundary based statistics In: Intelligence Science and Big Data Engineering. LNCS, 887–902.. Springer, Berlin Heidelberg.
Xu L (2013b) MatrixVariate discriminative analysis, integrative hypothesis testing, and genopheno A5 analyzer In: Intelligent Science and Intelligent Data Engineering. LNCS, 866–875.. Springer, Berlin Heidelberg.
Xu, L (2015) Further advances on Bayesian ying yang harmony learning. Applied Informatics 2(5).
Xu L, Amari SI (2008) Combining classifiers and learning mixtureofexperts. In: J Ramon e.a. (ed)Encyclopedia of Artificial Intelligence, 318–326.. IGI Global, Hershey: PA.
Xu L, Krzyzak A, Suen CY (1992b) Several methods for combining multiple classifiers and their applications in handwritten character recognition. IEEE Trans Syst Man Cybernet 22: 418–435.
Zaykin DV (2011) Optimally weighted ztest is a powerful method for combining probabilities in metaanalysis. J Evol Biol 24(8): 1836–1841.
Acknowledgements
This work was supported by a CUHK Direct grant project 4055025 and by the ZhiYuan chair professorship by Shanghai Jiao Tong University.
Author information
Additional information
Competing interests
The author declares that he has no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Kullback divergence
 Discriminative projection
 Logistic
 Cox
 and linear mixed regressions
 Bilinear form
 Boundarybased test
 Integrative hypothesis test
 Bayesian Ying Yang
 Statistics integration
 Dependence decoupling
 Bitest
 Test reliability
 Controlling testing complexity
 Inclining dominance
 Gene expression
 Joint SNVs detection