Open Access

Bi-linear matrix-variate analyses, integrative hypothesis tests, and case-control studies

Applied Informatics20152:4

DOI: 10.1186/s40535-015-0007-5

Received: 26 September 2014

Accepted: 10 February 2015

Published: 7 May 2015

Abstract

We pursue a threefold purpose in this paper. First, we suggest a Kullback-Leibler formulation for developing a statistics and making discriminative projection for case-control studies, based on which existing typical methods are revisited and then further extended to matrix-variate counterparts. Second, we propose a bi-linear matrix form, based on which multivariate discriminative analysis and logistic, Cox, and linear mixed regression are extended into their matrix-variate counterparts. Third, we systematically address the necessity, feasibility, and methodology of integrative hypothesis tests (IHT) from the complementarity of model-based test and boundary-based test (BBT) in the data (D)-space, statistics (S)-space, and probability (P)-space. We elaborate four IHT components (modelling, comparison, classification, and assurance) and summarise four IHT types in the D-space. Then, we extend the existing efforts on multivariate tests to BBTs in the S-space. Particularly, we extend the classic univariate one-tail z-test to the multivariate ones, which is then applied to a multivariate sample-pairing delta (SPD) test for detecting a collective inclining dominance. Also, we propose a SPD discriminative analysis that extends this SPD test. Moreover, we propose a multivariate bi-test that tests the classic null and also a null about the inference reliability due to test space complexity, including a further development of Fisher combination. Finally, we suggest possible applications for gene expression biomarkers and exome-sequencing-based joint single-nucleotide variant (SNV) detection.

Keywords

Kullback divergence Discriminative projection Logistic Cox and linear mixed regressions Bi-linear form Boundary-based test Integrative hypothesis test Bayesian Ying Yang Statistics integration Dependence decoupling Bi-test Test reliability Controlling testing complexity Inclining dominance Gene expression Joint SNVs detection

Background

Typically, multivariate statistical analysis and related machine-learning studies consider a basic sampling unit in a vector x t . Though an entire data set may be regarded as given in a format of matrix that consists of x 1,,x N as the columns, each statistics is computed from an assembly of vector samples and featured by vector inner product as a basic modelling unit.

Nowadays, not only rapid developments of data acquisition techniques (DePristo et al. 2011; Koboldt et al. 2013) demand that data with a matrix X t as shown in Figure 1 as a basic sampling unit be considered, but also ever-increasing computing ability makes such a demand possible. One typical field that longs for such demands is featured by image-based tasks, of which a basic sampling unit is naturally a matrix though traditional studies consider sample vectors to simplify computation. However, this simplification will miss some useful structural information, e.g. considering the rows of X t as independent and identically distributed (i.i.d.) samples will miss the dependence cross rows. Also, recent efforts on big-data analyses eagerly demand statistical approaches for matrix-variate-based data analysis.
https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Fig1_HTML.gif
Figure 1

A set of matrix-variate samples.

Another field that demands matrix-variate-based analyses is computational biology or particularly computational genomics. Typically, expression profiles of basic units (e.g. gene, miRNA, lncRNA) are analysed via vector samples (e.g. via rows or columns of expression matrix) (Simon et al. 2003). Advanced studies also examine expression profiles under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (Bar-Joseph et al. 2012) and thus demand that sampling units in matrix format or even a high-dimensional array are considered. In a genome-wide association study or exome-sequencing analysis (DePristo et al. 2011; Gibson 2012; Purcell et al. 2007), though a majority of methods is still featured by vector-variate analysis, there are already some efforts made on matrix-variate-based data analysis.

In the rest of this paper, we start at providing a background and review on the related topics and methods, including the following:
  • Two-sample test and Hotelling statistics.

  • Logistic regression, Wald test, and Rao’s score.

  • Discriminative analyses and integrative hypothesis tests (IHT).

  • Cox model and linear mixed model

Then, we pursuit a threefold purpose as follows: (1) A Kullback-Leibler-divergence-based formulation for developing statistics and discriminative criterion for the case-control studies, based on which existing typical methods are revisited and extended to their matrix-variate counterparts. (2) A bi-linear matrix form, based on which discriminative analysis, logistic regression, Cox model, and linear mixed model are extended into their matrix-variate counterparts. (3) A systematic investigation of the necessity, feasibility, and implementing methods of IHT from the perspective of model-based test (MBT) versus boundary-based test (BBT) in the three levels of space, namely the data sample space (D-space), the statistics space (S-space), and probability space (P-space).

More specifically, the above third one consists of the following:
  • The complementarity of MBT versus BBT in the D-space, the basic IHT components (modelling, comparison, classification, and assurance), and four types of IHT.

  • Bayesian Ying Yang (BYY)-harmony-learning-based IHT formulation for coordinately optimising the performances of task A, task B, and task C in the D-space.

  • The MBT vs BBT perspective in the S-space, especially extensions of the existing efforts on the integration of multiple statistics to the S-space BBT, with the help of dependence decoupling.

  • A S-space BBT-based extension of univariate one-tail z-test for testing the null of multivariate zero mean, which is then applied to multivariate sample-pairing delta (SPD) test for detecting a collective inclining dominance.

  • A SPD discriminative analysis that not only improves the multivariate SPD test but also further extends it to matrix-variate ones.

  • A multivariate bi-test on both the classic null and also a null about test reliability by controlling the testing complexity, including a further development of the Fisher combination.

Finally, we discuss several possible IHT applications for expression-profile-based biomarker finding and exome-sequencing-based joint single-nucleotide variant (SNV) detection.

Hypothesis tests for case-control studies

Most efforts in computational genomics and generally computational biology involve case-control studies. For a case-control study, we are given two populations of vector-variate samples X ω ={x t,ω ,t=1,,N ω },ω=0,1, where the one with ω=1 is called the case population while the one with ω=0 is called the control population. The task of a hypothesis test is examining a rejection of the following null assumption:
$$ \begin{aligned} H_{0} : \mathrm{there\; is\; statistically\; no\; difference\; between\; two\; populations\; of\; samples}, \end{aligned} $$
(1)

for which a statistics is computed from the samples to test the opposite assumption H 1 that there is a significant difference between the two populations.

A typical example is testing whether H 0 breaks on two populations of samples from a multivariate Gaussian distribution G(x|c,Σ) with the mean vector c and the covariance matrix Σ, with help from the following Hotelling statistics (Hotelling 1931):
$$\begin{array}{@{}rcl@{}} T^{2} =\frac{N_{0}N_{1}}{N}(\textbf{c}_{1}-\textbf{c}_{0})^{T}\Sigma^{-1}(\textbf{c}_{1}-\textbf{c}_{0}), \end{array} $$
(2)

where N=N 0+N 1, and c 1,c 0 are the mean vectors of the case and control populations, respectively. Also, the covariance matrix is assumed to be Σ=Σ 0=Σ 1.

Generally, we evaluate the difference between two populations based on population modelling by a parametric model q(x|θ), that is, firstly modelling each population of samples and then evaluating the overall difference between two resulted models. The performance is measured by the p value that describes the false alarm probability of judging that H 0 by Equation (1) significantly breaks. Such efforts are usually referred as model-based tests or sometimes called model comparison or class comparison (Simon et al. 2003).

Another typical example is logistic regression. Rewriting the above two populations of samples into a set of paired samples {x t ,ω t },t=1,,N with ω t =1 and ω t =0 indicating the sample x t from the case and control population, respectively. We let ω t be regressed by x t in the following conditional probability:
$$ \begin{aligned} p(\omega_{t} | \boldsymbol{x_{t}}, \theta) = s(\zeta_{t})^{\omega_{t}} [\!1- s(\zeta_{t})]^{1-\omega_{t} }, \\ {\zeta}_{t} = y_{t}+c, \ {y}_{t} =\textbf{w}^{T}\textbf{x}_{t}, \ s(r)=\frac{1}{1+e^{-r}}. \end{aligned} $$
(3)
All the unknowns in a notation θ are estimated by maximising the following likelihood:
$$\begin{array}{@{}rcl@{}} L = \prod\limits_{t =1}^{N} p(\omega_{t} | \boldsymbol{x_{t}}, \theta), \end{array} $$
(4)
which cannot be analytically solved due to the nonlinearity of s(r) and are usually handled by a gradient-based iterative algorithm (Hosmer et al. 2013). The test of the null assumption by Equation (1) becomes testing the null assumption:
$$\begin{array}{@{}rcl@{}} H_{0} : \ \textbf{w}=\textbf{0}, \end{array} $$
(5)
where w is a subset of θ. It is typically made by either the Wald test or the Score test (Engle 1984), both of which are computed from one or both of the following statistics:
$$\begin{array}{@{}rcl@{}} \Delta(\textbf{w})=\frac{\partial \ln{L}}{\partial \textbf{w}}, \ I(\textbf{w})=-\frac{\partial^{2} \ln{L}}{\partial \textbf{w} \partial \textbf{w}}, \end{array} $$
(6)

where Δ(w) is called the score vector, and I(w) is called the Fisher information matrix.

The Wald test considers the following:
$$\begin{array}{@{}rcl@{}} s= I^{0.5}(\hat{\boldsymbol{w}})\textbf{w}, \ \hat{\boldsymbol{w}}=\arg\max_{\boldsymbol{w}} L, \end{array} $$
(7)

as a testing statistics that has an asymptotic normal distribution under the null assumption.

While the Rao’s score (or simply the score test and often known as the Lagrange multiplier test) considers:
$$\begin{array}{@{}rcl@{}} s= \Delta^{T}(\hat{\boldsymbol{w}}) I^{-1}(\hat{\boldsymbol{w}}) \Delta(\hat{\boldsymbol{w}}), \end{array} $$
(8)

as a testing statistics that has an asymptotic distribution of \({\chi ^{2}_{k}}\), where k is the number of constraints imposed by the null hypothesis. It degenerates to \({\chi ^{2}_{1}}\) when w consists of only one parameter.

This logistic regression examines the difference between two populations via firstly building up a hyperplane boundary and then tests Equation (5) that directly aims at whether the boundary depends on variables in consideration.

Discriminative analyses and integrative tests

Other than directly aiming at the boundary, a different aspect of logistic regression is that we can use p(ω t |x t ,θ) by Equation (3) to classify each sample by:
$$\begin{array}{@{}rcl@{}} \omega_{t} =\arg\max\!{_{\omega}} p(\omega| \boldsymbol{x_{t}}, \theta). \end{array} $$
(9)
Equivalently, the same result comes from the hyperplane boundary ζ t =0 with ζ t given in Equation (3) such that samples are classified into its two sides. The outcome is the following decomposition:
$$\begin{array}{@{}rcl@{}} X_{1}=X^{(1)}_{1}\cup X^{(0)}_{1}, \ X_{0}=X^{(1)}_{0}\cup X^{(0)}_{0}. \end{array} $$
(10)

That is, the case set X 1 is separated into a subset \(X^{(1)}_{1}\) with unchanged labels and a subset \(X^{(0)}_{1}\) of samples that are relabelled as control samples, and similarly, the control set X 0 into \(X^{(0)}_{0}\) with unchanged labels and \(X^{(1)}_{0}\) relabelled as case samples.

Actually, seeking a hyperplane boundary is the goal of linear discriminative analyses (LDA). One classic example is the Fisher discriminative analysis (FDA). For separating samples of two populations, the FDA seeks a projection y t =w T x t to map each vector x t into a univariate y t such that:
$$\begin{array}{@{}rcl@{}} \max_{\textbf{w}}\ J_{y}(\textbf{w}), \ J_{y}(\textbf{w})=\frac{\left({c^{y}_{0}} -{c^{y}_{1}}\right)^{2}}{\alpha_{0} \sigma_{0}^{y\ 2} +\alpha_{1}\sigma_{1}^{y\ 2}}, \end{array} $$
(11)
where for ω=0,1 we have
$$ \begin{aligned} \alpha_{\omega}=\frac{N_{\omega}}{N},\ c_{\omega}^{y} =\frac{\sum_{t=1}^{N_{\omega}}y_{t, \omega}}{N_{\omega}}, \\ y_{t, \omega}=\textbf{w}^{T}\textbf{x}_{t, \omega},\\ \sigma_{\omega}^{y\ 2}=\frac{\sum_{t=1}^{N_{\omega}}\left(y_{t, \omega}-c_{y}^{\phi}\right)^{2}}{N_{\omega}}. \end{aligned} $$
(12)

On the one-dimensional y t , it follows from Equation (2) that \(T^{2}=\frac {N_{0}N_{1}}{N} J_{y}\) and that FDA is equivalent to seeking a direction w along which two populations differ mostly.

On a small size of samples, the resulted w by FDA may suffer the well-known overfitting problem, for which efforts have been made on learning a linear boundary in the literature of machine learning. One classical method is the support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002).

Widely adopted in the studies of pattern classification and machine learning, the performance of discriminative analyses is typically measured by the misclassification rate of Equation (10), featuring the separation or overlap of two populations around the boundary and reflecting the confusing chance incurred by a decision or prediction (sometimes called class prediction (Simon et al. 2003)).

The performance of discriminative analyses may also be measured by T 2 that considers the separation of two populations of y t =w T x t . Monotonically varying with T 2, the p value may be obtained by a univariate t-test. Here, the performance is measured by only considering the salient difference between two populations along the normal direction of the boundary, instead of considering the overall difference in the entire space as addressed after Equation (2).

Alternatively, see Equation (31) in (Xu 2013a), the performance of discriminative analyses may be also measured by a statistics that jointly considers the separating boundary and its outcome by Equation (10).

Since there are different choices for evaluating the difference between two populations, we are motivated to examine whether they can be integrated for a better evaluation. The name of IHT was previously advocated in (Xu 2013a, 2013b) for a joint consideration of the misclassification rate and the p value about the overall difference. This paper will further proceed along this direction.

Cox regression and linear mixed model

Survival analyses consider the relation of the observed time y t that a subject t passes before some event occurs to one or more covariates in x t that may be associated with y t . The Cox model for survival analysis (Cox and Oakes 1984) describes the hazard ratio as follows:
$$\begin{array}{@{}rcl@{}} h_{r}(t)=e^{y_{t}},\ y_{t} =\textbf{w}^{T}\textbf{x}_{t}, \end{array} $$
(13)
which shares the common part y t =w T x t with Equation (3). The difference is that w is estimated via maximising the following partial likelihood L(w):
$$\begin{array}{@{}rcl@{}} \max_{\textbf{w}} L, \ L(\textbf{w})=\prod_{t:\omega_{t}=1}\frac{e^{\textbf{w}^{T}\textbf{x}_{t}}} {\sum_{\tau:y_{\tau}>y_{t}}e^{\textbf{w}^{T}\textbf{x}_{\tau}}}. \end{array} $$
(14)

Again, we can test H 0 by Equation (5) with the Wald test by Equation (7) or Rao’s score test by Equation (8), with help getting Δ(w),I(w) still by Equation (6) but with L given by the above partial likelihood L(w).

Actually, the core part y t =w T x t of Equations (3) and (13) is also the core part of the classic multivariate linear regression y t =w T x t +e t with w estimated by minimising \(\sum _{t} \textbf {e}_{t}^{2}\).

Denoting y=[ y 1,,y N ] T , e=[ e 1,,e N ] T , and X=[x 1,,x N ] T , we may rewrite y t =w T x t +e t into y=X w+e as a degenerated case of the following linear mixed model (Demidenko 2013) :
$$ \begin{aligned} \textbf{y}=X\textbf{w}+Z \textbf{f} + \textbf{e},\\ \textbf{f} \sim G(\textbf{f}|0, K), \ \textbf{e} \sim G(\textbf{e}|0, R), \end{aligned} $$
(15)
where Z is a design matrix and f is a random effect vector. We may use the existing methods to estimate w,K,R (Demidenko 2013) and then test w=0 via the Wald test by Equation (7) or Rao’s score test by Equation (8) but with the likelihood L replaced by:
$$\begin{array}{@{}rcl@{}} L=G(\textbf{y}-X\textbf{w}|\textbf{0}, Z KZ^{T}+ R). \end{array} $$
(16)
Moreover, an N×1 vector y may be further extended to a N×m matrix with one dependent variable extended to m-dependent variables. Accordingly, w,f,e are extended to d×m matrices. As a result, we have:
$$\begin{array}{@{}rcl@{}} Y=XW+Z \textbf{F} + \textbf{E}, \end{array} $$
(17)

where F=[f 1,,f m ], and E=[e 1,,e m ]. One typical case is that f 1,,f m are mutually i.i.d. with each f i G(f i |0,K). Also, e 1,,e m are i.i.d. with each e i G(e i |0,R).

From inner product to bi-linear form

In many studies of multivariate statistical analysis and machine learning, a basic sampling unit is a vector \(\boldsymbol {x}_{t}=\left [\!{x}_{t}^{(1)},\cdots, {x}_{t}^{(d)}\right ]^{T}\), and the basic computing operation is the inner product w T x t that is linear with respect to the elements of x t and also of w. Though w T x t becomes XW in Equation (17), it actually consists of a set of vector inner products in parallel.

Efforts have been made in (Xu 2013a, 2013b) to extend this inner product to get a matrix-variate discriminative analysis. Considering that a basic sampling unit is a matrix X t as shown in Figure 1, the inner product is extended into a bi-linear form:
$$ \begin{aligned} y_{t}&=\textbf{w}^{T}X_{t}\textbf{v}=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{d} w^{(i)}v^{(j)}{ x}_{t}^{(i,j)}\\ &=\sum_{i=1}^{m} w^{(i)}\left(\textbf{v}^{T}\textbf{x}_{t}^{(i)}\right)=\textbf{w}^{T}\textbf{x}_{t}^{v}, \ \textbf{x}_{t}^{v}=X_{t}\textbf{v}, \end{aligned} $$
(18)

which is quadratic with respect to w (i) and v (j) but still linear with respect to the elements of X t and is featured by two consecutive layers of inner products. Similarly, we may also have \(\boldsymbol {w}^{T}X_{t}\textbf {v}=\textbf {v}^{T}\textbf {x}_{t}^{w}\) and \(\boldsymbol {x}_{t}^{w}={X_{t}^{T}}\textbf {w}\). We call such a matrix-variate-based basic-computing operation a bi-linear form. This bi-linear form leads us to matrix-variate LDA and factor analyses in (Xu 2013a, 2013b). Also, using matrix normal distribution, the implementations are made by the Bayesian Ying Yang harmony learning (Xu 1995, 2015).

To get further insight, we directly extend the vector inner product into the following matrix format:
$$\begin{array}{@{}rcl@{}} y_{t}= \text{vec}^{T}[\!O]\text{vec}[\!X_{t}] =\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{d} o^{(i,j)}{ x}_{t}^{(i,j)}, \end{array} $$
(19)

which is still linear with respect to the elements of X t but unable be decomposed into two inner products, where vec[ O] denotes the vectorisation of a matrix O.

Comparing Equations (18) and (19), we observe that the bi-linear form can be regarded as constrained in the following structure :
$$\begin{array}{@{}rcl@{}} o^{(i,j)} = w^{(i)}v^{(j)}, \ or \ O=\textbf{w}\textbf{v}^{T}. \end{array} $$
(20)

That is, the weighting along the rows of X t is unrelated to one along the columns of X t . It significantly reduces the number of free parameters of o (i,j) from md into m+d for w (i) and v (j), which is favourable because we usually have a small-size N for a given sample set N . However, it also suffers the limitation of being applicable only to the cases where the dependence across rows of X t is not related to one along the columns of X t . To extend such a limitation, further generalisations of bi-linear matrix forms will be proposed in Equation (40).

Methods

KL statistics and matrix-variate tests

Given the case and control samples X ω ={x t,ω ,t=1,,N ω and ω=0,1} from a parametric family q(x|θ), all the unknown parts of the true value θ are estimated under H 0 by Equation (1), e.g. by the maximum likelihood from X 0X 1. Also, we estimate \(\hat \theta \) from X 1 and test whether H 0 breaks by the following formulation (see Equation (36) in (Xu 2012a)):
$$ \begin{aligned} s_{KL}=KL (q (\textbf{x} |\theta^{\ast}) || q (\textbf{x} |\hat \theta)), \ \text{with} \\ KL(p||q) = \int p(u) \ln{\frac{p(u)}{q (u)}} d u, \end{aligned} $$
(21)

from which the Hotelling T 2 statistics (Hotelling 1931) and FDA are obtained as its special cases.

Alternatively, we may also rewrite H 0 into
$$\begin{array}{@{}rcl@{}} H_{0} : \text{no difference between}~~q(\textbf{x} | \theta_{1})~~\text{and} ~~q(\textbf{x} |\theta_{0}), \end{array} $$
(22)
with X 1 from q(x|θ 1) and X 0 from q(x|θ 0). We estimate θ 1 from the case samples X 1 and θ 0 from the control samples X 0 by either the maximum likelihood or other learning principles, and test H 0 by the following case-control formula:
$$\begin{array}{@{}rcl@{}} s_{KL}=KL (q (\textbf{x} | \theta_{0}) || q (\textbf{x} |\theta_{1})), \end{array} $$
(23)

which directly measures the discrepancy between the case population and control population and provides a general formulation for model-based tests. In contrast, s KL by Equation (21) indirectly considers the difference of the case population from the pool of both populations under H 0.

For the special case that q(x|θ)=G(x|c,Σ), s KL by Equation (21) and s KL by Equation (23) are equivalent with merely a slight difference of a constant scale, resulting in:
$$ \begin{aligned} s_{KL} &= KL(G(x|\textbf{c}_{0},\Sigma)||G(x|\textbf{c}_{1},\Sigma))\\ &=0.5Tr\left[\!\left(\textbf{c}_{0}-\textbf{c}_{1}\right)\left(\textbf{c}_{0}-\textbf{c}_{1}\right)^{T}\Sigma^{-1}\right]. \end{aligned} $$
(24)

It relates to the Hotelling statistics by Equation (2) via \( T^{2} =2\frac {N_{0}N_{1}}{N_{0}+N_{1}}s_{\textit {KL}}\), i.e. the Hotelling statistics is covered as a special case of the general formulation by Equation (21).

The equivalence no longer exists when we consider other examples of q(x|θ 1) and q(x|θ 0). Because the case population reflects an abnormal situation and thus has a distribution that is quite different from the control population; q(x|θ 1) may come from a parametric family that is different from the one of q(x|θ 0). For an example, we may consider a Gaussian for the control samples while a mixture of two Gaussians for the case samples.

In addition to testing c 0=c 1 as considered by the Hotelling statistics, we may use s KL by Equation (23) to develop statistics for other null hypotheses of the type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\). For examples, \({\theta ^{s}_{i}}\) could be a covariance Σ i .

Generally, we may use s KL by Equation (21) to develop a statistics for testing a general relation given by a vector equation h(θ)=0 that consists of one or several joint equations, for which we estimate θ 0 from samples of X 0X 1 subject to the constraint h(θ)=0 and estimate θ 1 from only the case samples X 1 without the constraint. The above type \({\theta ^{s}_{0}}= {\theta ^{s}_{1}}\) is a special case \(h(\theta)={\theta ^{s}_{0}}- {\theta ^{s}_{1}}=0\). Also, the equality may be extended to several subsets \(\{{\theta ^{s}_{i}}\}\) that are equal to each other, with each \({\theta ^{s}_{i}}\) to be either of the mean vector c i or a covariance Σ i . Even the simplest case θ s =0, θ s θ has been widely studied. For examples, θ s could be the variances for the variance analyses or w=0 in Equation (5) for logistic regression and Cox regression.

Not only Equation (21) provides a general formulation of developing a statistics for a composite test, but also a bird view of the existing statistics for further understanding, improvements, and extensions.

Simply with each vector x replaced by a matrix X, we can extend Equations (21) and (23) to consider matrix-variate samples. Without losing generality, we focus on Equation (23) and get:
$$\begin{array}{@{}rcl@{}} s_{KL}= KL (q (X| \theta_{0}) || q (X |\theta_{1})). \end{array} $$
(25)
We consider q(x|θ) given by the following matrix normal distribution (MND) (Dutilleul 1999; Xu 2012a) :
$$\begin{array}{@{}rcl@{}} N (X | C, \Omega, \Sigma) = \frac{e^{-0.5Tr\left[\Omega^{-1}(X-C)^{T} \Sigma^{-1} (X-C)\right]}} {(2\pi)^{0.5md} |\Sigma|^{0.5d}|\Omega |^{0.5m}}, \end{array} $$
(26)

where a matrix Ω describes the cross-column dependence of the matrix variate X, and a matrix Σ describes the cross-row dependence of X. This matrix distribution is equivalent to a multivariate Gaussian distribution G(vec(X)|vec(C),ΣΩ), where denotes the Kronecker product.

With each sample X t,ω from \(N\left (X|C^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\) under the assumption:
$$\begin{array}{@{}rcl@{}} \Sigma^{x}={\Sigma^{x}_{0}}={\Sigma^{x}_{1}}, \ \Omega^{x}={\Omega^{x}_{0}}={\Omega^{x}_{1}}, \end{array} $$
(27)
it follows from Equation (25) that we obtain:
$$ \begin{aligned} s_{KL}&=KL\left(N\left(X |{C^{x}_{1}},\Omega^{x},\Sigma^{x}\right)||N\left(X |{C^{x}_{0}},\Omega^{x},\Sigma^{x}\right)\right)\\ &=Tr\left[\Omega^{x\ -1}\left({C_{1}^{x}}-{C_{0}^{x}}\right)^{T} \Sigma^{x\ -1} \left({C_{1}^{x}}-{C_{0}^{x}}\right)\right], \end{aligned} $$
(28)

as the matrix-variate counterpart of Equation (24), where parameters are typically estimated by the maximum likelihood principle (Xu, 2015).

Generally, with help of Equation (25), we may also develop statistics for distributions other than matrix normal distributions.

Model-based two-sample tests

The tests for H 0 by Equation (22) are featured by comparing the difference between two parametric models q(x|θ 1) and q(x|θ 0) on the entire domain of x. Its basis is modelling the case population by q(x|θ 1) with its parameter θ 1 estimated from X 1 and modelling the control population by q(x|θ 0) with its parameter θ 0 estimated from X 0. Thus, these tests are called model-based two-sample tests or model-based tests in short wherever there is no confusion caused.

Typically, a statistics s is considered to measure the difference between two models. The bigger the value s is, the larger the difference is. We reject H 0 when s takes a large enough value s , while the false positive probability of this rejection is called the p value.

Usually, how to get a statistics s from samples is task-dependent. It is typically a function of the first- and second-order statistics that are random variables directly obtained from samples of populations, e.g. see the Hotelling statistics by Equation (2). Equation (23) provides a general perspective of getting such a statistics s KL , covering not only the first- and second-order statistics but also ones beyond.

Actually, Equation (23) can be further generalised. Adding in the priorities α 1,α 0 for q(x|θ 1) and q(x|θ 0), we have:
$$ \begin{aligned} {KL}_{10}&=KL (\alpha_{1} q(\textbf{x} |\theta_{1}) || \alpha_{0} q(\textbf{x} |\theta_{0}))\\ &=\alpha_{1}KL (q(\textbf{x} |\theta_{1}) || q(\textbf{x} |\theta_{0}))+ \alpha_{1}\delta_{\Gamma},\\ \delta_{\Gamma} &=\ln{\frac{\alpha_{1}}{\alpha_{0}}}=\ln {\alpha_{1}}-\ln{\alpha_{0}}, \end{aligned} $$
(29)
which describes the difference observed from the case side. From the control side, we have also:
$$ \begin{aligned} {KL}_{01}=KL (\alpha_{0} q(\textbf{x} |\theta_{0}) || \alpha_{1} q(\textbf{x} |\theta_{1})) =\alpha_{0} KL (q(\textbf{x} |\theta_{0}) || q(\textbf{x} |\theta_{1})) -\alpha_{0}\delta_{\Gamma}. \end{aligned} $$
We further get their average and difference as follows:
$$ \begin{aligned} {KL}_{\text{sum}}&= \frac{{KL}_{10}+{KL}_{01}}{2}= \int \frac{\alpha_{1} q(\textbf{x} |\theta_{1}) - \alpha_{0} q(\textbf{x} |\theta_{0})}{2}\ln{\frac{\alpha_{1} q(\textbf{x} |\theta_{1})}{\alpha_{0} q(\textbf{x} |\theta_{0})}}d\textbf{x},\\ {KL}_{\text{dif}}&= {{KL}_{10}-{KL}_{01}}=\int q(\textbf{x} |\theta)\ln{\frac{\alpha_{1} q(\textbf{x} |\theta_{1})}{\alpha_{0} q(\textbf{x} |\theta_{0}) }}d\textbf{x}, q(\textbf{x} |\theta) =\alpha_{1} q(\textbf{x} |\theta_{1}) + \alpha_{0} q(\textbf{x} |\theta_{0}). \end{aligned} $$
(30)
For q(x|θ)=G(x|c,Σ), we have:
$$ \begin{aligned} {KL}_{1,0}&= \alpha_{1}\left(\delta_{\alpha, \Sigma} + \delta \textbf{c}^{T}\Sigma_{1}^{-1} \delta \textbf{c}\right), {KL}_{0,1}=\alpha_{0}\left(-\delta_{\alpha, \Sigma}+ \delta \textbf{c}^{T}\Sigma_{0}^{-1} \delta \textbf{c}\right), \\ {KL}_{\text{sum}}&=\frac{(\alpha_{1}-\alpha_{0}) \delta_{\alpha, \Sigma}+\delta \textbf{c}^{T}\Sigma_{\Gamma}^{-1} \delta \textbf{c}}{2}, {KL}_{\text{dif}}=\delta_{\alpha, \Sigma}+\delta \textbf{c}^{T}\left[\alpha_{1}\Sigma_{1}^{-1} -\alpha_{0}\Sigma_{0}^{-1}\right] \delta \textbf{c},\\ \Sigma_{\Gamma}^{-1}&=\alpha_{0}\Sigma_{0}^{-1}+\alpha_{1}\Sigma_{1}^{-1}, \delta \textbf{c}=(\textbf{c}_{1}-\textbf{c}_{0})/\sqrt{2}, \delta_{\alpha, \Sigma}=\ln{\frac{\alpha_{1}}{|\Sigma_{1}|^{0.5}}}- \ln{\frac{\alpha_{0}}{|\Sigma_{0}|^{0.5}}}, \end{aligned} $$
(31)

from which we observe how an overall difference is structured from the statistics on individual differences. For K L sum, the role of anti-dispersion difference δ α,Σ is cancelled while the position difference δ c is averaged. For K L dif, the role of δ α,Σ is summed up while the position difference δ c is cancelled. In other words, the roles of K L sum and K L dif are complementary. According to the nature of tasks, we may use either of them separately or the both of them jointly.

The performance of examining H 0 by Equation (22) is typically evaluated via the p value, which depends on not only how p is approximately estimated but also how well q(x|θ 0) models X 0 and q(x|θ 1) models X 1. A poor modelling makes the resulted p unreliable. Thus, the performance evaluation should also consider its corresponding modelling error or generally the likelihood:
$$\begin{array}{@{}rcl@{}} L=\ln{[\!\alpha_{0} q (X_{0} | \theta_{0})+\alpha_{1}q (X_{1} | \theta_{1})]}. \end{array} $$
(32)

The modelling error depends not only on what type of model is used but also on an appropriate model complexity. Using a model with a big model complexity can lead to an over-optimistic result, i.e. suffering an over-fitting problem. To remedy it, we need to consider either an average of modelling errors on training and testing samples (e.g. by cross validation (Stone 1974)) or approximated generalisation error by one of the model-selection criterion (e.g. BIC (Schwarz 1978)).

Jointly, model-based two-sample tests involve two tasks, that is, the first two tasks summarised in Table 1. Task A is a typical topic of machine learning, from which those existing studies can be adopted, while task B is a typical topic of a statistical test, with its corresponding ε B being a nonnegative measure that monotonically decreases towards zero as s tends towards a large value.
Table 1

Four Tasks of Integrative Hypothesis Tests

Tasks

Description

Task A (modelling)

estimate θ ω such that q(x|θ ω ) models the corresponding population of samples, with the performance evaluated by its corresponding ε A , e.g., the average error or generalisation error.

Task B (comparison)

develop a statistics s based on the resulted models to test H 0 by Equation (22), with the performance evaluated by its corresponding ε B that measures the difference between two populations, e.g., the p-value.

Task C (classification)

classify each sample to either ω=1 or 0, with the performance evaluated by its corresponding ε C , e.g., either the rate of incorrect classification by Equation (44) or alternatively the corresponding p-value obtained by a test based on a statistics by Equation (47).

Task D (assurance)

test whether a reliable separating boundary exists between the two populations of samples, with the performance evaluated by its corresponding ε D .

It is an open challenge to integrate ε A and ε B into one objective to optimise because of lacking investigations on how to combine them. A preliminary study has been made empirically with the help of the 2D scattering plots of ε A versus ε B as illustrated in Figure 2. Each scattering point denotes a performance pair (ε A ,ε B ), associated with one miRNA on the samples for gene expression. Those points located near the origin (e.g. those in the orange colour) act as the interested candidate points.
https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Fig2_HTML.gif
Figure 2

2D scattering plots for joint analyses.

Matrix-variate discriminative analysis

As addressed around Equation (11), the classic FDA seeks a projection y t =w T x t to maximize J y . Moreover, it follows from the bi-linear form by Equation (18) that a matrix-variate discriminative analysis is obtained by:
$$ {\small{\begin{aligned} \{\textbf{w}^{\ast}, \textbf{v}^{\ast}\}&=\arg\max\!{_{\textbf{w},\textbf{v}}} \,J(\textbf{w},\textbf{v}), J(\textbf{w},\textbf{v})= \frac{\textbf{w}^{T}\left({C_{1}^{x}}-{C_{0}^{x}}\right)\textbf{v}\textbf{v}^{T}\left({C_{1}^{x}}-{C_{0}^{x}}\right)^{T}\textbf{w} }{\textbf{w}^{T}\Sigma_{\textbf{v}}\textbf{w} },\\ &=\frac{\textbf{v}^{T}\left({C_{1}^{x}}-{C_{0}^{x}}\right)^{T}\textbf{w}\textbf{w}^{T}\left({C_{1}^{x}}-{C_{0}^{x}}\right)\textbf{v} }{\textbf{v}^{T}\Sigma_{\textbf{w}}\textbf{v} }, \Sigma_{\textbf{v}}\,=\,\!\sum_{\omega=0,1} \!\sum\limits_{t=1}^{N_{\omega}} \!\left(X_{t, \omega} -C^{x}_{\omega}\right) \textbf{v}\textbf{v}^{T}\left(X_{t, \omega} -C^{x}_{\omega}\right)^{T},\\ \Sigma_{\textbf{w}}&=\sum_{\omega=0,1} \sum\limits_{t=1}^{N_{\omega}} \left(X_{t, \omega} -C^{x}_{\omega}\right)^{T} \textbf{w}\textbf{w}^{T}(X_{t, \omega} -C^{x}_{\omega}),\cr \end{aligned}}} $$
(33)
which may be solved by iterating:
$$ \begin{aligned} &\text{fix \(\boldsymbol{v}\), get} ~~\mathbf{w}^{\ast}\propto \Sigma_{\textbf{v}}^{-1}\left({C_{1}^{x}}-{C_{0}^{x}}\right)\mathbf{v}, \mathbf{w}= \mathbf{w}^{\ast}/\Vert \mathbf{w}^{\ast}\Vert, \\ &\text{ fix \(\boldsymbol{w}\), get} ~~\mathbf{v}^{\ast}\propto \Sigma_{\mathbf{w}}^{-1}\left({C_{1}^{x}}-{C_{0}^{x}}\right)^{T}\mathbf{w}, \mathbf{v}= \mathbf{v}^{\ast}/\Vert \mathbf{v}^{\ast}\Vert, \end{aligned} $$
(34)
Generally, the bi-linear form by Equation (18) may also be rewritten into the following matrix format:
$$\begin{array}{@{}rcl@{}} Y_{t}=V^{T}X_{t}W, \end{array} $$
(35)

with a m×m s matrix V and a d×d s matrix W. It degenerates back to Equation (18) when m s =1,d s =1. Mapping into one variable y t may lose too much discriminative information. Instead, Equation (35) maps X t into either of a size-reduced matrix, a column vector, or a row vector according to practical problems, e.g. from not only genomics data in genetic biology but also image or table data in various tasks of big data analyses.

With X t replaced by Y t , equations from Equations (25) to (29) are directly applicable. If X t comes from an MND, Y t comes from an MND too. Accordingly, Equation (33) becomes:
$$ \begin{aligned} \{W^{\ast}, V^{\ast}\}&=\arg\max\!{_{W,V}}\, J(W,V), \\ J(W,V)&= Tr\left[\Omega^{y\ -1}\left({C^{y}_{1}}-{C^{y}_{0}}\right)^{T} \Sigma^{y\ -1} \left({C^{y}_{1}}-{C^{y}_{0}}\right)\right], \end{aligned} $$
(36)
where the parameters are given in a way similar to Equation (28). Also, its solution may be obtained by iterating:
$$\begin{array}{@{}rcl@{}} \text{Fixing} W, \text{get}\; V\; \text{by solving}\; \nabla_{V} J(W,V)=0, \\ \text{Fixing} V, \text{get}\; W\; \text{by solving}\; \nabla_{W} J(W,V)=0. \end{array} $$
(37)
Actually, Equation (35) computes a set of the bi-linear matrix forms in parallel as follows:
$$\begin{array}{@{}rcl@{}} Y_{t}= \left[\!y_{t}^{(k,\ell)}\right], \ y_{t}^{(k,\ell)}=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{d} w^{(i,k)}v^{(j,\ell)}{ x}_{t}^{(i,j)}. \end{array} $$
(38)

Each \(y_{t}^{(k,\ell)}\) above and the bi-linear form by Equation (18) suffer the limitation discussed after Equation (20), which is relaxed with v (j) replaced by \(v^{(j)}_{i}\) or v (j,) replaced by \(v^{(j,\ell)}_{i}\), i.e. adding another dimension by a subscript i.

Focusing on the former, we extend Equation (20) into:
$$\begin{array}{@{}rcl@{}} o^{(i,j)} = w^{(i)}v^{(j)}_{i}, \cr v^{(j)}_{i} \text{is subject to a constraint, e.g. one of} \cr \left\{\begin{array}{ll} \sum_{j=1}^{d} v^{(j)}_{i}=1, &\text{Choice (a)}, \\ \text{from a Gaussian density}, &\text{Choice (b)}, \\ \text{from a Laplace density}, &\text{Choice (c)}. \end{array}\right. \end{array} $$
(39)
Accordingly, we extend Equation (18) into:
$$ \begin{aligned} y_{t}&=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{d} w^{(i)}v^{(j)}_{i}{x}_{t}^{(i,j)} =Tr[\!\text{diag}[\!\mathbf{w}]X_{t}V] =\mathbf{w}^{T}\mathbf{x}_{t}^{\mathbf{v}}, \ \mathbf{x}_{t}^{\mathbf{v}}=\left[\!\mathbf{v}_{1}^{T}\mathbf{x}_{t}^{(1)}, \cdots, \mathbf{v}_{m}^{T}\mathbf{x}_{t}^{(m)} \right]^{T}, \\ V&=[\!\mathbf{v}_{1}, \cdots, \mathbf{v}_{d}], \mathbf{v}_{i}=\left[\!v^{(1)}_{i}, \cdots, v^{(d)}_{i}\right]^{T}, \text{diag}[\!\mathbf{w}]=\text{diag}\left[\!w^{(1)}, \cdots, w^{(m)}\right], \end{aligned} $$
(40)

where T r[ A] denotes the trace of the matrix A.

Putting it into Equation (11) and considering choice (a) in Equation (39), we get Equation (33) modified into:
$$ \begin{aligned} \{\mathbf{w}^{\ast}, V^{\ast}\}&=\arg\max\nolimits_{\mathbf{w},V} J(\mathbf{w},V), \text{subject} \ \text{to} : \ \Vert \mathbf{v}_{i} \Vert=1, \forall i, \\ J(\mathbf{w},V)&= \frac{\mathbf{w}^{T}\left(\mathbf{c}^{\mathbf{v}}_{1}-\mathbf{c}^{\mathbf{v}}_{0}\right) \left(\mathbf{c}^{\mathbf{v}}_{1}-\mathbf{c}^{\mathbf{v}}_{0}\right)^{T} \mathbf{w}} {\mathbf{w}^{T}\Sigma_{\mathbf{v}} \mathbf{w}} =\frac{Tr^{2}\left[\!\text{diag}\left[\mathbf{w}\right]\left({C_{1}^{x}}-{C_{2}^{x}}\right)V\right]}{\sum_{\omega=0,1} \sum\limits_{t=1}^{N_{\omega}} Tr^{2}\left[\text{diag}[\!\mathbf{w}]\left(X_{t, \omega}- C_{\omega}^{x}\right) V\right]},\\ \mathbf{c}^{\mathbf{v}}_{\omega}&=\left[\mathbf{v}_{1}^{T}\mathbf{c}_{\omega}^{x\ (1)}, \cdots, \mathbf{v}_{d}^{T}\mathbf{c}_{\omega}^{x\ (d)} \right]^{T}, \Sigma_{\mathbf{v}}=\sum_{\omega=0,1} \sum\limits_{t=1}^{N_{\omega}} \delta \mathbf{x}_{t, \omega}^{\mathbf{v}}\delta \mathbf{x}_{t, \omega}^{\mathbf{v}\ T},~~\\ \text{where}~~\delta \mathbf{x}_{t, \omega}^{\mathbf{v}}&=\left[\mathbf{v}_{1}^{T}\left(\mathbf{x}_{t, \omega}^{(1)}-\mathbf{c}_{\omega}^{x\ (1)}\right), \cdots, \mathbf{v}_{m}^{T}\left(\mathbf{x}_{t, \omega}^{(m)}-\mathbf{c}_{\omega}^{x\ (m)}\right)\right]^{T}. \end{aligned} $$
(41)
which may be solved by iterating:
$$ \begin{aligned} \text{fix}~~ \boldsymbol{w}, ~~\text{get}~~ V~~ \text{by solving}~~ \nabla_{V} J(\mathbf{w},V)=0,\\ \text{subject to} ~:~~ \Vert \mathbf{v}_{i} \Vert=1, \forall i. \\ \text{fix}~~ V, ~~\text{get}~~ \mathbf{w}^{\ast} \propto \Sigma_{\mathbf{v}}^{-1} \left(\mathbf{c}^{\mathbf{v}}_{1}-\mathbf{c}^{\mathbf{v}}_{0}\right), \mathbf{w}= \frac{\mathbf{w}^{\ast}}{\Vert \mathbf{w}^{\ast}\Vert}. \end{aligned} $$
(42)
For simplicity, we may approximately ignore the coupling across different subscript i and get:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \mathbf{v}_{i}^{\ast} \propto \Sigma^{^{(i)}\ -1}\left(\mathbf{c}_{1}^{(i)}- \mathbf{c}_{0}^{(i)}\right), \ \mathbf{v}_{i}= \mathbf{v}_{i}^{\ast}/\Vert \mathbf{v}_{i}^{\ast}\Vert, \\ \Sigma^{(i)}=\sum_{\omega=0,1} \sum\limits_{t=1}^{N_{\omega}} \left(\mathbf{x}_{t, \omega}^{(i)}-\mathbf{c}_{\omega}^{x\ (i)}\right)\left(\mathbf{x}_{t, \omega}^{(i)}-\mathbf{c}_{\omega}^{x\ (i)}\right)^{T}. \end{aligned} \end{array} $$
(43)

This solution does not relate to w, and thus, the job is done after getting w by Equation (34).

Also, we may update V by a gradient-based approach via V J(w,V). Practically, a regularisation may be added on J(w,v) and J(w,V) via Gaussian priories on w,v, and V. Alternatively, we may make sparse learning via Laplace priories on w,v, and V.

Being a complementary to model-based two-sample tests that considers H 0 by Equation (22) from an overall perspective of populations, we may also perform the classification task in Table 1 to evaluate the goodness of the decomposition by Equation (10), measured by another quantity ε C , e.g. the following rate of incorrect classification
$$\begin{array}{@{}rcl@{}} \varepsilon_{C}=\frac{\# X^{(1)}_{0}+ \#X^{(0)}_{1}}{\#X_{0}+\#X_{1}}. \end{array} $$
(44)
Classically, an optimal classification is given by:
$$\begin{array}{@{}rcl@{}} \omega=\arg\max_{j} [\!\alpha_{j} q(\xi |\theta_{j})], \end{array} $$
(45)

where ξ could be either of x t and X t or the corresponding projections y t and Y t . Mapping samples into the projections helps to reduce the dimension of x t and X t for tackling the overfitting difficulty of task A in Table 1, especially when the size of samples is not large enough. Also, it facilitates visualisation of two populations in a low dimension (especially below 3D dimension) such that classification is made with human interaction.

Boundary-based tests

Actually, the FDA by Equation (11) finds w that defines the normal direction of the best discriminative hyperplane, as shown in Figure 3. In addition to Equation (45), the hyperplane often acts as a separating boundary as follows:
$$\begin{array}{@{}rcl@{}} g(x,\textbf{w})=\textbf{w}^{T}\textbf{x}+w_{0}=\textbf{w}^{T}(\textbf{x}-\mu)=0,\\ \text{by which \(\boldsymbol{x}\) is classified into} \\ \left\{\begin{array}{ll} \text{a case sample}, & if \ g(\textbf{x},\textbf{w})>0, \\ \text{a control sample}, & if \ g(\textbf{x},\textbf{w})\le 0. \end{array} \right. \end{array} $$
(46)
https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Fig3_HTML.gif
Figure 3

Linear boundary-based statistics.

That is, it performs task C to get the decomposition by Equation (10) on which we may directly get the measure ε C by Equation (44).

Alternatively, testing Equation (1) may be made by the following statistics from Equation (10):
$$\begin{array}{@{}rcl@{}} s=\frac{\# X^{(1)}_{1}+ \#X^{(0)}_{0}}{\#X^{(1)}+\#X^{(0)}}, \\ \text{or} \ s=\frac{\# X^{(1)}_{1}+ \#X^{(0)}_{0}}{\# X^{(1)}_{0}+ \#X^{(0)}_{1}}. \end{array} $$
(47)
There are also two other choices in Table 2. Choice (1) is a model-based test for task B from the perspective of one-dimensional samples of y t =w T x t . Focusing on a most discriminative direction, this test puts attention only on salient differences. As to be addressed later in Table 3, the test can be made together with testing H 0 by Equation (5) such that the rest of the entire sample space is taken into consideration.
Table 2

Two Boundary based tests for Task B

Type

Description

(1)

on the projected samples of y t =w T x t , we use the one dimensional case of Equation (24) or the Welch’s t-test to test Equation (1) merely along the normal direction of the boundary.

(2)

measuring the distances of samples from a separating boundary, we consider \( s_{B}=\frac {\sum _{{\mathbf {x}}\in X^{(1)}_{1}\cup X^{(0)}_{0} } | \frac {{\mathbf {w}}^{T}({\mathbf {x}}-{\mathbf {c}}_0)}{\Vert {\mathbf {w}} \Vert }|{\!~\!}^{q}}{ \sum _{{\mathbf {x}}\in X^{(1)}_{0}\cup X^{(0)}_{1} } | \frac {{\mathbf {w}}^{T}({\mathbf {x}}-{\mathbf {c}}_0) }{\Vert {\mathbf {w}} \Vert }|{\!~\!}^q+\gamma _{B}}, \ q \ge 0. \) with q=2 for the square distance, q=1 for the Euclidean one.

Table 3

Four Types of Integrative Hypothesis Tests

Types

Description

Type-1 (model based IHT)

For Task A, each of two populations is modelled by a parametric model, with ε A measured by the negative log-likelihood by Equation (32) or its extension to generalisation error. For Task B, a model based test is made to compare the difference between two parametric models, with ε B by the corresponding p-value. For Task C, we get the classification by Equation (45), with ε C by Equation (44) or the p-value by a BBT via a statistics obtained from Equation (10).

Type-2 (boundary based IHT)

A separating boundary is modelled by a hyperplane with its normal w, based on which Task D is handled by a boundary existence test by Equation (5) with ε D measured by the corresponding p-value. For Task C we get the classification by Equation (46) with ε C by Equation (44) or alternatively the corresponding p-value obtained by Equation (47), and for Task B we get the p-value by one of two BBT choices in Table 2.

Type-3 (mixing IHT)

Mix the above two types with two populations and their separating boundary all in parametric models. A basic one uses ε A ,ε B from Type-1 and ε C ,ε D from Type-2. The other uses ε C ,ε D from Type-2 while ε A ,ε B are modified by Equation (58).

Type-4 (Ying-Yang IHT)

Instead of mixing, the parametric models are jointly learned for two populations of samples and their separating boundary. One example is the BYY harmony learning based formulation to be introduced after Equation (60).

Choice (2) in Table 2 provides a statistics for task B on samples without dimension reduction. The statistics s B comes from considering that samples of \(X^{(1)}_{1}, \ X^{(0)}_{0} \) should be distant from the boundary (as illustrated by two blue arrows in Figure 3) while samples of \( X^{(1)}_{0}, \ X^{(0)}_{1}\) should not be far from this boundary (see two red arrows). Actually, s B is a special case of the ones given by Equations (26) and (30) in (Xu 2013a). The only difference is that γ B >0 is added here to trade off the contribution from \(X^{(1)}_{0}\cup X^{(0)}_{1}\).

Both two choices in Table 2 are based on the boundary (i.e. either Equation (10) or y t =w T x t ) and thus are called boundary-based two-sample tests or BBT in short. Different choices of BBT are also coupled with how w is obtained; see some examples outlined in Table 4.
Table 4

Some choices for obtaining w

Choice

Description

(a)

get w via FDA by Equation (11), as addressed in the previous subsection.

(b)

estimate w by maximizing L by Equation (4), as to be addressed in the next subsection.

(c)

get w as the normal direction of a separating hyperplane by one of machine learning approaches, e.g., support vector machine (SVM) (Cortes and Vapnik 1995; Suykens et al. 2002).

Replacing Equation (11) with the matrix-variate FDA by Equation (33), we get the projection y t =w T X t v column by column along the direction w and row by row along the direction v. With every appearance of x replaced by \(\boldsymbol {x}_{t}^{v}=X_{t}\mathbf {v}\), all the above studies directly apply. Similarly, we may also consider the dual representation \(y_{t}= \mathbf {v}^{T}\mathbf {x}_{t}^{w} \) with \(\boldsymbol {x}_{t}^{w} ={X}_{t}^{T}\mathbf {w}\) to get a linear separating boundary featured by v. It follows from Equations (19) and (20) that w and v jointly form a linear boundary by vec[ O] to separate samples of vec[ X t ].

Furthermore, extension can be made on the generalised bi-linear form via Equation (40) and Equation (41), with each x replaced by \(\boldsymbol {x}_{t}^{v} \) given in Equation (40).

Extensions can be also made on the generalised bi-linear form by Equation (35). Samples of two populations are projected into a dimension-reduced matrix Y t =V T X t W, and then, a matrix-variate Hotelling test can be made by Equation (28) with X t replaced by Y t and the subscript x replaced by y, where the matrices W,V actually take the roles of the boundary.

Matrix-variate logistic regression

Testing H 0 by Equation (5) has been widely studied in the literature of logistic regression. Actually, the role of this w is the same as the one in Equation (46), i.e. a discriminative boundary that separates every sample into either ω=1 or ω=0. Thus, the choices in Table 4 can be cross-utilised for a mutual benefit, e.g. getting w via FDA by Equation (11) is relatively easy to compute and thus provides an initialization for estimating w by Equation (4), while the advantage of Equation (3) over FDA is that dummy or design variables may be taken into consideration for learning w, e.g. we extend ζ t =y t +c in Equation (3) into:
$$\begin{array}{@{}rcl@{}} {\zeta}_{t} =y_{t}+z_{t}+c, \ z_{t}=\mathbf{b}^{T} \boldsymbol{\xi}_{t}, \end{array} $$
(48)

where ξ t consists of dummy variables. Moreover, random effects may also be added, in a way similar to that of the linear mixed model by Equation (15).

Testing H 0 by Equation (5) is typically handled with the Wald test by Equation (7) or Rao’s score test by Equation (8), for which the score vector and the information matrix are given as follows (Pan et al. 2014):
$$\begin{array}{@{}rcl@{}} \Delta(\mathbf{w})=\sum_{t =1}^{N} (\omega_{t} -\bar \omega)(\mathbf{x}_{t}-\mathbf{c}), \ \mathbf{c}=\frac{1}{N}\sum_{t =1}^{N}\mathbf{x}_{t}, \cr I(\mathbf{w})= \bar \omega (1-\bar \omega) \sum_{t =1}^{N} (\mathbf{x}_{t}-\mathbf{c}) (\mathbf{x}_{t}-\mathbf{c})^{T}, \end{array} $$
(49)

where \( \bar \omega \) denotes the mean of ω t .

Being different from the BBT addressed in the previous subsection, testing H 0 by Equation (5) directly aims at whether a boundary w exists. Such a test is thus named boundary existence test. It is widely known as a test for regression analyses. Also, we may regard it as a two-sample test that is complementary to the BBT choice (1) in Table 2. The two tests jointly cover the entire space of samples.

The boundary existence test actually tackles another essential problem of discriminative analysis, namely, task D in Table 1. Given two populations with a finite sample size, it is not difficult to draw a boundary to separate them if there is no restriction on the complexity of the boundary. However, a boundary with a high complexity will be unreliable to separate new samples that come randomly from the same populations. To be reliable, the boundary should have an appropriate complexity too. It follows from Equation (45) that an optimal separating boundary is related to the models q(x|θ 1) and q(x|θ 0). In other words, appropriate boundary complexity is related to an appropriate model boundary complexity. Thus, task D and task A in Table 1 are coupled.

Typically, we consider a linear boundary because of its simple complexity. In the literature of pattern recognition (Cortes and Vapnik 1995; Cover 1965) efforts on whether samples of two populations are linearly separable by a hyperplane or a maximum-margin hyperplane can be regarded as examples related to task D in Table 1.

Next, we proceed to consider matrix-variate logistic regression. Putting the case and control samples into a paired set {X t ,ω t },t=1,,N, we extend Equation (3) with the inner product y t =w T x t to be replaced by the bi-linear form by Equation (18) or its extension by Equation (40).

Given V, the above studies directly apply when \(\boldsymbol {x}_{t}^{\textbf {v}}\) in Equation (40) replaces x t in Equations 3, 4, 7, and 8. The task of learning w,V can be made via the matrix-variate FDA by Equations (34) or (42).

Alternatively, we may estimate w,V via the maximum likelihood L by Equation (4) with the advantage of taking the effect of covariates into consideration. With −L written as J(w,V), we get it solved by Equation (37) with w replaced by W, e.g. implemented by the following gradient-based updating (Hosmer et al. 2013):
$$ \begin{aligned} \mathbf{w}^{\text{new}}&= \mathbf{w}^{\text{old}}- \eta_{w} \nabla_{\mathbf{w}} J\left(\mathbf{w}^{\text{old}}, v^{\text{old}}\right), \\ v^{\text{new}}&=v^{\text{old}}- \eta_{V} \nabla_{\mathbf{v}} J\left(\mathbf{w}^{\text{new}},v^{\text{old}}\right), \end{aligned} $$
(50)

where η w >0,η V >0 are small learning step sizes.

Also, we may test the dual problem of Equation (5) as follows:
$$\begin{array}{@{}rcl@{}} H_{0} : \ \mathbf{v}=\mathbf{0}, \end{array} $$
(51)

for the bi-linear form by Equation (18) simply with v replacing w in Equations 6, 7, 8, and 49. Similarly, extension may also be made to test H 0: v i =0,i.

Moreover, we may also apply Equation (21) to develop a statistics as follows:
$$\begin{array}{@{}rcl@{}} s_{KL}=\sum_{t} KL (p(\omega_{t} | \boldsymbol{x_{t}}, \theta^{\ast}) || p(\omega_{t} | \boldsymbol{x_{t}}, \hat \theta)), \end{array} $$
(52)

with p(ω t |x t ,θ) given by Equation (3), where θ is estimated via maximising L by Equation (4) under H 0 by Equation (5) and \(\hat \theta \) is estimated via maximising L by Equation (4) without H 0.

Similarly, we may get a matrix-variate Cox regression with the inner product w T x t in Equation (13) replaced by the bi-linear form by Equation (18) or its extension Equation (40). Accordingly, we test the H 0 by Equation (5) and the H 0 by Equation (51), using the Wald test with Equation (7) or Rao’s score by Equation (8) with Δ(w),I(w) computed from Equation (6) but L given by the partial likelihood L(w).

Furthermore, the univariate y t can be extended into a vector or matrix Y t . One typical example is a bi-linear regression of Y t by Equation (35), that is we consider:
$$\begin{array}{@{}rcl@{}} Y_{t}=V^{T}X_{t}W+ E_{t}, \end{array} $$
(53)

where E t is independent of X t and comes from N(Y t V T X t W|0,Λ,D) by Equation (26), while both Λ,D are diagonal matrices.

Again, there are two choices to estimate W,V. One is the matrix-variate FDA by Equation (36). The other is maximising the following likelihood:
$$\begin{array}{@{}rcl@{}} L= \sum_{t =1}^{N} N \left(Y_{t} - V^{T}X_{t}W | 0, \Lambda, D\right). \end{array} $$
(54)
Particularly, when Λ=λ I,D=d I, we are lead to the following least square error approach:
$$\begin{array}{@{}rcl@{}} \min_{W,V} J(W,V), \ J(W,V)=\cr \sum\limits_{t =1}^{N} Tr\left[\left(Y_{t} - V^{T}X_{t}W\right) \left(Y_{t} - V^{T}X_{t}W\right)^{T}\right], \end{array} $$
(55)

which may be again handled by Equation (37) with w replaced by W.

It can be observed that Equation (53) is an extension of Equation (17) with F=0. On the other hand, we may extend Equation (17) into a bi-linear extension as follows:
$$\begin{array}{@{}rcl@{}} Y=V^{T}XW+Z \mathbf{F} + \mathbf{E}, \end{array} $$
(56)
which degenerates to:
$$\begin{array}{@{}rcl@{}} \mathbf{y}=V^{T}X\mathbf{w}+Z \mathbf{f} + \mathbf{e}, \end{array} $$
(57)

as a bi-linear mixed model extended from Equation (15).

Integrative hypothesis test

Discriminative analysis and testing of H 0 by Equation (1) are made from either a model-based perspective (e.g. performing task A and task B in Table 1) or a boundary-based perspective (e.g. performing task C and task D in Table 1). Moreover, all the four tasks are associated with another problem called feature selection, that is, selecting a number of elements in x to form a subset x f such that one or more of the four tasks achieves a good enough performance.

In the existing efforts, each of four tasks has been studied individually, with each having its strength and limited coverage. However, performances of these tasks are coupled, and thus, a best set of features for one task may not be necessarily the best for the others.

The complementary nature of task B and task C was preliminarily discussed in Section VI in (Xu 2012a), where a model-based test for task B is named as A-test (a test in the observed data domain) and a boundary-based test for task C is named as I-test (a test in the inner representation domain). Under the name of IHT, good performances of task B and task C are demanded jointly (Xu 2013a, 2013b). This paper further extends IHT to include task A and task D.

We start at jointly optimising the performances of task B and task C. Its necessity and feasibility are empirically justified, with help of the 2D scattering plots of ε B by the p value for measuring the performance of task B and ε C by the misclassification rate for measuring the performance of task C. A small ε B indicates a big difference between q(x|θ 0) and q(x|θ 1) from an overall perspective, and a small ε C indicates a well classification of samples from a separating boundary perspective. Illustrated in Figure 4 are two examples obtained from one empirical study.
https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Fig4_HTML.gif
Figure 4

Necessity and feasibility of evaluating the performances of tasks B and C, on the samples of gene expressions. A scattering point denotes a performance pair with the x-axis for p value and the y-axis for misclassification, associated with one miRNA for (A) and two miRNAs for (B).

As indicated by the blue vertical dashed line in Figure 4, there are many miRNAs that share a same small p value ε B but can take different values of misclassification ε C in a big range. Also, as indicated by the blue horizontal dashed line in Figure 4, there could be multiple miRNAs that take a same misclassification but take different p values. In other words, though the performance of one task is optimised, the performance of the other can still be poor. Thus, we need to jointly seek the good performances of both the tasks, i.e. IHT is necessary. On the other hand, it is observable from the red dots within the blue circle in Figure 4 that there are indeed a few scattering points with each taking both a small p value ε B and a small misclassification ε C , i.e. it is also feasible to achieve the goal of IHT too.

Such a 2D plot’s evaluation provides a tool for better joint performances of task B and task C, by which we may interactively observe the configuration of scattering points and locate the candidate points that are nearest to the origin of the coordinate space.

Extensions can be further made to a joint evaluation of the IHT performance with task A and task D also included, such that the strengths of different tests and methods are integrated in a rather systemic way, for which we address four types of IHT in Table 3.

From the model-based perspective, the first type is an extension of the one addressed in Figure 2, with ε C added in to get a 3D plots for a joint evaluation of ε A , ε B , and ε C . Instead of Equation (45), we may get ε C by some nonparametric classifiers, e.g. the classic kNN classifier and the kernel classifiers (Williams 2003). Moreover, we are unable to handle task D because the boundary involved here does not have an explicit expression to be tested.

From the boundary-based perspective, the second type considers samples jointly by a separating boundary and projected samples, evaluated by ε D for the existence of boundary, ε C for the misclassification by the boundary, and ε B for measuring the difference of two populations either along the normal direction of the boundary or according to the sample deviations from the boundary. Again, we may use a 3D plots for a joint evaluation of ε B , ε C , and ε D . However, it is difficult to handle task A merely based on the boundary.

The type of mix-modelled IHT combines the above two types to avoid the weak points of each type. Two typical examples are listed in Table 3. One picks ε A ,ε B from type (1) and ε C ,ε D from type (2) for a joint evaluation. The other modifies ε A ,ε B by taking the outcome by Equation (10) of the boundary in consideration, with the original estimated θ 0 and θ 1 replaced by the following maximum likelihood estimation:
$$\begin{array}{@{}rcl@{}} {\small{\max_{\theta} q \left(X^{(0)}_{0} | \theta\right)\ \text{and} \ \max_{\theta} q\left(X^{(1)}_{1}| \theta\right).}} \end{array} $$
(58)

Even better, we may estimate each θ ω by the maximum likelihood on the entire set X of samples but with the likelihood of each sample weighted by its corresponding posteriori p(ω|sample) by Equation (3).

BYY-harmony-learning-based formulation

The 2D plots and 3D plots only provides a preliminary tool for IHT, we need further studies on not only appropriate combinations of multiple p values and misclassification rates but also simultaneous optimisation of multiple measures. For the latter purpose, the mix-modelled IHT in Table 3 is further extended via iteratively learning θ 0 and θ 1 by Equation (58) to update the models \(q\left (X^{(0)}_{0} | \theta _{0}\right), q\left (X^{(1)}_{1}| \theta _{1}\right)\) and also re-estimating the boundary w, e.g. by a FDA method based on the updated models.

Leaving the task D for a future study, in the sequel, we further understand the task of learning the models from a perspective of learning a Ying machine and the task of learning the boundary from a perspective of learning a Yang machine, which leads to a BYY-harmony-learning-based formulation for IHT.

We start from revisiting Equation (29) from an IHT perspective. From α 1 q(x|θ 1)=q(x|θ)−α 0 q(x|θ 0), we consider the task B by the following measure:
$$ {\small{\begin{aligned} {KL}_{10}=KL (\alpha_{1} q(\mathbf{x} |\theta_{1}) || \alpha_{0} q(\mathbf{x} |\theta_{0})) =\\ =\int \alpha_{1} q(\mathbf{x} |\theta_{1})\ln{[\!\alpha_{1} q(\mathbf{x} |\theta_{1})]}d\mathbf{x}-e^{c}_{1,0}\\ =L_{1}-\left(e^{c}_{0,1}+e^{c}_{1,0}\right), \cr L_{1}=\int q(\mathbf{x} |\theta)\ln{[\!\alpha_{1} q(\mathbf{x} |\theta_{1})]}d\mathbf{x}, \cr e^{c}_{i,j}=\int \alpha^{(i)} q(\mathbf{x} |\theta_{i})\ln{\left[\alpha^{(j)} q(\mathbf{x} |\theta_{j})\right]}d\mathbf{x}, \end{aligned}}} $$

from which we observe that a large K L 10 comes from a large L 1 that reflects a good modelling of α 1 q(x|θ 1) (i.e. a good performance of task A) and a small confusion error \(e^{c}_{0,1}+e^{c}_{1,0}\) that is closely related to a small misclassification (i.e. a good performance of task C). In other words, three tasks are coordinately optimised.

However, a good modelling on the control samples has not been taken in the consideration of K L 10, which may be further improved by considering:
$$\begin{array}{@{}rcl@{}} {KL}_{\text{sum}}=\frac{L_{1}+L_{0}}{2}-\left(e^{c}_{0,1}+e^{c}_{1,0}\right), \cr L_{0}=\int q(\mathbf{x} |\theta)\ln{\left[\alpha_{0} q(\mathbf{x} |\theta_{0})\right]}d\mathbf{x}. \end{array} $$
(59)

From this K L sum, we need to get θ ω ,ω=0,1 by the ML learning. In other words, K L sum merely takes a role of evaluating the performances of task B and task C, but do not have a port to accommodate samples for estimating θ ω ,ω=0,1. Favourably, such a port is provided in the BYY harmony learning such that task A, task B, and task C are all jointly implemented.

Firstly, proposed in (Xu 1995) and systematically developed in the past two decades, the BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing learning regularisation, and developing a class of algorithms that implement automatic model selection during parameter learning. Readers are referred to (Xu 2010, 2012b, 2015) for the latest introduction about the BYY harmony learning.

Briefly, a BYY system consists of a Yang machine and Ying machine corresponding to two types of decomposition, namely, Yang p(R|X)p(X) and Ying q(X|R)q(R), respectively. The data X is regarded as generated from its inner representation R that consists of latent variables Y and parameters θ. The harmony measure is mathematically expressed as follows:
$$\begin{array}{@{}rcl@{}} H(p||q)=\int p({R}| {X})p({X})\ln[q({X}| {R})q({R})]d{X}d{R}. \end{array} $$
(60)

Maximising this H(p||q) makes this Ying Yang pair not only best matched but also have the least complexity. Such an ability can also be further observed from several perspectives (see Section 4.1 in (Xu 2010)).

Applied to α 1 q(x|θ 1) and α 0 q(x|θ 0), we have:
$$ \begin{aligned} H(p||q)&= \sum_{\omega=0,1} \int p(\omega| \mathbf{x}_{t})p(\mathbf{x})\ln[\!\alpha_{\omega} q(\mathbf{x} |\theta_{\omega}) ]d\mathbf{x}, \\ p(\omega| \mathbf{x}_{t})&=\frac{\alpha_{\omega} q(\mathbf{x} |\theta_{\omega}) }{\sum_{\omega=0, 1} \alpha_{\omega} q(\mathbf{x} |\theta_{\omega})}, \end{aligned} $$
(61)

where p(x) provides a port to accommodate samples \(\{\mathbf {x}_{t}\}^{N}_{t=1}\) via an empirical \(p(\mathbf {x})=\frac {1}{N}\sum _{t} \delta (\mathbf {x}-\mathbf {x}_{t})\) with δ(x) being the Dirac delta, which thus makes it possible to estimate θ ω ,ω=0,1 via maximising H(p||q).

It follows from p(0|x t )+p(1|x t )=1 that we get:
$$ \begin{aligned} H(p||q)&= {L_{1}^{H}}+{L_{0}^{H}} -\left(e^{H}_{0,1}+e^{H}_{1,0}\right),\\[-3pt] L_{\omega}^{H}&=\int p(\mathbf{x})\ln{[\!\alpha_{\omega} q(\mathbf{x} |\theta_{\omega})]}d\mathbf{x},\\[-3pt] e^{H}_{i,j}&=\int p(j| \mathbf{x}_{t})p(\mathbf{x})\ln{[\!\alpha_{i} q(\mathbf{x} |\theta_{i})]}d\mathbf{x}. \end{aligned} $$
(62)

Approximately considering p(x)≈q(x|θ), \(e^{H}_{0,1}+e^{H}_{1,0}\approx e^{c}_{0,1}+e^{c}_{1,0}\), and \({L_{1}^{H}}+{L_{0}^{H}}\approx L_{1} +L_{0}\), we observe that H(p||q) shares a nature similar to K L sum in Equation (59), while a difference is that the modelling part \({L_{1}^{H}}+{L_{0}^{H}}\) is provided with a port p(x) to accommodate samples such that task A can be performed via maximising H(p||q) without a need of separately estimating θ ω by the ML learning.

For q(x|θ)=G(x|c,Σ), we implement the maximisation of H(p||q) to estimate θ ω by directly adopting the semi-supervised BYY harmony learning for Gaussian mixture given in (Xu 2015), i.e. its algorithm 9, by which the performances of task A, task B, and task C are coordinated. Moreover, H(p||q) can be extended into its matrix-variate counterpart. Particularly, algorithm 9 in (Xu 2015) can be extended into the algorithm ?? given below for learning \(\alpha _{\omega }N\left (X|C^{x}_{\omega },\Omega ^{x}_{\omega },\Sigma ^{x}_{\omega }\right)\).

https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Figa_HTML.gif

During implementation of the above algorithm, not only task A is performed but also task C can be simply handled in the Yang step by checking whether \(\phantom {\dot {i}\!}p_{1| \textbf {x}_{t}} \ge p_{0| \textbf {x}_{t}} \) to classify each sample into the case or control. Also, task B can be made after learning by putting the resulted parameters into s KL =K L 10 or s KL =K L sum to get the corresponding p value.

Last but not least, considering semi-supervised learning, we also propose an improved procedure in Table 5 for training, testing, and validating on a small size of samples.
Table 5

Semi-supervised testing and validating

Issues

Description

Issue-1

Estimate the parameters by semi-supervised learning on the training set, from which we get the corresponding p-value p and a classifier. Using this classifier on the training set and the testing set, it follows from Equation (44) that we get \(\varepsilon _{C}^{tr}\) and \(\varepsilon _{C}^{te}\). This is what we traditionally get.

Issue-2

Lump the training samples and testing samples together, and estimate the parameters by semi-supervised learning on the lumped set, we also get the corresponding \(\tilde {p}\), \(\tilde {\varepsilon }_{C}^{tr}\) and \(\tilde {\varepsilon }_{C}^{te}\).

Issue-3

\(\tilde {p}\) is actually more reliable than p because testing samples are used for regularising parameter estimation. This \(\tilde {p}\) is also different from the traditional compounded p-value because the label information of testing samples have not been compounded.

Issue-4

Without using the label information of testing samples, \(\tilde {\varepsilon }_{C}^{te}\) shares the concept same as \(\varepsilon _{C}^{te}\), but is actually more reliable because of regularization.

Issue-5

Merging the training set and testing set to get a big training set and treating the validating set as a new testing set, which actually extends this procedure to improve the validation.

Integrating p values, inferring rejection domain, and S-space boundary-based tests

Each IHT type in Table 3 involves more than one measure, which incurs for the problem about how different measures are jointly evaluated. Though 2D or 3D plots provide a possible joint evaluation, how to appropriately scale each measure is still a challenging issue. In general, we need to integrate multiple measures into a scalar index based on which the joint performance can be evaluated, which relates closely to efforts made on combing multiple classifiers (Xu and Amari 2008; Xu et al. 1992b) and evidence combination (Barnett 2008).

For an IHT task, the final scalar index is typically the p value. When multiple measures are all in the p values, what we encounter becomes the task of p value combination, e.g. by the Fisher combination (Fisher 1948).

In Table 3, ε B and ε D are already given in p values. But ε A is usually measured by a square error or negative log-likelihood, and ε C is measured by a misclassification rate. Alternatively, ε C may be given in a p value via the statistics in Equation (47). Let s=−ε A or generally s=−ε for a monotonic measure ε≥0 that prefers values close to zero, we may get the corresponding p value with help of the permutation method.

However, p value combination has a weak point. Each p value is merely a positive number that indicates the false alarm probability, losing certain useful information already. Under the term meta-analysis (Evangelou and Ioannidis 2013), efforts have been made by transforming p values into multiple Z statistics such that the missing information is added in without or with help of information directly from data (Zaykin 2011).

Actually, the Hotelling T 2 statistics by Equation (24) and getting a statistics by Equation (21) may also be regarded as examples that get an integrated statistics s f . Generally, a multivariate hypothesis test may also be regarded as an integration of multiple univariate hypothesis tests.

Typically, an integrated statistics s f =g(s,Ψ)≥0 comes from s=[s (1),,s (d)] such that s f ≥0 monotonically increases as the situation differs far from H 0, where each s (i) comes from one univariate hypothesis test (e.g. s=c 1c 1 in the Hotelling T 2 statistics) with a set Ψ of parameters shaping the integration (e.g. the covariance Σ in the Hotelling T 2 statistics). The set Ψ is specified without or with help of information obtained directly from input data. A critical value \({\tilde s}_{f}\) is computed from the original pair of the sample set X 0,X 1. Then, the false alarm probability \(p(s_{f}>{\tilde s}_{f}|H_{0})\) is obtained as the p value, where and hereafter p(·|H 0) denotes under the condition that H 0 is satisfied.

However, choices for such a s f =g(s,Ψ) are very limited in the existing studies, mostly in a quadratic form such as Hotelling statistics, Rao’s score by Equation (8), and the Wald test by Equation (7). This is equivalent to approximately regarding s (1),,s (d) from a multivariate Gaussian distribution, while other distributions are seldom studied yet.

Instead of seeking an integrated statistics s f , we directly seek the domain \(\Gamma (\boldsymbol {\tilde s})\) of rejecting H 0 in the space of s based on a critical vector \(\boldsymbol {\tilde s}\) as follows:
$$\begin{array}{@{}rcl@{}} \Gamma(\boldsymbol{\tilde s}) \ \text{with} \ \boldsymbol{\tilde s}_{{X}_{1||0}}=I_{nf}(X_{0}||X_{1}), \end{array} $$
(63)

where \(\boldsymbol {\tilde s}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}||X_{1})\) means that \(\boldsymbol {\tilde s}\) is inferred from the given sample set X 0,X 1 by an inferring method I nf , and the subscript X 1||0 is used as the abbreviation of X 1||X 0, which will be used whenever its omission will not cause confusion.

Then, test is made by checking the probability that s falls in \(\Gamma (\boldsymbol {\tilde s})\) under H 0, that is:
$$\begin{array}{@{}rcl@{}} p\left(\boldsymbol{s}\in \Gamma\left(\boldsymbol{\tilde s}\right) |H_{0}\right)=p\left(\boldsymbol{s}\in \Gamma\left(\boldsymbol{\tilde s}_{{X}_{1||0}}\right) |H_{0}\right). \end{array} $$
(64)
We estimate the p value by a permutation test. That is, we get a new pair of sample sets \(X_{0}^{\pi }, X_{1}^{\pi }\) from X 0,X 1 by a permutation π that shuffles each label ω of x t,ω and then we obtain:
$$\begin{array}{@{}rcl@{}} p\left(\boldsymbol{s}\in \Gamma\left(\boldsymbol{\tilde s}\right) |H_{0}\right)=\frac{1}{\# \Pi} \left\{1+\sum_{\pi \in \Pi} I\left(\boldsymbol{s}_{X^{\pi}_{1|0}}\in \Gamma\left(\boldsymbol{\tilde s}\right)\right) \right\},\\ I(u)=\left\{\begin{array}{ll} 1, & u\ \text{is}\ \text{true}, \\ 0, & \text{otherwise}, \end{array}\right. \end{array} $$
(65)

where # S denotes the cardinality of a set S, the subscript \(X^{\pi }_{1|0}\) is used as the abbreviation of \(X_{0}^{\pi }||X_{1}^{\pi }\), and Π consists of a large enough set of permutations made by either enumeration or random shuffling, including that π=empty denotes the sample pair X 0,X 1.

Recalling the classic studies of getting an integrated statistics s f , we observe that \({\tilde s}_{f}=g(\boldsymbol {s},\Psi)\) actually define a closed shell or boundary that divides the space of multivariate statistics s (shortly S-space) into two parts, with its inside as the acceptance domain and its outside as the rejection domain \(\Gamma (\boldsymbol {\tilde s})\). For example, the acceptance domain obtained by both the Hotelling statistics and Rao’s score by Equation (8) is a hyper-elliptic volume. We may further extend a hyper-elliptic volume to a bounded volume in another shape. Actually, a bounded acceptance domain corresponds a probabilistic modelling by a single-mode distribution. Thus, the corresponding tests are called S-space model-based tests.

On the other hand, we have also a S-space boundary based test (BBT) as summarised in Table 6. It should not be confused with the BBTs in the space of input data (shortly D-space), as those previously addressed in Tables 2 and 3, as well as in Figure 3. Those are two-sample tests with the boundary for separating two populations in the D-space while the S-space BBTs may correspond to any tests in the D-space.
Table 6

S-space boundary based test (BBT)

Step

Description

(1)

infer \(\tilde {\mathbf {s}}=I_{\textit {nf}}(X_{0}|| X_{1})\) in the multidimensional space of statistics s, where \(\tilde {\mathbf {s}}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}||X_{1})\) means that \(\tilde {\mathbf {s}}\) is inferred from the given sample set X 0,X 1 by an inferring method I nf , and the subscript X 1||0 is used as the abbreviation of X 1||X 0, which will be used whenever its omission will not cause confusion.

(2)

use \(\tilde {\mathbf {s}}\) to design an unbounded boundary that divides the space of statistics s into two separated and unbounded half-spaces.

(3)

let the one that does not contain the origin 0 as the rejection domain \(\Gamma (\tilde {\mathbf {s}})\), with the corresponding boundary side named as the R-side. The other one is the acceptance domain.

(4)

tend to reject H 0 as s deviates from the R-side of boundary with a nonzero distance. The larger the distance is, the more seriously H 0 breaks.

Also, integration can be made by considering the complementarity of S-space BBTs and S-space model-based tests, via combining \(\Gamma (\boldsymbol {\tilde s})\) and the acceptance domains, obtained from not only the above complementary aspects, but also different sources, e.g. a bottom-up source from univariate tests on input data and a top-down source inversely transformed from the p values via a meta-analysis (Evangelou and Ioannidis 2013). Also, based on the resulted \(\Gamma (\boldsymbol {\tilde s})\), an easy computing expression \(s_{f}=g(\boldsymbol {s},\Gamma (\boldsymbol {\tilde s}))\) may be obtained to get an asymptotic distribution \(p(s_{f}|\Gamma (\boldsymbol {\tilde s}))\) for a fast estimation of the p value, see examples given after Equation (70).

S-space BBT for the multivariate zero mean

Testing H 0 by Equation (1) for the case-control studies can be formulated into testing whether a multivariate statistics s=[s (1),,s (d)] takes a point far away from the origin of the multidimensional space. One example is a two-sample test that examines the following null:
$$\begin{array}{@{}rcl@{}} H_{0} : \textbf{s}=\textbf{c}_{1}-\textbf{c}_{1}=0, \end{array} $$
(66)

by the Hotelling T 2 statistics. The second example is the Wald testing statistics by Equation (7), and another example will be given in the next subsection.

In the existing studies, such a test is typically made via either the \({\chi ^{2}_{k}}\) statistics or Hotelling’s T 2 statistics. Also, Rao’s score by Equation (8) is such a type of statistics. As addressed in the previous subsection, they are all featured by an integrated statistics s f ≥0 that monotonically increases as s deviates away from the origin and belong to the S-space model-based tests. Also, all these tests may be regarded as extensions of one typical univariate two-tail test (e.g. by t 2 test), that is, a univariate statistics s deviates away from the origin s=0 via the value |s|.

The counterpart of a univariate two-tail test is a univariate one-tail test that examines how far s deviates from (−,0], i.e. testing the statement s≤0. When either rejecting s≤0 or rejecting s≥0 happens, we reject H 0:s=0. Even when the statement s≤0 is not rejected, there are still chances that H 0:s=0 will be rejected.

Typical studies of univariate one-tail tests include the one-tailed t-test and one-tailed z-test. However, we are not clear what are their counterparts in multivariate tests. As addressed above, Hotelling’s T 2 test can be regarded as a multivariate counterpart of a two-tailed test.

The S-space BBT given in Table 6 actually provides a road to extend univariate one-tail tests to multivariate ones. Observing univariate one-tail tests from the perspective of S-space BBT, we see that \({ \tilde s}=I_{\textit {nf}}(X_{0}|| X_{1})\) is actually a boundary point that results in:
$$\begin{array}{@{}rcl@{}} \Gamma({ \tilde s})=\{s: (s-{\tilde s})\text{sign}({\tilde s})>0\}\qquad \qquad \\ =\left\{\begin{array}{ll} [\!{\tilde s}, \infty), & if \ {\tilde s}>0, \\ (- \infty, {\tilde s}], & if \ {\tilde s}< 0. \end{array}\right.\ \text{with} \ \text{Sign}[\!u]=\frac{u}{|u|}. \end{array} $$
(67)

Given \({ \tilde s}\) and thus \(\Gamma ({ \tilde s})\), any s obtained from the case-control samples under H 0 may cause a false alarm if s falls in \(\Gamma ({ \tilde s})\), which happens in a probability \(p(s \in \Gamma ({ \tilde s}) | H_{0})\), i.e. the p value by the inference \({ \tilde s}\). If it is small enough, the statement \(s \notin \Gamma ({ \tilde s}) \) will be rejected, which implies that s=0 or H 0 by Equation (1) is rejected.

We further consider a statistics s in the multidimensional space from the perspective of S-space BBT given in Table 6 (2). We start by observing an orthant of the R d space featured by \(\text {sign}(\boldsymbol {\tilde s}) =\left [\text {sign}\left ({ \tilde s^{(1)}}\right), \dots, \text {sign}\left ({ \tilde s}^{(d)}\right)\right ]^{T}\) and consider one separating boundary, as illustrated in Figure 5A. Such a boundary is equivalent to the following decomposition:
$$\begin{array}{@{}rcl@{}} \Gamma(\boldsymbol{\tilde s})=\Gamma({ \tilde s^{(1)}})\times \cdots \times \Gamma\left({ \tilde s^{(d)}}\right), \cr p(\boldsymbol{s}\in \Gamma(\boldsymbol{\tilde s}) |H_{0})=\prod_{i}p\left(\boldsymbol{s}^{(i)}\in \Gamma\left({ \tilde s}^{(i)}\right) |H_{0}\right), \end{array} $$
(68)
https://static-content.springer.com/image/art%3A10.1186%2Fs40535-015-0007-5/MediaObjects/40535_2015_7_Fig5_HTML.gif
Figure 5

Rejection domain \(\Gamma (\boldsymbol {\tilde s})\), e.g. point x in the rejection domain, and y in the acceptance domain.(A) One separating boundary that consists of d lines with each emitting from \(\boldsymbol {\tilde s}\) to infinity in parallel to one axis of the orthant. (B) Choice (c) defines a hyperplane that passes \(\boldsymbol {\tilde s}\) and uses its vector direction as the normal direction. While choice (b) takes the normal direction by the primary diagonal direction of the orthant.

where each \(\Gamma ({ \tilde s}^{(i)})\) is given by Equation (67) for computing \(p\left (\textbf {s}^{(i)}\in \Gamma \left ({ \tilde s}^{(i)}\right) |H_{0}\right)\). This actually provides an example that extends a one-tail univariate hypothesis test to a vector-variate one.

In implementation, it is not easy to get the factorization of \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) |H_{0})\) by Equation (68). Instead, we approximately consider to remove the second-order dependence by the following decorrelation:
$$\begin{array}{@{}rcl@{}} \mathbf{s}_{u}=\left\{\begin{array}{ll} U^{T}\mathbf{s}, & \text{Choice}\ (a), \\ \Lambda^{-0.5}U^{T}\mathbf{s}, & \text{Choice}\ (b), \end{array}\right. \ s.t.\ U^{T}U=I, \end{array} $$
(69)
where Λ u is a diagonal matrix consisting of the nonzero eigenvalues of the following covariance matrix:
$$\begin{array}{@{}rcl@{}} \Sigma_{\pi}=\frac{\sum_{\pi\in \Pi} \left(\mathbf{s}_{X^{\pi}_{1|0}}- \mu^{\pi}\right)\left(\mathbf{s}_{X^{\pi}_{1|0}}- \mu^{\pi}\right)^{T}}{\# \Pi}, \cr \mu^{\pi}=\frac{\sum_{\pi\in \Pi} \mathbf{s}_{X^{\pi}_{1|0}}}{\# \Pi}. \qquad \qquad \end{array} $$

and U is a d×m matrix with its columns consisting of the eigenvectors of Σ π such that Λ u =U T Σ π U.

Another issue is that only those major components in Equation (68) are useful while some components are not only useless but also disturbing, especially when we consider a limited size of samples. To do so, one may consider that the columns of the matrix U consist of the eigenvectors of Σ π corresponding to the m-largest diagonal elements of Λ u . Such an implementation of Equation (69) is typically called principal component analysis (PCA). How to decide an appropriate number of components is a model selection task (Tu and Xu 2011, 2012; Xu 2011). Moreover, one novel direction for this task will be addressed later in thip paper between Equation (91) and Equation (99). Actually, Equation (69) only applies to remove the second-order dependence. One may further consider non-Gaussian factor analysis (NFA) and binary factor analysis (BFA) to remove dependencies among non-Gaussian components (Tu and Xu (2014); Xu (2003, 2009) and also Section 5 in Xu (2012b)).

Simply, we use the notation \(\boldsymbol {\tilde s}=I_{\textit {nf}}(X_{0}|| X_{1})\) to denote a procedure to obtain such major components and then use this \(\boldsymbol {\tilde s}\) to get a separating boundary and its corresponding \(\Gamma (\boldsymbol {\tilde s})\). Illustrated in Figure 5 are three examples as follows:
$$ \Gamma(\boldsymbol{\tilde s})= \left\{ \begin{array}{ll} \left\{\boldsymbol{s}: \left(\boldsymbol{s}^{(i)}-\boldsymbol{\tilde s}^{(i)}\right)\text{sign}\left(\boldsymbol{\tilde s}^{(i)}\right)>0, \forall i\right \}, & (a), \\ \left\{\boldsymbol{s}: \left(\boldsymbol{s}-\boldsymbol{\tilde s}\right)^{T}\text{sign}(\boldsymbol{\tilde s})>0\right\}, & (b),\\ \left\{\boldsymbol{s}: \left(\boldsymbol{s}-\boldsymbol{\tilde s}\right)^{T}\boldsymbol{\tilde s}>0\right\}, & (c). \end{array} \right. $$
(70)

Choice (a) is illustrated in Figure 5A same as the one in Equation (68) with each \(\Gamma ({ \tilde s}^{(i)})\) given by Equation (67). As illustrated in Figure 5B, each of two other choices is a half space bounded by a plane and on the side away from the origin. Choice (b) is more suitable to the case after using Equation (69) in choice (b). Except for the degenerated cases that the normal direction of the hyperplane becomes in parallel to one of the coordinate axis, choice (b) and choice (c) will approximately describe a certain dependence across the components of s.

After using Equation (69) to make the statistics s become an m-dimensional vector with the second-order dependence removed, we may observe that the scope of \(\Gamma (\boldsymbol {\tilde s})\) becomes narrowed as m reduces. When m=1, the scope of \(\Gamma (\boldsymbol {\tilde s})\) is narrowed to a one-tail test along the axis of only one component.

In implementation, we obtain \(p(\boldsymbol {s}\in \Gamma (\boldsymbol {\tilde s}) |H_{0})\) by Equation (64) via the permutation by Equation (65). Also, choice (b) and choice (c) may be understood from getting an integrated statistics as follows:
$$ \begin{aligned} \boldsymbol{s}_{\boldsymbol{w}}&=\boldsymbol{w}^{T}\boldsymbol{s}, \qquad \qquad \\ \boldsymbol{w}&=\text{sign}(\boldsymbol{\tilde s}) \ or \ \boldsymbol{w}=\boldsymbol{\tilde s}. \end{aligned} $$
(71)

Approximately, s w comes from a normal distribution with the mean μ w and the variance s w , based on which we can make a one univariate test.

SPD test and SPD discriminative analysis

Proposed in (Xu 2013a), the SPD method firstly examines the delta δ(x,y) by pairing every case sample xX 1 and every control sample yX 0 and then summarises such deltas as follows:
$$\begin{array}{@{}rcl@{}} D(X_{1}||X_{0})=\frac{1}{\# X_{0}\# X_{1}} \sum_{x \in X_{1}} \sum_{y \in X_{0}} \delta(x, y). \end{array} $$
(72)
Generally, δ(x,y) could be either symmetric or antisymmetric. One simple symmetric example is:
$$ \begin{aligned} \delta(x, y)&=\frac{(x - y)^{2}}{\alpha_{1}{\sigma_{1}^{2}}+\alpha_{0} {\sigma_{0}^{2}}}, \\ D(X_{1||0}) &=1+\frac{(c_{1}- c_{0})^{2}}{\alpha_{1}{\sigma_{1}^{2}}+\alpha_{0} {\sigma_{0}^{2}}}-\frac{r_{xy} }{\alpha_{1}{\sigma_{1}^{2}}+\alpha_{0} {\sigma_{0}^{2}}}, \end{aligned} $$
(73)

where \(c_{\omega }, \sigma ^{2}_{\omega }, \alpha _{\omega }\) is the sample mean, variance, and proportion of the samples in X ω , respectively, and r xy is the mutual correlation between x and y.

The above example can be extended to the case that both x,y are vectors with:
$$\begin{array}{@{}rcl@{}} \delta({\mathbf{x}, \mathbf{y}})=({\mathbf{x}-\mathbf{y}})^{T}[\!\alpha_{0}\Sigma_{0}+ \alpha_{1} \Sigma_{1}]^{-1}({\mathbf{x}-\mathbf{y}}). \end{array} $$
Also, we may consider an antisymmetric delta:
$$\begin{array}{@{}rcl@{}} \delta(x, y)=\rho(x-y), \ d\rho(u)/du>0, \end{array} $$
(74)
where ρ(u) is a monotonic function. One simplest example is ρ(u)=u as follows:
$$\begin{array}{@{}rcl@{}} \delta(x, y)=x-y, \end{array} $$
(75)
which is equivalent to testing the difference of two sample means. To find the collective inclining structure, we classify δ(x,y) into three groups by x>y,x=y,x<y and get the following decomposition:
$$\begin{array}{@{}rcl@{}} D(X_{1||0})=D_{+}(X_{1||0}) -D_{-}(X_{1||0}), \\ D_{+}(X_{1||0}) =\sum_{x>y} (x-y), \cr D_{-}(X_{1||0})=\sum_{y<x} (y-x) \end{array} $$
(76)

with D(X 1||0)<0 indicating that there is a collective inclining dominance (i.e. the representations of cases are bigger than the ones of controls), D(X 1||0)<0 indicating a reversed dominance, and D(X 1||0)=0 indicating no dominance.

Recalling Equation (66), it follows from \({\tilde s}=D(X_{1||0})= c_{1}-c_{0}\) that D(X 1||0) is approximated from a normal distribution. Thus, the above collective inclining dominance can be tested by the one-tailed t-test and one-tailed z-test addressed in the previous subsections. We may get the mean \( \mu \left (X_{1||0}^{\pi }\right)\) and the variance \(\sigma ^{2}\left (X_{1||0}^{\pi }\right)\) from \(\left \{ D(X_{1||0}^{\pi }, \pi \in \Pi \right \}\) and then approximately compute the p value by a univariate one-tail z-test.

When x,y are vectors, we consider:
$$\begin{array}{@{}rcl@{}} \textbf{s}=\left[D^{(1)}(X_{1||0}), \cdots, D^{(d)}(X_{1||0})\right]^{T}, \end{array} $$
(77)

with each D (i)(X 1||0) by Equation (76). The task is detecting whether there is a collective inclining dominance, i.e. whether s deviates far away from the origin such that H 0 by Equation (1) breaks. The task can be handled by the S-space BBT in Table 6 as a multivariate extension of a one-tail univariate hypothesis test, following the method introduced from Equations (68) to (71) given previously.

Also, we may consider this multivariate SPD study from a perspective similar to the FDA by Equation (11). When x,y are the d-dimensional vectors, we extend Equation (74) into:
$$\begin{array}{@{}rcl@{}} \delta(\mathbf{x}, \mathbf{y})=\rho(\mathbf{x}-\mathbf{y})^{T}\mathbf{w}, \end{array} $$
(78)
where ρ(u)=[ρ(u (1),,ρ(u (d)] T and ρ(u) is the same as the one in Equation (74). That is, the difference xy is projected onto a most reasonable direction w. In the simplest case ρ(u)=u, we get δ(x,y)=(xy) T w given in Equation (72) and thus leads to s w =w T s in Equation (71) as follows:
$$\begin{array}{@{}rcl@{}} \mathbf{s}_{\mathbf{w}}=\frac{1}{\# X_{0}\# X_{1}} {\sum_{x \in X_{1}} \sum_{y \in X_{0}} (\mathbf{x}-\mathbf{y})^{T}\mathbf{w}} = \mathbf{w}^{T}\mathbf{s}. \end{array} $$
(79)

Without losing generality, we consider that the components of s are mutually independent, e.g. obtaining a second-order independence by Equation (69). Then, we seek how to choose an appropriate w.

Under H 0, we expect that \( \mathbf {s}_{\mathbf {w}}^{\pi }=D_{\mathbf {w}}\left (X_{1||0}^{\pi }\right), \pi \in \Pi \) varies around its mean that is typically zero according to Equation (75), that is, we expect that the following standard deviation of \(\boldsymbol {s}_{\mathbf {w}}^{\pi }\) is minimised:
$$ \begin{aligned} \sigma_{\pi}(\mathbf{w})&=\sqrt{\mathbf{w}^{T}\Sigma_{\pi}\mathbf{w}}, \\ \Sigma_{\pi}&=E\left[\!\left(\mathbf{s}_{\mathbf{w}}^{\pi}-E\mathbf{s}_{\mathbf{w}}^{\pi}\right)\left(\mathbf{s}_{\mathbf{w}}^{\pi}-E\mathbf{s}_{\mathbf{w}}^{\pi}\right)^{T}\right]. \end{aligned} $$
(80)
Also, we expect that s w best preserves discriminative information underlying X 1,X 0, for which we maximise |s w |. We apply a bootstrapping method to enhance the reliability by maximising:
$$\begin{array}{@{}rcl@{}} \rho_{\gamma}(\mathbf{w})=\sum_{{\omega} \in \Omega} |\mathbf{w}^{T}\mathbf{s}^{\phi}|^{\gamma}, \ \gamma>0, \end{array} $$
(81)

which may tend to if it is unbounded. To avoid it, some bound will be imposed on w.

For γ=1, we usually consider:
$$\begin{array}{@{}rcl@{}} \max_{\mathbf{w}} \rho_{\gamma=1}(\mathbf{w}), \ s.t.\ w^{(i)}\in \left[a^{(i)}, b^{(i)}\right], \forall i. \end{array} $$
(82)

by which the solution of w=[w (1),…,w (d)] T is reached at one vertex, i.e. w (i) takes either a (i) or b (i). Particularly, when Ω consists of only one pair X 1,X 0, the above maximisation leads to choice (b) in Equation (70) if we let −a (i)=b (i)=1 and to choice (c) if we let −a (i)=b (i)=|D (i)(X 1||0)|.

For γ=2, we consider:
$$\begin{array}{@{}rcl@{}} \max_{\mathbf{w}, \ s.t.\ \Vert \mathbf{w} \Vert^{2}=1,} \rho_{\gamma=2}(\mathbf{w}) =\mathbf{w}^{T} \Sigma^{\phi} \mathbf{w}, \end{array} $$
(83)

with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma ^{\phi } =\sum _{{\omega } \in \Omega } \mathbf {s}^{\phi }\mathbf {s}^{\phi \ T} \).

Integrating Equations (80) and (81), we consider to maximise ρ γ (w) with \(\sigma _{\pi }^{\gamma }(\mathbf {w})\) minimised simultaneously or subject to a constraint \( \sigma _{\pi }^{\gamma }(\mathbf {w})\le \text {constant}\).

Alternatively, we may consider:
$$\begin{array}{@{}rcl@{}} \max_{\mathbf{w}} J(\mathbf{w}), \ J(\mathbf{w})=\frac{\rho_{\gamma}(\mathbf{w})}{\sigma_{\pi}^{\gamma}(\mathbf{w})}, \end{array} $$
(84)
which shares a spirit similar to the FDA by Equation (11). At the typical case γ=2, it becomes
$$\begin{array}{@{}rcl@{}} \max_{\mathbf{w}} J(\mathbf{w}), \ J(\mathbf{w})=\frac{\mathbf{w}^{T} \Sigma^{\phi} \mathbf{w}}{\mathbf{w}^{T}\Lambda \mathbf{w}}, \end{array} $$
(85)

with its solution given by the eigenvector that corresponds to the largest eigenvalue of \(\Sigma _{\pi }^{-0.5}\Sigma ^{\phi } \Sigma _{\pi }^{-0.5}\).

Furthermore, we proceed to consider that each D (i)(X 1||0) in Equation (79) is not a simple difference by Equation (76) but the following 1×2 row vector:
$$\begin{array}{@{}rcl@{}} D^{(i)}(X_{1||0})=\left[D_{+}^{(i)}(X_{1||0}), -D_{-}^{(i)}(X_{1||0})\right]. \end{array} $$
(86)
Also, we may extend xy with each element x (i)y (i) becoming a row vector [x (i),−y (i)]. Accordingly, we get:
$$\begin{array}{@{}rcl@{}} \mathbf{x}-\mathbf{y}=\Delta_{x-y} \mathbf{v}, \cr \delta(x,y)=\mathbf{w}^{T} \Delta_{x-y} \mathbf{v}, \end{array} $$
(87)
where v=[v (1),v (2)] T and Δ xy is a d×2 matrix with the i-th row being [x (i),−y (i)]. It follows from Equation (72) that the above Equation (87) leads D (i)(X 1||0) to:
$$\begin{array}{@{}rcl@{}} D^{(i)}(X_{1||0})=\left[c_{1}^{(i)}, -c_{0}^{(i)}\right], \cr \mathbf{s}=D_{M}(X_{1||0})\mathbf{v}, \end{array} $$
(88)
where D M (X 1||0) is a d×2 matrix with D (i)(X 1||0) as its i-th column. Accordingly, the inner product by Equation (79) becomes:
$$\begin{array}{@{}rcl@{}} \mathbf{s}_{\mathbf{w}} =\mathbf{w}^{T}D_{M}(X_{1||0})\mathbf{v}. \end{array} $$
(89)

Given v as fixed, the study from Equations (79) and (84) applies directly for us to get w.

Given w as fixed, \( \mathbf {w}^{T}D_{M}\left (X_{1||0}^{\pi }\right)={D_{c}^{T}}\left (X_{1||0}\right)\) becomes a two-dimensional row vector and, it follows from Equation (89) that we have \(\boldsymbol {s}_{\mathbf {w}} =\mathbf {v}^{T}{D_{c}^{T}}\left (X_{1||0}\right)\) in the same form as Equation (79). With v in the place of w and D c (X 1||0) in the place of s, similarly, the study from Equations (79) and (84) applies directly for us to get v. Generally, we iteratively update v with a fixed w and update w with a fixed v, for a number of circles getting converged. Still, whether such an alternative iterating procedure can converge is an open issue that demands further investigation.

The p values and testing complexity control

Recalling Equation (64) and Table 6, based on a given sample pair X 1||0=X 0X 1, we get a statistics vector \(\boldsymbol {\tilde s}_{{X}_{1||0}}=I_{\textit {nf}}(X_{0}|| X_{1})\) and a rejection domain \(\Gamma =\Gamma \left (\boldsymbol {\tilde s}_{{X}_{1||0}}\right)\) by the inferring method I nf . Then, we compute the following false alarm probability:
$$\begin{array}{@{}rcl@{}} p_{{X}_{1||0}}=p(\mathbf{s}\in \Gamma |\ I_{nf}, X_{1||0}, H_{0}) \end{array} $$
(90)

as the p value. This concept is the same as the one used in the conventional literature where X 1||0 and I nf are usually implied but not spelled out.

Being different from those studies considering a univariate statistics, the p value by a multidimensional statistics vector s highly depends on the dimension m of this vector or the complexity of the testing space. Given a limited sample size, the p value by Equation (90) will reduce as the value of m increases, causing a phenomenon similar to the overfitting problem in the studies of machine learning and statistical modelling. In other words, we encounter a ‘dimension curse’ in hypothesis testing too. Therefore, we need to appropriately control the complexity of testing space, i.e. selecting one appropriate m.

Given a criterion J(m), the problem of selecting a best subset is a typical problem of feature selection. Generally, it involves an exhaustive evaluation of all the combinations of m features (i.e. m components of s) and all the possible values of m, which is a NP hard problem. Usually, the branch and bound policy (Narendra and Fukunaga 1977; Somol et al. 2004) and the best first strategy are used to save computing cost (Xu et al. 1988). In this paper, we only consider one simple selection strategy that evaluates the components of s incrementally one by one.

To facilitate it, we perform Equation (69) to make the components of s become decorrelated and start to pick one component that corresponds to the smallest value of a given criterion J(m). Then, we successively add in one component such that J(m) gets a bigger drop further and so on and so forth until no further reduction is caused. Finally, the selected components form the resulted feature set with a size m .

For this purpose, using the p value by Equation (90) as J(m) does not work well because of its tendency of reducing as m increases, resulting in one m that is usually much bigger than the appropriate one. Instead, we consider another false alarm probability as follows:
$$ \begin{aligned} p(\mathbf{s}\in \Gamma |\ I_{nf}, H_{0}) = \int p\left(X^{\pi}_{1||0}\right) p\left(\mathbf{s}\in \Gamma |\ I_{nf}, X^{\pi}_{1||0}, H_{0}\right) dX^{\pi}_{1||0}, \end{aligned} $$
(91)

which is obtained on all the possible sets of \(X^{\pi }_{1||0}\) that come under H 0 instead of merely on a given pair X 1||0.

Though this probability is useless to judge whether X 1||0 contains enough information to reject H 0, it reflects how the complexity of testing space affects a background portion of the false alarm probability. Actually, it reflects an inverse of the effective volume of the support that the statistics s locates. As m increases, the volume increases exponentially, and thus, p(sΓ| I nf ,H 0) will reduce negative-exponentially. Such an exponentially decreasing tendency is also contained in p(sΓ| I nf ,X 1||0,H 0) for the same reason, which affects the accuracy of the estimated p value.

To reduce this background disturbance, we consider Equations (90) and (91) jointly by the following a posteriori version of the p value:
$$ \begin{aligned} {pp}_{{X}_{1||0}}=p(\neg H_{0} |\ I_{nf}, X_{1||0}, H_{0})= \frac{ \int_{p_{X^{\pi}_{1|0}}\le p_{X_{1|0}}} p\left(X^{\pi}_{1||0}\right) p\left(\textbf{s}\in \Gamma |\ I_{nf}, X^{\pi}_{1||0}, H_{0}\right) dX^{\pi}_{1||0} }{ p(\textbf{s}\in \Gamma |\ I_{nf}, H_{0}) }, \end{aligned} $$

where and hereafter ¬H 0 denotes rejecting H 0. The denominator aims at cancelling out the disturbing portion in the numerator, such that \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\) provides not only a better estimation of false alarm probability of rejecting H 0 but also a better criterion J(m) for selecting a best subset of the components of s and thus inferring one appropriate m .

Instead of directly handling the above integral, we get a large set Π of sample pairs \(X_{1}^{\pi }, X_{0}^{\pi }\), with each pair \(X_{1}^{\pi }, X_{0}^{\pi }\) resulted from a permutation of X 0 and X 1. Using every pair \(X_{1}^{\pi }, X_{0}^{\pi } \) to infer \(I_{\textit {nf}}\left (X^{\pi }_{0}|| X^{\pi }_{1}\right)= \boldsymbol {\tilde s}_{X^{\pi }_{1|0}}\), we get a set of p values as follows:
$$ \begin{aligned} P_{\Pi}=\left\{ p_{X^{\pi}_{1|0}}, \pi \in \Pi\right\}, \ \text{with} ~~p_{X^{\pi}_{1|0}}=p\left(\mathbf{s}\in \Gamma |\ I_{nf}, X_{1||0}^{\pi}, H_{0}\right) \end{aligned} $$
(92)
based on which we compute:
$$ \begin{aligned} {pp}_{{X}_{1||0}}&= \frac{\sum_{\pi \in \Pi_{\Gamma}} p_{X^{\pi}_{1|0}}}{\sum_{{\pi} \in \Pi} p_{X^{\pi}_{1|0}}} =pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}, \\ pp^{o}_{{X}_{1||0}} &=\frac{n_{\Gamma}}{n_{\Pi}}, \ {rp}_{{X}_{1||0}}= \frac{\mu_{\Gamma}}{\mu}, \\ n_{\Gamma}&=\# \Pi_{\Gamma}, \ n_{\Pi}=\# \Pi, \\ \Pi_{\Gamma}&=\left\{\pi: p_{X^{\pi}_{1|0}}\le p_{X_{1|0}}, \forall \pi \in \Pi\right\}, \\ \mu_{\Gamma}&=\frac{\sum_{\pi \in \Pi_{\Gamma}} p_{X^{\pi}_{1|0}}}{ n_{\Gamma}}, \ \mu=\frac{\sum_{\pi \in \Pi} p_{X^{\pi}_{1|0}}}{n_{\Pi}}. \end{aligned} $$
(93)

We observe that the pp value has two factors. One is \(pp^{o}_{{X}_{1||0}}\) that describes the proportion of the pairs of \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(p_{X^{\pi }_{1|0}}\le p_{X_{1|0}}\), that is, on each of these pairs we should also reject H 0 if we reject H 0 on X 1||0. In other words, \(pp^{o}_{{X}_{1||0}}\) reflects the information of relative difference contained in P Π . The other factor \(\phantom {\dot {i}\!}{rp}_{{X}_{1||0}}\) is the ratio of the average false alarm probability per pair over the disturbing background per pair, reflecting the strength of discriminative information contained in P Π .

In implementation, we may use \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) to make an initial screening. When \( {rp}_{{X}_{1||0}}>1\phantom {\dot {i}\!}\), inference is nonsense and no further computing should be made. Generally, \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) will be much smaller than 1, and thus, \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\) will be much smaller, while \(pp^{o}_{{X}_{1||0}}\) provides a worst case upper bound of \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\).

We should observe \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) at not only one same value of m but also an appropriate m . In addition to using \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (93) as J(m) for making an incremental selection, we may also consider \(pp^{o}_{{X}_{1||0}}\) or \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as J(m), resulting in \( m^{\ast }_{o} \) or \(m^{\ast }_{\textit {rp}}\). Also, it follows from some mathematical derivation that we have \( m^{\ast } \ge m^{\ast }_{\textit {rp}} \ge m^{\ast }_{o}\) with \(m^{\ast }_{o}\) being a most conservative lower bound. We will be more confident when all these values are identical or not different too much. Moreover, further insights can be obtained from the following considerations.

On one side, we desire that the exponentially decreasing tendency contained in p(sΓ| I nf ,X 1||0,H 0) is removed via the normalisation by p(sΓ| I nf ,H 0) such that \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) in Equation (93) will no longer have such a decreasing tendency. With \(p_{X^{\pi }_{1|0}}=p(\textbf {s}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_{0})\) in Equation (92) replaced by \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), we may turn P Π into its counterparts P pp , \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P rp . We compute not only the varying curve for each of \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as m increases, but also the varying curve of the mean of the elements in each of P pp , \(\phantom {\dot {i}\!}P_{pp^{o}}\), and P rp as m increases. Then, we compare each curve with its corresponding mean curve and desire that the mean curve is as flat as possible or at least flat around m .

On the other side, desiring a flat mean curve is not a sole principle. W also desire that the discriminative information should be kept in each of \({pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) as much as possible. Observing the factorization \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}= pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) in Equation (93), the strength of discriminative information is contained in \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) with an exponentially decreasing tendency that is supposed to be mutually cancelled out by the denominator and the numerator but perhaps not completely, while the discriminative information of relative difference is contained in \(pp^{o}_{{X}_{1||0}}\) and kept unchanged as long as every inequality between \(\phantom {\dot {i}\!}p_{X^{\pi }_{1|0}}\) and \(\phantom {\dot {i}\!}p_{X_{1|0}}\) remains unchanged.

Bi-test, twin p values, and P-space BBT

Putting the above two sides together, we observe that a S-space multivariate test is actually a bi-test that tests H 0 together with the following hypothesis:
$$\begin{array}{@{}rcl@{}} I_{0}\ : \text{ the inference is not reliable.} \end{array} $$
(94)

We examine a decision that both H 0 and I 0 are rejected, featured with two p values.

As addressed after Equation (91), the multivariate statistics s inferred by I nf suffers a systematic bias that will make I nf unreliable. This unreliability varies with the dimension m that takes an important role in I nf . Though corrected by the denominator in Equation (93), there are still some residuals that will not be completely cancelled out, the effect of which still varies with m and reduces the reliability of I nf . The test I 0 is formulated for this reliability via controlling an appropriate m and a level of false alarm probability of rejecting I 0.

One should notice the difference between testing H 0 and testing I 0. Testing H 0 examines only the input, while testing I 0 examines both the input and the performance of testing H 0. The inference I nf gets X 1||0 as the input and the outcomes \(p_{{X}_{1||0}}, {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\). Using \(\phantom {\dot {i}\!}o_{{X}_{1||0}}\) to denote anyone of these indices, regarding I nf as reliable on X 1||0 actually implies that it should also be regarded as reliable on any pair \(X_{1}^{\pi }, X_{0}^{\pi }\) with the corresponding \(o_{{X}^{\pi }_{1||0}}\) being smaller than \(\phantom {\dot {i}\!}o_{{X}_{1||0}}\). Thus, the false alarm probability of rejecting I 0 is computed by \(p\left (o_{{X}^{\pi }_{1||0}}\le o_{{X}_{1||0}}|\neg H_{0}, H_{0}\right)\).

Interestingly, some mathematical derivation shows that letting \(o_{{X}_{1||0}}\phantom {\dot {i}\!}\) to be anyone of \(p_{{X}_{1||0}}, {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\), \(pp^{o}_{{X}_{1||0}}\), and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) will always result in the same false alarm probability as follows:
$$\begin{array}{@{}rcl@{}} p(\neg I_{0} |\neg H_{0}, H_{0}) =p\left(p_{X^{\pi}_{1|0}}\le p_{X_{1|0}}| H_{0}\right)=pp^{o}_{{X}_{1||0}}, \end{array} $$
(95)

where and hereafter ¬I 0 denotes rejecting I 0. Reflecting the discriminative information of relative difference, this p value of rejecting I 0 will be not affected as long as the exponentially decreasing tendency will not change every inequality between \(p_{X^{\pi }_{1|0}}\) and \(p_{X_{1|0}}\).

As summarised in Table 7, a multivariate test is actually a bi-test that tests not only the classic null but also a null about the ‘dimension curse’. The rejection of H 0 is controlled by a given level α. If \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\ge \alpha \), H 0 will not be rejected, and thus, there is no need to test I 0. Accordingly, Equation (93) for the p value of rejecting I 0 is also modified in Table 7. The bi-test is implemented with or without using stochastic simulation. Table 7 (2) outlines those previously addressed points for implementation via stochastic simulation, while Table 7 (3) outlines an alternative implementation that does not need stochastic simulation.
Table 7

Multivariate Bi-test and Implementations

Type

Description

 

Test bi-hypotheses and twin p-values

test H 0

whether the case-control populations are different, by an inference I nf in the space of multivariate statistics s based on samples from the two populations. H 0 is rejected if \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\!\le \! \alpha \), where the false alarm probability \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}\,=\,pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\) is given by Equation (93) and α is a prespecified level.

test I 0

whether the dimension m of s is appropriate such that I nf is reliable, with the p-value given by \( p(\neg I_{0} |\neg H_{0}, H_{0}) = p({pp}_{X^{\pi }_{1|0}}\le \alpha |{pp}_{{X}_{1||0}}< \alpha, H_0), \) which is not smaller than \(pp^{o}_{{X}_{1||0}}\) that reflects the relative discriminative information among \( {pp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) while ignoring \({rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) that reflects the strength of discriminative information.

 

Bi-text Implementations

Stochastic way

(a) Make the components of s decorrelated by Equation (69). (b) Get \(p({\mathbf {s}}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_0)=p({\mathbf {s}}\in \Gamma (\tilde {\bf {s}}) |H_0)\) by Equation (68) with \(\Gamma (\tilde {\bf s})\) taking one of three choices in Equation (70), and then getting P Π by Equation (92). (c) Get \( {pp}_{{X}_{1||0}}, pp^{o}_{{X}_{1||0}}, {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (93) and then getting pI 0H 0,H 0) as above. (d) Using \(pp^{o}_{{X}_{1||0}}\) or pI 0H 0,H 0) as J(m) to infer an appropriate \(m^{*}_{o}\) and select the \(m^{*}_{o}\) best components of s.

Nonstochastic way

(a) Make the components of s decorrelated by Equation (69). (b) Get {p i } with each p-value p i obatined by an univariate test. (c) Get \( pp^{o}_{{X}_{1||0}}\) by Equation (99) and \({rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{1|0}}= \prod _{i} p_{i}\), as well as getting pI 0H 0,H 0) as above. (d) The same as the above (2)(d).

This alternative comes from considering Γ in the choice (a) of Equation (70) by which we have:
$$\begin{array}{@{}rcl@{}} p\left(\textbf{s}\in \Gamma |\ I_{nf}, X_{1||0}^{\pi}, H_{0}\right)=\prod_{i \le m^{\ast}}p_{i}\prod_{i > m^{\ast}}\delta_{m}, \cr p_{i}= p\left(\textbf{s}^{(i)}\in \Gamma\left({ \tilde s}^{(i)}\right) |H_{0}\right), \end{array} $$
(96)

where the extra components of s will contribute a constant factor \(\prod _{i > m^{\ast }}\delta _{i}\) that will be cancelled out via the denominator and the numerator in Equation (93).

In such a case, we may get \(\phantom {\dot {i}\!}{rp}_{{X}_{1||0}}={\mu _{\Gamma }}/{\mu }\) without stochastic simulation. First, we have \(\mu =\prod _{i} \mu ^{(i)}\). Each \({ \tilde s}^{(i)}\) under H 0 is a random variable with a zero mean, and its corresponding false alarm probability p i is uniformly distributed over [ 0,0.5]. Thus, we get μ (i)=1/4. Second, we also get \(\phantom {\dot {i}\!}\mu _{\Gamma }\le p_{X_{1|0}}\) by letting \( p(\textbf {s}\in \Gamma |\ I_{\textit {nf}}, X_{1||0}^{\pi }, H_{0}) \le p_{X_{1|0}}\) for each πΠ Γ to be approximated by its upper bound \(p_{X_{1|0}}\). Putting the two together, we have:
$$\begin{array}{@{}rcl@{}} {rp}_{{X}_{1||0}}=\frac{\mu_{\Gamma}}{\mu}\le \frac{p_{X_{1|0}}}{\mu}, \ \mu=\left\{ \begin{array}{ll} \frac{1}{4^{m}}, & \text{one tail}, \\ \frac{1}{2^{m}}, & \text{two tails}. \end{array}\right. \end{array} $$
(97)
Next, \(pp^{o}_{{X}_{1||0}}\) is also considered without stochastic simulation. From Equation (95), we have \( p\left (\neg I_{0} |\neg H_{0}, H_{0}\right) =pp^{o}_{{X}_{1||0}} =p\left (\prod _{i}p_{i}^{\pi }\le \prod _{i}p_{i}| H_{0}\right) =p\left (\prod _{i}\left (p_{i}^{\pi }\right)^{2}\le \prod _{i}{p_{i}^{2}} |H_{0}\right)\), which leads us to the well-known Fisher combination (Fisher 1948) that makes a test on the false alarm probabilities {p i } by the following combination:
$$ \begin{aligned} p_{F}&=p\left(\prod_{i} \left(p_{i}^{\pi}\right)^{2} <\prod_{i} p_{i}^{2 }\ |\ H_{0}\right)\\ &=p\left(\chi_{2m}^{2} >-2\sum_{i} \ln{p_{i}}\right), \ \chi_{2m}^{2}=-2\sum_{i} \ln{p_{i}}^{\pi}. \end{aligned} $$
(98)
This link provides new insights from two perspectives. On one perspective, we may adopt the Fisher combination approach to estimate \(pp^{o}_{{X}_{1||0}}\) as follows:
$$\begin{array}{@{}rcl@{}} pp^{o}_{{X}_{1||0}}=p\left(\chi_{2m}^{2} >-2\sum_{i} \ln{p_{i}} \right). \end{array} $$
(99)

Together with Equation (97), we get \(\phantom {\dot {i}\!}{pp}_{{X}_{1||0}}=pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\) for testing both H 0 and I 0 without stochastic simulation via permutation.

On the other perspective, we observe that the traditional p value p F of the Fisher combination is actually the false alarm probability by Equation (95), only reflecting the discriminative information of relative difference between \(\prod _{i} p_{i}^{\pi }\) and \( \prod _{i} p_{i} \) but ignoring the strength of discriminative information contained in \( \prod _{i} p_{i}\). In other words, the Fisher combination just provides a half story for combining {p i }, and we can use the formulation \( {pp}_{{X}_{1||0}}=pp^{o}_{{X}_{1||0}} {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) to complete the whole story, using \(pp^{o}_{{X}_{1||0}}\) by Equation (99) and \( {rp}_{{X}_{1||0}}\phantom {\dot {i}\!}\) by Equation (97) with \(p_{X_{1|0}}= \prod _{i} p_{i}\).

The last but not least, one should notice that the p value of testing H 0 measures the chances in the S-space (i.e. the space of multivariate statistics), and the p value of testing I 0 measures an event in the P-space (i.e. the space of false alarm probabilities). In other words, testing H 0 involves a S-space BBT while testing I 0 involves a P-space BBT.

Discussions

Gene expression analyses

Gene expression analyses take important roles in bioinformatics and computational genetics. Expression profiles are featured by data matrix with its row indicating expressions of different samples t=1,,N while its column consisting of expressions i=1,,m from different genes, miRNAs, and lncRNAs.

In recent years, developments of data acquisition techniques lead us to consider expression profiles in a cubic or even a high-dimension array. As illustrated in Figure 1, one additional dimension j=1,,d is added for examining expressions under different conditions (Ji et al. 2009; Persson et al. 2011) and across different time points (Bar-Joseph et al. 2012). For examples, current cancer studies consider each basic unit (i.e. a gene, a miRNA, a lncRNA) in paired expressions of normal and tumour tissue from the same individual, that is, each individual is featured at least by a 2×d matrix X t . Generally, each example X t is a m×d matrix. In Table 7, we suggest a list of topics for such matrix-variate-based applications.

Typically, the number d of rows (i.e. gene, miRNA, and lnclRNA) is huge, while the sample size n is small. It is difficult and also unreliable to consider the entire m×d matrix as a sample X t . Instead, we pick k- tuple out of m rows to form a m×k matrix as a sample X t . Without losing generality, we focus on that each sample X t is a 2×k matrix from paired expressions of normal tissue and tumour tissue.

In the existing studies, there are two types of efforts for dealing with such format of samples. The first one reduces each sample \(X_{t}=\left [x_{t}^{(i,j)}\right ], i=1,2; j=1,\cdots, k\) into a 1×k matrix \(x_{t}=\left [x_{t}^{(1)}, \cdots, x_{t}^{(k)}\right ]\) for multivariate hypothesis test. A typical reduction is given by:
$$\begin{array}{@{}rcl@{}} x_{t}^{(j)}= \ln{x_{t}^{(1,j)}}- \ln{x_{t}^{(2,j)}}. \end{array} $$
(100)
The second type of efforts is a paired difference test, e.g. a paired t-test when k=1 and paired Hotelling’s square test when k≥2. In Table 8, comparative empirical IHT studies are suggested on the samples of X t in a 2×k matrix versus in a 1×k vector.
Table 8

Several IHT Applications

IHT types

Applications

Model based and Mix-modelled

(a) Starting at the case that X t is degenerated into an 1×2 matrix, we conduct the Hotelling test by Equation (2) and its extension K L sum in Equation (31), in comparison with both univariate t-test and a paired t-test. (b) For the general case with k≥2, we conduct a matrix-variate test by Equation (28), as well as by the matrix-variate counterparts of K L 1,0, K L sum , and K L s u m, in comparison with not only the Hotelling’s T-square test on the k dimensional vector x t obtained from Equation (100) but also the paired Hotelling’s T-square test on 2×k matrix-variate samples of X t . (c) Considering each sample X t in a 2×k matrix, we investigate the bi-linear discriminant analysis by Equations 18, 33, and 34, in comparison with the classic FDA by Equation (11) on the k dimensional vector x t obtained from Equation (100). (d) Investigate the generalised bi-linear discriminant analysis by Equations 40, 41, and 34. For simplicity, we get v i ,i=1,,d by Equation (43) and then solve w by Equation (34). When k becomes too big, we further regularise the learning of v i by minimising \( J_y=\frac {\alpha _{0} \sigma _{0}^{y\ 2} +\alpha _{1}\sigma _{1}^{y\ 2}} {(c^{y}_{0} -c^{y}_{1})^{2}}+ \sum _{i=1}^{m} \gamma _{i} \sum _{j=1}^{d} |u_{i}^{(j)} |^{q}, \) with q=2 for Tikhonov, q=1 for sparse learning.

Boundary based and Mix-modelled

(a) Consider a logistic regression by Equation (3) with w in one of the ways given in Table 4, we test Equation (5) by the Rao’s score Equation (8), and get ε C by Equation (44), and ε B by the p-value with one of choices in Table 2. (b) Extend all the above studies on Equation (3) with y t =w T x t replaced by the bi-linear form Equation (18). (c) Make a survival analysis via the Cox regression by Equation (13) in comparison with its bi-linear extension by Equations (18) or (40). Again, IHT is made by ε D , ε C , and ε B in a way similar to the above.

BYY harmony

(a) Use either Algorithm 9 in Ref. (Xu, 2015) to get α (i),c (i), Σ (i),i=0,1 or Algorithm ?? to get α (i),C (i),Σ (i),Ω (i),i=0,1 for model based IHT. (b) Perform the procedure given in Table 5 for training, testing and validating in a small size of samples.

Exome sequencing analyses

The case-control study is also a major problem in a genome-wide association study (GWAS) or exome-sequencing analysis (DePristo et al. 2011; Purcell et al. 2007). Typically, a digit score (i.e. 0,1,2) is assigned to a Single Nucleotide Polymorphism (SNP) allele per site and per individual. In such a representation, each sample is univariate when each site is considered one by one. One variate two-sample test takes a fundamental role for detecting a single SNP in the GWAS, e.g. the PLINK provides one widely used tool box (Purcell et al. 2007).

Moreover, each sample can be a vector when multiple sites are considered jointly. Recently, there have been ever-increasing efforts on finding multiple SNVs jointly (DePristo et al. 2011; Derkach et al. 2013; Evangelou and Ioannidis 2013; Lin et al. 2014; Liu et al. 2014; Pan et al. 2014). Also, we may test whether there is a collective inclining dominance of the representations of case samples over the ones of control samples, or vice versa, with help of the method proposed from Equations (79) and (84), as well as the extension introduced around Equations (87) and (89).

Alternatively, we may also consider a SNP allele per site and per individual with δ(x,y) in Equation (75) replaced by one 3×3 matrix \(\Delta (x,y)=\left [ \delta _{x-y}^{(i.j)}\right ]\) with:
$$\begin{array}{@{}rcl@{}} \delta_{x-y}^{(i.j)}=\left\{ \begin{array}{ll} \text{sign}(x-y), &\text{\(i=x, y=j\)}, \\ 0, &\text{otherwise}. \end{array} \right. \end{array} $$
(101)

It follows from Equation (72) that we get D(X 1||0) to be also a 3×3 matrix as a collective measure, which may be further examined to test whether two populations differ significantly. We may visualise the matrix by plotting them in two 2D histograms and observe their configurations.

Conclusions

Statistical analyses for case-control studies have been addressed rather comprehensively. First, a Kullback-Leibler divergence-based formulation is suggested to develop testing statistics and discriminative criterion for the case-control studies. Based on this formulation, typical existing methods are revisited, and their matrix-variate counterparts are developed. Second, a bi-linear matrix form is proposed to obtain the matrix-variate counterparts from existing multivariate statistical analyses, such as discriminative analysis, logistic regression, Cox model, and linear mixed model. Third, the necessity and feasibility of integrative hypothesis tests (IHT) are addressed from the complementarity of BMTs and BBTs in the D-space, together with empirical illustration. Moreover, four basic components of IHT are elaborated, and four IHT types are summarised according to how the components are integrated. Then, in the space of multiple statistics (shortly S-space), the S-space BBT is proposed to perform BBT based on an unbounded boundary, with the help of information-preserved decoupling. Moreover, a S-space BBT-based extension of univariate one-tail z-test is developed to test the null of multivariate zero mean and then applied to a multivariate SPD test for detecting a collective inclining dominance for the case-control studies. Also, a SPD discriminative analysis is proposed with this multivariate SPD test improved and extended to matrix-variate ones. Furthermore, a multivariate bi-test is proposed to test not only the classic null but also a null about inference reliability due to the complexity of testing space, including a new insight on and a further development of the Fisher combination. Finally, possible applications have been suggested for expression-profile-based biomarker finding and exome-sequencing-based joint SNV detection.

Declarations

Acknowledgements

This work was supported by a CUHK Direct grant project 4055025 and by the Zhi-Yuan chair professorship by Shanghai Jiao Tong University.

Authors’ Affiliations

(1)
Department of Computer Science and Engineering, The Chinese University of Hong Kong
(2)
Department of Computer Science and Engineering, The Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University

References

  1. Bar-Joseph, Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet13(8): 552–564.View ArticleGoogle Scholar
  2. Barnett, JA (2008) Computational methods for a mathematical theory of evidence. In: Yager L Liu L (eds)Classic Works of the Dempster-Shafer Theory of Belief Functions. Studies in Fuzziness and Soft Computing, 197–216.. Springer, Berlin Heidelberg.View ArticleGoogle Scholar
  3. Cortes, C, Vapnik V (1995) Support-vector networks. Mach Learn20(3): 273–297.MATHGoogle Scholar
  4. Cox, DR, Oakes D (1984) Analysis of survival data. CRC Press, Chapman & Hall, Boca Raton, Florida.Google Scholar
  5. Cover, TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. Electronic Computers, IEEE Transactions on 14(3): 326–334.View ArticleMATHGoogle Scholar
  6. Demidenko, E (2013) Mixed models: theory and applications with R. Probability and Statistics. John Wiley & Sons, Hoboken, New Jersey.MATHGoogle Scholar
  7. DePristo, MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5): 491–498.View ArticleGoogle Scholar
  8. Derkach, A, Lawless JF, Sun L (2013) Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genet Epidemiol 37(1): 110–121.View ArticleGoogle Scholar
  9. Dutilleul, P (1999) The mle algorithm for the matrix normal distribution. J Stat Comput Simul 64(2): 105–123.View ArticleMATHGoogle Scholar
  10. Engle, RF (1984) Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Handb Econometrics 2: 775–826.View ArticleMATHGoogle Scholar
  11. Evangelou, E, Ioannidis JP (2013) Meta-analysis methods for genome-wide association studies and beyond. Nat Rev Genet 14(6): 379–389.View ArticleGoogle Scholar
  12. Fisher, RA (1948) Questions and answers# 14. Am Stat 2(5): 30–31.Google Scholar
  13. Gibson, G (2012) Rare and common variants: twenty arguments. Nat Rev Genet 13(2): 135–145.View ArticleGoogle Scholar
  14. Hosmer Jr, DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression. John Wiley & Sons, Hoboken, New Jersey.View ArticleMATHGoogle Scholar
  15. Hotelling H (1931) The generalization of Student’s ratio. Ann Math Stat 2(3): 360–378.View ArticleMATHGoogle Scholar
  16. Ji, J, Shi J, Budhu A, Yu Z, Forgues M, Roessler S, Ambs S, Chen Y, Meltzer PS, Croce CM, Qin L-X, Man K, Lo C-M, Lee J, Ng IOL, Fan J, Tang Z-Y, Sun H-C, Wang XW (2009) Microrna expression, survival, and response to interferon in liver cancer. New Engl J Med 361(15): 1437–1447.View ArticleGoogle Scholar
  17. Koboldt, DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER (2013) The next-generation sequencing revolution and its impact on genomics. Cell 155(1): 27–38.View ArticleGoogle Scholar
  18. Lin, W-Y, Lou X-Y, Gao G, Liu N (2014) Rare variant association testing by adaptive combination of p-values. PloS one9(1): 85728.View ArticleGoogle Scholar
  19. Liu, DJ, Peloso GM, Zhan X, Holmen OL, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem K, Melander O, Kathiresan S, Abecasis GR (2014) Meta-analysis of gene-level tests for rare variant association. Nat Genet 46(2): 200–204.View ArticleGoogle Scholar
  20. Narendra, PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. Comput IEEE Trans 100(9): 917–922.View ArticleMATHGoogle Scholar
  21. Pan, W, Kim J, Zhang Y, Shen X, Wei P (2014) A powerful and adaptive association test for rare variants. Genetics197(4): 1081–1095.View ArticleGoogle Scholar
  22. Persson, H, Kvist A, Rego N, Staaf J, Vallon-Christersson J, Luts L, Loman N, Jonsson G, Naya H, Hoglund M, Borg A, Rovira C (2011) Identification of new microRNAs in paired normal and tumor breast tissue suggests a dual role for the erbb2/her2 gene. Cancer Res 71(1): 78–86.View ArticleGoogle Scholar
  23. Purcell, S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, Sham PC (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet81(3): 559–575.View ArticleGoogle Scholar
  24. Schwarz, G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464.View ArticleMATHMathSciNetGoogle Scholar
  25. Simon, RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y (2003) Design and analysis of DNA microarray investigations. Springer-Verlag, New York.MATHGoogle Scholar
  26. Somol, P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. Pattern Anal Mach Intell IEEE Trans26(7): 900–912.View ArticleGoogle Scholar
  27. Stone, M (1974) Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological)36(2): 111–147.MATHMathSciNetGoogle Scholar
  28. Suykens, JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3): 293–300.View ArticleMathSciNetMATHGoogle Scholar
  29. Suykens, JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific Publishing, Singapore.View ArticleMATHGoogle Scholar
  30. Tu, S, Xu L (2011) An investigation of several typical model selection criteria for detecting the number of signals. Front Electrical Electronic Eng China 6(2): 245–255.View ArticleMathSciNetGoogle Scholar
  31. Tu, S, Xu L (2012) A theoretical investigation of several model selection criteria for dimensionality reduction. Pattern Recognit Lett 33(9): 1117–1126.View ArticleGoogle Scholar
  32. Tu, S, Xu L (2014) Learning binary factor analysis with automatic model selection. Neurocomputing 134: 149–158.View ArticleGoogle Scholar
  33. Williams CKI (2003) Learning kernel classifiers. J Am Stat Assoc98(462): 489–490.Google Scholar
  34. Xu, L, Yan P, Chang T (1988) Best first strategy for feature selection In: 9th International Conference on Pattern Recognition, 706–708.. IEEE Computer Society Press, Piscataway, New Jerse.Google Scholar
  35. Xu, L (1995) Bayesian-Kullback coupled ying-yang machines: unified learnings and new results on vector quantization In: Proc. Int. Conf. Neural Information Process (ICONIP ’95), 977–988.. Publishing House of Electronics Industry, Beijing.Google Scholar
  36. Xu, L (2003) Independent component analysis and extensions with noise and time: a Bayesian ying-yang learning perspective. Neural Inform Process Lett Rev 1: 1–52.Google Scholar
  37. Xu L (2009) Independent Subspaces In: Encyclopedia of Artificial Intelligence, 892–901.. IGI Global IGI Global Snippet, Hershey, Pennsylvania.View ArticleGoogle Scholar
  38. Xu L (2010) Bayesian ying-yang system, best harmony learning, and five action circling. Front Electrical Electronic Eng China5(3): 281–328.View ArticleGoogle Scholar
  39. Xu, L (2011) Codimensional matrix pairing perspective of BYY harmony learning: hierarchy of bilinear systems, joint decomposition of data-covariance, and applications of network biology. Front Electr Electron Eng China 6: 86–119. A special issue on Machine Learning and Intelligence Science: IScIDE2010 (A).View ArticleGoogle Scholar
  40. Xu, L (2012a) Semi-blind bilinear matrix system, BYY harmony learning, and gene analysis applications In: Proceedings of The 6th International Conference on New Trends in Information Science, Service Science and Data Mining: 23-25 October 2012, 661–666.. IEEE, Taipei.
  41. Xu, L (2012b) On essential topics of BYY harmony learning: current status, challenging issues, and gene analysis applications. Front Electrical Electronic Eng 7(1): 147–196.
  42. Xu, L (2013a) Integrative hypothesis test and A5 formulation: sample pairing delta, case control study, and boundary based statistics In: Intelligence Science and Big Data Engineering. LNCS, 887–902.. Springer, Berlin Heidelberg.
  43. Xu L (2013b) Matrix-Variate discriminative analysis, integrative hypothesis testing, and geno-pheno A5 analyzer In: Intelligent Science and Intelligent Data Engineering. LNCS, 866–875.. Springer, Berlin Heidelberg.
  44. Xu, L (2015) Further advances on Bayesian ying yang harmony learning. Applied Informatics 2(5).
  45. Xu L, Amari SI (2008) Combining classifiers and learning mixture-of-experts. In: J Ramon e.a. (ed)Encyclopedia of Artificial Intelligence, 318–326.. IGI Global, Hershey: PA.
  46. Xu L, Krzyzak A, Suen CY (1992b) Several methods for combining multiple classifiers and their applications in handwritten character recognition. IEEE Trans Syst Man Cybernet 22: 418–435.
  47. Zaykin DV (2011) Optimally weighted z-test is a powerful method for combining probabilities in meta-analysis. J Evol Biol 24(8): 1836–1841.View ArticleGoogle Scholar

Copyright

© Xu; licensee Springer. 2015

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

Advertisement