Skip to main content

Enviro-geno-pheno state approach and state based biomarkers for differentiation, prognosis, subtypes, and staging


Finding biomarkers for differentiation, prognosis, subtypes, and staging takes a key role in precision medicine, usually featured by association analysis on geno-measures and pheno-measures. Recent efforts turn to identifying the role of a biomarker under certain condition or in a particular environment, represented by a set of enviro-measures. This paper proposes to consider the joint domain of geno-measures, pheno-measures, and enviro-measures, in which one element (i.e., each triple jointly taken by the three measures) represents a possible behaviour of the bio-system under investigation. A collection of elements that locate adjacently and share a common system status represents a ‘state’, and the system is characterised by a number of such states learned from samples. Instead of directly using one or a set of geno-measures as a biomarker, such an enviro-geno-pheno state (E-GPS) is considered as a biomarker, indicating ‘health/normal’ versus ‘risk/abnormal’ together with its associated enviro-geno-pheno conditions. Association analyses for differentiation, prognosis, subtypes, and staging can be performed between such E-GPS biomarkers and those measures representing clinical phenotypes and treatments, made either on one state or cross multiple states. Moreover, potential applications are suggested for analyses of expression data, sequencing data, and their integrative uses.


In an extended sense, we use a geno-measure g (shortly g-measure) to refer a genetic measure that takes either a real value or one of a few labels, e.g., the expression level of a gene, the frequency of a mutation, the genotype of an SNP, etc. Moreover, g-measure can also be \(\mathbf{g}\) that denotes a vector or a matrix with each element being one of such genetic measures. On the other hand, we use a pheno-measure \(\phi\) (shortly \(\phi\)-measure) to refer a phenomenon indicator that is typically a categorical label or an integer number, indicating different subtypes or stages of a cancer or complex disease. Moreover, we may use a real-valued \(\phi\) for a phenomenon that has a large category size or is directly featured by a continuous measure, e.g., survival length. Considering multiple phenomena jointly, a \(\phi\)-measure could also be a vector \(\mathbf{\pmb {\phi }}\) with each element being such an individual phenomenon indicator.

A biomarker that identifies abnormal or normal is a g-measure g that demonstrates a significant difference between the case population and the control population, and a biomarker that indicates subtypes or stages is one g-measure g that demonstrates a significant characteristic underlying samples of each corresponding group, while a biomarker of prognosis provides a good prediction on post-treatment survival. Some is a common biomarker that is useful to all the uses, while some particularly works for merely one or two of them.

Typically, a g-measure g is an SNP in GWAS or an expression value of a gene in expression profile. Moreover, a g-measure can be a vector \(\mathbf{g}\) that consists of multiple SNPs in a segment of DNA sequence, where the segment corresponds to a gene or a noncoding RNA (lncRNA, circRNA, etc.) in consideration. In addition, \(\mathbf{g}\) may consist of a number of features obtained from mutation analysis. For analysing expression profile with the tumour versus its paired adjacent tissue, \(\mathbf{g}\) is a two-dimensional vector that consists of simply the expressions of tumour and of the paired adjacent tissue, see page 36 in Ref. Xu (2015a) and Fig. 7 in Ref. Xu (2016). Even generally, a vector \(\mathbf{g}\) may represent a bio-unit in consideration, which covers more than one gene, e.g., expressions of several mRNAs that group in a signature on a heart map or certain features that represent one biological functional module.

Conventionally, whether a g-measure acts as a biomarker was examined on a set of case–control samples. For examples, a gene expression biomarker for prognosis takes a high value to indicate positive to survival. Alternatively, there may also be a biomarker that takes a low value to indicate positive to survival. Traditionally, such a biomarker reflects a difference between samples of the g-measure without particularly taking a specific condition in consideration. Recently, multiple reasons appear to support that it would be better to examine a biomarker under certain conditions or in a particular environment.

First, the meaning of a biomarker identified unconditionally may change considerably in a particular environment. An example is one recent finding of CDX2 as a gene expression biomarker for prognostic in colon cancer (Dalerba et al. 2016). Without considering any condition, its high expression is preferred because the rate of 5-year disease-free survival with stage II CDX2-negative colon cancers was significantly lower than the rate with stage II CDX2-positive colon cancers. Interestingly, it was found that the rate of 5-year disease-free survival with stage II CDX2-negative tumours who were treated with adjuvant chemotherapy became significantly higher than the rate with ones who were not treated with adjuvant chemotherapy, i.e., a low expression became preferred under the condition when adjuvant chemotherapy was treated.

Second, the role of some biomarker that is unable to be unconditionally identified will become detectable under a specific condition. One example is one recent study on IDH1-mutant glioma malignant progression (Bai et al. 2016). Considering all 82 sequenced gliomas conditioning on that they all have IDH1 mutations, the role of rare mutations of NOTCH1 and NOTCH2 was identified, occurring within sequences encoding the EGF-like domains, which is consistent with inactivating mutations identified in squamous cell carcinomas.

Third, a comprehensive study demands jointly considering a biomarker that consists of multiple g-measures or even jointly considering multiple biomarkers, for which one effective way is hierarchical formulation. In its simplest case, jointly considering two parts can be made subsequently by first considering one part and then considering the rest part conditioning on the first part. One example is the recent molecular analysis of gastric cancer that identifies four subtypes of gastric cancer by a binary tree with three layers (Cristescu et al. 2015), where subtypes MSS/TP53+ and MSS/TP53− are identified by an integrated biomarker named TP53 signature conditioning on an integrated biomarker named EMT signature and an integrated biomarker named MSI signature.

In a summary, a geno-phenotype study involves not only g-measures and \(\phi\)-measures, but also a set \(\mathbf{e}\) of enviro-measures (shortly e-measures) that specify certain condition or a particular environment underlying the study. In other words, we actually make an enviro-geno-pheno integrative study, which may be shortly denoted by a notation \(g \xrightarrow [e]{} \phi\) or \(\mathbf{g} \xrightarrow [\mathbf{e}]{} \mathbf{\pmb {\phi }}\), where each e-measure may represent one of treatments, patient characteristics or g-measures jointly in consideration.

In the rest of this paper, we propose a generic approach as summarised in Table 1. First, the approach identifies one or several convex subsets in the joint domain of g-measures, \(\phi\)-measures, and e-measures, with each subset representing a state of the bio-subsystem in our investigation. Shortly, such a state is called enviro-geno-pheno state (E-GPS) that acts as E-GPS biomarker, indicating ‘health/normal’ versus ‘risk/abnormal’ together with its associated enviro-geno-pheno conditions. Second, the approach makes association analysis from such E-GPS states to not only \(\phi\)-measures or clinical phenotypes but also e-measures, towards various tasks that include but are not limited to differentiation, prognosis, subtype, and staging.

Even generally, g-measures may not only be limited to genetic measures but could be also other measures that serve as the inner ground of study, called ground measures (still g-measures shortly). In other words, the E-GPS approach is also applicable to those data-mining tasks that can be formulated into the format \(g \xrightarrow [e]{}\phi\).


Whether a living system survives healthily or a machine system runs normally is featured by its internal status that could be one of several types. One major type is ‘health/good/normal’ or negative ‘−’, the other type is ‘risk/bad/abnormal’ or positive ‘+’ . There could be other types too, e.g., sub-health or slightly abnormal. In addition, there may be a type indicating ‘unknown/confusing’ or shortly ‘?’.

Specifically, a system status is measured via a set \(\mathbf{g}\) of internal intrinsic or ground factors and a set \(\mathbf{e}\) of environmental factors, as well as a set \(\pmb {\phi }\) of the external behaviours or phenotypes that the system demonstrates correspondingly. Let \(\mathbf{G} , \mathbf{E}\), and \(\pmb {\Phi }\) to indicate the domain of \(\mathbf{g}\), \(\mathbf{e}\) and \(\pmb {\phi }\), respectively, as illustrated in Fig. 1a, a system variate \(\xi \in \mathcal{D}_{g\phi e}=\mathbf{G} \times \mathbf{E}\times \pmb {\Phi }\) represents an enviro-geno-pheno triple and is associated with a label, e.g., coloured green for ‘normal’ and coloured red for ‘abnormal’ as illustrated in Fig. 1b, that indicates an instance of the system status. Moreover, a subset \(R_s\subset \mathcal{D}_{g\phi e}\) conceptually describes a possible relation among \(\mathbf{g}\), \(\mathbf{e}\), and \(\pmb {\phi }\). Not all possible subsets are interesting. We are interested in that \(R_s\) is convex and every element in \(R_s\) shares a same type of system status. The system behaves the same as long as \(\xi\) locates within \(R_s\), namely \(R_s\) represents one Enviro-Geno-Pheno state (E-GPS ) in \(\mathcal{D}_{g\phi e}\), shortly denoted by s. The system behaviour is actually an external manifestation of one or several such states, as illustrated in Fig. 1h. For each E-GPS state s, not only its associated type Type(s) indicates the system status, for which we subsequently focus on Type(s) from one of values \(-,+,?\) for simplicity, and the study can be rather straightforwardly extended to other sub-health types, but also the boundary of its corresponding convex set \(R_s\) describes the condition \(\mathcal{B}_s=COND(s)=Boundary(R_s)\) to stay at this state, as illustrated in Fig. 1c.

Requiring that every element in \(R_s\) shares a same type of system status, an E-GPS state is featured by its dedication to one specific type of system status, and thus is shortly called a dedicating state or shortly d-state, e.g., the green d-state \(s^{(11)}\) in Fig. 1f dedicates to a ‘normal’ system status. To tolerate some error or disturbance, we may relax to require that every element in \(R_s\) gets a high enough probability to share a same type of system status, that is, we consider the concept of d-state in a probabilistic sense, e.g., the red d-state \(s_{00}\) in Fig. 1f dedicates to one ‘abnormal’ system status. In addition to d-states, we may also need to handle subsets confused with different types or unknown types of samples, shortly we also regard such a subset as a confusing state or c-state, e.g., \(s_{10}\) in Fig. 1f.

Instead of adopting a standard routine that directly uses one or a set of g-measures as a biomarker of phenotypes that we aim at, we suggest to use each d-state as a biomarker, shortly, called E-GPS biomarker. Its difference from considering merely \(\mathbf{g}\) measures as a biomarker lays in not just jointly considering \(\mathbf{g}, \mathbf{e}\) as a biomarker. Even without \(\mathbf{e}\), an E-GPS biomarker \(R_s\subset \mathcal{D}_{g\phi e}\) degenerates to a binary relation or a subset of \(\mathbf{G} \times \pmb {\Phi }\) whilst we traditionally consider a special bi-relation called function \(F: \mathbf{G} \rightarrow \pmb {\Phi }\) or \(\pmb {\phi }=f(\mathbf{g})\). Actually, widely studied is a linear or logistic function \(f(\cdot )\), which is an example of merely considering \(\mathbf{g}\) measures as a biomarker. In other words, an E-GPS biomarker extends such a function not only to a bi-relation but also further to a triple-relation \(R_s\subset \mathcal{D}_{g\phi e}\) with \(\mathbf{e}\) also taken in consideration, featured by the corresponding condition \(\mathcal{C}_s\) that summarises the boundary conditions about genotypes and phenotypes as well as environments.

Fig. 1
figure 1

Enviro-geno-pheno state as biomarker, shortly E-GPS biomarker a Each element of \(\mathcal{D}_{g\phi e}\) is generally a \({ d_g}\times { d_e}\times { d_{\phi }}\) data cubic, where \({ d_g}\), \({ d_e}\), and \({ d_{\phi }}\) are the dimensionalities of \(\mathbf{g}\), \(\mathbf{e}\), and \(\mathbf{\pmb {\phi }}\), respectively. b When g, e, and \({ \phi }\) are univariate, the case is illustrated by a scattering map, which is degenerated into an \({ m_g}\times {m_e}\times {m_{\phi }}\) table that represents a discrete distribution when g, e, and \({ \phi }\) take \({ m_g}, {m_e}\), and \({m_{\phi }}\) discrete values, respectively. c A convex set \(R_s\) acts as E-GPS biomarker, with the system status indicated by Type(s) and the boundary condition by COND(s) about genotypes, phenotypes, and environments by the boundary of \(R_s\). d The possible system statuses are featured by E-GPS states that are learned from given samples, by minimising the criterion given by Eq. (1) or (4). e For a finite size of samples, we prefer a simple parametric model, e.g., by one of the two choices given in Eq. (7). f An E-GPS state corresponds to a convex subset with all its elements dedicated to the same status type, e.g., \(s_{11}\) is a biomarker of ‘green’, which maybe relaxed to require a probabilistic dedication, i.e., samples falling in a convex subset are mostly dedicated to a same status type. Contrastingly, a c-state is featured by that two status type compete samples, e.g., \(s_{01}\) and \(s_{10}\). g Prognosis analysis can be made per d-state, as addressed in Table 1 (3)(a). In addition, subtype analysis is made per state, with the top row indicating ‘green’ and ‘red’ samples and other rows indicating subtypes in binary values. The relation between the E-GPS state in consideration and each subtype is examined by their intersection. h We may compare the configuration of states jointly. In addition, the results of phenotype analysis per state can be combined, with help of the weighting probability \(p(j|s_j)\) in accordance with the individual performance of each state. We may further make state transient analysis by estimating the transfer probabilities \(p(s_i|s_j)\)

A representation of \(R_s\) is learned from a given set of samples, for which we may consider a d-state s to be described by the convex hull of samples of the corresponding type, as illustrated in Fig. 1c. It is better to jointly consider a d-state of type '+' (shortly \(s^{d_+}\)) by the convex hull \(\bar{H}^{d_+}\) of red samples and a d-state of type '−' (shortly \(s^{d_-}\)) by the convex hull \(\bar{H}^{d_-}\) of green samples, as illustrated in Fig. 1d. There may be a nonempty or rather large intersection \(\bar{S}_{\cap }\) that should be cut away from both \(\bar{H}^{d_+}\) and \(\bar{H}^{d_-}\), for which we shrink both \(\bar{H}^{d_+}\) and \(\bar{H}^{d_-}\) into a convex subset \({H}^{d_+}\subseteq \bar{H}^{d_+}\) and a convex subset \({H}^{d_-}\subseteq \bar{H}^{d_-}\) with minimal intersection \({S}_{\cap }\) but a maximal union \(S_{\cup }\) such that the red samples and the green samples are best represented by \({H}^{d_+}\) and \({H}^{d_-}\), respectively, which is implemented by minimising the following criterion

$$\begin{aligned} J(s^{d_+}, s^{d_-})=\frac{ |S_{\cap }|}{|S_{\cup }|}, \ \quad S_{\cap }={H}^{d_+}\cap {H}^{d_-}, \ \quad S_{\cup }={H}^{d_+}\cup {H}^{d_-}. \end{aligned}$$

As a whole, \({H}^{d_+}\) and \({H}^{d_-}\) jointly divide \(\mathcal{D}_{g\phi e}\) into four subsets

$$\begin{aligned}&\mathcal{S}=\{{S}^{d_+}_-, \ {S}^{d_-}_-, \ S_{\cap }, \lnot S_{\cup }\}, \\&{S}^{d_+}_-={H}^{d_+}-S_{\cap }, \ {S}^{d_-}_-={H}^{d_-}-S_{\cap }, \ S_{\cap }, \ \quad \text{and} \quad \ \lnot S=\mathcal{D}_{g\phi e}-S_{\cup }, \end{aligned}$$

which includes those special cases of three subsets when \(S_{\cap }\) becomes one of \({H}^{d_+}\) and \({H}^{d_-}\), and also those special cases of two subsets when \({H}^{d_+}\) and \({H}^{d_-}\) are identical.

No longer each subset \(S\in \mathcal{S}\) is guaranteed to be convex. Without requiring such a convexity, we further consider each subset by the following ratio of minority

$$\begin{aligned} r_{S}= { \min \{ n_S^{+},n_S^{-} \} \over n_S}, \quad \ n_{S}=n_S^{+}+n_S^{-}, \end{aligned}$$

where there are a number \(n_{S}\) of the samples in \(S\in \mathcal{S}\), with \(n_S^{+}\) red samples and \(n_S^{-}\) green samples, respectively. We may regard S as a d-state when \(r_{S}\) goes below a threshold \(\gamma _s\) and \(n_{S}\) is bigger than a minimum number \(n_0\). In this case, we expect to minimise \(r_{S}\) for every \(S\in \mathcal{S}\). Possibly, there is a subset S with its \(r_{S}\) being impossibly reduced below \(\gamma _s\). Forcibly minimising such a \(r_{S}\) will unfavourably increase the competing ratios of other subsets. Thus, we are better to leave this \(r_{S}\) away from being minimised. Considering every \(S\in \mathcal{S}\) jointly, we minimise the following criterion

$$\begin{aligned} J( \mathcal{S})= \sum _{S \in S, \ s.t.\ n_{S}\ge n_o\ \& \ r_{S} \le \gamma _S} \varepsilon ( r_{S}, d_{S}, \eta _S ), \end{aligned}$$

where \(\varepsilon ( u, v, w ) \ge 0\) is a function with

$$\begin{aligned} {\partial \varepsilon \over \partial u}\ge 0,\quad \ {\partial \varepsilon \over \partial v}\le 0, \quad \ {\partial \varepsilon \over \partial w}\le 0. \end{aligned}$$

That is, a smaller value J prefers a smaller \(r_{S }\) or equivalently a d-state. Moreover, \(d_{S }\) reflects a degree of separation between the samples inside and outside S. It follows from \({\partial \varepsilon \over \partial v}\le 0\) that a smaller value J prefers bigger \(d_{S}\). Furthermore, \(\eta _S\) reflects a degree of balance on the numbers of samples over subsets in \(\mathcal {S}\), e.g., we may consider the following entropy

$$\begin{aligned} \eta _S=-{1 \over \# \mathcal {S}}\sum _{S} { n_{S} \over \sum _{S} n_{S} } \ln { { n_{S} \over \sum _{S} n_{S} }} \ \quad {\text{or}} \ \quad \eta _S=- {1 \over \# \mathcal {S}} \sum _{S} \left [{ n_{S} \over \sum _{S} n_{S} } \right ] ^2. \end{aligned}$$

It follows from \({\partial \varepsilon \over \partial w}\le 0\) that a smaller value J prefers bigger \(\eta _S\) or a configuration with a least number of states and also with samples dedicated to the states in a balanced way.

In implementation, there could be different choices for representing \({H}^{d_+}\) and \({H}^{d_-}\). Typically, a finite size of samples restrains our preference to a simple parametric model. Two simplest choices are given as follows:

$$\begin{aligned} \text{(a) }&\text{ a } \text{ hyper-sphere } \text{ parameterized } \text{ by } \text{ a } \text{ location } \text{ vector } \mathbf{m} \text{ and } \text{ a } \text{ radius } \text{a}, \end{aligned}$$
$$\begin{aligned} \text{(b) }&\text{ a } \text{ hyper-plane } \text{ parameterized } \text{ by } \text{ a } \text{ location } \text{ vector } \mathbf{m} \text{ and } \text{ a } \text{ normal } \text{ vector } \mathbf{a}. \end{aligned}$$

For examples, Choice (a) is illustrated by the dashed circles in Fig. 1e, and Choice (b) is illustrated by the greyed planes in Fig. 1e. The former choice is similar to the general case in Fig. 1d, parameterised by a least number of free parameters to be estimated by minimising the criterion given by Eqs. (1) or (4). However, there are two limitations. First, the spherical shape is not suitable for representing a sample population in an elongated configuration featured by some orientation. Second, there may be some subset in Eq. (2) that is not convex and thus loses the robustness of a convex set.

Though the first limitation may become broken with hypersphere replaced by hyper-ellipse, not only it largely increases the number of free parameters and thus becomes prone to overfitting but also some subset in Eq. (2) may still not be convex. Favourably, Choice (b) gets a small incremental in free parameters, i.e., simply with the scalar a replaced by a vector \(\mathbf{a}\), such that the second limitation is overcome and the first limitation is at least partially overcome. Specifically, \(\mathcal{D}_{g\phi e}\) is partitioned into at least two convex subsets and at most four convex subsets by two hyper-planes, and the resulted subsets may also have some orientation. Again, we may estimate the two hyper-planes by minimising the criterion given by Eqs. (1) or (4).

Illustrated in Fig. 1f is a simple example that \(\mathcal{D}_{g\phi e}\) is partitioned into four convex subsets \(S_{11}\), \(S_{01}\), \(S_{10}\), and \(S_{00}\) by two lines. Specifically, the subset \(S_{11}\) represents a d-state \(s_{11}\) as good biomarker of ’green’ (i.e., normal), though unconditionally using g as a biomarker cannot differentiate the normal versus the abnormal. As a d-state, samples of the state \(s_{11}\) are all dedicated to ‘green’, while the state \(s_{00}\) is almost a d-state that corresponds to the subset \(S_{00}\) that consists of mostly red samples. The other two subsets \(S_{01}\) and \(S_{10}\) act as the c-states. Relaxing two lines in Fig. 1f to become adjusted freely, an optimal partition may be obtained by minimising the criterion given by Eq. (4).

This approach differs from the conventional linear discriminating analysis not just in that one hyper-plane is replaced by two hyper-planes, but also in that the classification error is replaced by the dedication degree of samples while the c-states are excluded from disturbing the d-states. Then, the samples of each c-state maybe further divided into two to four convex subsets using this approach too, e.g., the subset \(S_{01}\) in Fig. 1f can be further divided into two d-states still within \(\mathcal{D}_{g\phi e}\). Recursively doing so, we are led to a tree as illustrated in Fig. 2e. As a whole, red samples are represented jointly by a number of d-states, either in a probabilistic combination as illustrated in Fig. 1h or via a union of convex subsets while this union may be no longer convex. Similarly, red samples are represented by a number of d-states too.

Fig. 2
figure 2

Learning E-GPS- and E-GPS-based analysis a Learning E-GPS with \(J(\mathcal{S})\) given by Eq. (4) simplified into \(J(a_1, a_2)\), where \(\eta _S\) is given by Eq. (6) and \(\lambda \ge 0\) is a weight for the role of \(\eta _S\). b Two lines in special cases with only two free parameters. c We may get three convex subsets by considering two parallel lines featured by three free parameters. One way for learning the two lines is using support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002) for several different choices of the margin a among which we pick the best one according to J. d Alternatively, two parallel lines may also be obtained in two steps. First, we find the normal direction of the two lines by SVM and then project all the samples orthogonally onto the direction. Second, we treat the projected samples in the same way as in (b). Instead of SVM, the normal direction may also be determined by either Fisher discriminative analysis (FDA) or principal component analysis (PCA). e Recursively, we perform the division as illustrated in (c) or (d) on each c-state, ..., so on so forth, until all the remaining c-states are noncuttable, see Table 1(1)(c). Finally, we get a tree with each d-state (e.g., \(s_{11}\)) as a leaf. f Defragment is made from time to time by merging adjacent d-states \(s_2, s_3, s_4\) and merging c-states \(s_5, s_6\), see Table 1(3)

To avoid overfitting, in Eq. (4) we impose a lower bound on the number of samples in each d-state. In addition, we may merge samples of adjacent c-states to form a big c-stage before dividing one c-state into subsets on the next level, as illustrated in Fig. 2f. For a small size of samples, we may further reduce the number of free parameters by restraining two hyper-planes in parallel, i.e., reduce one orientation vector \(\mathbf{a}\) into a scalar a to denote the distance between two parallel hyper-planes. Learning may be simplified into a two-stage implementation as illustrated in Fig. 2c, d. First, the normal direction of parallel hyper-planes is learned either directly by support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002) as shown in Fig. 2c or with help of Fisher discriminative analysis (FDA) as shown in Fig. 2d. Second, samples are projected onto the normal direction and further divided into three subsets by minimising a simplified version of J(S) given by Eq. (4), as shown in Fig. 2a, b.

In Eq. (3), one sample within S will contribute one count to either \(n_S^{+}\) or \(n_S^{-}\) regardless of where it locates. We may consider that each sample \(x\in S\) is associated with weight coefficients \(w^{+}(x)\) for red samples and \(w^{-}(x)\) for green samples, based on which we modify \(n_S^{+}\) and \(n_S^{-}\) in Eq. (3) into the follow ones:

$$\begin{aligned}&n_S^{+}=\sum _{ \text{ each } \text{ red } \text{ sample }\ x\in S }w^{+}(x), \\&n_S^{-}=\sum _{ \text{ each } \text{ green } \text{ sample }\ x\in S }w^{-}(x). \end{aligned}$$

There could be two types of choices for getting \(w^{+}(x)\) and \(w^{-}(x)\). One based on how well x belongs to the corresponding state. A weight tends to be small if x marginally belongs to the state (e.g., locating near the boundary of S) but large if x firmly belongs to the state (e.g., locating deep inside S). The other bases on distributions \(p(x|+)\) of red samples and \(p(x|-)\) of green samples may be given by nonparamatric kernel estimation as follows:

$$\begin{aligned}&p(x|+)=\frac{1}{h^dn^{+}}\sum _{ \text{ each } \text{ red } \text{ sample }\ \xi }K\left(\frac{x-\xi }{h}\right), \\&p(x|-)=\frac{1}{h^dn^{-}}\sum _{ \text{ each } \text{ green } \text{ sample }\ \xi }K\left(\frac{x-\xi }{h}\right), \end{aligned}$$

where \(n^{+}, n^{-}\) are the total number of red and green samples, respectively, d is the dimension of x, and \(h>0\) is a small smoothing parameter. One simple example of \(K(\frac{x-\xi }{h})\) is a Gaussian distribution with its mean \(\xi\) and the covariance \(h^d I\).

Table 1 E-GPS states and E-GPS approach

As summarised in Table 1, the E-GPS approach is featured by identifying the system status via the E-GPS states that are learned from a given set of samples as addressed previously in this section and then further refined cutting, merging, growing as addressed in Table 1(2). Subsequently, we conduct various conditional phenotype analyses based on the E-GPS states, as summarised in Table 1(3).

Table 2 Potential applications
Fig. 3
figure 3

A possible extension: getting E-GPS biomarkers by deep learning. a Considering many genes and multiple conditional measures in a bio-system (e.g., a pathway) that consists of far more than a few genes, we may consider a multiple-layer network by deep learning, such as stacked RBMs (Hinton and Salakhutdinov 2006) and LMSER (Xu 1991, 1993) featured by unsupervised learning for a hierarchical abstraction of biomarkers. b According to Turing–Church thesis, the class of partial recursive functions is precisely the functions that can be computed by Turing machines, which provide an interesting perspective for understanding deep learning. The basic functions are involved within each layer, and the operators of composition and primitive recursion correspond a forward processing across different layers (namely what is usually called ‘deep’), while the minimisation operator corresponds a recurrent process from upper layers back to lower layers. From this perspective, we speculate that the class of functions performed by deep neural networks is also the class of functions that can be computed by Turing machines, for which ‘deep’ and ‘recurrent’ are indispensable


The E-GPS approach may find many uses in genomic biomarkers and cancer genetics, of which several applications are summarised in Table 2, including not only expression analyses and transcriptomic analysis of mRNA, lncRNA, and circRNA but also whole genome sequencing-based joint SNV analyses, mutation analyses, and methylation analyses, etc.

Additionally, it is also interesting to notice those degenerated situations with phenotype information unknown, e.g., all the red or green coloured points are turned into black dots. In such cases, all the states are degenerated into a same type, namely a unknown state or shortly called U-state. Each U-state actually represents a cluster of samples without any label information, and the task of identifying states is degenerated into clustering analysis, for which one possible method is learning a mixture of multiple local subspaces, e.g., see Algorithm 5 in Ref. Xu (2015b). In addition, we may consider the binary factor analysis that describes \(2^m\) states with the number of free parameters significantly reduced, e.g., see Algorithm 6 and Algorithm 7 in Ref. Xu (2015b).

Without considering phenotype information, the task nature will become really different from the E-GPS approach, in which phenotype analysis takes a core role in various tasks. We may use unsupervised learning as a preprocessing stage and the resulted clusters as one initial state configuration, on which the E-GPS study is further performed to take phenotype information in consideration.

In many biomarker searching tasks, the data may be mixed up by samples with phenotypes available and samples with phenotype unknown or partially missing. Considering unlabelled data may help to improve performances, which relates to semi-supervised learning, e.g., see Algorithm 9 in Ref. Xu (2015b) for semi-supervised clustering and Algorithm 11 in Ref. Xu (2015b) for semi-supervised binary factor analysis.

Another possible extension is getting the E-GPS biomarkers by deep learning multiple-layer networks, especially when we consider many genes and multiple conditional measures in a bio-system (e.g., pathway) that consists of far larger than a few genes. As illustrated in Fig. 3a, examples include stacked restricted Boltzmann machines (RBMs) (Hinton and Salakhutdinov 2006) and Least mean square error reconstruction (LMSER) (Xu 1991, 1993). Interestingly, the class of functions performed by deep neural networks is here speculated to be equivalently the class of functions that can be computed by Turing machines, from the perspective of partial recursive functions.


In the joint domain \(\mathcal{D}_{g\phi e}\) of geno-measures, pheno-measures, and enviro-measures, those elements that locate adjacently in a convex subset are identified as forming a state as biomarkers. In place of a conventional biomarker that uses one or multiple g-measures as a biomarker unconditionally, this E-GPS approach provides a new biomarker analysis tool that considers not only geno-variables conditionally on certain focused domain but also the joint enviro-geno-pheno effect, as well as the E-GPS state based phenotype analyses such as differentiation, prognosis, subtype, staging, and pathogenic progression. Specifically, a two-stage method is proposed for learning these E-GPS states, and several possible applications are suggested. Moreover, it is further addressed that such an E-GPS approach facilitates integrative study of expression and sequencing.


  • Bai H, Harmancı AS, Erson-Omay EZ, Li J, Coşkun S, Simon M, Krischek B, Özduman K, Omay SB, Sorensen EA (2016) Integrated genomic characterization of idh1-mutant glioma malignant progression. Nat Genet 48(1):59–66

    Article  Google Scholar 

  • Cristescu R, Lee J, Nebozhyn M, Kim K-M, Ting JC, Wong SS, Liu J, Yue YG, Wang J, Yu K (2015) Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes. Nat Med 21(5):449–456

    Article  Google Scholar 

  • Dalerba P, Sahoo D, Paik S, Guo X, Yothers G, Song N, Wilcox-Fogel N, Forgó E, Rajendran PS, Miranda SP (2016) Cdx2 as a prognostic biomarker in stage II and stage III colon cancer. N Engl J Med 374(3):211–222

    Article  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hotelling H (1931) The generalization of student’s ratio. Ann Math Stat 2(3):360–378

    Article  MATH  Google Scholar 

  • Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300

    Article  MathSciNet  MATH  Google Scholar 

  • Suykens JA, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J, Suykens J, Van Gestel T (2002) Least squares support vector machines. World Scientific Publishing, Singapore

    Book  MATH  Google Scholar 

  • Xu L (1991) Least mse reconstruction for self-organization:(i) multi-layer neural nets and (ii) further theoretical and experimental studies on one layer nets. In: Proceedings of the international joint conference on neural networks-1991-Singapore. pp 2363–2373

  • Xu L (1993) Least mean square error reconstruction principle for self-organizing neural-nets. Neural Netw 6(5):627–648

    Article  Google Scholar 

  • Xu L (2015a) Bi-linear matrix-variate analyses, integrative hypothesis tests, and case–control studies. Appl Inform 2(1):1–39

    Article  Google Scholar 

  • Xu L (2015b) Further advances on bayesian ying yang harmony learning. Appl Inform 2(5):1–45

    Google Scholar 

  • Xu L (2016) A new multivariate test formulation: theory, implementation, and applications to genome-scale sequencing and expression. Appl Inform 3(1):1–23

    Article  Google Scholar 

Download references


This work was supported by the Zhi-Yuan chair professorship start-up Grant from Shanghai Jiao Tong University.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lei Xu.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, L. Enviro-geno-pheno state approach and state based biomarkers for differentiation, prognosis, subtypes, and staging. Appl Inform 3, 4 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: