Enviro-geno-pheno state approach and state based biomarkers for differentiation, prognosis, subtypes, and staging
- Lei Xu^{1, 2}Email author
DOI: 10.1186/s40535-016-0020-3
© The Author(s) 2016
Received: 17 April 2016
Accepted: 20 July 2016
Published: 2 August 2016
Abstract
Finding biomarkers for differentiation, prognosis, subtypes, and staging takes a key role in precision medicine, usually featured by association analysis on geno-measures and pheno-measures. Recent efforts turn to identifying the role of a biomarker under certain condition or in a particular environment, represented by a set of enviro-measures. This paper proposes to consider the joint domain of geno-measures, pheno-measures, and enviro-measures, in which one element (i.e., each triple jointly taken by the three measures) represents a possible behaviour of the bio-system under investigation. A collection of elements that locate adjacently and share a common system status represents a ‘state’, and the system is characterised by a number of such states learned from samples. Instead of directly using one or a set of geno-measures as a biomarker, such an enviro-geno-pheno state (E-GPS) is considered as a biomarker, indicating ‘health/normal’ versus ‘risk/abnormal’ together with its associated enviro-geno-pheno conditions. Association analyses for differentiation, prognosis, subtypes, and staging can be performed between such E-GPS biomarkers and those measures representing clinical phenotypes and treatments, made either on one state or cross multiple states. Moreover, potential applications are suggested for analyses of expression data, sequencing data, and their integrative uses.
Keywords
Enviro-geno-pheno state Biomarker Differentiation Prognosis Stage Subtype Integrative study Case–control study Genome-scale sequencing Expression profile analysisBackground
In an extended sense, we use a geno-measure g (shortly g-measure) to refer a genetic measure that takes either a real value or one of a few labels, e.g., the expression level of a gene, the frequency of a mutation, the genotype of an SNP, etc. Moreover, g-measure can also be \(\mathbf{g}\) that denotes a vector or a matrix with each element being one of such genetic measures. On the other hand, we use a pheno-measure \(\phi\) (shortly \(\phi\)-measure) to refer a phenomenon indicator that is typically a categorical label or an integer number, indicating different subtypes or stages of a cancer or complex disease. Moreover, we may use a real-valued \(\phi\) for a phenomenon that has a large category size or is directly featured by a continuous measure, e.g., survival length. Considering multiple phenomena jointly, a \(\phi\)-measure could also be a vector \(\mathbf{\pmb {\phi }}\) with each element being such an individual phenomenon indicator.
A biomarker that identifies abnormal or normal is a g-measure g that demonstrates a significant difference between the case population and the control population, and a biomarker that indicates subtypes or stages is one g-measure g that demonstrates a significant characteristic underlying samples of each corresponding group, while a biomarker of prognosis provides a good prediction on post-treatment survival. Some is a common biomarker that is useful to all the uses, while some particularly works for merely one or two of them.
Typically, a g-measure g is an SNP in GWAS or an expression value of a gene in expression profile. Moreover, a g-measure can be a vector \(\mathbf{g}\) that consists of multiple SNPs in a segment of DNA sequence, where the segment corresponds to a gene or a noncoding RNA (lncRNA, circRNA, etc.) in consideration. In addition, \(\mathbf{g}\) may consist of a number of features obtained from mutation analysis. For analysing expression profile with the tumour versus its paired adjacent tissue, \(\mathbf{g}\) is a two-dimensional vector that consists of simply the expressions of tumour and of the paired adjacent tissue, see page 36 in Ref. Xu (2015a) and Fig. 7 in Ref. Xu (2016). Even generally, a vector \(\mathbf{g}\) may represent a bio-unit in consideration, which covers more than one gene, e.g., expressions of several mRNAs that group in a signature on a heart map or certain features that represent one biological functional module.
Conventionally, whether a g-measure acts as a biomarker was examined on a set of case–control samples. For examples, a gene expression biomarker for prognosis takes a high value to indicate positive to survival. Alternatively, there may also be a biomarker that takes a low value to indicate positive to survival. Traditionally, such a biomarker reflects a difference between samples of the g-measure without particularly taking a specific condition in consideration. Recently, multiple reasons appear to support that it would be better to examine a biomarker under certain conditions or in a particular environment.
First, the meaning of a biomarker identified unconditionally may change considerably in a particular environment. An example is one recent finding of CDX2 as a gene expression biomarker for prognostic in colon cancer (Dalerba et al. 2016). Without considering any condition, its high expression is preferred because the rate of 5-year disease-free survival with stage II CDX2-negative colon cancers was significantly lower than the rate with stage II CDX2-positive colon cancers. Interestingly, it was found that the rate of 5-year disease-free survival with stage II CDX2-negative tumours who were treated with adjuvant chemotherapy became significantly higher than the rate with ones who were not treated with adjuvant chemotherapy, i.e., a low expression became preferred under the condition when adjuvant chemotherapy was treated.
Second, the role of some biomarker that is unable to be unconditionally identified will become detectable under a specific condition. One example is one recent study on IDH1-mutant glioma malignant progression (Bai et al. 2016). Considering all 82 sequenced gliomas conditioning on that they all have IDH1 mutations, the role of rare mutations of NOTCH1 and NOTCH2 was identified, occurring within sequences encoding the EGF-like domains, which is consistent with inactivating mutations identified in squamous cell carcinomas.
Third, a comprehensive study demands jointly considering a biomarker that consists of multiple g-measures or even jointly considering multiple biomarkers, for which one effective way is hierarchical formulation. In its simplest case, jointly considering two parts can be made subsequently by first considering one part and then considering the rest part conditioning on the first part. One example is the recent molecular analysis of gastric cancer that identifies four subtypes of gastric cancer by a binary tree with three layers (Cristescu et al. 2015), where subtypes MSS/TP53+ and MSS/TP53− are identified by an integrated biomarker named TP53 signature conditioning on an integrated biomarker named EMT signature and an integrated biomarker named MSI signature.
In a summary, a geno-phenotype study involves not only g-measures and \(\phi\)-measures, but also a set \(\mathbf{e}\) of enviro-measures (shortly e-measures) that specify certain condition or a particular environment underlying the study. In other words, we actually make an enviro-geno-pheno integrative study, which may be shortly denoted by a notation \(g \xrightarrow [e]{} \phi\) or \(\mathbf{g} \xrightarrow [\mathbf{e}]{} \mathbf{\pmb {\phi }}\), where each e-measure may represent one of treatments, patient characteristics or g-measures jointly in consideration.
In the rest of this paper, we propose a generic approach as summarised in Table 1. First, the approach identifies one or several convex subsets in the joint domain of g-measures, \(\phi\)-measures, and e-measures, with each subset representing a state of the bio-subsystem in our investigation. Shortly, such a state is called enviro-geno-pheno state (E-GPS) that acts as E-GPS biomarker, indicating ‘health/normal’ versus ‘risk/abnormal’ together with its associated enviro-geno-pheno conditions. Second, the approach makes association analysis from such E-GPS states to not only \(\phi\)-measures or clinical phenotypes but also e-measures, towards various tasks that include but are not limited to differentiation, prognosis, subtype, and staging.
Even generally, g-measures may not only be limited to genetic measures but could be also other measures that serve as the inner ground of study, called ground measures (still g-measures shortly). In other words, the E-GPS approach is also applicable to those data-mining tasks that can be formulated into the format \(g \xrightarrow [e]{}\phi\).
Methods
Whether a living system survives healthily or a machine system runs normally is featured by its internal status that could be one of several types. One major type is ‘health/good/normal’ or negative ‘−’, the other type is ‘risk/bad/abnormal’ or positive ‘+’ . There could be other types too, e.g., sub-health or slightly abnormal. In addition, there may be a type indicating ‘unknown/confusing’ or shortly ‘?’.
Specifically, a system status is measured via a set \(\mathbf{g}\) of internal intrinsic or ground factors and a set \(\mathbf{e}\) of environmental factors, as well as a set \(\pmb {\phi }\) of the external behaviours or phenotypes that the system demonstrates correspondingly. Let \(\mathbf{G} , \mathbf{E}\), and \(\pmb {\Phi }\) to indicate the domain of \(\mathbf{g}\), \(\mathbf{e}\) and \(\pmb {\phi }\), respectively, as illustrated in Fig. 1a, a system variate \(\xi \in \mathcal{D}_{g\phi e}=\mathbf{G} \times \mathbf{E}\times \pmb {\Phi }\) represents an enviro-geno-pheno triple and is associated with a label, e.g., coloured green for ‘normal’ and coloured red for ‘abnormal’ as illustrated in Fig. 1b, that indicates an instance of the system status. Moreover, a subset \(R_s\subset \mathcal{D}_{g\phi e}\) conceptually describes a possible relation among \(\mathbf{g}\), \(\mathbf{e}\), and \(\pmb {\phi }\). Not all possible subsets are interesting. We are interested in that \(R_s\) is convex and every element in \(R_s\) shares a same type of system status. The system behaves the same as long as \(\xi\) locates within \(R_s\), namely \(R_s\) represents one Enviro-Geno-Pheno state (E-GPS ) in \(\mathcal{D}_{g\phi e}\), shortly denoted by s. The system behaviour is actually an external manifestation of one or several such states, as illustrated in Fig. 1h. For each E-GPS state s, not only its associated type Type(s) indicates the system status, for which we subsequently focus on Type(s) from one of values \(-,+,?\) for simplicity, and the study can be rather straightforwardly extended to other sub-health types, but also the boundary of its corresponding convex set \(R_s\) describes the condition \(\mathcal{B}_s=COND(s)=Boundary(R_s)\) to stay at this state, as illustrated in Fig. 1c.
Requiring that every element in \(R_s\) shares a same type of system status, an E-GPS state is featured by its dedication to one specific type of system status, and thus is shortly called a dedicating state or shortly d-state, e.g., the green d-state \(s^{(11)}\) in Fig. 1f dedicates to a ‘normal’ system status. To tolerate some error or disturbance, we may relax to require that every element in \(R_s\) gets a high enough probability to share a same type of system status, that is, we consider the concept of d-state in a probabilistic sense, e.g., the red d-state \(s_{00}\) in Fig. 1f dedicates to one ‘abnormal’ system status. In addition to d-states, we may also need to handle subsets confused with different types or unknown types of samples, shortly we also regard such a subset as a confusing state or c-state, e.g., \(s_{10}\) in Fig. 1f.
Though the first limitation may become broken with hypersphere replaced by hyper-ellipse, not only it largely increases the number of free parameters and thus becomes prone to overfitting but also some subset in Eq. (2) may still not be convex. Favourably, Choice (b) gets a small incremental in free parameters, i.e., simply with the scalar a replaced by a vector \(\mathbf{a}\), such that the second limitation is overcome and the first limitation is at least partially overcome. Specifically, \(\mathcal{D}_{g\phi e}\) is partitioned into at least two convex subsets and at most four convex subsets by two hyper-planes, and the resulted subsets may also have some orientation. Again, we may estimate the two hyper-planes by minimising the criterion given by Eqs. (1) or (4).
Illustrated in Fig. 1f is a simple example that \(\mathcal{D}_{g\phi e}\) is partitioned into four convex subsets \(S_{11}\), \(S_{01}\), \(S_{10}\), and \(S_{00}\) by two lines. Specifically, the subset \(S_{11}\) represents a d-state \(s_{11}\) as good biomarker of ’green’ (i.e., normal), though unconditionally using g as a biomarker cannot differentiate the normal versus the abnormal. As a d-state, samples of the state \(s_{11}\) are all dedicated to ‘green’, while the state \(s_{00}\) is almost a d-state that corresponds to the subset \(S_{00}\) that consists of mostly red samples. The other two subsets \(S_{01}\) and \(S_{10}\) act as the c-states. Relaxing two lines in Fig. 1f to become adjusted freely, an optimal partition may be obtained by minimising the criterion given by Eq. (4).
To avoid overfitting, in Eq. (4) we impose a lower bound on the number of samples in each d-state. In addition, we may merge samples of adjacent c-states to form a big c-stage before dividing one c-state into subsets on the next level, as illustrated in Fig. 2f. For a small size of samples, we may further reduce the number of free parameters by restraining two hyper-planes in parallel, i.e., reduce one orientation vector \(\mathbf{a}\) into a scalar a to denote the distance between two parallel hyper-planes. Learning may be simplified into a two-stage implementation as illustrated in Fig. 2c, d. First, the normal direction of parallel hyper-planes is learned either directly by support vector machine (SVM) (Suykens and Vandewalle 1999; Suykens et al. 2002) as shown in Fig. 2c or with help of Fisher discriminative analysis (FDA) as shown in Fig. 2d. Second, samples are projected onto the normal direction and further divided into three subsets by minimising a simplified version of J(S) given by Eq. (4), as shown in Fig. 2a, b.
E-GPS states and E-GPS approach
Term | Description |
---|---|
(1) Identification of system status by E-GPS states | |
(a) E-GPS state | It is a convex set \(R_s\subseteq \mathcal{D}_{g\phi e}\) with all its elements sharing the same status type, e.g., the state \(s_{11}\) in Fig. 1f, and the probability that the system visits this state (i.e., within \(R_s\)) is bigger than a threshold, i.e., the state is not rare. Empirically, the percentage of a given set of samples falling in \(R_s\) should be larger enough |
(b) Prob. E-GPS state (d-state vs. c-state ) | It is a state that is not rare but prob. (probabilistic) in a sense that each element in \(R_s\) is either Type ‘+’ in a number \(n_S^{+}\) or Type ‘−’ in a number \(n_S^{-}\), in two categories: d-state (Dedicated state): \(max\{n_S^{+},n_S^{-}\}\) is significantly bigger than \(min\{n_S^{+},n_S^{-}\}\). Empirically, samples falling in \(R_s\) are mostly dedicated to a same status type, e.g., the state \(s_{00}\) in Fig. 1f. c-state (Confusing state); otherwise, i.e., two status types compete samples in \(R_s\), e.g., the states \(s_{10}\) and \(s_{01}\) in Fig. 1f |
(c) c-state (cuttable vs noncuttable) | It is a c-state with at least one convex subset that is able to be cut off as a d-state, e.g., the state \(s_{01}\) in Fig. 1f; otherwise the c-state is said to be noncuttable under the current settings of \(\mathcal{D}_{g\phi e}\), e.g., the state \(s_{10}\) in Fig. 1f |
(d) Learning configuration of states | Overall, a set of at least one d-state and c-states (if any) is learned from a given set of samples, featured by not only these states but also their configuration that encodes the locations and mutual relations of these states, as illustrated in Fig. 1h |
(2) Refinements of E-GPS states | |
(a) cutting | Cut a cuttable c-state by linear separation, e.g., SVM (Suykens and Vandewalle 1999; Suykens et al. 2002) or FDA by Eqs. (11) and (12) in Ref. Xu (2015a), via refining condition, e.g., the red line cuts \(s_{01}\) in Fig. 2f, which results in one convex subset as a d-state and one size-reduced c-state that may be still cuttable c-state |
(b) merging | Merge adjacent d-states if their union is still convex, e.g., merging \(s_2, s_3, s_4\) in Fig. 2f. In addition, merge adjacent c-states, e.g., \(s_5, s_6\) in Fig. 2f |
(c) growing | Grow each d-state s with \(Type(s)=\)‘green’ by including those adjacent ’green’ samples if the enlarged subset is still convex, and also grow each d-state s with \(Type(s)=\)‘red’ by including those adjacent ’red’ samples if the enlarged subset is still convex |
(d) treating | Use additional conditions (e.g., one more variable is added to \(\mathbf{\phi }\)) such that more ’green’ samples in the c-states become adjacent to and able to be re-allocated into some d-states in the above ways |
(3) Conditional phenotype analyses based on E-GPS states | |
(a) analysis per d-state | Prognosis analyses test whether \(max\{n_S^{+},n_S^{-}\}\) differs from \(min\{n_S^{+},n_S^{-}\}\) significantly by \(\chi ^2\) test or Fisher exact test to identify whether this state is good for prognosis, while the boundary of this state indicates the conditions under which the judgement is made. Moreover, prognosis of a unlabelled sample may be made by an one-class classifier obtained from these conditions |
Survival analyses plot K-M curves on samples with survival record and make the log rank test or the Cox proportional hazards test | |
Subtype analyses stratify samples of this state into each subtype, test the enrichment of each subtype in this state, plot K-M curves on each stratification, and examine the correlation or the intersection of each subtype to good and bad prognosis, as shown in Fig. 1h | |
(b) analysis cross d-states | Differentiation test on whether there is a significant difference pair-wisely either between samples of different d-states or between samples associated with different values of a phenotype, in one of the following manners: |
\(*\) A t-test when we ignore e and merely consider a univariate g; | |
\(*\) A multivariate test, e.g., Hotelling test Hotelling (1931), BBT test [see Table 6 in Ref. Xu (2015a)], and property-oriented test [see Algorithm 1 in Ref. Xu (2016)]; | |
\(*\) Model-based test proposed by Eqs. (29–31) in Ref. Xu (2015a); | |
\(*\) Logistic- or Cox-regression. On the lefthand of \(\eta (\phi _t)=\mathbf{b}^T g_t+ \mathbf{a}^T e_t +c+\varepsilon _t\), we test whether one or more of coefficients of \(\mathbf{b}\) are zero and whether one or more of coefficients of \(\mathbf{a}\) (e.g., by the score test or the Wald test) to examine whether the corresponding variables take roles significantly | |
Staging that is related to subtypes but different, staging involves subtypes in a temporal order. The later stage is usually more serious than the earlier stage, which may be learned via the transfer probabilities \(p(s_i|s_j)\) cross the states in Fig. 1h | |
Cross-state integration by comparing the configuration of states to enhance the differentiation study above. Moreover, cross-state combination can further provide better performance, as illustrated in Fig. 1h. Given the output measure \(\zeta _{j,t}\) (e.g., p value, classification error, and predicted regression) for a particular sample t, we may get one weighted average \(\zeta _t=\sum _j \zeta _{j,t} p(s_j|t)\), as well as a combined classification rule \(p(+|t)=\sum _j p(+|s_j) p(s_j|t)> p(-|t)=\sum _j p(-|s_j) p(s_j|t)\) |
Potential applications
Task | Study description |
---|---|
(a) Expression data differentiation | We find d-states as biomarkers by examining one \({g}_a\) vs \({e}_a\) (e.g., one \({g}_a\) or \({g}_b\)) by 2D scattering map. Also, one \({e}_c\) can be jointly examined with one map for \({e}_c=1\) and one map for \({e}_c=0\) |
(b) Mutation analysis | We examine one \({g}_D\) vs \({e}_c\). First, get a \(2\times 2\) table for \({g}_D\). Then, the table is split into a 3D one with one slice for \({e}_c=1\) and the other for \({e}_c=0\). Also, we may use one additional \({g}_D\) as \({e}_c\) to get a 3D table. Moreover, each slice may be further split by considering a new \({e}_c\). All the resulted slices are analysed in a way similar to Table 1(3)(a) |
(c) SNP analysis | The situation is similar to the above except that a \(2\times 2\) table becomes a \(2\times 3\) table in consideration of \({g}_D\) in a tri-nary values to denote AA, Aa, and aa. When using another SNP as \({e}_c\), its tri-valued \({g}_D\) is replaced by a binary one that takes either 0 if the sample has no SNP on this site or 1 otherwise |
(f) High-risk samples | Based on the above studies, we estimate the posteriori \(p(+ |x)\) per sample x and pick one with its value higher than a threshold as a high-risk sample, which is directly applicable to expression data. For sequencing data and particularly for finding SNPs, it difficult to get \(p(+ |x)\) because merely a few samples have variants on a particular site of \(g_c\). Instead, a sample is regarded as risk simply when there is a variant on the site of \(g_c\) or an enough number of variants on the sites of multiple SNPs |
(g) Expression-sequencing echoing | We obtain d-states and trees on expression data and sequencing data, and examine whether the results from two types of data in accordance with each other. |
(h) Expression-sequencing combining (ESC) test | Assume the null \(H_0\) holds on both the E-side and the S-side and using \(E_{\lnot {H}^*}\) and \(S_{\lnot {H}^*}\) to denote making alarm on its corresponding side, we get \(p(E_{\lnot {H}^*}, S_{\lnot {H}^*}|s)=p(E_{\lnot {H}^*}| S_{\lnot {H}^*}|s)p_S\) with \(p_S=p( S_{\lnot {H}^*}|s)\) being the p value obtained on the S-side and \(p(E_{\lnot {H}^*}| S_{\lnot {H}^*},s )\approx Card(B_E)/Card(B_S), \) being the probability of rejecting \(H_0\) on the E-side conditioning on that \(H_0\) is rejected on the S-side, where \(B_S\) consists of biomarkers on which \(H_0\) is rejected significantly on the S-side, and \(B_E\subseteq B_S\) consists of biomarkers on which \(H_0\) is also regarded as significantly rejected on the S-side |
(i) E-GPS based Integration | Integration may also be made by examining one \({g}_a\) from expression of a gene versus \({g}_c\) from multiple SNPs within the DNA sequence of the gene (e.g., either the number of or the average score of multiple SNPs) |
* General settings | |
\(\mathbf{g}\): each of its elements is a g-variable that could be | |
\({g}_a\) a real variable for expression of an RNA unit, e.g., either of mRNA, lncRNA, and circRNA; | |
\({g}_b\) a real variable for a signature expression (i.e., a collective expression of a set of RNA-units); | |
\({g}_c\) a discrete label for an SNP in DNA sequence (could be multiple SNPs per an RNA unit); | |
\({g}_D\) a binary variable that indicates whether there is a mutation within a bio-unit sequence (e.g., gene, pathway, etc). There are usually multiple variables for different type mutations | |
\(\mathbf{\pmb {\phi }}\): each of its elements is a \(\phi\)-variable that could be | |
\({\phi }_a\) a binary variable that indicates ‘case vs control’ or ‘abnormal vs normal’ ; | |
\({\phi }_b\) a binary or discrete variable that indicates clinical features; | |
\({\phi }_c\)a discrete label that indicates one of subtypes or grades or stages; | |
\({\phi }_D\) a real variable that indicates the occurrence of an event (e.g., survival time) | |
\(\mathbf{e}\): each of its elements is an e-variable that could be | |
\({e}_a\) a g-variable that acts as a condition for our examination; | |
\({e}_b\) a \(\phi\)-variable that act as a condition for our examination; | |
\({e}_c\) a binary variable that indicates whether a treatment is made, e.g., adjuvant chemotherapy; | |
\({e}_D\) an environmental variable, in either discrete (e.g., sex M/F) or real (e.g., age) |
Discussions
The E-GPS approach may find many uses in genomic biomarkers and cancer genetics, of which several applications are summarised in Table 2, including not only expression analyses and transcriptomic analysis of mRNA, lncRNA, and circRNA but also whole genome sequencing-based joint SNV analyses, mutation analyses, and methylation analyses, etc.
Additionally, it is also interesting to notice those degenerated situations with phenotype information unknown, e.g., all the red or green coloured points are turned into black dots. In such cases, all the states are degenerated into a same type, namely a unknown state or shortly called U-state. Each U-state actually represents a cluster of samples without any label information, and the task of identifying states is degenerated into clustering analysis, for which one possible method is learning a mixture of multiple local subspaces, e.g., see Algorithm 5 in Ref. Xu (2015b). In addition, we may consider the binary factor analysis that describes \(2^m\) states with the number of free parameters significantly reduced, e.g., see Algorithm 6 and Algorithm 7 in Ref. Xu (2015b).
Without considering phenotype information, the task nature will become really different from the E-GPS approach, in which phenotype analysis takes a core role in various tasks. We may use unsupervised learning as a preprocessing stage and the resulted clusters as one initial state configuration, on which the E-GPS study is further performed to take phenotype information in consideration.
In many biomarker searching tasks, the data may be mixed up by samples with phenotypes available and samples with phenotype unknown or partially missing. Considering unlabelled data may help to improve performances, which relates to semi-supervised learning, e.g., see Algorithm 9 in Ref. Xu (2015b) for semi-supervised clustering and Algorithm 11 in Ref. Xu (2015b) for semi-supervised binary factor analysis.
Another possible extension is getting the E-GPS biomarkers by deep learning multiple-layer networks, especially when we consider many genes and multiple conditional measures in a bio-system (e.g., pathway) that consists of far larger than a few genes. As illustrated in Fig. 3a, examples include stacked restricted Boltzmann machines (RBMs) (Hinton and Salakhutdinov 2006) and Least mean square error reconstruction (LMSER) (Xu 1991, 1993). Interestingly, the class of functions performed by deep neural networks is here speculated to be equivalently the class of functions that can be computed by Turing machines, from the perspective of partial recursive functions.
Conclusion
In the joint domain \(\mathcal{D}_{g\phi e}\) of geno-measures, pheno-measures, and enviro-measures, those elements that locate adjacently in a convex subset are identified as forming a state as biomarkers. In place of a conventional biomarker that uses one or multiple g-measures as a biomarker unconditionally, this E-GPS approach provides a new biomarker analysis tool that considers not only geno-variables conditionally on certain focused domain but also the joint enviro-geno-pheno effect, as well as the E-GPS state based phenotype analyses such as differentiation, prognosis, subtype, staging, and pathogenic progression. Specifically, a two-stage method is proposed for learning these E-GPS states, and several possible applications are suggested. Moreover, it is further addressed that such an E-GPS approach facilitates integrative study of expression and sequencing.
Declarations
Acknowledgements
This work was supported by the Zhi-Yuan chair professorship start-up Grant from Shanghai Jiao Tong University.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Bai H, Harmancı AS, Erson-Omay EZ, Li J, Coşkun S, Simon M, Krischek B, Özduman K, Omay SB, Sorensen EA (2016) Integrated genomic characterization of idh1-mutant glioma malignant progression. Nat Genet 48(1):59–66View ArticleGoogle Scholar
- Cristescu R, Lee J, Nebozhyn M, Kim K-M, Ting JC, Wong SS, Liu J, Yue YG, Wang J, Yu K (2015) Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes. Nat Med 21(5):449–456View ArticleGoogle Scholar
- Dalerba P, Sahoo D, Paik S, Guo X, Yothers G, Song N, Wilcox-Fogel N, Forgó E, Rajendran PS, Miranda SP (2016) Cdx2 as a prognostic biomarker in stage II and stage III colon cancer. N Engl J Med 374(3):211–222View ArticleGoogle Scholar
- Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetView ArticleMATHGoogle Scholar
- Hotelling H (1931) The generalization of student’s ratio. Ann Math Stat 2(3):360–378View ArticleMATHGoogle Scholar
- Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300MathSciNetView ArticleMATHGoogle Scholar
- Suykens JA, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J, Suykens J, Van Gestel T (2002) Least squares support vector machines. World Scientific Publishing, SingaporeView ArticleMATHGoogle Scholar
- Xu L (1991) Least mse reconstruction for self-organization:(i) multi-layer neural nets and (ii) further theoretical and experimental studies on one layer nets. In: Proceedings of the international joint conference on neural networks-1991-Singapore. pp 2363–2373Google Scholar
- Xu L (1993) Least mean square error reconstruction principle for self-organizing neural-nets. Neural Netw 6(5):627–648View ArticleGoogle Scholar
- Xu L (2015a) Bi-linear matrix-variate analyses, integrative hypothesis tests, and case–control studies. Appl Inform 2(1):1–39View ArticleGoogle Scholar
- Xu L (2015b) Further advances on bayesian ying yang harmony learning. Appl Inform 2(5):1–45Google Scholar
- Xu L (2016) A new multivariate test formulation: theory, implementation, and applications to genome-scale sequencing and expression. Appl Inform 3(1):1–23View ArticleGoogle Scholar