 Methodology
 Open Access
 Published:
Starcausality and factor analysis: old stories and new perspectives
Applied Informatics volume 4, Article number: 17 (2017)
Abstract
Advances in causal discovery from data are becoming a widespread topic in machine learning these recent years. In this paper, studies on conditional independencebased causality are briefly reviewed along a line of observable twovariable, threevariable, star decomposable, and tree decomposable, as well as their relationship to factor analysis. Then, developments along this line are further addressed from three perspectives with a number of issues, especially on learning approximate star decomposable, and tree decomposable, as well as their generalisations to block starcausality analysis on factor analysis and block tree decomposable analysis on linear causal model.
Background
From the view of probability, “two events \(X_1\) and \(X_2\) have no relationship” is described as two random variables \(X_1\) and \(X_2\) that are independent in their corresponding joint distribution \(p(x_1,x_2)=p(x_1)p(x_2)\). Away from this end, \(X_1\) and \(X_2\) must get some dependence, which can be one of the different types. In most of the existing big data mining efforts, what considered is correlation. Without correlation \(E[X_1X_2]=E[X_1]E[X_2]\) or \({\rm Cov}[X_1X_2]=E[X_1X_2]E[X_1]E[X_2]=0\) means that there is no dependence of the second order between two events, but they may be still dependent in one of higher order types.
Particularly, the correlation \(E[X_1X_2]\) is symmetric, while there also exists some relationship that is asymmetric and even more interesting. One example is the causality, i.e., the occurrence of the event \(X_1\) causes occurrence of the event \(X_2\), but not inversely. But an asymmetric relationship is not necessarily a causal relation. One example is a regression \(E(x_2x_1)\) that is widely considered in data analysing studies. However, a regression does not necessarily represent a causal relation.
In fact, whether \(X_1\) and \(X_2\) have a causal relation also depends on their environment W, which was first made precise by the common cause principle of Reichenbach (1956). This principle makes it possible to infer causal relation from statistical relation. Specifically, it follows from the noncorrelation
or the conditional independence
that we can infer that there must exist one of the three causal relationships \(X_1 \leftarrow W \rightarrow X_2, \ X_1\rightarrow W \rightarrow X_2, \ X_1\leftarrow W \leftarrow X_2\), though we can not identify specifically which one. We may identify Eq. (1) or even Eq. (2) from samples of variables \(x_1,x_2, w\) when they are binary variables. However, it becomes increasingly difficult when the variables take multiple values or even continuous values, for which a kernelbased approach has been proposed to deal with such a task in Fukumizu et al. (2008). Even worse, the environment typically consists of a set of features \(W_1, \ldots, W_k,\) which makes the task become even much more difficult. Alternatively, the Rubin Causal Model was first proposed in 1974 by Rubin and subsequently studied for many years (Rubin and Rubin 2011), which considers the socalled average causal effect (ACE) by computing \(E[X_2X_1,W]\) or its differences with \(X_1,W\) taking different values.
Pearl (1986) has shown that the following decomposable distribution
of dichotomous variables \(x_1,x_2,x_3,w\) can be identified by examining whether the observable threevariable distribution
satisfies a necessary and sufficient condition on seven jointoccurrence probabilities of one, two, and three dichotomous variables, where these jointoccurrence probabilities are estimated from samples of \(x_1,x_2,x_3\). Moreover, a necessary but not sufficient condition for \(p(x_1,x_2,x_3)\) to be stardecomposable (as illustrated in Fig. 1a, b and to be further described in "Methods") is that all correlation coefficients \(\rho _{ji}, \ i,j \in \{1,2,3\}\) obey the following triangle inequalities:
Furthermore, for a treedecomposable distribution (as illustrated in Fig. 1c and to be further described in "Methods") of dichotomous variables, it is also shown in Pearl (1986) that the topology of this tree can be uncovered uniquely from the observed correlation coefficients between pairs of variables, based on the following TETRAD conditions (Spearman 1904; Anderson and Rubin 1956):
Subsequently, Xu (1986) and Xu and Pearl (1987) further proceeded to study the distribution Eq. (3) of Gaussian variables \(x_1,x_2,x_3,w\) with three new results as follows:

1.
The analysing tool used in Pearl (1986) stems from Eqs. (3) and (4) on dichotomous variables (i.e., Eq. 24 in Pearl 1986) that considers the products of conditional independence indirectly in a linear mixture, led to a set of constraint equations that are solved to get a necessary and sufficient condition. Differently, a new tool is suggested in Xu (1986) and Xu and Pearl (1987), which stems from
$$\begin{aligned} p(x_1,x_2,x_3w)=p(x_1w)p(x_2w)p(x_3w) \end{aligned}$$(7)that directly considers the product of conditional independence for inferring the star structure or topology of causality, and subsequently identifies the parameters of the involved distributions by
$$\begin{aligned} p(x_1,x_2,x_3)=\int p(x_1,x_2,x_3,w)\,\text{d}w. \end{aligned}$$(8) 
2.
Instead of following Pearl (1986) that considers join probabilities to form constraint equations from Eq. (4), the equation by Eq. (7) is turned into one or a number of equations on different orders of statistics. Particularly, for Eq. (7) with Gaussian variables \(x_1,x_2,x_3,w,\) the block decomposition of covariance matrix (Gigi 1977) is adopted with equalities and inequalities on the second orders of statistics as constraints, which are further simplified into Eq. (5).

3.
Specifically, the necessary and sufficient condition for \(p(x_1,x_2,x_3)\) of Gaussian variables to be stardecomposable is simply that the triangle inequalities by Eq. (5), i.e., the starcausality by Eq. (3) and the latent structure by Eq. (4) can be recovered from merely the second order statistics, i.e., correlation coefficients \(\rho _{ji}, \ i,j \in \{x_1,x_2,x_3\}\).
When all the variables are Gaussians, the latent structure by
with the starcausality by
is actually equivalent to the classical factor analysis with only one factor. Pioneered by Spearman (1904), whether the factor analysis model (as illustrated in Fig. 1d and to be further described in the next section) is identifiable has been a classical topic for more than 100 years, from perspectives that are more or less similar to constraints on the secondorder statistics obtained from Eq. (9). The wellknown TETRAD equations or differences were discovered already in Spearman (1904) and have been used for constructing casual structures not just in Pearl (1986) but also by others (Spirtes and Glymour 2000; Bartholomew 1995; Bollen and Ting 2000). Moreover, Theorem 4.2 in Anderson and Rubin (1956) also gave a necessary and sufficient condition for identifying whether a covariance matrix can be the one of a factor analysis model with one factor and three observation variables, which is actually equivalent to Eq. (5) but expressed in a different format.
Methods
Following Pearl (1986), the following decomposition of a joint distribution
is called stardecomposable distribution, as illustrated in Fig. 1a, and particularly triplet stardecomposable in Fig. 1b. Also, w acts as a common cause that emits to affect the observable variables \(x_1,\ldots , x_k\); we use starcausality to name such a simple but important casual structure. A typical treecausality is in a tree structure, as illustrated in Fig. 1c. Moreover, we say that a distribution \(p(x_1,\ldots , x_k)\) is treedecomposable if it is the marginal of a distribution \(p(x_1,\ldots , x_n; w_1,\ldots , w_m), m\le n2\) that supports a treestructured, such that \(W_1,\ldots , W_m\) correspond to the internal nodes of a tree and \(x_1,\ldots , x_n\) to its leaves.
We further push forward developments of discovering causality along the line of Xu (1986) and Xu and Pearl (1987) from three perspectives.
First, the causal tree constructing procedure proposed in Pearl (1986), and also adopted in Xu (1986) and Xu and Pearl (1987), may be improved by the following three considerations:

(a)
In that procedure, constructing causal tree is made via joining triplets by checking the TETRAD equations by Eq. (6) while triplets were detected by the triangle inequalities by Eq. (5). However, Pearl (1986) pointed out that TETRAD equalities are unlikely to be satisfied forever in practice because we often have only sample estimates of the correlation coefficients. Though it was also tried in Pearl (1986) to decide the 4tuple topology on the basis of the permutation of indices that minimises the difference \(T_\mathrm{e}^{(ijkl)}\), experiments found that the structure which evolves from such a method is very sensitive to inaccuracies in the estimates of the correlation coefficients. Here, we suggest to consider TETRAD equalities by minimising the difference \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5).

(b)
Not limited to consider triplet stardecomposable, starcausality in Fig. 1a may also consider in the same line of Xu (1986) and Xu and Pearl (1987), while the necessary and sufficient condition for stardecomposable is not just satisfying the triangle inequalities by Eq. (5) but also \(0.5n(n1)n\) equalities, which is equivalent to Theorem 4.2 in Anderson and Rubin (1956) for the identifiability of a covariance matrix to be the one of the factor analysis models with one factor in general. In other words, the consideration above can be extended to a general case in a similar way.

(c)
Moreover, we may also combine an edge removing procedure as used in the wellknown PC algorithm (Spirtes and Glymour 1993, 2000) by which the link between two nodes is removed by testing the independence between them conditioning on the rest nodes. This checking also relates to inaccuracies in the estimates of correlation coefficients, for which we may consider to add in minimising \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5). Second, in addition to the above improvements, we proceed to a new method. The existing procedure is featured by making testing based on the set of correlation coefficients between observable variables, while the new method first estimates another set of correlation coefficients between observable variables and latent variables, and then makes testing based on both the sets. Specifically, we propose the following two suggestions:

(d)
Equations (11) and (12) in Xu and Pearl (1987) were derived from Eq. (11) and are rewritten below:
$$\begin{aligned}&T_\mathrm{e}^{(ijw)} =0, \ i \ne j, \quad {\rm and} \quad B_\mathrm{e}^{(iw)} > 0, \quad \forall i \\ &\mathrm{where}\quad T_\mathrm{e}^{(ijw)}= \sigma _{ij} { \sigma _{iw} \sigma _{jw} \over \sigma _{ww} }, \quad {\rm and} \quad B_\mathrm{e}^{(iw)}= \sigma _{ii} { \sigma ^2_{iw} \over \sigma _{ww} }. \end{aligned}$$(12)Constructing starcausality can be made by learning \(\sigma _{iw} , \forall i\) and \(\sigma _{ww} >0\) (or simply setting \(\sigma _{ww} =1\)) by the following constrained optimisation
$$\begin{aligned} \max \sum _i [B_\mathrm{e}^{(iw)}\rho R^{(iw)}], \ \mathrm{s.t.}\ T_\mathrm{e}^{(ijw)}=0, \quad \forall i \ne j, \end{aligned}$$(13)which may have different implementations, e.g., by the Lagrange method. Also, sparse learning is added via the term
$$\begin{aligned} R^{(iw)} = \sum _i \sigma _{iw}, \end{aligned}$$(14)which prefers to push \(\sigma _{iw}\) towards zero in order to reduce a false or unreliable relation, where \(\rho\) is a coefficient that controls the strength. \(R^{(iw)}\) has no action if we simply set \(\rho =0\) while a large action when \(\rho >0\) gets a large value. After learning, we test whether this starcausality is justified via testing \(T_\mathrm{e}^{(ijw)}=0, \ \forall i \ne j\) or with help of some sum \(\sum T_\mathrm{e}^{(ijw)}\) as a statistics.

(e)
Once a starcausality is made, the latent node can be treated in a way similar to observable nodes, such that a new starcausality can be constructed from a combination of observable nodes and learned latent nodes. Hence, constructing treedecomposable causality can be made from starcausality in at least two manners. First, a treedecomposable structure can be grown up from a starcausality by gradually learning and testing newly added observable nodes and latent nodes. Second, constructing a number of starcausality structures in parallel, and then combining them to form a treedecomposable structure with help of some composition of the above learning and testing.

(f)
The above studies may be further extended to consider nonGaussian variables in a twostage approach. At the first stage, the topology of the starcausality and even generally treedecomposable causality can be obtained from the correlation coefficients. At the second stage, the conditional probabilities and the marginal probabilities of each latent node can be estimated from Eq. (9) in a way similar to that in Xu (1986) and Xu and Pearl (1987). Specifically, each link can be still a linear equation and the conditional distribution \(p(x_iw)\) or \(p(x_iw_k)\) is still Gaussian, while each inner node w or \(w_k\) can even come from a nonGaussian distribution. Moreover, we may also obtain constraint equations of higher order statistics from Eq. (11). Third, beyond causality between variables, we further proceed to considering causality between sets or blocks of variables. Lumping latent factors \(\{W_k\}\) into one vector factor \(\mathbf{W}\), the factor analysis model in Fig. 1d may be turned into a block stardecomposable structure still in the format of Fig. 1a, with w in Eq. (10) simply replaced by \(\mathbf{W}=[W_1,\ldots ,W_k]^{\rm T}\). In the sequel, we address further details.

(g)
In a way similar to that adopted in Xu (1986) and Xu and Pearl (1987), we may obtain a necessary and sufficient condition for such a stardecomposable based on Theorem 1 given in Fig. 2c. For the block stardecomposable problem in Fig. 2a, this is equivalent to that the solution \(\Sigma _{XW}, \Sigma _{WW}, D\) of the following matrix equation is unique:
$$\begin{aligned} \Sigma _{XX} \Sigma _{XW} \Sigma _{WW}^{1} \Sigma _{XW}^{\rm T} =D, \end{aligned}$$(15)where \(\Sigma _{XX}=[ \sigma _{x_ix_j}]\) is the covariance matrix of the vector \(X=[x_1,\ldots ,x_n]^{\rm T}\), \(\Sigma _{XW}=[ \sigma _{x_iw_k}]\) is the covariance matrix between the vector X and the lumped latent vector W, and \(D={\rm diag}[d_1,\ldots , d_n], d_j>0, \forall j\) is a diagonal matrix. Getting such a unique solution is generally difficult, but possible when \(\Sigma _{XW}, \Sigma _{WW}\) have some particular structures. A typical example is \(\Sigma _{WW}={\rm diag}[ \sigma _{w_1w_1} ,\ldots \sigma _{w_mw_m} ]\), which equivalently leads to getting the necessary and sufficient condition for identifying the factor analysis model \(\mathbf{X}=A\mathbf{W}+\mathbf{\mu }+\mathbf{\varepsilon }\) with a diagonal covariance matrix of \(\mathbf{\varepsilon }\), e.g., Theorem 4.1 in Anderson and Rubin (1956). Additionally, from the same motivation as getting Eq. (12) we can get
$$\begin{aligned}T_\mathrm{e}^{(ijw)} &=\sigma _{x_ix_j}  \sum _k { \sigma _{x_iw_k}\sigma _{x_jw_k} \over \sigma _{w_kw_k} }, \nonumber\\ B_\mathrm{e}^{(iw)}&= \sigma _{x_ix_i}  \sum _k { \sigma ^2_{x_iw_k}\over \sigma _{w_kw_k} }. \end{aligned}$$(16)Then, constructing block starcausality can be made by learning \(\sigma _{x_iw_k}, \forall i,k\) and \(\sigma _{w_iw_i}>0, \forall i\) (or simply setting each \(\sigma _{w_iw_i}=1\)) again by the constrained optimisation Eq. (13) with \(R^{(iw)} = \sum _i \sigma _{iw_k}\) and the subsequent testing.

(h)
Similar to what addressed in the above e and d, constructing treedecomposable causality can be made from starcausality. Moreover, a tree decomposable structure Fig. 2a can be turned into not only a problem of stardecomposable causality in Fig. 1a by lumping latent factors \(\{W_k\}\) into one vector factor but also a problem of triplet stardecomposable causality in blocks as illustrated in Fig. 2b. Then, we get \(\Sigma _{XW}\) in a block structure, which increases the chance that Eq. (15) becomes uniquely solved. Considering variables of W in some structure, we may also extend this line to study a linear causal structure (Zhang et al. 2017; Shimizu et al. 2011) with both latent variables and loading variables in structures. Also, discovering causality may be further made within each subsets of variables. In other words, we may discovery causality on multiple levels of a hierarchy in a topdown manner, or even trading off the topdown manner with the bottomup manner addressed in the first two perspectives.
Concluding remarks
Considers minimising the TETRAD differences by \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5) may motivate a new road for learning a causal model from samples to approximate tree decomposable causality. Instead of making tests based on the set of correlation coefficients between observable variables, as typically made in the existing procedure, we first perform the optimisation by Eq. (13) to estimate another set of correlation coefficients between observable variables and latent variables, and then make tests based on both the sets. We may further proceed along this road to make block starcausality analysis on factor analysis and block tree decomposable analysis on linear causal model.
References
Anderson TW, Rubin H (1956) Statistical inference in factor analysis. Proc Third Berkeley Symp Math Stat Probab 5:111–150
Bartholomew DJ (1995) Spearman and the origin and development of factor analysis. Br J Math Stat Psychol 48(2):211–220
Bollen KA, Ting KF (2000) A tetrad test for causal indicators. Psychol Methods 5(1):3
Fukumizu K, Gretton A, Sun X, Schölkopf B (2008) Kernel measures of conditional dependence. In: Advances in neural information processing systems. pp 489–496
Gigi NC (1977) Multivariate statistical inference. Academic Press, New York
Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif intell 29(3):241–288
Reichenbach H (1956) The direction of time. University of California Press, Berkeley
Rubin DB, Rubin JL (2011) Causal model. In: International encyclopedia of statistical science. pp 1263–1265. Springer, Berlin
Shimizu S, Inazumi T, Sogawa Y, Hyvärinen A, Kawahara Y, Washio T, Hoyer PO, Bollen K (2011) Directlingam: a direct method for learning a linear nonGaussian structural equation model. J Mach Learn Res 12:1225–1248
Spearman C (1904) “General intelligence,” objectively determined and measured. Am J Psychol 15(2):201–292
Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search. MIT Press, Cambridge
Spirtes PG, Glymour C (1993) Causation, prediction and search. In: Lecture notes in statistics, vol 81. Springer, Berlin
Xu L (1986) Investigation on signal reconstruction, search technique, and pattern recognition. Ph.D. Dissertation, Tsinghua University
Xu L, Pearl J (1987) Structuring causal tree models with continuous variables. In: Proceedings of the 3rd annual conference on uncertainty in artificial intelligence. pp 170–179
Zhang K, Gong M, Ramsey J, Batmanghelich K, Spirtes P, Glymour C (2017) Causal discovery in the presence of measurement error: identifiability conditions. arXiv preprint arXiv:1706.03768
Acknowledgements
This work was supported by the ZhiYuan chair professorship startup grant (WF220103010) from Shanghai Jiao Tong University.
Competing interests
The author declares that they have no competing interests.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Causal discovery
 Factor analysis
 Tree decomposable
 Block starcausality