# Star-causality and factor analysis: old stories and new perspectives

- Lei Xu
^{1, 2}Email authorView ORCID ID profile

**Received: **29 November 2017

**Accepted: **5 December 2017

**Published: **12 December 2017

## Abstract

Advances in causal discovery from data are becoming a widespread topic in machine learning these recent years. In this paper, studies on conditional independence-based causality are briefly reviewed along a line of observable two-variable, three-variable, star decomposable, and tree decomposable, as well as their relationship to factor analysis. Then, developments along this line are further addressed from three perspectives with a number of issues, especially on learning approximate star decomposable, and tree decomposable, as well as their generalisations to block star-causality analysis on factor analysis and block tree decomposable analysis on linear causal model.

## Keywords

## Background

From the view of probability, “two events \(X_1\) and \(X_2\) have no relationship” is described as two random variables \(X_1\) and \(X_2\) that are independent in their corresponding joint distribution \(p(x_1,x_2)=p(x_1)p(x_2)\). Away from this end, \(X_1\) and \(X_2\) must get some dependence, which can be one of the different types. In most of the existing big data mining efforts, what considered is correlation. Without correlation \(E[X_1X_2]=E[X_1]E[X_2]\) or \({\rm Cov}[X_1X_2]=E[X_1X_2]-E[X_1]E[X_2]=0\) means that there is no dependence of the second order between two events, but they may be still dependent in one of higher order types.

Particularly, the correlation \(E[X_1X_2]\) is symmetric, while there also exists some relationship that is asymmetric and even more interesting. One example is the causality, i.e., the occurrence of the event \(X_1\) causes occurrence of the event \(X_2\), but not inversely. But an asymmetric relationship is not necessarily a causal relation. One example is a regression \(E(x_2|x_1)\) that is widely considered in data analysing studies. However, a regression does not necessarily represent a causal relation.

*W*, which was first made precise by the common cause principle of Reichenbach (1956). This principle makes it possible to infer causal relation from statistical relation. Specifically, it follows from the non-correlation

- 1.The analysing tool used in Pearl (1986) stems from Eqs. (3) and (4) on dichotomous variables (i.e., Eq. 24 in Pearl 1986) that considers the products of conditional independence indirectly in a linear mixture, led to a set of constraint equations that are solved to get a necessary and sufficient condition. Differently, a new tool is suggested in Xu (1986) and Xu and Pearl (1987), which stems fromthat directly considers the product of conditional independence for inferring the star structure or topology of causality, and subsequently identifies the parameters of the involved distributions by$$\begin{aligned} p(x_1,x_2,x_3|w)=p(x_1|w)p(x_2|w)p(x_3|w) \end{aligned}$$(7)$$\begin{aligned} p(x_1,x_2,x_3)=\int p(x_1,x_2,x_3,w)\,\text{d}w. \end{aligned}$$(8)
- 2.
Instead of following Pearl (1986) that considers join probabilities to form constraint equations from Eq. (4), the equation by Eq. (7) is turned into one or a number of equations on different orders of statistics. Particularly, for Eq. (7) with Gaussian variables \(x_1,x_2,x_3,w,\) the block decomposition of covariance matrix (Gigi 1977) is adopted with equalities and inequalities on the second orders of statistics as constraints, which are further simplified into Eq. (5).

- 3.
Specifically, the necessary and sufficient condition for \(p(x_1,x_2,x_3)\) of Gaussian variables to be star-decomposable is simply that the triangle inequalities by Eq. (5), i.e., the star-causality by Eq. (3) and the latent structure by Eq. (4) can be recovered from merely the second order statistics, i.e., correlation coefficients \(\rho _{ji}, \ i,j \in \{x_1,x_2,x_3\}\).

## Methods

*w*acts as a common cause that emits to affect the observable variables \(x_1,\ldots , x_k\); we use star-causality to name such a simple but important casual structure. A typical tree-causality is in a tree structure, as illustrated in Fig. 1c. Moreover, we say that a distribution \(p(x_1,\ldots , x_k)\) is tree-decomposable if it is the marginal of a distribution \(p(x_1,\ldots , x_n; w_1,\ldots , w_m), m\le n-2\) that supports a tree-structured, such that \(W_1,\ldots , W_m\) correspond to the internal nodes of a tree and \(x_1,\ldots , x_n\) to its leaves.

We further push forward developments of discovering causality along the line of Xu (1986) and Xu and Pearl (1987) from three perspectives.

- (a)
In that procedure, constructing causal tree is made via joining triplets by checking the TETRAD equations by Eq. (6) while triplets were detected by the triangle inequalities by Eq. (5). However, Pearl (1986) pointed out that TETRAD equalities are unlikely to be satisfied forever in practice because we often have only sample estimates of the correlation coefficients. Though it was also tried in Pearl (1986) to decide the 4-tuple topology on the basis of the permutation of indices that minimises the difference \(T_\mathrm{e}^{(ijkl)}\), experiments found that the structure which evolves from such a method is very sensitive to inaccuracies in the estimates of the correlation coefficients. Here, we suggest to consider TETRAD equalities by minimising the difference \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5).

- (b)
Not limited to consider triplet star-decomposable, star-causality in Fig. 1a may also consider in the same line of Xu (1986) and Xu and Pearl (1987), while the necessary and sufficient condition for star-decomposable is not just satisfying the triangle inequalities by Eq. (5) but also \(0.5n(n-1)-n\) equalities, which is equivalent to Theorem 4.2 in Anderson and Rubin (1956) for the identifiability of a covariance matrix to be the one of the factor analysis models with one factor in general. In other words, the consideration above can be extended to a general case in a similar way.

- (c)
Moreover, we may also combine an edge removing procedure as used in the well-known PC algorithm (Spirtes and Glymour 1993, 2000) by which the link between two nodes is removed by testing the independence between them conditioning on the rest nodes. This checking also relates to inaccuracies in the estimates of correlation coefficients, for which we may consider to add in minimising \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5). Second, in addition to the above improvements, we proceed to a new method. The existing procedure is featured by making testing based on the set of correlation coefficients between observable variables, while the new method first estimates another set of correlation coefficients between observable variables and latent variables, and then makes testing based on both the sets. Specifically, we propose the following two suggestions:

- (d)Equations (11) and (12) in Xu and Pearl (1987) were derived from Eq. (11) and are rewritten below:Constructing star-causality can be made by learning \(\sigma _{iw} , \forall i\) and \(\sigma _{ww} >0\) (or simply setting \(\sigma _{ww} =1\)) by the following constrained optimisation$$\begin{aligned}&T_\mathrm{e}^{(ijw)} =0, \ i \ne j, \quad {\rm and} \quad B_\mathrm{e}^{(iw)} > 0, \quad \forall i \\ &\mathrm{where}\quad T_\mathrm{e}^{(ijw)}= \sigma _{ij} -{ \sigma _{iw} \sigma _{jw} \over \sigma _{ww} }, \quad {\rm and} \quad B_\mathrm{e}^{(iw)}= \sigma _{ii} -{ \sigma ^2_{iw} \over \sigma _{ww} }. \end{aligned}$$(12)which may have different implementations, e.g., by the Lagrange method. Also, sparse learning is added via the term$$\begin{aligned} \max \sum _i [B_\mathrm{e}^{(iw)}-\rho R^{(iw)}], \ \mathrm{s.t.}\ T_\mathrm{e}^{(ijw)}=0, \quad \forall i \ne j, \end{aligned}$$(13)which prefers to push \(\sigma _{iw}\) towards zero in order to reduce a false or unreliable relation, where \(\rho\) is a coefficient that controls the strength. \(R^{(iw)}\) has no action if we simply set \(\rho =0\) while a large action when \(\rho >0\) gets a large value. After learning, we test whether this star-causality is justified via testing \(T_\mathrm{e}^{(ijw)}=0, \ \forall i \ne j\) or with help of some sum \(\sum T_\mathrm{e}^{(ijw)}\) as a statistics.$$\begin{aligned} R^{(iw)} = \sum _i |\sigma _{iw}|, \end{aligned}$$(14)
- (e)
Once a star-causality is made, the latent node can be treated in a way similar to observable nodes, such that a new star-causality can be constructed from a combination of observable nodes and learned latent nodes. Hence, constructing tree-decomposable causality can be made from star-causality in at least two manners. First, a tree-decomposable structure can be grown up from a star-causality by gradually learning and testing newly added observable nodes and latent nodes. Second, constructing a number of star-causality structures in parallel, and then combining them to form a tree-decomposable structure with help of some composition of the above learning and testing.

- (f)
The above studies may be further extended to consider non-Gaussian variables in a two-stage approach. At the first stage, the topology of the star-causality and even generally tree-decomposable causality can be obtained from the correlation coefficients. At the second stage, the conditional probabilities and the marginal probabilities of each latent node can be estimated from Eq. (9) in a way similar to that in Xu (1986) and Xu and Pearl (1987). Specifically, each link can be still a linear equation and the conditional distribution \(p(x_i|w)\) or \(p(x_i|w_k)\) is still Gaussian, while each inner node

*w*or \(w_k\) can even come from a non-Gaussian distribution. Moreover, we may also obtain constraint equations of higher order statistics from Eq. (11). Third, beyond causality between variables, we further proceed to considering causality between sets or blocks of variables. Lumping latent factors \(\{W_k\}\) into one vector factor \(\mathbf{W}\), the factor analysis model in Fig. 1d may be turned into a block star-decomposable structure still in the format of Fig. 1a, with*w*in Eq. (10) simply replaced by \(\mathbf{W}=[W_1,\ldots ,W_k]^{\rm T}\). In the sequel, we address further details. - (g)In a way similar to that adopted in Xu (1986) and Xu and Pearl (1987), we may obtain a necessary and sufficient condition for such a star-decomposable based on Theorem 1 given in Fig. 2c. For the block star-decomposable problem in Fig. 2a, this is equivalent to that the solution \(\Sigma _{XW}, \Sigma _{WW}, D\) of the following matrix equation is unique:where \(\Sigma _{XX}=[ \sigma _{x_ix_j}]\) is the covariance matrix of the vector \(X=[x_1,\ldots ,x_n]^{\rm T}\), \(\Sigma _{XW}=[ \sigma _{x_iw_k}]\) is the covariance matrix between the vector$$\begin{aligned} \Sigma _{XX} -\Sigma _{XW} \Sigma _{WW}^{-1} \Sigma _{XW}^{\rm T} =D, \end{aligned}$$(15)
*X*and the lumped latent vector*W*, and \(D={\rm diag}[d_1,\ldots , d_n], d_j>0, \forall j\) is a diagonal matrix. Getting such a unique solution is generally difficult, but possible when \(\Sigma _{XW}, \Sigma _{WW}\) have some particular structures. A typical example is \(\Sigma _{WW}={\rm diag}[ \sigma _{w_1w_1} ,\ldots \sigma _{w_mw_m} ]\), which equivalently leads to getting the necessary and sufficient condition for identifying the factor analysis model \(\mathbf{X}=A\mathbf{W}+\mathbf{\mu }+\mathbf{\varepsilon }\) with a diagonal covariance matrix of \(\mathbf{\varepsilon }\), e.g., Theorem 4.1 in Anderson and Rubin (1956). Additionally, from the same motivation as getting Eq. (12) we can getThen, constructing block star-causality can be made by learning \(\sigma _{x_iw_k}, \forall i,k\) and \(\sigma _{w_iw_i}>0, \forall i\) (or simply setting each \(\sigma _{w_iw_i}=1\)) again by the constrained optimisation Eq. (13) with \(R^{(iw)} = \sum _i |\sigma _{iw_k}|\) and the subsequent testing.$$\begin{aligned}T_\mathrm{e}^{(ijw)} &=\sigma _{x_ix_j} - \sum _k { \sigma _{x_iw_k}\sigma _{x_jw_k} \over \sigma _{w_kw_k} }, \nonumber\\ B_\mathrm{e}^{(iw)}&= \sigma _{x_ix_i} - \sum _k { \sigma ^2_{x_iw_k}\over \sigma _{w_kw_k} }. \end{aligned}$$(16) - (h)
Similar to what addressed in the above e and d, constructing tree-decomposable causality can be made from star-causality. Moreover, a tree decomposable structure Fig. 2a can be turned into not only a problem of star-decomposable causality in Fig. 1a by lumping latent factors \(\{W_k\}\) into one vector factor but also a problem of triplet star-decomposable causality in blocks as illustrated in Fig. 2b. Then, we get \(\Sigma _{XW}\) in a block structure, which increases the chance that Eq. (15) becomes uniquely solved. Considering variables of

*W*in some structure, we may also extend this line to study a linear causal structure (Zhang et al. 2017; Shimizu et al. 2011) with both latent variables and loading variables in structures. Also, discovering causality may be further made within each subsets of variables. In other words, we may discovery causality on multiple levels of a hierarchy in a top-down manner, or even trading off the top-down manner with the bottom-up manner addressed in the first two perspectives.

## Concluding remarks

Considers minimising the TETRAD differences by \(T_\mathrm{e}^{(ijkl)}\) subject to the constraints by Eq. (5) may motivate a new road for learning a causal model from samples to approximate tree decomposable causality. Instead of making tests based on the set of correlation coefficients between observable variables, as typically made in the existing procedure, we first perform the optimisation by Eq. (13) to estimate another set of correlation coefficients between observable variables and latent variables, and then make tests based on both the sets. We may further proceed along this road to make block star-causality analysis on factor analysis and block tree decomposable analysis on linear causal model.

## Declarations

### Acknowledgements

This work was supported by the Zhi-Yuan chair professorship start-up grant (WF220103010) from Shanghai Jiao Tong University.

### Competing interests

The author declares that they have no competing interests.

### Ethics approval and consent to participate

Not applicable.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Anderson TW, Rubin H (1956) Statistical inference in factor analysis. Proc Third Berkeley Symp Math Stat Probab 5:111–150MathSciNetMATHGoogle Scholar
- Bartholomew DJ (1995) Spearman and the origin and development of factor analysis. Br J Math Stat Psychol 48(2):211–220View ArticleMATHGoogle Scholar
- Bollen KA, Ting K-F (2000) A tetrad test for causal indicators. Psychol Methods 5(1):3View ArticleGoogle Scholar
- Fukumizu K, Gretton A, Sun X, Schölkopf B (2008) Kernel measures of conditional dependence. In: Advances in neural information processing systems. pp 489–496Google Scholar
- Gigi NC (1977) Multivariate statistical inference. Academic Press, New YorkGoogle Scholar
- Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif intell 29(3):241–288MathSciNetView ArticleMATHGoogle Scholar
- Reichenbach H (1956) The direction of time. University of California Press, BerkeleyGoogle Scholar
- Rubin DB, Rubin JL (2011) Causal model. In: International encyclopedia of statistical science. pp 1263–1265. Springer, BerlinGoogle Scholar
- Shimizu S, Inazumi T, Sogawa Y, Hyvärinen A, Kawahara Y, Washio T, Hoyer PO, Bollen K (2011) Directlingam: a direct method for learning a linear non-Gaussian structural equation model. J Mach Learn Res 12:1225–1248MathSciNetMATHGoogle Scholar
- Spearman C (1904) “General intelligence,” objectively determined and measured. Am J Psychol 15(2):201–292View ArticleGoogle Scholar
- Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search. MIT Press, CambridgeMATHGoogle Scholar
- Spirtes PG, Glymour C (1993) Causation, prediction and search. In: Lecture notes in statistics, vol 81. Springer, BerlinGoogle Scholar
- Xu L (1986) Investigation on signal reconstruction, search technique, and pattern recognition. Ph.D. Dissertation, Tsinghua UniversityGoogle Scholar
- Xu L, Pearl J (1987) Structuring causal tree models with continuous variables. In: Proceedings of the 3rd annual conference on uncertainty in artificial intelligence. pp 170–179Google Scholar
- Zhang K, Gong M, Ramsey J, Batmanghelich K, Spirtes P, Glymour C (2017) Causal discovery in the presence of measurement error: identifiability conditions. arXiv preprint arXiv:1706.03768