 Research
 Open Access
Deep bidirectional intelligence: AlphaZero, deep IAsearch, deep IAinfer, and TPC causal learning
 Lei Xu^{1, 2}Email authorView ORCID ID profile
 Received: 16 March 2018
 Accepted: 10 September 2018
 Published: 29 September 2018
Abstract
This paper starts at a brief review on AlphaGoZero, Q learning, and MonteCarlo tree search (MCTS), in a comparison with decades ago studies on A* search and CNneimA that was proposed in 1986 and shares a scouting technique similar to one used in MCTS. Then, we combine the strengths of AlphaGoZero and CNneimA, resulting in a family named deep IAsearch that consists of Deep Scout A*, Deep CNneimA, Deep BiScout A*, and VAlphaGoZero, as well as their extensions. Moreover, relation between search and reasoning motivates to extend deep IAsearch to Deep IAInfer for implementing reasoning. Especially, another early study (Xu and Pearl, Structuring causal tree models with continuous variables. In: Proceedings of the 3rd annual conference on uncertainty in artificial intelligence, pp 170–179 1987) on structuring causal tree is developed into a three phase causal learning approach, namely topology identification, parameter reestimation, and causal \(\rho\)tree search on a casual \(\rho\)diagram that is defined by a set of pairwise correlation coefficient \(\rho\). Algorithms are sketched for discovering casual topologies of triplets, stars, and trees, as well as some topologies of casual \(\rho\)Directed Acyclic Graph (DAG), e.g. ones for Yule–Simpson’s paradox, Pearl’s Sprinkler DAG, and Back door DAG. Furthermore, the classic Boolean SAT problem is extended into one \(\rho\)SAT problem, and the roles of four fundamental mechanisms in an intelligent system are elaborated, with insights on integrating these mechanisms to encode not only variables but also how they are organised, as well as on why deep networks are preferred while extra depth is unnecessary.
Keywords
 Deep learning
 Deep scouting
 Bayesian valuation
 MCTS
 Path consistency
 Star structure
 Causal tree
 Topology discovery
 Conditional independence
 \(\rho\)Diagram
 \(\rho\)SAT
 PRODSUM
 HIERARCHY
 Why deep
Background
Recent renaissance of artificial intelligence is basically marked by four types of major advances. First, deep learning and big data achieved high accuracy on recognising patterns, especially human faces and speeches in a huge population. Second, IBM Watson system demonstrated promising capabilities of natural language processing, hypothesis generation, and deep evidence scoring, with successes on conversation system, healthcare decision support, contact centre, financial and government services (Ferrucci et al. 2010, 2013). Third, AlphaGo by DeepMind impacted the world firstly by its 41 victory against legendary player Mr. Lee Sedol (Silver et al. 2016) and subsequently by its evolution into AlphaGoZero several months ago that was learnt to perform Gogame simply via selfplay, starting from completely random play (Silver et al. 2017). The last but not least, astonishing developments of humanoid robots from ASIMO by HONDA in 2000 to the recent version Atlas by Boston Dynamics, and Sophia by Hanson.
Mathematically, the Yang domain X accommodates a set \(x_0, x_1, \ldots , x_t\) of input data with each \(x_t\) in a taskdependent data type, and the Ying domain \(R=\{Y,\Theta \}\) accommodates the inner representations of external world, consisting of longterm memory \(\Theta\) that accommodates model parameters, and shortterm memory Y that accommodates the inner representation \(y_0, y_1, \ldots , y_t\) of \(x_0, x_1, \ldots , x_t\). Perceiving, recognising, and cognising via Abstraction, Yang passage \(X\rightarrow R\) involves various tasks related words with an initial character “A”, thus also named Atype mapping. Then, thinking process is conducted in the Ying domain and outcomes selected representation \(\{Y,\Theta \}\) to reversely drive Ying passage \(R\rightarrow X\) that implements various tasks, as illustrated in Fig. 1b, roughly classified into four categories: (1) identity mapping that calibrates whether input can be well reconstructed; (2) interacting with outside (informing, illustrating, communicating etc.); (3) implementing, motoring, instructing, intending; and (4) imagining and creating.
Specifically, it is noted that shortterm memory and longterm memory are far from being simply represented by two sets \(\{Y,\Theta \}\) in the Ying domain. Elements of both Y and \(\Theta\) are actually represented in different data types and accommodated in certain procedural and hierarchical structures. In this paper, for example, shortterm memory involves an inner state process \(s_0, s_1, \ldots , s_t\) not only in labels that indicates a trajectory of concept flow towards goals, but also associated with a flow of attributes \(\{y_0, y_1, \ldots , y_t\}\) that describe the concepts and instigate not only an action flow \(a_0, a_1, \ldots , a_t\) to control state transition but also drive the other categories of outcomes. Moreover, \(s_0, s_1, \ldots , s_t\) is closely coupled with value process \(v_{0}, v_{1}, \ldots ,v_{t}\) that evaluates the prospect of each state towards goals.
Formulated with help of probability theory, the Ying machine and Yang machine describe the joint distribution of R, X by two kinds of decomposition \(q=q(R)q(XR)\) and \(p = p(X)p(RX)\), respectively. The best harmony learning theory argues that the inner activities R as well as the corresponding \(X\rightarrow R\) and \(R\rightarrow X\), including updating (or called learning) parameters, are managed by a principle called Ying–Yang harmony maximisation^{1} that maximizes \(H(pq) +H(qp)\), which is in short also referred to as BYY harmony or IA harmony. From \(H(pq) +H(qp) = KL(pq) KL(qp) E(p)E(q)\), we see an interpretation that Ying and Yang seek a best agreement via minimising \(KL(pq)+KL(qp)\) in a vitality or parsimony system via minimising \(E(p)+E(q)\).
This paper addresses further possible developments of such brainlike bidirectional systems, with a key nature that deep implementing (or Itype mapping) and deep abstracting (or Atype mapping) in harmony, shortly deep IA harmony or deep Ying–Yang harmony. Examining AlphaGoZero together with revisiting early studies on A* search, it is interestingly found that MCTS (one key ingredient of AlphaGoZero) actually shares a scouting technique with CNneimA that was proposed in 1986. Integrating the strengths of AlphaGoZero and CNneimA, a new method named deep IAsearch is proposed, including Deep Scout A* (DSA), Deep CNneimA (DCA), Deep BiScout A* (DBA), and VAlphaGoZero, as well as their extensions DSAE, DCAE, DBAE, and AlphaGoZeroE.
Considering relation between search and reasoning, we are further motivated to implement reasoning with help of deep IAsearch, referred to as Deep IAInfer. Particularly, casual reasoning are addressed. Another early study (Xu 1986a; Xu and Pearl 1987) on structuring causal tree is developed into a threephase causal learning approach, namely topology identification, parameter reestimation, and causal \(\rho\)tree search on a casual \(\rho\)diagram that is defined by a set of pairwise correlation coefficient \(\rho\). Also, implementing procedures are further sketched for this TPC causal learning of triplets, stars, and trees, as well as some topologies of casual \(\rho\)Directed Acyclic Graph (DAG), e.g. the ones for Yule–Simpson’s paradox, Pearl’s Sprinkler DAG, and back door DAG.
Moreover, the classic Boolean SAT problem is extended into one \(\rho\)SAT problem, and the roles of four fundamental mechanisms in an intelligent system are elaborated with insights on integrating these mechanisms to encode not only variables but also how they are organised, from which we can interpret why deep networks are preferred while extra depth is unnecessary, and also adopt causal tree or hierarchical SEM equations as an alternative to deep learning.
Tree search and algorithm A*
A snapshot of tree search is illustrated in Fig. 2b, which is featured by a red coloured boundary drawn in a dashed closure that divides the tree into two parts. One is the inner part of tree, consisting of nodes inside the closure with each node already expanded and put into a list called CLOSED. The other is the peripheral part of tree, consisting of nodes outside the closure with each node put into a list called OPEN in which all the nodes are expandable but not expanded yet. Made step by step, each step of tree search is expanding one node in OPEN, featured by two jobs. One is usually named Expansion that is an action of putting the node \(S_E\) that is selected to be expanded next into CLOSE and all its child nodes into OPEN. The crucial point is that each node in either OPEN or CLOSED is associated with one or more attributes to indicate the value of this node. That is, each node should be evaluated before putting into OPEN, for which an appropriate measure is needed and effectively valued as accurately as possible. The other job is named Selection that chooses one node from OPEN as \(S_E\). The crucial point is that it is based on a strategy that integrates information not only from different attributes of each node but also from different nodes in OPEN.
The simplest case is encountered in WFS and DFS, the value measure is simply an integer \(\ell\) as an attribute to indicate the depth that this node locates at, and the selection strategy is the biggest \(\ell\) for DFS and the smallest \(\ell\) for WFS, respectively. However, this attribute does not directly reflect how good the corresponding node is. Usually, another number f is added as an attribute. The corresponding selection strategy is choosing from OPEN the node with its associated f value being the best among those with all the other nodes in OPEN, e.g. made in those bestfirstsearch (BFS) methods (Xu et al. 1988).
There are also examples that combine the uses of the attributes \(\ell\) and f. DFS seeks nodes with the biggest \(\ell\) and quickly goes through a path from the root to a leaf, where the actual cost \(g*\) becomes known and then used as a bound to prune off those branches with its current \(g(s)>g*\). Typical examples are alpha–beta pruning for minimax search (Hart and Edwards 1961) and branchandbound technique in many algorithms (Land and Doig 1960). Another example is a combination of DFS and BFS. When DFS tries to pick up nodes with the biggest \(\ell\), it will encounter more than one nodes associated with the biggest \(\ell\) and simply pick one randomly. Instead, we may use BFS to select the best node according to f value among a subset \(\text{OPEN}_{s*}\subset \text{OPEN}\), where nodes in \(\text{OPEN}_{s*}\) associate with the same biggest \(\ell\) and share the same father \(s*\), resulting in a search path as illustrated in Fig. 2c. In sequel, we will see that such a combination of DFS–BFS (shortly DBFS) leads to different search strategies by different combining ways and specific formulae for f, covering the ones in Qlearning and in MCTS.
Qlearning and reinforcement learning
One should not be confused with the notation Q in Eq. (2), where it is used as cost or regret associated with a strategy for cost minimisation and the notation Q in Eq. (3), where it is used as reward associated with a strategy of reward maximisation. Actually, considering either of two directions to optimise, i.e. {cost, min} and {reward, max}, has no fundamental difference and is interchangeable. In sequel, we use Q and also f in Eq. (1) as well without a particular clarification unless it may cause confusion. Also, we use \(v_h(s)\) to indicate that either v(s) is considered in reward maximisation or the subscript h(s) is considered in cost minimisation.
There are also efforts that apply traditional MonteCarlo technique to estimate Q(s, a) (Sutton and Barto 1998), still suffering the inferiority of nodetonode search. It is completely different from and should not be confused with MonteCarlo tree search below.
MonteCarlo tree search versus CNneimA search
MonteCarlo tree search (MCTS) has received remarkable interest due to its spectacular success in computer games (Kocsis and Szepesv 2006; Browne et al. 2012), especially the unbelievable result that AlphaGo impacted the world firstly by its 41 victory against Mr. Lee Sedol (Silver et al. 2016).
Illustrated in Fig. 3a is a snapshot after circling four steps for certain times, and illustrated in Fig. 3b is a snapshot after running A* for expanding certain number of nodes. Being different from A* that chooses the best node from OPEN according to its associated f value, MCTS starts a new circle at the step of selection as illustrated in Fig. 3a for picking one node \(s_E\) from OPEN to be expanded next. Search is implemented from the root of tree T in a way similar to DBFS but with f value replaced by Q(s, a), yielding a path that hits one node \(s_E\) in OPEN. This node is expanded in the step of expansion with all its child nodes put into OPEN, and each child node is valued by the step of simulation that runs a default policy to make a fast search, which in the simplest case is to make uniform random moves, until reaching a terminal node that receives a reward value \(\Delta\) directly from environment. Then, this \(\Delta\) is delivered up along the path in the step of back propagation not only to \(s_E\) for updating its v value but also further back to the tree root with Q value on each of its edges updated by Eq. (4), which affects DBFS in the selection step of the next circling.
After a prefixed number of circling, the frequency \(\pi _i\) of passing an edge from the root is calculated and used as a probabilistic policy by Eq. (5) to guide a move from \(\mathbf{s}\) to one of its children, as illustrated in Fig. 2c. One crucial difference between A* and MCTS is that the search by A* is made as illustrated in Fig. 3b) without calculation of \(\pi _i\) while MCTS moves as illustrated in Fig. 2c based on \(\pi _i\) that comes from scouting made in Fig. 3a.
Actually, scouting is a strategy that is widely encountered in real life. For example, as illustrated in Fig. 2e, when corps meet a junction, scouts are sent to collect information before making a choice. The strength of MCTS comes from such a scout for computing each \(\pi _i\). There is a similar scout made in Algorithm CNneimA that was proposed more than 30 years ago (Xu 1986a; Xu et al. 1987). As illustrated in Fig. 3c, each subtree of the root \(\mathbf{s}\) is scouted with a prefixed number \(n_i\) nodes expanded by A*, obtaining the average \(\mu _i\) of the f values associated with those expanded nodes to take place of the f value associated with each child i of the root \(\mathbf{s}\). Then, a move is made from \(\mathbf{s}\) to its child according to the best of \(\{ \mu _i\}\), as illustrated in Fig. 2c.
Apparently, selection by \(\{ \mu _i\}\) is different from one by \(\{ \pi _i\}\), but not really. Considering a simple case that there is only one goal G within subtree i from the root s, we randomly assign each node located on the path from \(s\rightarrow G\) by 1 with probability \(\pi _i\) and 0 with probability \(1\pi _i\), while all the other nodes are assigned by 1 with a much smaller probability \(\mathbf{e}\) and 0 with a probability \(1\mathbf{e}\). Also, we turn minselection into maxselection and consider simply DFS by Eq. (3) with tie breaking uniformly. We may roughly observe that \(\mu _i\) tends \(\pi _i\) as long as \(n_i\) is large, and thus the moving policy \(\mathbf{\pi }\) may be regarded as merely an extreme case. It may explain why AlphaGoZero needs a temperature parameter \(\tau\) to adjust (Silver et al. 2017). Containing more information, \(\mu _i\) already takes such an adjustment in consideration.
Some issues about computing complexity
The idea of scouting subtrees was first appeared in an algorithm named SA* (Zhang and Zhang 1985). Though SA* and CNneimA share a common point of scouting subtrees to collect more information to aid expanding, CNneimA differs from SA* critically. CNneimA simply considers the sample mean of f values on a possible optimal path scouted by A* in this subtree, justified by the fact that f values on the optimal path should be identical and thus can be regarded as coming from a same distribution.
In contrast, SA* focuses on Pearl’s simplified search space (Pearl 1984; Zhang and Zhang 1985). It is a uniform mary tree with one root node and a unique goal node at depth N, where each edge simply has a unit cost 1 and thus we have \(g(s)=d\) for a node s at depth d. For a node s in the subtree rooted at the ith node on the optimal path away from tree’s root, we observe \(h^*(s)=Nd\) if s is on the optimal path but otherwise \(h^*(s)=N+d2i\). What SA* considers is the statistic \(a(s)= 0.5[dN+h(s)]/d\) based on which SPRT test is made to examine whether the population of such a(s) values in a subtree under inspection is significantly different from the population of a(s) values in a subtree that contains the optimal path (Zhang and Zhang 1985).
What can be obtained from such a scouting mechanism? There was a serious confusion in the early theoretical study (Zhang and Zhang 1985). Its was surprisingly claimed in Zhang and Zhang (1985) that the mean complexity of order \(O(d\ln {d})\) was achieved by SA* search in the same space as Pearl’s simplified search space, contradicting to the well known fact that the mean complexity of A* search is an order growing exponentially with d (Pearl 1984). However, it follows from investigations in Xu (1986a, b, 1987) that this surprising claim unfortunately turned out to be a mistake, due to an incorrect and even contradictory theoretical formulation made in Zhang and Zhang (1985). In Pearl’s simplified search space, the fact is that a(s) values actually do not come from a same population, even in a subtree that contains the optimal path.
Nevertheless, the idea of scouting subtrees was newly proposed by that time, being different from those DFS aided lookahead techniques, e.g. alpha–beta pruning (Hart and Edwards 1961) and branchandbound (Land and Doig 1960). Sharing the idea of scouting subtrees but implementing selection policy instead of making SPRT test, it was shown by mathematical analysis (Xu 1986a; Xu et al. 1987) that CNneimA gets the mean complexity of order \(O(d^2)\) in general and even of order \(O(d\ln {d})\) under some constraint in a particularly assumed search space.
Though the search space considered in the above studies is very different from Pearl’s simplified search space, the above results do cast insights on those classic results about the mean complexity of A* search. The scoutingaveraging technique used in CNneimA can improve A* for two reasons (Xu 1986a; Xu et al. 1987). First, heuristics h becomes easier to estimate for those nodes locate deeper. Second, the variance of the average of a number of random variables is smaller than the variance of each individual random variable, that is, averaging reduces inaccuracy.
Deep reinforcement learning and AlphaGo Zero
No doubt, deep learning (LeCun et al. 2015) and MonteCarlo tree search (MCTS) (Kocsis and Szepesv 2006; Browne et al. 2012) are two most popular achievements for past decades or more in the AI field. Recently, integration of deep learning, reinforcement learning, and MCTS have outcome not only what called deep reinforcement learning (Mnih et al. 2015; Clark and Storkey 2015), but also AlphaGo (Silver et al. 2016). The key point is using deep network to model a mapping from state configuration s to value function \(v_{\theta }(s)\) and probabilistic policy \(p_{\rho }(as)\), such that value and policy estimation are turned into deep learning tasks (Sutton et al. 2000), which enhances both valuing measure and selection strategy in MCTS, as in AlphaGo (Silver et al. 2016) and AlphaGoZero (Silver et al. 2017). We focuses on AlphaGoZero since it is the latest and outperforms the former significantly.
Running over a prefixed number of circling, \(\pi\) is estimated by \(\pi \propto N(\mathbf{s}, a)^{1/\tau }\), representing the frequency of passing each edge from the root \(\mathbf{s}\), where \(\tau >0\) is a controlling temperature. Moreover, \(\theta\) is updated by stochastic gradient learning to minimise a given loss function L, once the search guided by \(\mathbf{\pi }\) and illustrated in Fig. 2c eventually reaches a gameover state with an indicator z received.
Method
Deep IAsearch family
With help of deep learning (DL) network for estimating \([v_h,g](s)=f_{\theta }(s)\), we can extend A* search and CNneimA into three DLbased tree search techniques summarised in Table 1(A), as variants or counterparts of AlphaGoZero.
Deep IAsearch family
(A)  Deep Scout A* (DSA)  Deep CNneimA (DCA)  Deep BiScout A* (DBA)  VAlphaGoZero 

Deep learning  \([v_{h} ,g](\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)  \([v_{h} ,\varvec{p}](\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)  
Selection step  Get expanding node \(\varvec{S}_{\mathbf{E}}\) by A*  In each child tree, get expanding node by A*  Get expanding node by A* subtree scout for \({\mu }\), by A*  Get expanding node by DBFS 
Valuating  Equation (1) by f = g + h  Equation (1) by f = g + h  Equation (1) with \({\varvec{\mu }}\) replacing f  Q and p by Eq. (6) 
Moving policy  Frequency \({\pi }_{i}\)  Mean \({\mu }_{i}\)  Frequency \({\pi }_{i}\)  Frequency \({\pi }_{i}\) 
(B)  DSAE  DCAE  DBAE  AlphaGoZeroE 

Deep learning  \([v_{h}, g, \varvec{p}]_{\varvec{\theta }}(\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)  
Selection step  Get expanding node \(\varvec{S}_{\mathbf{E}}\) by DBFSnA* selection  
Bayesian valuation  Make action either stochastically by value q or max\(_{a}\)q\(_{a}\)upon posteriori \(\varvec{q}=[q_{a}], q_{a}=p_{a}e_{a}/\varvec{p}^{T}{} \varvec{E}, \varvec{E}=[e_{a}]\)  
\(TypeQ: e_{a} = \rho (Q(s,s_{a}))~\mathrm{or}~e_{a} = \rho (r + v_{h}(s_{a}), ~\mathrm{where}~s_{a} = a(s)\)  \(TypeF: e_{a} = \rho ({\mu }_{a}), {\varvec{\mu }} = [{\mu }_{a}]\)  \(TypeQ: e_{a} = Q(s,a)\)  
\(TypeF: e_{a} = \rho (f(s_{a})),~\mathrm{where}~s_{a} = a(s)\)  \(TypeF: e_{a} = \rho (f(s_{a}))\)  
If \(q_{a}\) is larger than a prespecified threshold, put \(s_{a}\) into OPEN, otherwise into WAIT. When OPEN becomes empty, move some ones from WAIT to OPEN. Note: p(r) is monotonically increasing for reward maximisation or decreasing for cost minimisation  
OPEN revision  Revise f values in OPEN by backforward propagation after each expanding  
Others  Same as DSA  Same as DCA  Same as DBA  Same as AlphaGoZero 
The second is named Deep CNneimA or shortly DCA. As illustrated in Fig. 3c, each child tree of the root \(\mathbf{s}\) is scouted by A* by expanding a prespecified number of nodes, obtaining the average \(\mu _i\) of the f values on the longest path from s to a leaf. Then, a move is made from \(\mathbf{s}\) to its child according to the best of \(\{ \mu _i\}\) instead of the best of \(\{ \pi _i\}\) used by AlphaGoZero, as illustrated in Fig. 2c.
The third is named Deep BiScout A* or shortly DBA, which combines DSA and DCA, featured by two levels of scouting. The first level is similar to DSA, making a move based on \(\pi _i\) in Fig. 2c. As illustrated in Fig. 3d, the second level is expanding the node \(S_E\) based on the average of the f values on the longest path from \(S_E\) to a leaf after a subtree is scouted by A* with a prefixed number nodes expanded, that is, this average is used in place of the original f value associated with the node \(S_E\) to guide A* search on the first level.
Moreover, AlphaGoZero also gets a priori \(p=p(s,a)=p(as)\) by deep network and uses it in Eq. (6) to make action while there is no such a priori considered in these A* based techniques.
 (1)
DL priori policy deep learning network \([v_h,g](s)=f_{\theta }(s)\) is extended into \([v_h,g,\mathbf{p}](s)=f_{\theta }(s)\) to add an output for a priori policy \(p=p(s,a)=p(as)\).
 (2)
Bayesian valuation combing such a priori policy and turning the estimates \(Q,f, v_h\) into a sort of likelihood such that a posteriori \(\mathbf{q}\) is obtained by Bayesian formulae, then making action based on this posteriori. It degenerates back to Table 1(A) when elements of \(\mathbf{p}\) are same.
 (3)
OPEN beaming only a part of children of s are put into OPEN based on \(\mathbf{q}\) with the rest into WAIT that is a preparatory list of storing these nodes, some of which will be moved back to OPEN once OPEN becomes empty. It degenerates back to Table 1(A) when all the children of s are put into OPEN.
 (4)
DBFS\(_n\)A* selection a spectrum of selection strategies can be obtained by combining the one from A* and the one from MCTS by varying \(n=0\) to \(n=d\), where d is the length of the path from the root to \(S_E\) hit by DBFS. As illustrated in Fig. 3f, one end is DBFS\(_0\)A that hits \(S_E\) by MCTS without using A*. Generally, DBFS\(_n\)A* conducts DBFS search to return back for n nodes along the path from \(S_E\) back to the root. For examples, DBFS\(_1\)A* returns back to its father node, while A* or precisely BFS selects the best among all the children (i.e. ones indicated by double circle ) as \(S_E\); DBFS\(_2\)A* returns back for two nodes, while BFS selects the best among all its children and grandchildren (i.e. add in two double circled nodes) as \(S_E\), so forth. The other end is DBFS\(_d\)A that returns back to the root and becomes equivalent to A* that picks the best among OPEN.
 (5)OPEN revision A* search will not revise the f values in OPEN and thus do not affect future expanding, while AlphaGoZero or MCTS uses back propagation to revise Q values along the path from \(S_E\) to the root of tree, which affects the search of the next circling to hit a node to expand. Also, we may integrate this idea to revise the f values in OPEN per expanding such that the next expanding will be affected. As illustrated in Fig. 3g, after expanding \(S_E\) and getting the best value f* of children’s f values, we backpropagate f* to revise the f values of nodes one by one along its path \(path_b\) back to the root of tree, e.g. the f value of \(s_0\) is updated by a weighted average of its old value and f*, and then the f value of \(s_1\) is updated by a weighted average of its old value and the new f value of \(s_0\). Precisely, we make a revision bywhere \(1> \eta > 0\). On a backward path, the notation \((s_f,s_s)\) is moved up after each step, until the f value of the tree root is revised. Similarly, on a forward path, the notation \((s_f,s_s)\) is moved down after each step, from each node on \(path_b\) to reach a node in OPEN.$$\begin{aligned} f^{\rm new}(s_f) = \;& {} (1\eta ) f^{old}(s_f) + \eta f^{\text{new}}(s_s),~\mathrm{on~backward~path} \nonumber \\ f^{\text{new}}(s_s) =\; & {} (1\eta ) f^{\text{old}}(s_s) + \eta f^{\text{new}}(s_f),~\mathrm{on~forward~path}, \end{aligned}$$(7)
Each of search techniques in Table 1(B) degenerates back to its counterpart given in Table 1(A) when all the above modifications are shut off. Whether each of these modifications gets in action is task dependent, and correspondingly we may have various special cases in place of ones in Table 1(B). They all can be regarded as examples of deep bidirectional intelligence addressed at the beginning of this paper. Deep implementation of problem solving search is driven by deep abstraction or Atype mapping via deep learning network in harmony. These exemplars constitute a family that can be shortly named deep IAsearch to indicate not only its nature of deep IA harmony but also its feature of making scouting aided search.
Deep learning, path consistency, and domain knowledge embedding
Ignoring the reward in Eq. (4) or the regret in Eq. (2), we discard \(\vert g(s)g^*(s) \vert ^{\gamma _e}\) and let \(w_c=0\), \(L (\theta )\) returns back to the same loss function used in Ref. Silver et al. (2017). In general, the above \(L (\theta )\) generalises along two directions. First, the reward or regret is considered such that it becomes applicable to problem solving tasks beyond games like Go, e.g. tasks traditionally considered by A*. Second, \(L_s^c(\theta )\) is added with \(w_c>0\) to enhance a nature named as path consistency, that is, f values on one optimal path should be identical. Previously, OPEN revision by Eq. (7) is actually rooted from this nature. In sequel, we use this nature to improve learning.
Learning is performed in two phases. The main phase is updating \(\theta\) to minimise the overall \(L(\theta )\) after reaching a terminate node, where \(v_h^*(s) =0\) such that \(f^{*}(s) = g^{*}(s)\) becomes the actual reward or regret received from the environment. There is also a complementary phase made before reaching terminate node. We let \(w_s=0\) to shut off \(L_s(\theta )\) that is unavailable yet because the value \(v_h\) of each node on this incomplete PATH is unknown. Still, we have \(f_{\theta }(s)\) given by DL network, together with \(f^{*}(s)\) given by the average of such \(f_{\theta }(s), \forall s \in W_s\), based on which we may update \(\theta\) to minimise a part of \(L(\theta )\) to ensure path consistency.
Additionally, there are also taskdependent constraints. One classic example is the travelling salesman problem (TSP) that not only finds a minimum path of passing a number of cities but also satisfies the constraint of visiting each city once and only once. Another example is considering a portfolio of assets, e.g. stocks, bonds, currencies, and metals, etc. Changing holding of one or more assets leads to a new state, incurring for profit gain or loss \(r_t\). Within a given period, the task of changing states for a best profit can be formulated as a tree search problem.

A manifold in high dimensional space Each state denotes a manifold, a convex set, and a cluster of points in a high dimensional space. Similar states correspond topologically to neighboured manifolds or clusters.

A 2D or more image Each state denotes a collection of images either varying continuously or sharing common critical features. Being different from the above vector representation, image representation is good at encoding dependent structure related to element locations.

For portfolio management, we turn time series of assets into to images, with the help of time–frequency analysis, such as shorttime Fourier transform (STFT).

For TSP problem, a typical representation of a state is a contour or trace from the starting city to the end city in a 2D image with locations of n cities. We may turn the 2D contour by 3D time–frequency analysis, scale space representation, and wavelet analysis, etc.

For natural language understanding, we associate words with its corresponding speech signals or image patterns, and then treat the signals and patterns as above to generate image inputs to deep learning networks. Also, we may embed each word or a parsing tree of a sentence into high dimensional space to generate vector input to deep learning networks.
Deep IAinfer
A process of tree search such as AlphaGo may be regarded as a process that infers the statement “the skill of player A is higher than player B” is true, which is proved or disproved at a terminal state of tree search. Though output is deterministically either of win, lost, and tie, the result has uncertainty since it comes from just one game. Uncertainty may be reduced by playing a number of games. Also, the value v is attached to each state as a belief value against uncertainty. Finally, we can backtrack the path from the tree root to the terminal state, and get to know the reason of winning.
Generally, search process of solving Boolean satisfiability and other constraint satisfaction problems may be regarded as uncertainty reasoning process of inferring a statement via checking whether certain conditions are satisfied.
During the last wave of AI studies in the eighties of the last century, tree search not only found many applications in the areas of pattern recognition (Xu et al. 1989), but also took important roles in problem solving tasks, especially in the hot topic called expert system. Starting at a set of preconditions towards a set of ending conditions that specify a consequence, the reasoning process is actually a tree search, with each elementary unit being a star structure. Each state associates with a number of attributes that specify an antecedent, and several production rules that match this antecedent act as branches emitted from this state. After reaching the state that represents a targeted consequence, a path backtracked to the tree root will provide a sequence of IFTHEN rules that explain the reasoning process.
Regrettably, studies on these tasks gradually faded out since the late eighties in the last century because implementation of tree search confronted great challenges in computing complexity. Nowadays, AlphaGoZero and each of deep IAsearch techniques in Table 1(B) as well as learning by Eq. (10) arise a new direction for rebooting these early studies, with the help of deep learning and recent advances on tree search.
The tree expanding process \(\{s_t, y_t,h_t\}, t=1, \ldots , T\) records a reasoning sequence, where \(y_t\) consists of attributes associated with state \(s_t\), each transition \(s_{t}\rightarrow s_{t+1}\) denotes an implementation of production rule with its antecedent best matching \(y_t\) among several production rules and with its postcedent attached to \(s_{t+1}\) as its attributes, and \(h_{t1}\rightarrow h_t\) represents the corresponding change of uncertainty that the search is guaranteed to reach a targeted consequence. The matching between \(y_t\) and the antecedent of each rule is usually inexact and will increase uncertainty. There is a sequence of actions \(a_1,a_2, \ldots , a_t\) that controls tree expanding process in order to maintain low uncertainty. In such a way, reasoning is implemented with the help of techniques of deep IAsearch, which is here referred to as Deep IAInfer.
The star structure illustrated either in Fig. 4b or by the shadowed area in Fig. 4a takes a fundamental role in tree search or tree reasoning. Typically, each star structure (i.e. a number of edges emitted from the state) is prespecified according to domaindependent knowledge, and also possible to be identified from data in some cases. Then, certain quantities, such as \(p_{\rho }(as)\), Q(s, a), and \(p(s_{t+1}s_t)\), are learned from data collected either in advance or during search.
In general, there are two types of star structure. As illustrated by the shadowed area in Fig. 4a, each link of the first type acts as a ‘perceptor’ or ‘controller’ that perceives and maps the current state information into one of actions. Such actions can be regarded as an example of abstract representation such as decision index, features, and control signal, and thus may be named as Atype link. Each link of the second type acts as an ‘actuator’ that is driven by the selected action to implement certain mechanism that moves the state \(s_t\) to the next \(s_{t+1}\), and thus may be named as Itype link.
Both types usually contain randomness and uncertainty and, thus, are modelled by probabilities \(p(as_t)\) and \(p(s_{t+1}a)\), respectively. Collapsing Atype links and Itype links between a state to its direct descendant states, the situation illustrated in Fig. 4a will be simplified to the star structure illustrated in Fig. 4b. However, such a collapsing degrades the representation ability except the special case shown in Fig. 4c, such as ones considered by AlphaGoZero and deep IAsearch.
Casual analysis: modelbased versus databased approaches

Dependence we observe dependence between changes of x and y.

Directional x changes before y changes for at least an infinitesimal interval.

Deinterference changes of y are caused by neither itself or its environment.
The above three 3D natures jointly tell that the changes of y are indeed caused by x at least partially, in a general scope but a weak sense. We may observe a stronger causality in a more restricted scope that y changes with x in a regular manner, e.g. satisfying certain dynamic law or following a particular mechanism, which actually leads to one current popular direction of casual analysis on nonexperimental data, featured with modelbased approach. That is, we attempt to fit nonexperimental data by a model that generates y from x in a manner usually regarded as causal, e.g. possessing at least the above three natures, and interpreting according to the model that there is a causality underlying the data observed. Typically, the topology of such a model is a Directed Acyclic Graph (DAG). Examples include not only Bayesian networks (Pearl 1988). Examples include not only the classical structure equation model (SEM) (Wright 1921; Pearl 2010), but also LiNGAM (Shimizu 2006), postnonlinear (PNL) model (Zhang and Hyvärinen 2009), and additive noise model (ANM) (Hoyer et al. 2009).
This direction of studies has two critical problems. The first is how to judge whether data fit the model well. A conventional fitting error or likelihood may not be enough. A small fitting error may be obtained in a risk of undermining or violating causality. The second is how to judge whether a model describes causality well. There lacks a numeric measure yet. Instead, one typically checks whether some restrictions are satisfied. Current ways that tackle the two problems are estimating two or a few candidate models to minimise error and then picking one via checking which one satisfies restrictions well.
Yet, there is still a distance towards a best solution from the perspectives of both minimising description error and ensuring causality, for which we need to seek a causal orientated fitting error measure. Such a measure should cover not only the conventional fitting error but also deviation from causality. The latter is about deviation from the DAG topology that models causality, related to not only the complexity of DAG but also whether the casual direction is consistent to the corresponding directions of the DAG topology. Therefore, a reasonable guess about such a measure is somewhat a kind of directional generalisation error.
Actually, the modelbased approaches represent just one category of a dichotomy of casual analysis. The other category is data based approach, discovering causality underlying data without assuming a data generative model but with efforts that ensure the 3D natures. One example is Rubin causal model (RCM) (Rubin 1974; Rubin and John 2011). The key idea is intervening the change of designated action, e.g. typically changing between two states usually denoted by the case \(x=1\) and the control \(x=0\), and then observing whether the corresponding effect of y ensures the first two natures, while the nature of deinterference is considered by weighted adjustment via the conditional independence (CI) \(x \perp y \mid u\) on a covariate that describes the environment u or its propensity score that represents sufficient dimension reduction.
Another example came even much early. Considering that x and y share one common factor u with \(x \perp y \mid u\), it follows from Ref. Reichenbach (1956) that the causality is described by either of three casual tree topologies that share a common CI topology \(x u y\). First, \(x\leftarrow u \rightarrow y\) is a simple tree of type \(T_1\) with its root being hidden and two leaf nodes located on the first layer, where for simplicity we use \(T_{\ell }\) to denote a type of trees with its total depth being \(\ell\). Either of the latter two \(x\rightarrow u \rightarrow y\) and \(x\leftarrow u \leftarrow y\) is a simplest tree of type \(T_2\) with its root being visible and each of two layers having only one node.
Using one CI testing, we are able to recover the CI topology but unable to determine which one among the three causal topologies, i.e. we are unable to orientate the direction of each edge. Such a CI testbased study applies to generally the cases with many variables too. The best known examples are the inductive causation (IC) algorithm (Pearl 1991) and its refined counterpart named PC algorithm (Spirtes and Glymour 1991). The latter starts with a completely connected graph and deletes recursively edges based on CI tests, resulting in the CI topology. Next, nondirectional edges of every Vstructure are turned into directional edges. However, merely a part of edges can be orientated in such a way. The rest edges still remain nondirectional. One way is orientating each edge a direction randomly in consistence with conditional independence.
Generally speaking, CI nature can recover CI topology but is unable to orientate every edge direction. We may use an approach similar to RCM for the direction of each of unorientated edge. Also, techniques used in the abovementioned LiNGAM, PNL, and ANM as well as the SADA (Cai et al. 2013) may also be adopted for this purpose.
Discovering star structure and TPC causal learning
The CI topology obtained above consists of merely visible variables, which is not enough in real situations where certain hidden factors need to be taken in consideration. One early study is made by Pearl (1986) on the problem of binary visible nodes together with a number of binary hidden variables.
Subsequently, the study was extended to Gaussian visible nodes \(x_1,x_2,x_3\) in Refs. Xu (1986a); Xu and Pearl (1987), in which it was found that the necessary and sufficient condition for identifying this CI topology is satisfying the above triangular inequalities.
Theorem 1
To identify the star topology by Eq. ( 14 ), the satisfaction of the triangular inequalities by Eq. ( 12 ) or equivalently the equalities by Eq. ( 13 ) is (a) the necessary condition for random variables \(x_1,x_2,\ldots , x_n\) in general, and (b) the necessary and sufficient condition for Gaussian variables \(x_1,x_2,\ldots , x_n\) in particular.
Even when the theorem is satisfied, it should be noted that there are still two types of indeterminacy. First, though the satisfaction of Eq. (13) uniquely specifies the star topology, this topology is shared by an equivalence class that consists of not only one causal topology of \(T_1\) tree with its root hidden but also three different causal topologies of \(T_2\) tree with its root visible, as illustrated in Fig. 4d. Second, the solution for the triplet \((\rho _{12}, \rho _{13}, \rho _{32})\) to satisfy Eq. (13) is not unique but may be infinite many.
Typically, identifying star or CI topology and estimating unknown parameters are handled nonseparately (Spearman 1904; Wright 1921; Anderson and Rubin 1956; Pearl 1986; Pearl 2010; Shimizu 2006; Zhang and Hyvärinen 2009; Hoyer et al. 2009). Differently, the above theorem implies a possibility of separating topology identification and parameter estimation. Taking the issue of equivalence class in consideration too, we get a three phase causal learning approach as summarised in Table 2.
The other way is based on a start causal model with a set \({\varvec{\rho }}\) of n unknown parameters \(\rho _{iw}\)’s, from which we obtain Eq. (13) that consists of \(0.5n(n1)\) joint equations. As summarised in Table 2, we examine the possible outcomes of the joint equations to identify the start causality. The best outcome is that the joint equations have a unique solution, which identifies that the casual tree topology in consideration is among an equivalence class of four simple causal \(\rho\)trees as illustrated in Fig. 4d, but fails to identify uniquely which one.
TPC causal learning
Solving joint equations for CI \(\rho\)tree by Theorem 2 and for CI \(\rho\)DAG by Theorems 3 and 4  

Joint equations  Topology identification (Tphase)  Parameter reestimation (Pphase)  Causal \(\rho\)tree search (Cphase) 
No solution  Inconsistent with data  NA  NA 
Unique solution  As the necessary and sufficient condition for identifying this CI topology of \(\rho\)tree or \(\rho\)DAG, which is an equivalence class of a number of causal \(\rho\)trees or \(\rho\)DAGs. At least, one of them models data well  Each in this equivalence class is modeled in SEM equations with coefficients as parameters that are reestimated via iterating one SEMbased constrained sparse optimisation with coefficients get in Tphase as initialisation  Search the best one among the equivalence class with each one enumerated by a search strategy, estimated in Pphase, and evaluated by a measure that considers both bestfit and causality to get a directional generalisation error. 
Many or infinite many solutions  As a necessary condition that this CI topology satisfies 
The third phase is named Cphase that aims at choosing the best one among the equivalence class, for which we need a search strategy to enumerate among the equivalence class and also a measure to evaluate both bestfit and causality to get a directional generalisation error.
In a summary, the separation principle that was firstly revealed in Refs. Xu (1986a); Xu and Pearl (1987) has been here further developed into a method of making three phase learning, namely Tphase, Pphase, and Cphase, or shortly TPC causal learning.
Structuring causal trees: TRIPLET, STAR, and their recursions

Step 1 Select a set V of significant variables, where each \(x\in V\) is selected from the set U of all the variables if x is regarded as “significant” by either a hypothesis testing or according to a criterion. One example is that each \(x\in V\) is a significant SNV selected from U that consists of all the SNVs in a GWAS study. The other example is that each \(x\in V\) is a significant biomarkers selected from U that consists of expressions of all the genes in genomics analyses.

Step 2 For every \(x\in V\), we screen every triplet (x, y, z) by enumerating every pair (y, z) from the difference set \(UV\). This triplet is ignored if it is already in the set \(T^d\) that accommodates all the detected triplets, otherwise, u is examined by either one inequality test based on Eqs. (12) and (15) or one equality test based on Eqs. (13) and (16), and then it is added into \(T^d\) if the test is passed.

Step 3 For each triplet \((x,y,z)\in T^d\), we solve Eq. (20) and choose the best \({\varvec{\rho }}^*\) in the best model among the four candidates.

Pick away from \(US\) a visible variable u that has the strongest correlation with either one of variables in S, and then perform Tphase to examine a star that consists of u and all the variables in S by either an inequality test based on Eqs. (12) and (15) or an equality test based on Eqs. (13) and (16).

If the test fails, discard u. If success, put u into S and choose candidate causal trees by choosing \(a\rightarrow u\) or \(u \rightarrow a\) to avoid a Vstructure.

Make Pphase to solve Eq. (20) and choose the best \({\varvec{\rho }}^*\) in the best model among the candidates obtained above.

Examine whether a prespecified terminating condition is satisfied, if yes, stop; otherwise, goto the beginning line above.

Step 1 As illustrated in Fig. 4e, we choose one edge \(l_{az}\) on a known triplet axyz and consider to locate at the middle of \(l_{az}\) an additional edge \(l_{bu}\) that has one end being a new hidden variable b and the other end being a new visible variable u. We thus get a new triplet bzua.

Step 2 Make Tphase to test this new triplet bzua in the same way as Step 2 in the above TRIPLET procedure. If failed, \(l_{bu}\) is discarded.

Step 3 If succeed, perform Pphase by Eq. (20) on the triplet bzua and choose the best \({\varvec{\rho }}^*\) in the best model \(T^*\) among the four candidates.

Step 3(a) if succeed, solve Eq. (20) on the triplet bzua and choose the best \({\varvec{\rho }}^*\) in the best model among the four candidates. Also, record the corresponding best fitting measure \(L^*_{az}\)

Step 3(b) repeat from Step 1 to Step 3 on the other two edges \(l_{ay}\) and \(l_{ax}\), respectively. Choose the best \(\xi ^*\) according to \(L^*_{a\xi }\) for \(\xi =x,y,z\) and finally add the edge \(l_{bu}\) to the middle of the corresponding edge \(l_{a\xi ^*}\).
Alternatively, we may also locate the hidden centre of a triplet at the middle point of edge \(l_{az}\). In such a way, the middle point actually forms a star topology of five edges. Even more, we may locate the hidden centre of a star with more edges at the middle point or merge the hidden centres of two triplets or stars to form a bigger star. Then, we can check this star by either an inequality test based on Eqs. (12) and (15) or an equality test based on Eqs. (13) and (16). If the test is passed, we solve Eq. (20) and choose the best \({\varvec{\rho }}^*\) in the best model among all the candidates. So on and so forth, ..., we add a visible node, a triplet, and a star into the current tree until all the visible nodes have been examined.
CI \(\rho\)diagram and casual \(\rho\)tree discovery
In some real applications, a possible topology of a causal tree may come from domain knowledge. Also, we may be asked to compare causal topologies resulted from using different existing methods for causal analyses. Thus, it is also demanded to further examine whether a given topology is a CI diagram that defines an equivalence class of a number of causal trees. Beyond tree topology, a CI \(\rho\)diagram is a Directed Acyclic Graph (DAG) or Bayesian networks, on which extensive studies have been made (Pearl 1988, 2010; Spirtes and Glymour 1993, Spirtes et al. 2000).
Illustrated in Fig. 4f, we consider a diagram with visible nodes \(x_1,x_2,\ldots , x_n\) and hidden nodes \(w_1, \ldots , w_m\), in which again not only each \(x_i\) is normalised to be zero mean and unit variance but also each \(w_j\) is assumed to be zero mean and unit variance too. In this diagram, each edge is associated with the correlation coefficient \(\rho\) between its two nodes. Such a diagram is completely defined by a set of pairwise correlation coefficient \(\rho\), shortly called \(\rho\)diagram.
Since independence implies decorrelation (i.e. an edge can be removed if its associated correlation coefficient \(\rho =0\)), a CI diagram must be a CI \(\rho\)diagram. On the other hand, two variables \(x_1, x_2\) may not be independent even there is no correlation between \(x_1, x_2\), because there may still be higher order dependence between them, i.e. a conditional decorrelation (CD) \(\rho\)diagram may not be a CI diagram. For simplicity, we still use the name CI \(\rho\)diagram in a narrow sense that the diagram is a CIdiagram while we consider each edge merely by its associated correlation coefficient \(\rho\). That is, we consider a restricted form of CIdiagram, namely CI \(\rho\)diagram.
In sequel, we start at a special case that the diagram is a tree, namely CI \(\rho\)tree, on which we may extend Eq. (13) into the following theorem:
Theorem 2
Given a conditional independence (CI) undirected \(\rho\) tree topology, considering a pair of nodes \(\xi , \eta\) by a undirected path \(\xi x_1\cdots x_m\eta\) and adding an additional edge between \(\xi ,\eta\) associated with \(\rho _{\xi \eta }\) to form in a loop, we have \(\rho _{\xi \eta }=\rho _{\xi x_1}\rho _{x_1x_2}\cdots \rho _{x_{m1}x_m}\rho _{x_m \eta }.\)
As illustrated in Fig. 5a–c, a pair of nodes \(\xi , \eta\) is said to be directionally correlated if the pair is linked by a path in pattern \(\leftarrow \leftarrow \cdots \leftarrow _j \rightarrow \rightarrow \cdots \rightarrow\), where j can locate at \(\xi\) or \(\eta\) as well as any middle point.
Theorem 3
Given a directed \(\rho\) tree topology, considering a pair of nodes \(\xi , \eta\) that are directionally correlated by a path \(\xi \leftarrow x_1\leftarrow \cdots \leftarrow x_j \rightarrow \cdots \rightarrow x_m\rightarrow \eta\) , we have \(\rho _{\xi \eta }=\rho _{\xi x_1}\rho _{x_1x_2}\cdots \rho _{x_{j1}x_j}\rho _{x_jx_{j+1}}\cdots \rho _{x_{m1}x_m}\rho _{x_m \eta }.\)
Based on this theorem, we can perform Tphase to check whether a given directed \(\rho\)tree topology is consistent with a given set of samples from visible nodes. Following the procedure similar to turning causal star topologies into Eqs. (18) and (19), we turn each directed casual \(\rho\)tree into its corresponding SEM equations, based on which we perform Pphase to solve the best \({\varvec{\rho }}^*\) for performing the optimisation by Eq. (20). Based on these SEM equations, we may infer the firstorder statistics of variables, subject to errors propagated along the paths within tree, e.g. the previous \(\rho _{2w}e_{w1}+e_{2w},\ \rho _{3w}e_{w1}+e_{3w}\) for the cases in Fig. 4d.
The information flows in a directed \(\rho\)tree topology as illustrated in Fig. 5b are diverging from tree’s root or hidden nodes to the visible nodes, thus, it may be called diverging causal tree. Obviously, it is a generative model or Ying causal model, implementing an Itype mapping. Revising the direction of every edge, the topology becomes one that may be named converging causal tree, as illustrated in Fig. 5a, in which the information flows are converging from the visible nodes to the tree’s root. Also, it is called representative model or Yang causal model, implementing an Atype mapping. Strictly speaking, such a converging causal tree is a Directed Acyclic Graph (DAG) but no longer a directed tree, for which Theorem 3 does not apply.
In comparison with structuring causal tree incrementally by TRIPLET, STAR, and their recursions, the above casual \(\rho\)tree approach balances estimations of parameters \({\varvec{\rho }}\) in a systematic manner. Yet, there lacks an effective technique for performing Cphase and especially for enumerating all the candidate causal \(\rho\)tree topologies. Still, it is useful to practical needs of not only examining one or more topologies obtained from domain knowledge but also comparing causal topologies resulted from different existing methods.
Casual \(\rho\)DAG discovery: Yule–Simpson Paradox, SPRINKLER, and BACKDOOR
Without losing generality, we again normalise variables to zero means and unit variances. It follows from Eq. (24) that \(Ezy=Ez(y_z+y_x)=\rho _{yz}Ez^2 + \rho _{yx}\rho _{zx}Ez^2= \rho _{yz} +\rho _{yx}\rho _{zx}\). In other words, the simple product in Theorem 2 is here extended into a summation of products, or shortly the PROD format is extended into a SUMPROD format.

Projection pick a suspicious variable z that affects either or both of x, y. Generally, such z comes from a projection of a multidimensional vector that consists of many environmental variables, e.g. from a principal component;

Detection make a hypothesis test to detect the inequalities by Eq. (28) or its probabilistic version in a way similar to Eq. (15) discussed previously; if not, discard z; if yes, goto the next step;

Correction put the corrected \(\rho _{yz} , \rho _{yx}, \rho _{zx}\) by Eq. (27) into the original \(\rho\) valuebased test to check the corrected effects \(x\rightarrow y\), \(z\rightarrow y\) and \(z\rightarrow x\).
There are 10 equations with 5 unknowns, that is, the problem may be overconstrained, especially when \(\{ \rho _{\xi \eta }^o\}\) are estimated from a set of samples. Therefore, it is better to consider one least error approach, e.g. like Eq. (23).
It deserves to recall that Wright’s path analysis (Wright 1934} also obtained a set of joint equations like the ones in Eq. (29) or Eq. (33) by considering the secondorder statistics along paths among variables in a multiple system from a set of SEM equations. In other words, our approach in this paper inherited the spirit of getting joint equations of the secondorder statistics by analysing paths. But there are two major differences in two studies: First, Wright aimed at solving variable quantities based on the joint equations from certain known variables, usually in a small scale system with a few variables. However, our approach aims at using these joint equations to examine whether the topology of a specific DAG is consistent to the one underlying observable data samples, that is, making topology identification in Tphase as summarised in Table 2, while not only parameters and unknowns in the system are reestimated in Pphase via optimisation for the least error or maximum likelihood but also a best model is searched in Cphase by considering both bestfit and causality. Second, our approach focuses on the relation between \({\rho }\) values and the topology in consideration by assuming that variables are normalised to zero means and unit variances. In other words, the joint equations considered in our approach merely involves correlation coefficients which considerably reduce the number of free parameters that are not helpful to topology identification. However, Wright’s path analysis considered the joint equations that contains not only correlation coefficients but also path coefficients as well as variances.
Casual \(\rho\)DAG discovery: general case, \(\rho\)SAT problem, and basic mechanisms
For \(\rho _{x_5x_7}\), it contains not just merely a path in the pattern \(\leftarrow \leftarrow \cdots \leftarrow _j \rightarrow \rightarrow \cdots \rightarrow\) with \(\rho _{x_5w_3}\rho _{w_3w_2}\rho _{w_2w_4}\rho _{w_4x_7}\) and j at \(W_3\). There is also a flow that injects in at \(W_2\) and thus \(\rho _{x_5w_3}\rho _{w_3w_2}\) should be replaced by \([(\rho _{x_1x_4}\rho _{x_1w_1}+\rho _{x_2x_4}\rho _{x_2w_1}+\rho _{x_3x_4}\rho _{x_3w_1})\rho _{w_1w_2} + \rho _{x_5w_3} \rho _{w_3w_2}].\) Similarly we can obtain \(\rho _{x_5x_6}.\)
Totally, we get \(4+4+4+3=15\) equations jointly for identifying whether the \(\rho\)DAG illustrated in Fig. 5d is underlying a given set of samples by checking whether these joint equations are solvable, for which we are lead to Table 2 again.

Step 1 find out every pair of nodes \(\xi , \eta\) that are directionally correlated by a path in pattern \(\leftarrow \leftarrow \cdots \leftarrow _j \rightarrow \rightarrow \cdots \rightarrow\), where j can locate at \(\xi\) or \(\eta\) as well as any middle point;

Step 2 on every path found above, identify those junction nodes featured by their indegrees bigger than 1, and pool all the related paths into a hierarchy similar to the one in Eq. (34).

Step 3 attach each edge \(a\rightarrow b\) in this hierarchy by its corresponding \(\rho _{ab}\);

Step 4 sum up at each junction node the \(\rho\)products of subpaths from the bottom up, in a way similar to the one in Eq. (34), until the top of the hierarchy.
This is also generally true. There are three possible scenarios for a pair of nodes \(\xi , \eta\) in a \(\rho\)DAG. First, it can be ignored if it is not linked by any path in the pattern \(\xi \leftarrow \leftarrow \cdots \leftarrow _j \rightarrow \rightarrow \cdots \rightarrow \eta\). Second, it is linked by only one such path and thus is applicable to Theorem 3. Third, the pair is linked by a number paths in the pattern \(\xi \leftarrow \leftarrow \cdots \leftarrow _j \rightarrow \rightarrow \cdots \rightarrow \eta\) due to the existence of those junction nodes. For the last two scenarios, we may extend Theorem 3 into the following one.
Theorem 4
Given a pair of nodes \(\xi , \eta\) in a \(\rho\) DAG, if the pair is linked by a number \(n_{\xi \eta }\) paths with each in \(\xi \leftarrow x_1^{(r)}\leftarrow \cdots \leftarrow x_j^{(r)}\rightarrow \cdots \rightarrow x_{m_r}^{(r)}\rightarrow \eta , \ r=1, \ldots , n_{\xi \eta }\) , where j may locate at \(\xi\) or \(\eta\) as well as any middle point, we have \(\rho _{\xi \eta }^o=\sum _{r=1}^{n_{\xi \eta }} \rho ^{(r)}_{\xi \eta }\) and \(\rho ^{(r)}_{\xi \eta }= \rho ^{(r)}_{\xi x_1}\rho ^{(r)}_{x_1x_2}\cdots \rho ^{(r)}_{x_{j1}x_j}\rho ^{(r)}_{x_jx_{j+1}}\cdots \rho ^{(r)}_{x_{m_{r}1}x_{m_r}}\rho _{x_{m_r} \eta }.\)
Further insight on Eq. (34) can be obtained by imagining a special case that each \(\rho\)parameter takes values either around 0 or around 1. Then, a product of two \(\rho\)parameters likes a logical ’AND’ gate, while a sum of two \(\rho\)parameters likes a logical ’OR’ gate. The problem of identifying \(\rho\)parameters by solving equations with ones like Eq. (34) becomes somewhat similar to the problem of Boolean satisfiability or propositional satisfiability (shortly SAT) that has wide real applications in artificial intelligence, circuit design, and automatic theorem proving (Vizel et al. 2015). On the other hand, we may regard that the classic SAT problem is extended into the above \(\rho\)SAT problem by solving equations with ones like Eq. (34), from binary valued to real valued and from deterministic to probabilistic. A lot of studies have been made on the classical SAT problem, which may provide some references to further study on the \(\rho\)based SUMPROD system.
It follows from observing Eq. (34) and Fig. 5d as well as Fig. 6a that SUM operates on a node v with \(deg^(v)>1\) and integrates arrows that act in the same dimension or subspace, while PROD operates on an arrow or a directed path of several arrows along a same direction and integrates the arrows that act in different dimension or subspaces with each arrow adding in one dimension. Actually, SUM is of several choices that implements the FANin mechanism, as elaborated in Fig. 6e. One end is SUM that treats each fanin information evenly and thus considers the average of all the fanin flows, e.g. a classic neuron model, while the other end is WTA (winnertakeall) that treats fanin flows competitively to pick the best one as winner, e.g. the pooling operation in a conventional neural networks. Between the two ends, SUM may be replaced by some weighted average, and WTA may be replaced by some soft version of competition, e.g. based on a finite mixture.
The information flows in Fig. 6a are converging from the observation nodes to the tree root, featured by \(deg^(v)>1\) for every inner node v in the tree. Shortly, we may use a converging tree to name such a tree, which is actually an example of Abstraction model or Yang model. Reversing the direction of every edge, the information flows in Fig. 6b are diverging from tree’s root or hidden nodes to the visible nodes, featured by \(deg^+(v)>1\) for every inner node v in the tree. Shortly, we may use a diverging tree to name such a tree, which is actually an example of Generative model or Ying model.
In Fig. 6b, the FANin mechanism is replaced by its counterpart named FANout mechanism, as elaborated in Fig. 6f. At one end, the counterpart of SUM is ASSIGN that evenly emits outflows out of every node v with \(deg^+(v)>1\), e.g. an outer product generative unit in a deconvention networks or in the reconstruction part of LMSER learning networks (Xu 1993); while the other end is again WTA that emits merely the best one out of all the fanout flows competitively. Again, there may also be weighted average and soft competition between the two ends,
Particularly, we suggest that such a WTA FANout mechanism may improve image reconstruction to perform WTA subpixel interpolation or alternatively mixture subpixel interpolation, as simply sketched in Fig. 6g. The bottom level of reconstruction or deconvention networks does not terminate at the pixel level. Instead of minimising the error between each pixel and its reconstruction, the bottom level is designed for reconstructing subpixels. For each pixel in a sample image, a WTA competition is made among the reconstructed subpixels underlying the pixel that corresponds to this sample pixel, we minimise the error between this sample pixel and the winning reconstructed subpixel. Alternatively, we may also replace WTA by a soft competition, e.g. by a posterior weighting p(ipixel) via considering \(p(pixel)=\sum _i \alpha _i p(subpixels)\).
Next, the locations of these FANin/FANout nodes specify a hierarchy of nodes and edges in consideration, which corresponds to a HIERARCHY mechanism that defines a specific combination of those nodes’ locations or a partial order hierarchy.
Moreover, all the previous analyses in this paper assume that variables are normalised to zero means and unit variances, which is actually one example of implementing BOUND mechanism. A variable with bounded variance actually implies that this variable varies with a bounded energy. In computation, requiring a bounded variance is equivalent to requiring a unit variance, implemented simply by normalisation. Equivalently, such a bounded mechanism may be implemented by a nonlinear transform as sketched in Fig. 6h, e.g. sigmoid nonlinearity, its piecewise approximation, and the widely used LUT nonlinearity.
In a summary, the PROD and FANin/FANout mechanisms jointly describe dependence among variables, with PROD harmonising effects from different parts of one individual and FANin/FANout mechanisms gathering from and allocating among different individuals; while the HIERARCHY mechanism defines conditional independence and how variables are organised, and the BOUND mechanism ensures practical feasibility. These four basic aspects coordinately operate a casual \(\rho\)DAG model or even a general intelligent system.
Removing HIERARCHY mechanism, e.g. the tree hierarchy in Fig. 6b will collapse such that the tree root becomes the centre of a star topology as illustrated in Fig. 6c, while every path between the tree root and each leaf collapses into an edge of the star. Further removing PROD mechanism, the corresponding \(\rho\) product collapses into one variable. Apparently, both the tree hierarchy in Fig. 6b and the star topology in Fig. 6c have the same number of SEM equations and thus a same representative capacity. However, the HIERARCHY and PROD mechanisms impose additionally higher order joint equations on these SEM equations for encoding not only variables but also how they are organised. Similar understandings may be obtained from the converging tree illustrated in Fig. 6a and its collapsed star topology in Fig. 6d.
As illustrated in Fig. 6j, we top on the factor \(\mathbf{f}\) with another node that exclusively selects one of factors such that visible nodes become no longer simultaneously shared by different stars, which actually acts as a mixture of star topologies that effect exclusively on visible nodes and thus reduces the coupling width from m towards 1. In general, a HIERARCHY mechanism that defines a multiple level partial order hierarchy via specifying the locations of these FANin/FANout nodes and their corresponding mechanisms featured by two ends alternatively.
The HIERARCHY mechanism is not just increasing layers to go ’deep’. Multiple linear layers obtained with one layer topped on the other layer will collapse into merely one layer because variables are Gaussians and their summations are still Gaussians. Instead, HIERARCHY defines a partial order hierarchy, which will not collapse because of sparse links and constraints imposed by higher order joint equations that are solvable as described by Table 2. Also, similar arguments are applicable to a neural networks as illustrated in Fig. 6k.
Understandably, the more layers there are, it becomes more easier to accommodate such a hierarchy, which echoes the opinion in Ref. Xu (2017) and interprets why deep learning is preferred, that is, we need to encode not only variables but also how they are organised. In a particular task domain, we only need networks with a limited depth because patterns underlying samples are expressed in hierarchies with a limited depth. Also, a partial order will not be affected by adding one extra edge weighted simply with 1. Thus, increasing depth will not affect performance, but waste more computing cost in both memory and learning time.
Last but not least, the BOUND mechanism not only ensures practical feasibility, but also helps the HIERARCHY mechanism. First, embedding nonlinearity after each summation, i.e., \(x^{(i)}=s(\sum _i w_{ij} y^{(j)}+\varepsilon ),\) will remedy the above mentioned collapsing of multiple linear layers. Second, it has been discovered in Ref. Xu (1993) that adding a sigmoid nonlinearity after a summation drives hidden nodes towards mutually independent and get organised, which again echoes the opinion made in Ref. Xu (2017) and interprets why it is beneficial to make a bottomup unsupervised learning as pretraining. Third, a linear transform from one level of variables \(\mathbf{y}=\{ y^{(j)}\}\) to the next level of variables \(\mathbf{x}=\{ x^{(i)}\}\) will not only map a specific value of \(\mathbf{y}\) into a specific value of \(\mathbf{x}\) but also make the neighbourhood or topological relation of \(\mathbf{y}\) preserved after being mapped to the one of \(\mathbf{x}\). Thus, we should only consider those of postsummation nonlinearity s(.) that can preserve this nature, e.g. ones in Fig. 6h.
Concluding remarks
Examining AlphaGoZero together with revisiting early studies on A* search, MCTS is found to share a scouting technique with CNneimA that was proposed in 1986. The strengths of AlphaGoZero and CNneimA are further integrated to develop a new family named deep IAsearch, including DSA, DCA, DBA, and VAlphaGoZero, as well as their extensions DSAE, DCAE, DBAE, and AlphaGoZeroE. We are further motivated to perform reasoning with the help of deep IAsearch. Especially, casual reasoning is addressed and a correlation coefficientbased approach is proposed for identifying casual \(\rho\)tree and casual \(\rho\)DAG, featured by performing TPC learning to discover causality in three phases. Algorithms are sketched for discovering casual topologies of triplets, stars, trees, and \(\rho\)DAG, with further details on Yule–Simpson’s paradox, Pearl’s Sprinkler DAG, and Back door DAG. Moreover, the classic Boolean SAT problem is extended into one \(\rho\)SAT problem, and the roles of four fundamental mechanisms in an intelligent system are elaborated, with insights on integrating these mechanisms to encode not only variables but also how they are organised, as well as on why deep networks are preferred while extra depth is unnecessary.

(a) Though the \(\rho\)based TPC learning proposed only considers the second order statistics, implying that all the variables are Gaussians and all the SEM equations are linear relations, it is directly applicable to tasks with nonGaussian variables and nonlinear relations as a sort of approximation. It is likely that the obtained topologies of \(\rho\)tree and \(\rho\)DAG may provide good approximations already, which motivates a direction of extension that performs Tphase as it is but modifies each SEM equation with its linear relation replaced by a postlinear nonlinearity and its driving noise by a nonGaussian variable. Example we may extend Eq. (24) intowhere each of \(s_{zx}(.), s_{yz}(.), s_{yx}( )\) is some nonlinear scalar function, e.g. a sigmoid nonlinearity, and at least one of \(e_{zx}, e_{yz}, e_{yx}\) is a nonGaussian variable. It may be observed that higher components of nonlinear functions get not only each variable but also multiple variables involved in higher order statistics.$$\begin{aligned}&x=s_{zx}(\rho _{zx}z) +e_{zx}, \ y=y_z+y_x,\nonumber \\&y_{z}=s_{yz}(\rho _{yz}z) +e_{yz}, \ y_x =s_{yx}( \rho _{yx}x) +e_{yx}, \end{aligned}$$(36)

(b) Embedding understandings obtained from causal tree into deep learning may improve performances, especially when there is merely a small size of samples. Causal tree or precisely hierarchical SEM equations may be used as an alternative to deep learning. Learning methods may be developed based on Theorem 4 to conduct a constrained optimisation similar to Eq. (20), in a way similar to identifying SPRINKLER DAG and BACKDOOR DAG. This alternative may also lay a road that turns the black box of deep learning into interpretable causal analyses.

(c) We may further jointly consider a Yang model as illustrated in Fig. 6k and a Ying model as illustrated in Fig. 6l. One early example is bidirectional multilayer neural networks proposed under the name Lmser in 1991 (Xu 1991, 1993). Let the output y in Fig. 6k to be directly the input f in Fig. 6l, we are lead to the classical autoassociation or autoencoder (Bourlard and Kamp 1988). Differently, Lmser extended autoencoder in four aspects. First, the weight matrix A of each layer in Fig. 6k is directly used as the weight matrix W of the corresponding layer in Fig. 6l, simply by letting \(A=W^T\), i.e. currently socalled weight sharing or domain transferring. Second, the corresponding nodes are also forced to be same, that is, let \(v_i=u_i, i=1, \ldots , m\) as illustrated in Fig. 6k, l. Third, dynamic effect is approximately considered with the reconstructed \(\overline{\mathbf{x}}\) of the input \(\mathbf{x}\) by the model in Fig. 6l being reinputed into the model in Fig. 6k, resulting in a learning rule that consists of a term equivalent to Hinton’s wakesleep algorithm, plus one correcting term that reduces confusion in boundary area. Third, for labeled data, supervised signals may also get the top–down signals with supervised and unsupervised learning jointly (see Section 6 in Ref. Xu (1991)), i.e. currently socalled semisupervised learning. With the help of these developments, Lmser may be used not only for both reconstruction (i.e. currently socalled data generation) and pattern recognition, but also for concept driven imaginary recall that visualises thinking, preactivation driven top–down attention, associative memory, pattern transform, and interpreting development of cortical field templates, as well as ctreative mapping. Moreover, as addressed at the end of the last section in this paper, the nature of preserving the neighbourhood or topological relation by Lmser and autoencoder also facilitates concept forming and organising in the top encoding domain, i.e. the domain of y and f in Fig. 6k, l, which is superior to those models in lack of such preservation, e.g. variational autoencoder (Schmidhuber 2015), deep generative model (Rezende et al. 2016), generative adversarial networks (Goodfellow et al. 2014). Furthermore, further improvement may be developed by RPCLLmser and LVQLmser that perform vector quantisation in the encoding domain of Lmser by LVQ (Kohonen 1995) for labelled data and RPCL (Xu et al. 1993) for unlabelled data.
See \(H(pq)=\int \ln \mathcal{Q} d\mathcal{P} =\int p\ln {q} d\mu\), \(KL(pq)=\int \ln {\mathcal{P} \over \mathcal{Q}} d\mathcal{P} =\int p\ln {p\over q} d\mu\), and \(E(p)=\int \ln {d\mathcal{P} \over d\mu } d\mathcal{P}\).
Declarations
Authors’ contributions
All from the sole author. The author read and approved the final manuscript.
Acknowledgements
This work was supported by the ZhiYuan chair professorship startup Grant (WF220103010) from Shanghai Jiao Tong University.
Competing interests
The author declares that there is no competing interests.
Availability of data and materials
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
WF220103010, Shanghai Jiao Tong University.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Anderson TW, Rubin H (1956) Statistical inference in factor analysis. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability, vol. 5. pp 111–150Google Scholar
 Bourlard H, Kamp Y (1988) Autoassociation by multilayer perceptrons and singular value decomposition. Biol Cybernet 59(4–5):291–294MathSciNetView ArticleGoogle Scholar
 Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43View ArticleGoogle Scholar
 Cai R, Zhang Z, Hao Z (2013) Sada: a general framework to support robust causation discovery. In: International conference on machine learning, pp 208–216Google Scholar
 Clark C, Storkey A (2015) Training deep convolutional neural networks to play go. In: International conference on machine learning, pp 1766–1774Google Scholar
 Ferrucci D, Brown E, ChuCarroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J (2010) Building watson: an overview of the deepqa project. AI Magaz 31(3):59–79View ArticleGoogle Scholar
 Ferrucci D, Levas A, Bagchi S, Gondek D, Mueller ET (2013) Watson: beyond jeopardy!. Artif Intell 199:93–105View ArticleGoogle Scholar
 Goodfellow I, PougetAbadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680Google Scholar
 Hart T, Edwards D (1961) The alphabeta heuristicGoogle Scholar
 Hart PE, Nilsson NJ, Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybern 4(2):100–107View ArticleGoogle Scholar
 Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696Google Scholar
 Kocsis L, Szepesvári C (2006) European conference on machine learning. Bandit based montecarlo planning. Springer, Berlin, pp 282–293Google Scholar
 Kohonen T (1995) Selforganizing maps. Learning vector quantization. Springer, Berlin, pp 175–189Google Scholar
 Land AH, Doig AG (1960) An automatic method of solving discrete programming problems. Econometrica 28:497–520MathSciNetView ArticleGoogle Scholar
 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436View ArticleGoogle Scholar
 Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Humanlevel control through deep reinforcement learning. Nature 518(7540):529View ArticleGoogle Scholar
 Pearl J (1984) Heuristics: intelligent search strategies for computer problem solving. AddisonWesley Pub. Co., Inc, ReadingGoogle Scholar
 Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288MathSciNetView ArticleGoogle Scholar
 Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San MateoMATHGoogle Scholar
 Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2):1–62MathSciNetView ArticleGoogle Scholar
 Pearl J, Verma TS (1991) Equivalence and synthesis of causal models. In: Proceedings of sixth conference on uncertainty in artificial intelligence, pp 220–227Google Scholar
 Reichenbach H (1956) The direction of time. University of California Press, BerkeleyGoogle Scholar
 Rezende DJ, Mohamed S, Danihelka I, Gregor K, Wierstra D (2016) Oneshot generalization in deep generative models. arXiv: 1603.05106
 Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66(5):688View ArticleGoogle Scholar
 Rubin DB, John L (2011) Rubin causal model. In: International encyclopedia of statistical science. Springer, Berlin, pp 1263–1265View ArticleGoogle Scholar
 Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117View ArticleGoogle Scholar
 Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A (2006) A linear nonGaussian acyclic model for causal discovery. J Mach Learn Res 7:2003Machine Learning Res–2030MathSciNetMATHGoogle Scholar
 Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489View ArticleGoogle Scholar
 Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of go without human knowledge. Nature 550(7676):354View ArticleGoogle Scholar
 Spearman C (1904) “General intelligence”, objectively determined and measured. Am J Psychol 15(2):201–292View ArticleGoogle Scholar
 Spirtes P, Glymour C (1991) An algorithm for fast recovery of sparse causal graphs. Soc Sci Comput Rev 9(1):62–72View ArticleGoogle Scholar
 Spirtes PG, Glymour C (1993) Causation, prediction and search. Lecture Notes in Statistics. vol. 81. Springer, BerlinView ArticleGoogle Scholar
 Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search. MIT Press, New YorkMATHGoogle Scholar
 Sutton RS, Barto AG (1998) Reinforcement learning: an introduction, vol 1. MIT Press, CambridgeGoogle Scholar
 Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp 1057–1063Google Scholar
 Vizel Y, Weissenbacher G, Malik S (2015) Boolean satisfiability solvers and their applications in model checking. Proc IEEE 103(11):2021–2035View ArticleGoogle Scholar
 Watkins CJ, Dayan P (1992) Qlearning. Springer, Berlin, pp 279–292MATHGoogle Scholar
 Wright S (1921) Correlation and causation. J Agric Res 20(7):557–585Google Scholar
 Wright S (1934) The method of path coefficients. Ann Math Stat 5(3):161–215View ArticleGoogle Scholar
 Xu L (1986a) A note on a new heuristic search technique algorithm sa. In: Proc. of 8th international conference on pattern recognition, vol. 2. IEEE Press, Paris, pp 992–994Google Scholar
 Xu L (1986b) Investigation on signal reconstruction, search technique, and pattern recognition. PhD dissertation. Tsinghua University, TsinghuaGoogle Scholar
 Xu L (1987) Can sa beat the exponential explosion ? In: Proc. of 2nd international conference on computers and applications. IEEE Press, Beijing, pp 706–713Google Scholar
 Xu L (1991) Least mse reconstruction for selforganization:(i) multilayer neural nets and (ii) further theoretical and experimental studies on one layer nets. In: Proceedings of the international joint conference on neural networks. Singapore, pp 2363–2373Google Scholar
 Xu L (1993) Least mean square error reconstruction principle for selforganizing neuralnets. Neural Netw 6(5):627–648View ArticleGoogle Scholar
 Xu L (1995) Bayesiankullback coupled ying–yang machines: unified learnings and new results on vector quantization. In: Proceeding international conference neural information process (ICONIP ’95). Publishing House of Electronics Industry, Beijing, pp 977–988Google Scholar
 Xu L (1996) A unified learning scheme: Bayesian–Kullback ying–yang machine. In: Advances in neural information processing systems, pp 444–450Google Scholar
 Xu L (2010) Bayesian Ying–Yang system, best harmony learning, and five action circling. Front Elect Elect Eng China 5(3):281–328View ArticleGoogle Scholar
 Xu L (2017) The third wave of artificial intelligence. KeXue 69(3):1–5 (in Chinese)Google Scholar
 Xu L, Pearl J (1987) Structuring causal tree models with continuous variables. In: Proceedings of the 3rd annual conference on uncertainty in artificial intelligence, pp 170–179Google Scholar
 Xu L, Yan P, Chang T (1987) Algorithm cnneima and its mean complexity. In: Proc. of 2nd international conference on computers and applications. IEEE Press, Beijing, pp 494–499Google Scholar
 Xu L, Yan P, Chang T (1988) Best first strategy for feature selection. In: Proc. of 9th international conference on pattern recognition, vol. 2. IEEE Press, Rome, pp 706–709Google Scholar
 Xu L, Yan P, Chang T (1989) Application of state space heuristic search technique in pattern recognitionapplication of state space heuristic search technique in pattern recognition. Comput Appl Softw 6(1):27–34Google Scholar
 Xu L, Krzyzak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans Neural Netw 4(4):636–649View ArticleGoogle Scholar
 Zhang B, Zhang L (1985) A new heuristic search techniquealgorithm sa. IEEE Trans Pattern Anal Mach Intell 1:103–107View ArticleGoogle Scholar
 Zhang K, Hyvärinen A (2009) On the identifiability of the postnonlinear causal model. In: Proceedings of the twentyfifth conference on uncertainty in artificial intelligence. AUAI Press, pp 647–655Google Scholar