Deep bidirectional intelligence: AlphaZero, deep IA-search, deep IA-infer, and TPC causal learning

Applied Informatics

Table 1 Deep IA-search family

(A)	Deep Scout A* (DSA)	Deep CNneim-A (DCA)	Deep Bi-Scout A* (DBA)	V-AlphaGoZero
Deep learning	\([v_{h} ,g](\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)			\([v_{h} ,\varvec{p}](\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)
Selection step	Get expanding node \(\varvec{S}_{\mathbf{E}}\) by A*	In each child tree, get expanding node by A*	Get expanding node by A* subtree scout for \({\mu }\), by A*	Get expanding node by DBFS
Valuating	Equation (1) by f = *g + h*	Equation (1) by f = *g + h*	Equation (1) with \({\varvec{\mu }}\) replacing f	Q and p by Eq. (6)
Moving policy	Frequency \({\pi }_{i}\)	Mean \({\mu }_{i}\)	Frequency \({\pi }_{i}\)	Frequency \({\pi }_{i}\)

(B)	DSA-E	DCA-E	DBA-E	AlphaGoZero-E
Deep learning	\([v_{h}, g, \varvec{p}]_{\varvec{\theta }}(\varvec{s})~=~\varvec{f}_{\varvec{\theta }}~\varvec{(s)}\)
Selection step	Get expanding node \(\varvec{S}_{\mathbf{E}}\) by DBFS-n-A* selection
Bayesian valuation	Make action either stochastically by value q or max\(_{a}\)q\(_{a}\)upon posteriori \(\varvec{q}=[q_{a}], q_{a}=p_{a}e_{a}/\varvec{p}^{T}{} \varvec{E}, \varvec{E}=[e_{a}]\)
	\(TypeQ: e_{a} = \rho (Q(s,s_{a}))~\mathrm{or}~e_{a} = \rho (r + v_{h}(s_{a}), ~\mathrm{where}~s_{a} = a(s)\)		\(TypeF: e_{a} = \rho ({\mu }_{a}), {\varvec{\mu }} = [{\mu }_{a}]\)	\(TypeQ: e_{a} = Q(s,a)\)
	\(TypeF: e_{a} = \rho (f(s_{a})),~\mathrm{where}~s_{a} = a(s)\)			\(TypeF: e_{a} = \rho (f(s_{a}))\)
	If \(q_{a}\) is larger than a pre-specified threshold, put \(s_{a}\) into OPEN, otherwise into WAIT. When OPEN becomes empty, move some ones from WAIT to OPEN. Note: p(r) is monotonically increasing for reward maximisation or decreasing for cost minimisation
OPEN revision	Revise f values in OPEN by back-forward propagation after each expanding
Others	Same as DSA	Same as DCA	Same as DBA	Same as AlphaGoZero