Open Access

A spatial-constrained multi-target regression model for human brain activity prediction

Applied Informatics20163:10

DOI: 10.1186/s40535-016-0026-x

Received: 14 September 2016

Accepted: 9 November 2016

Published: 24 November 2016


Analyzing functional magnetic resonance imaging (fMRI) data from the encoding perspective provides a powerful tool to explore human vision. Using voxel-wise encoding models, previous studies predicted the brain activity evoked by external stimuli successfully. However, these models constructed a regularized regression model for each single voxel separately, which overlooked the intrinsic spatial property of fMRI data. In this work, we proposed a multi-target regression model that predicts the activities of adjacent voxels simultaneously. Different from the previous models, the spatial constraint is considered in our model. The effectiveness of the proposed model is demonstrated by comparing it with two state-of-the-art voxel-wise models on a publicly available dataset. Results indicate that the proposed method can predict voxel responses more accurately than the competing methods.


fMRI Encoding Spatial constraint Multi-target regression


One important goal of neuroscience is to understand the relationship between external visual stimulus and human brain activity. We can gain the understanding by analyzing fMRI data from the mirror perspectives of neural decoding and neural encoding (Naselaris et al. 2011). In the view of neural decoding, we often attempt to predict information of stimuli from measured brain activity. Numerous studies have explored human vision using decoding models (Haxby et al. 2001, 2014; Norman et al. 2006). Conversely, in the view of neural encoding, we try to model how brain activity varies corresponding to external stimulus and attempt to predict brain activity from stimuli features. Previous studies have indicated that encoding models are more efficient in describing the function of brain areas than decoding models (Naselaris et al. 2011), suggesting the advantages of analyzing fMRI in the encoding view.

In recent years, voxel-based encoding models were proposed and caught much attention (Kay et al. 2008). A typical encoding model can be divided into two parts. The first part tries to find a feature space to describe the external stimulus. The second part corresponds to the construction of regression models, which uses the stimulus features to predict corresponding brain activity. Lots of effort were taken to find ways to represent the stimulus images. Previous studies used Gabor wavelet pyramid model (Kay et al. 2008; Vu et al. 2011), two-layer sparse coding model (Güçlü and van Gerven 2014), and convolutional neural networks (Agrawal et al. 2014) to extract features that can represent natural images effectively. However, fewer studies focused on efficient regression model construction.

In the regression part of encoding, regularized linear regression models such as lasso (Kay et al. 2008), ridge regression (Güçlü and van Gerven 2014) and graph-constrained elastic net (Kay et al. 2008; Schoenmakers et al. 2013) were most commonly used. Recently, a more advanced sparse nonparametric regression model was proposed (Vu et al. 2011). In spite of the successful prediction of brain activity using these models, one drawback of these voxel-wise models in previous studies is that the response of each voxel is modeled separately; thus, the estimated parameters of different voxels are independent. As a result, these regression models cannot fully employ the correlations between voxels and brain regions. Numerous studies have indicated the benefits of taking the spatial smoothness of fMRI data into account. For example, in the decoding models, when the spatial structure of the data is considered, higher decoding accuracies and more informative and interpretable results can be obtained (Michel et al. 2011; de Brecht and Yamagishi 2012). In functional brain mapping, combining local brain activity often results in more consistent patterns across subjects (Kriegeskorte et al. 2006). All these results suggest that spatial structure of fMRI data should also be considered in encoding models.

In this paper, we focus on the part of regression models construction in the encoding models, i.e., given the features of external stimuli images, we try to construct a regression model that can predict internal brain activity efficiently. We employ the spatial smoothness property of fMRI data and construct a multi-target linear regression model (Evgeniou and Pontil 2004, Argyriou et al. 2008) in which the activities of local adjacent voxels will be predicted simultaneously, and a spatial constraint is proposed to restrict the model parameters. To demonstrate the effectiveness of this model, we compare the brain activity prediction performances of the proposed method with two state-of-the-art voxel-wise models on a public fMRI dataset.


Data description

The publicly available fMRI data (Kay et al. 2011) were used for model validation; this dataset is widely used in comparing models (Güçlü and van Gerven 2014; Naselaris et al. 2009; Agrawal et al. 2014), and detailed experiment information is available in the original papers (Kay et al. 2008; Naselaris et al. 2009). The fMRI responses were recorded when human subjects viewing grayscale natural images while fixating on a central white square. Two subjects took part in the experiments. They viewed 1750 training images (for encoding model training), each presented twice; and 120 validation images (for encoding model testing), each presented ten times. For each subject, the data were acquired in five scanner sessions on five different days. Each scan session consisted of five training runs, each lasted 11 min, and two validation runs, each lasted 12 min.

The brain activity from the occipital cortex were recorded at a spatial resolution of 2 mm × 2 mm × 2.5 mm and a temporal resolution of 1 s using a 4T INOVA MR scanner (Varian, Inc.). Brain volumes were co-registered to correct head movements, and the time-series data were deconvolved from the data to account for the delay in the hemodynamic response (Friston et al. 1994). Thus after the preprocessing, each stimulus image corresponds to one brain volume. The voxels in early visual areas were further divided into visual area one (V1), visual area two (V2,) and visual area three (V3). We only considered brain activity prediction in these areas in this study.

Problem formulation

In a standard regression framework, the design matrix \(X \in \mathfrak {R}^{N\times M}\) is formed by \(1\times M\) feature vectors \(x_{s},s=1,2,\dots ,N\) of N samples. The goal is to predict the value of a \(N\times 1\) target vector y, which contains corresponding target values of \(x_{s}\). In this work, the design matrix comprises the features of N stimuli images, and the target vector is composed of intensities of a voxel, with each intensity corresponding to a image feature vector. Thus the problem here is to find a model that can predict voxel activity in response to stimuli accurately.

To evaluate the encoding performance of the prediction models, we calculate the coefficient of determination (\(R^{2}\)) between the observed and predicted voxel responses across the samples in the validation set. The \(R^2\) is defined as
$$\begin{aligned} R^2 = 1 - \frac{\Vert y - \hat{y}\Vert ^{2}}{\Vert y - \bar{y}\Vert ^{2}} . \end{aligned},$$
where \(\Vert \cdot \Vert\) is the Euclidean norm in \(\mathfrak {R}^n\), y is the recorded true response vector, \(\hat{y}\) is the predicted response vector, and \(\bar{y}\) is the mean response vector. A higher \(R^2\) means the model performs better in the prediction.

Voxel-wise models

Most voxel-wise models proposed in previous studies assume that voxel response is a weighted sum of the transformed image features. The regression model for each voxel is constructed separately, i.e., the model of voxel v is
$$\begin{aligned} y_{v} = Xb_{v} + \varepsilon _v, v=1,2,\dots ,V \end{aligned},$$
where \(X \in R^{N \times M}\) is the design matrix that contains features of stimuli images, \(b_{v} \in R^{M}\) is the parameter of the model, M is the number of features of each stimuli image, V is the total number of voxels and \(\varepsilon _v\) is zero mean Gaussian random vector.
A common problem that often occurs in regression is the so-called over-fitting, which may result in models with good performance in training data, but poor generalization performance in testing data. To estimate the model and control over-fitting, the common method is to find parameters that minimize sum-of-squares error function with an additional regularization term added:
$$\begin{aligned} L(b_v) = \Vert y_{v}-Xb_{v}\Vert ^{2} + \lambda _{v} J(b_{v}) . \end{aligned},$$
where X is the known design matrix, \(b_v\) is the parameter to estimate. The first term in the right side is the usual sum of squared errors, and \(J(b_{v})\) is a function of \(b_{v}\) as a penalty term, \(\lambda _{v}\) is the regularization coefficient that controls the relative importance of the error term and penalty term \(J(b_{v})\). One widely used \(J(b_{v})\) is the sum of squares of the weight vector elements:
$$\begin{aligned} J(b_{v}) = \frac{1}{2}\Vert b_{v}\Vert ^{2} . \end{aligned}$$
This is often termed ridge regularizer. Minimizing \(L(b_v)\) with ridge regularizer controls over-fitting and yields a closed-form solution.
Fig. 1

Mean prediction \(R^2\). The mean prediction \(R^2\) of voxels survived the threshold of 0.1 across the two subjects in brain areas V1, V2, and V3. Error bars show \(\pm 1\) SEM across the voxels
Fig. 2

Distribution of prediction \(R^2\). The distribution of prediction \(R^2\) that survived the threshold of 0.1 in brain area V1, V2, and V3. Results of the two subjects were merged
Fig. 3

Prediction \(R^2\) in each voxels. Prediction \(R^2\) in each voxels, the points above the diagonals indicate the superiority of the model on the y-axis over the one on the x-axis

Another popular regularizer is the \(\ell 1\) norm of the weight vector elements:
$$\begin{aligned} J(b_{v}) = \Vert b_{v}\Vert _{1} \end{aligned},$$
where \(\Vert \cdot \Vert _{1}\) is the \(\ell 1\) norm in \(\mathfrak {R}^n\). This regularizer is often termed Lasso (Tibshirani 1996). The Lasso regularizer often results sparse parameter estimation with many parameters shrunk to zero.

To determine the optimal \(\lambda _{v}\) in the models, we conduct a nested threefold cross-validation and choose for each voxel v that model which maximizes the correlation between \(Xb_{v}\) and \(y_{v}\) on hold-out data. As done in the previous study (Schoenmakers et al. 2013), we sample lambda in the range \((10^{-5},10^{5})\) on a log scale. For the convenience of discussion, we refer the voxel-wise model with ridge regularizer as Ridge and the model with Lasso regularizer as Lasso.

Proposed model

The voxel-wise models proposed in previous studies constructed regression model for each voxel separately, but ignored the dependents between voxels. However, fMRI data often possess the specific spatial smoothness property, and voxels from the same local brain area often exhibit similar properties. To elevate the performance of brain activity prediction, we employ the spatial smoothness property of fMRI data and construct a multi-target regression model.

For each voxel v, we construct the response matrix
$$\begin{aligned} Y_{v} = [y_{v}, y_{v1}, \dots , y_{v(q-1)}] . \end{aligned},$$
where q is the total number of voxel v’s neighbors, and \(y_{vj}, j = 1,2,\dots ,q-1\) are the response vectors of voxel v’s neighbors. The neighbors of v are defined as voxels contained in a sphere that centered on voxel v. In this work, we set the radius of the sphere to 3 voxel size, results in 33 voxels as each voxel v’s neighbors, i.e., q equals 33. We try to minimize the total error function for voxel v:
$$\begin{aligned} L(B_{v})=Tr[(XB_{v}-Y_{v})^{T}(XB_{v}-Y_{v}) + \lambda _{1}(B_{v}R_{v})^{T}(B_{v}R_{v}) + \lambda _{2}B_{v}^{T}B_{v}] . \end{aligned},$$
where X is the same as in voxel-wise models, \(B_{v}\in \mathfrak {R}^{M\times q}\) is the parameter to determine, Tr[X] means the trace of matrix X, and \(R_{v}\) is a \(q\times q\) matrix with the (ij) element being
$$R_{{i,j}} = \left\{ {\begin{array}{*{20}l} q-1 & {if\;i = j} \\ { - 1} & {otherwise} \\ \end{array} } \right.$$
The first element in the trace operator is the sum-of-squares error function to make sure the predicted response matrix \(\hat{Y}_v\) is similar to the true response matrix. The second element is a regularizer that controls the parameter matrix \(B_{v}\) and trends to set the estimated parameter of voxel v similar to its neighbors. Here, we hypothesize that a voxel responds to external stimuli in a similar way as those voxels that locate around it; thus, these voxels may possess similar parameters in the regression model. The third element is a regularizer similar with the ridge penalty to control over-fitting.

Model estimation

To estimate the model, we consider the gradient of the total error function:
$$\begin{aligned} \nabla L(B_{v}) = 2\times [X^TXB_v - X^TY_v + B_v(\lambda _1RR^T+\lambda _2I)] . \end{aligned},$$
where I is a \(q\times q\) identity matrix. There are two regularization coefficients (\(\lambda _1\) and \(\lambda _2\) ) to be determined using nested cross-validation. For description convenient, the gradient is expressed as
$$\begin{aligned} \nabla L(B_{v}) = 2\times [X^TXB_v - X^TY_v + B_v\hat{R}] . \end{aligned},$$
where \(\hat{R} = \lambda _1RR^T+\lambda _2I\). Setting this gradient to zero gives
$$\begin{aligned} X^TXB_v + B_v\hat{R} = X^TY_v . \end{aligned}$$
Note that this equation is different from the traditional equation of penalized least square regression, the unknown parameter \(B_v\) is in the left hand of the second term in this equation, which means the equation cannot be formed into a formation like Ax = b.

Actually, this is the Sylvester equation, with \(B_v\) the unknown parameter matrix to be determined. The equation can be solved efficiently (Bartels and Stewart 1972). Similar to the estimation of Ridge and Lasso regression, we used a nested threefold cross-validation to determine \(\lambda _1, \lambda _2\) in the range \((10^{-5},10^{5})\) on a log scale.


Similar with the widely used searchlight strategy (Kriegeskorte et al. 2006) in brain mapping, we move a spherical searchlight through the brain volume. For each center voxel v, we can obtain the estimated parameter matrix \(\hat{B}_{v}\) by solving Eq. (11). Thus the prediction response matrix \(\hat{Y}_v\) is calculated as
$$\begin{aligned} \hat{Y}_{v} = X\hat{B}_{v} . \end{aligned}$$
The brain activity prediction of voxel v and its neighbors are thus the columns of \(\hat{Y}_v\). Note that in this strategy, the brain activity of voxel v will be predicted in several models, i.e., it will appear as the center voxel for one model and will also be as neighbor of other voxels for several other models. To obtain a smooth prediction, we set the response of voxel v as the mean of these responses.

Implementation details

In this work, we used the Gabor wavelet pyramid model (Jones and Palmer 1987) with six frequencies and eight possible orientations to extract stimulus features. To address the residual nonlinearity in the model, we applied an additional nonlinear transformation
$$\begin{aligned} f(x) = \log (1+\sqrt{x}) . \end{aligned}$$
for each stimuli feature as done in previous studies (Kay et al. 2008). This resulted a \(1\times 10,920\) feature vector for each stimuli image. It is time consuming to optimize the regression models when X is so large; so, for computational reasons, we reduced the features by performing a principal component analysis (PCA) (Bishop 2006) first, which is a common strategy in machine learning field for dimension reduction. Only the largest 500 components were retained; these components capture over \(80\%\) of the variance, and so the transformed feature vector is \(1\times 500\) for each stimuli.

Results and discussion

Here, we present results obtained by different models on the dataset. We compare our proposed multi-target model with two state-of-the-art voxel-wise models (Ridge and Lasso); these two models were widely employed in fMRI encoding models (Agrawal et al. 2014; Kay et al. 2008; Schoenmakers et al. 2013; Güçlü and van Gerven 2014). Only data from training sessions were used to construct models and select regularization coefficients \(\lambda _1, \lambda _2\); data from validation sessions were used to validate the model performances.
Table 1

Percentage of voxels survived an \(R^2\) threshold of 0.1 for different models


Subject 1

Subject 2

V1 (%)

V2 (%)

V3 (%)

V1 (%)

V2 (%)

V3 (%)






















Table 1 lists how many voxels (in percentage) survived a \(R^2\) threshold of 0.1 for different models in brain area V1, V2, and V3; these voxels are thought as activity well predicted. In all models, the performance in V1 is better than in V2 and V3. For subject 1, the percent of survived voxels systematically decreased from 29% in V1 to 10% in V3 when proposed method is used. While for the voxel-based models (ridge and lasso), the percent of survived voxels systematically decreased from about 25% in V1 to 6% in V3. Similar trend is observed for subject 2, though the performance is not as better as for subject 1.

Fig.1 compares the mean \(R^2\) of different models across the survived voxels. The mean \(R^2\) of the proposed method is about 0.26 in V1, and it systematically decreases to 0.19 in V3. In contrast, the mean \(R^2\) of voxel-based ridge and lasso models are similar, systematically decreasing from 0.24 in V1 to 0.17 in V3.

Figures 2 and 3 compare the performance of different models across voxels and brain areas. Figure 2 represents the distribution of prediction \(R^2\) for survived voxels. In most values of \(R^2\), the proposed method obtained more voxels than ridge and lasso models. The prediction \(R^2\) for all voxels are displayed in Fig. 3, where the points above the diagonals indicate the superiority of the model on the y-axis over the one on the x-axis. Obviously, most voxels in each brain area are better predicted by the proposed model than the traditional voxel-wise models.


In this paper, we proposed a multi-target regression model to predict brain activity when subjects view grayscale images. Based on the hypothesis that the property of a voxel is similar to its local neighbors, we constructed a spatial constraint on model parameters. The parameters can be estimated in an efficient way. We illustrated that the proposed method achieves better prediction performance on a public dataset fMRI data than voxel-wise ridge and lasso models did. The prediction \(R^2\) of proposed model was higher than those acquired by voxel-wise models and more voxels survived an \(R^2\) threshold of 0.1. These results suggest the benefits of considering essential spatial property of fMRI data in encoding models.



functional magnetic resonance imaging


the coefficient of determination


visual area one


visual area two


visual area three


principal component analysis


Authors' contributions

ZW and YL participated in the design of the study, performed in the data analysis, and drafted the manuscript. Both authors read and approved the final manuscript.


This work was supported by the National Key Basic Research Program of China (973 Program) under the Grant 2015CB351703, the National Natural Science Foundation of China under the Grants 61633010, 91420302, and 61573150, and Guangdong Natural Science Foundation under the Grant 2014A030312005.

Competing interests

The authors declare that they have no competing interests.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

Center for Brain Computer Interfaces and Brain Information Processing, South China University of Technology
Guangzhou Key Laboratory of Brain Computer Interaction and Applications, South China University of Technology


  1. Agrawal P, Stansbury D, Malik J, Gallant JL (2014) Pixels to voxels: modeling visual representation in the human brain. arXiv preprint arXiv:1407.5104
  2. Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272View ArticleGoogle Scholar
  3. Bartels RH, Stewart G (1972) Solution of the matrix equation ax+ xb= c [f4]. Commun ACM 15(9):820–826View ArticleGoogle Scholar
  4. Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
  5. de Brecht M, Yamagishi N (2012) Combining sparseness and smoothness improves classification accuracy and interpretability. Neuroimage 60(2):1550–1561View ArticleGoogle Scholar
  6. Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 109–117
  7. Friston KJ, Holmes AP, Worsley KJ, Poline J, Frith CD, Frackowiak RS (1994) Statistical parametric maps in functional imaging: a general linear approach. Hum Brain Mapp 2(4):189–210View ArticleGoogle Scholar
  8. Güçlü U, van Gerven MA (2014) Unsupervised feature learning improves prediction of human brain activity in response to natural images. PLoS Comput Biol 10:e1003724View ArticleGoogle Scholar
  9. Haxby JV, Connolly AC, Guntupalli JS (2014) Decoding neural representational spaces using multivariate pattern analysis. Annu Rev Neurosci 37:435–456View ArticleGoogle Scholar
  10. Haxby JV, Gobbini MI, Furey ML, Ishai A, Schouten JL, Pietrini P (2001) Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293(5539):2425–2430View ArticleGoogle Scholar
  11. Jones JP, Palmer LA (1987) An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. J Neurophysiol 58(6):1233–1258Google Scholar
  12. Kay KN, Naselaris T, Prenger RJ, Gallant JL (2008) Identifying natural images from human brain activity. Nature 452(7185):352–355View ArticleGoogle Scholar
  13. Kay K, Naselaris T, Gallant J (2011) fmri of human visual areas in response to natural images. Accessed 18 June 2015
  14. Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based functional brain mapping. Proc Natl Acad Sci USA 103(10):3863–3868View ArticleGoogle Scholar
  15. Michel V, Gramfort A, Varoquaux G, Eger E, Thirion B (2011) Total variation regularization for fmri-based prediction of behavior. IEEE Trans Med Imaging 30(7):1328–1340View ArticleGoogle Scholar
  16. Naselaris T, Kay KN, Nishimoto S, Gallant JL (2011) Encoding and decoding in fmri. Neuroimage 56(2):400–410View ArticleGoogle Scholar
  17. Naselaris T, Prenger RJ, Kay KN, Oliver M, Gallant JL (2009) Bayesian reconstruction of natural images from human brain activity. Neuron 63(6):902–915View ArticleGoogle Scholar
  18. Norman KA, Polyn SM, Detre GJ, Haxby JV (2006) Beyond mind-reading: multi-voxel pattern analysis of fmri data. Trends Cogn Sci 10(9):424–430View ArticleGoogle Scholar
  19. Schoenmakers S, Barth M, Heskes T, van Gerven M (2013) Linear reconstruction of perceived images from human brain activity. Neuroimage 83:951–961View ArticleGoogle Scholar
  20. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288MathSciNetMATHGoogle Scholar
  21. Vu VQ, Ravikumar P, Naselaris T, Kay KN, Gallant JL, Yu B (2011) Encoding and decoding v1 fmri responses to natural images with sparse nonparametric models. Ann Appl Stat 5(2B):1159MathSciNetView ArticleMATHGoogle Scholar


© The Author(s) 2016