### Data description

The publicly available fMRI data (Kay et al. 2011) were used for model validation; this dataset is widely used in comparing models (Güçlü and van Gerven 2014; Naselaris et al. 2009; Agrawal et al. 2014), and detailed experiment information is available in the original papers (Kay et al. 2008; Naselaris et al. 2009). The fMRI responses were recorded when human subjects viewing grayscale natural images while fixating on a central white square. Two subjects took part in the experiments. They viewed 1750 training images (for encoding model training), each presented twice; and 120 validation images (for encoding model testing), each presented ten times. For each subject, the data were acquired in five scanner sessions on five different days. Each scan session consisted of five training runs, each lasted 11 min, and two validation runs, each lasted 12 min.

The brain activity from the occipital cortex were recorded at a spatial resolution of 2 mm × 2 mm × 2.5 mm and a temporal resolution of 1 s using a 4T INOVA MR scanner (Varian, Inc.). Brain volumes were co-registered to correct head movements, and the time-series data were deconvolved from the data to account for the delay in the hemodynamic response (Friston et al. 1994). Thus after the preprocessing, each stimulus image corresponds to one brain volume. The voxels in early visual areas were further divided into visual area one (V1), visual area two (V2,) and visual area three (V3). We only considered brain activity prediction in these areas in this study.

### Problem formulation

In a standard regression framework, the design matrix \(X \in \mathfrak {R}^{N\times M}\) is formed by \(1\times M\) feature vectors \(x_{s},s=1,2,\dots ,N\) of *N* samples. The goal is to predict the value of a \(N\times 1\) target vector *y*, which contains corresponding target values of \(x_{s}\). In this work, the design matrix comprises the features of *N* stimuli images, and the target vector is composed of intensities of a voxel, with each intensity corresponding to a image feature vector. Thus the problem here is to find a model that can predict voxel activity in response to stimuli accurately.

To evaluate the encoding performance of the prediction models, we calculate the coefficient of determination (\(R^{2}\)) between the observed and predicted voxel responses across the samples in the validation set. The \(R^2\) is defined as

$$\begin{aligned} R^2 = 1 - \frac{\Vert y - \hat{y}\Vert ^{2}}{\Vert y - \bar{y}\Vert ^{2}} . \end{aligned},$$

(1)

where \(\Vert \cdot \Vert\) is the Euclidean norm in \(\mathfrak {R}^n\), *y* is the recorded true response vector, \(\hat{y}\) is the predicted response vector, and \(\bar{y}\) is the mean response vector. A higher \(R^2\) means the model performs better in the prediction.

### Voxel-wise models

Most voxel-wise models proposed in previous studies assume that voxel response is a weighted sum of the transformed image features. The regression model for each voxel is constructed separately, i.e., the model of voxel *v* is

$$\begin{aligned} y_{v} = Xb_{v} + \varepsilon _v, v=1,2,\dots ,V \end{aligned},$$

(2)

where \(X \in R^{N \times M}\) is the design matrix that contains features of stimuli images, \(b_{v} \in R^{M}\) is the parameter of the model, *M* is the number of features of each stimuli image, *V* is the total number of voxels and \(\varepsilon _v\) is zero mean Gaussian random vector.

A common problem that often occurs in regression is the so-called over-fitting, which may result in models with good performance in training data, but poor generalization performance in testing data. To estimate the model and control over-fitting, the common method is to find parameters that minimize sum-of-squares error function with an additional regularization term added:

$$\begin{aligned} L(b_v) = \Vert y_{v}-Xb_{v}\Vert ^{2} + \lambda _{v} J(b_{v}) . \end{aligned},$$

(3)

where *X* is the known design matrix, \(b_v\) is the parameter to estimate. The first term in the right side is the usual sum of squared errors, and \(J(b_{v})\) is a function of \(b_{v}\) as a penalty term, \(\lambda _{v}\) is the regularization coefficient that controls the relative importance of the error term and penalty term \(J(b_{v})\). One widely used \(J(b_{v})\) is the sum of squares of the weight vector elements:

$$\begin{aligned} J(b_{v}) = \frac{1}{2}\Vert b_{v}\Vert ^{2} . \end{aligned}$$

(4)

This is often termed ridge regularizer. Minimizing \(L(b_v)\) with ridge regularizer controls over-fitting and yields a closed-form solution.

Another popular regularizer is the \(\ell 1\) norm of the weight vector elements:

$$\begin{aligned} J(b_{v}) = \Vert b_{v}\Vert _{1} \end{aligned},$$

(5)

where \(\Vert \cdot \Vert _{1}\) is the \(\ell 1\) norm in \(\mathfrak {R}^n\). This regularizer is often termed Lasso (Tibshirani 1996). The Lasso regularizer often results sparse parameter estimation with many parameters shrunk to zero.

To determine the optimal \(\lambda _{v}\) in the models, we conduct a nested threefold cross-validation and choose for each voxel *v* that model which maximizes the correlation between \(Xb_{v}\) and \(y_{v}\) on hold-out data. As done in the previous study (Schoenmakers et al. 2013), we sample lambda in the range \((10^{-5},10^{5})\) on a log scale. For the convenience of discussion, we refer the voxel-wise model with ridge regularizer as Ridge and the model with Lasso regularizer as Lasso.

### Proposed model

The voxel-wise models proposed in previous studies constructed regression model for each voxel separately, but ignored the dependents between voxels. However, fMRI data often possess the specific spatial smoothness property, and voxels from the same local brain area often exhibit similar properties. To elevate the performance of brain activity prediction, we employ the spatial smoothness property of fMRI data and construct a multi-target regression model.

For each voxel *v*, we construct the response matrix

$$\begin{aligned} Y_{v} = [y_{v}, y_{v1}, \dots , y_{v(q-1)}] . \end{aligned},$$

(6)

where *q* is the total number of voxel *v*’s neighbors, and \(y_{vj}, j = 1,2,\dots ,q-1\) are the response vectors of voxel *v*’s neighbors. The neighbors of *v* are defined as voxels contained in a sphere that centered on voxel *v*. In this work, we set the radius of the sphere to 3 voxel size, results in 33 voxels as each voxel *v*’s neighbors, i.e., *q* equals 33. We try to minimize the total error function for voxel *v*:

$$\begin{aligned} L(B_{v})=Tr[(XB_{v}-Y_{v})^{T}(XB_{v}-Y_{v}) + \lambda _{1}(B_{v}R_{v})^{T}(B_{v}R_{v}) + \lambda _{2}B_{v}^{T}B_{v}] . \end{aligned},$$

(7)

where *X* is the same as in voxel-wise models, \(B_{v}\in \mathfrak {R}^{M\times q}\) is the parameter to determine, *Tr*[*X*] means the trace of matrix *X*, and \(R_{v}\) is a \(q\times q\) matrix with the (*i*, *j*) element being

$$R_{{i,j}} = \left\{ {\begin{array}{*{20}l} q-1 & {if\;i = j} \\ { - 1} & {otherwise} \\ \end{array} } \right.$$

(8)

The first element in the trace operator is the sum-of-squares error function to make sure the predicted response matrix \(\hat{Y}_v\) is similar to the true response matrix. The second element is a regularizer that controls the parameter matrix \(B_{v}\) and trends to set the estimated parameter of voxel *v* similar to its neighbors. Here, we hypothesize that a voxel responds to external stimuli in a similar way as those voxels that locate around it; thus, these voxels may possess similar parameters in the regression model. The third element is a regularizer similar with the ridge penalty to control over-fitting.

### Model estimation

To estimate the model, we consider the gradient of the total error function:

$$\begin{aligned} \nabla L(B_{v}) = 2\times [X^TXB_v - X^TY_v + B_v(\lambda _1RR^T+\lambda _2I)] . \end{aligned},$$

(9)

where *I* is a \(q\times q\) identity matrix. There are two regularization coefficients (\(\lambda _1\) and \(\lambda _2\) ) to be determined using nested cross-validation. For description convenient, the gradient is expressed as

$$\begin{aligned} \nabla L(B_{v}) = 2\times [X^TXB_v - X^TY_v + B_v\hat{R}] . \end{aligned},$$

(10)

where \(\hat{R} = \lambda _1RR^T+\lambda _2I\). Setting this gradient to zero gives

$$\begin{aligned} X^TXB_v + B_v\hat{R} = X^TY_v . \end{aligned}$$

(11)

Note that this equation is different from the traditional equation of penalized least square regression, the unknown parameter \(B_v\) is in the left hand of the second term in this equation, which means the equation cannot be formed into a formation like *Ax* = *b*.

Actually, this is the Sylvester equation, with \(B_v\) the unknown parameter matrix to be determined. The equation can be solved efficiently (Bartels and Stewart 1972). Similar to the estimation of Ridge and Lasso regression, we used a nested threefold cross-validation to determine \(\lambda _1, \lambda _2\) in the range \((10^{-5},10^{5})\) on a log scale.

### Prediction

Similar with the widely used searchlight strategy (Kriegeskorte et al. 2006) in brain mapping, we move a spherical searchlight through the brain volume. For each center voxel *v*, we can obtain the estimated parameter matrix \(\hat{B}_{v}\) by solving Eq. (11). Thus the prediction response matrix \(\hat{Y}_v\) is calculated as

$$\begin{aligned} \hat{Y}_{v} = X\hat{B}_{v} . \end{aligned}$$

(12)

The brain activity prediction of voxel *v* and its neighbors are thus the columns of \(\hat{Y}_v\). Note that in this strategy, the brain activity of voxel *v* will be predicted in several models, i.e., it will appear as the center voxel for one model and will also be as neighbor of other voxels for several other models. To obtain a smooth prediction, we set the response of voxel *v* as the mean of these responses.

### Implementation details

In this work, we used the Gabor wavelet pyramid model (Jones and Palmer 1987) with six frequencies and eight possible orientations to extract stimulus features. To address the residual nonlinearity in the model, we applied an additional nonlinear transformation

$$\begin{aligned} f(x) = \log (1+\sqrt{x}) . \end{aligned}$$

(13)

for each stimuli feature as done in previous studies (Kay et al. 2008). This resulted a \(1\times 10,920\) feature vector for each stimuli image. It is time consuming to optimize the regression models when *X* is so large; so, for computational reasons, we reduced the features by performing a principal component analysis (PCA) (Bishop 2006) first, which is a common strategy in machine learning field for dimension reduction. Only the largest 500 components were retained; these components capture over \(80\%\) of the variance, and so the transformed feature vector is \(1\times 500\) for each stimuli.