Classification within single-participant sessions
The classification result within a single-participant session is established here to be used as a comparison benchmark for cross-session predictions. Classification accuracy was measured by the proportion of single trials whose semantic category (animal or tool) was successfully determined. The two classes are balanced, so chance performance was 50%, and accuracies above 55.8% were significant (at p < 0.05, binomial test over independent trials, chance 50%, n = 240). This chance level was identical to the result of a permutation test with random labelling.As shown in Figure 3, for experiment I (Korean-Chinese bilinguals), the classification accuracy was well over this threshold for all sessions, both in the Korean-to-Chinese language-switching condition (K → C, L1 → L2; image captions in Korean, covert generation in Chinese) and also in the Chinese-to-Korean (C → K, L2 → L1) condition. In the K → C condition, the mean accuracy was 93.4% (SD = 4%), ranging from 70.7% to 98.8%. In the C → K condition, the mean accuracy was 90.8% (SD = 9%), ranging from 77.5% to 96.7%.Figure 4 shows the equivalent results for experiment II (Chinese-Japanese second language learners), where all individual accuracies were again well beyond the significance threshold (55.8%). In the C → J (L1 → L2) condition (captions in Chinese, covert task in Japanese), the mean accuracy was 91.6% (SD = 7%) ranging from 77.5% to 97.5%. In the J → C (L2 → > L1) condition, the mean accuracy was 92.7% (SD = 5%), ranging from 83.3% to 96.7%.
Within-participant cross-session classification
Here, we see whether category-specific activation patterns are shared between different sessions in the same participant. The PLR classifier with ANOVA feature selection was trained on all 240 trials from one session and then tested directly to discriminate among animal and tool presentations, on the 240 trials from the same participant's other experimental session.Compared to within-session classification, the cross-session prediction was slightly less successful, as demonstrated in Figures 5 and 6. Still, all results were significantly above chance at the same threshold of 55.8%.For experiment I (early Korean-Chinese bilingual participants), the (K → C) → (C → K) analysis (training on data from the Korean caption/Chinese production session; testing on data from the Chinese caption/Korean production session) achieved a mean accuracy over seven participants of 83.6% (SD = 10%), ranging from 63.7% to 94.2% (Figure 5, left panel). In the other direction, (C → K) → (K → C), the classification accuracy was also significant, with a mean accuracy of 82.9% (SD = 11%), ranging from 62.9% to 95.8% (Figure 5, right panel). For the late Chinese-Japanese bilingual group (Figure 6), the mean classification accuracy was 84.0% (SD = 8%) in the (J → C) → (C → J) analysis (range 70.4% to 97.5%) and 86.7% (SD = 5%) in the (C → J) → (J → C) analysis (range 81.2% to 93.3%).
Cross-participant classification: pairwise and groupwise
Here, we performed a similar analysis to that done in the previous section (PLR classifier, ANOVA feature selection, different training and testing sessions) but classified the data from one participant after training on the data of different participants. We first do this training on the data from single-participant sessions and then training on whole groups of participants.
In Figure 7, we present the classification accuracy when doing ANOVA feature selection and training on one dataset (see y-axis) and testing on another (x-axis). For comparison, on alternating cells just off the diagonal, we have shown the within-participant cross-session accuracies (as already shown in Figures 5 and 6).
The mean classification accuracy for the other cells in Figure 7 (where training and testing data come from different participants) is 64.6%. This is considerably lower than that seen for within-session and within-participant analyses, but 69.0% of the individual test/training pairs were above the significance threshold of 61.3% (binomial test with Bonferroni correction, n = 168).
In Figure 8, the results are shown where ANOVA feature selection and training are performed over the data from 13 participants (26 sessions, 6,240 stimulus trials in total), while trials from the remaining held-out participant (2 sessions) are classified. Of the 28 classification analyses, 92.9% reach a significance threshold of 60% (with Bonferroni correction, n = 28). We see a clear improvement of mean classification accuracies, which reach 74.5% compared to 64.6% in the previous analysis that used only one session at a time for training, rather than 26 here. But it is still considerably lower than the mean accuracy of 83.2% seen for cross-session analyses using only a single training session from the same participant (Figures 5 and 6; t = 7.7, p < 2.8 × 10−8).
JRFS cross-participant classification: pairwise and groupwise
In the last section, we saw that there was a clear performance penalty for training across participants relative to cross-validated testing/training from a single participant, even when dramatically increasing the amount of training data by including multiple sessions. Here, we introduce a joint feature selection strategy to try to address that. Up until now, our feature selections have used only the same source data as is used in training. Here, feature selection is performed jointly using all of the source data and one half of the data from the target session dataset at a time. Training is still executed using the source data only, and the testing uses the other held-out half of the target set that did not contribute to feature selection. In each analysis, this process is performed twice (i.e. twofold), so that each half of the target dataset can be tested separately, and the accuracies given are the mean of those two separate accuracies.
Figure 9 shows the cross-session pairwise classification accuracies, as in Figure 7, but now using the modified JRFS strategy. Considering the cross-participant analyses only (off-diagonal cells in the plot), 89.7% of them were above the significance threshold of 61.3% (binomial test, p < 0.05, with a Bonferroni correction of n = 168). The mean accuracy rate was 71.3%, which was an average improvement of 6.7% points, in comparison to the pure cross-participant modelling (feature selection and training on one participant, testing on another; Figure 7). While there was a strong correlation of r = 0.78 between the conventional and JRFS accuracies, the group-level difference was highly significant (t = −36.69, p < 7.79 × 10−82).
Figure 10 shows the results of the JRFS group classification accuracies (corresponding to Figure 8 which used a conventional source-dataset-only feature selection). The mean classification accuracy was 80.0%, significantly higher than that seen with conventional feature selection (74.5%, t = −5.76, p < 3.97 × 10−6) and increased by 8.7% points compared to the single- or cross-participant analyses in Figure 9.
Here, all the sessions exceed the significance threshold of 60% (with Bonferroni correction, n = 28). What is notable here is that the modelling for the participant ‘e_P5’, whose dataset recorded the worst accuracy in the within-participant prediction, was considerably improved as a result of the feature co-selection technique. The precision rates of the Korean-to-Chinese and the Chinese-to-Korean predictions in this particular case increased by virtue of JRFS from 59.6% to 66.2% and from 58.3% to 70.8%, respectively.
DJFS cross-participant classification: pairwise and groupwise
In this section, the results of DJFS are reported in comparison of those of JRFS. For DJFS, the voxels are selected only from one half at a time of the dataset of a target subject T, model training is made on the whole dataset of a source subject S, and testing is executed on the held-out half of the dataset T. Figure 11 shows the results of the pairwise DJFS classification, which outperformed all the between-subject classification techniques. The mean classification accuracy was 75.7%, which was significantly higher than that of the JRFS (71.3%), with an improvement of 4.3% points (t = −24.87, p < 3.2172 × 10−99). Note that 94.8% of the subject combinations were above the significance threshold of 61.3% (binomial test, p < 0.05, with Bonferroni correction of n = 168).
The results of the DJFS groupwise classification accuracies also ranked top among all the groupwise modelling instances with the same configuration. The mean accuracy was 82.0%, an increase of 2% points compared to that of the JRFS groupwise classification (t = −3.08, p < 0.00471). The classification accuracy of all the sessions was greater than the significance threshold of 60% (with Bonferroni correction, n = 28) (Figure 12).
JRFS and DJFS using a searchlight selector
It turns out that the JRFS and DJFS using a cross-validated searchlight were similarly effective to the ANOVA-based JRFS and DJFS described above. After removing all the results of the within-single-participant session prediction as before, we calculated (1) the mean and (2) the standard deviation of the classification accuracy and (3) the proportion of significant combination session patterns (accuracy larger than 61.3% at p < 0.05). For a JRFS searchlight selector using radii = 0, 1, 2, or 3, these statistics were {72.1%, 0.079, 91.1%}, {69.6%, 0.076, 87.4%}, {67.7%, 0.076, 79.3%} and {66.3%, 0.076, 73.6%}, respectively. The corresponding number for DJFS was consistently superior, recording {76.3%, 0.082, 95.3%}, {72.6%, 0.081, 91.9%}, {69.7%, 0.076, 86.9%} and {68.0%, 0.074, 83.1%}, respectively. Figures 13 and 14 represent the results of the cross-participant prediction based on the JRFS and DJFS with the searchlight.
The pattern of DJFS outperforming JRFS from the last analysis was seen again at all searchlight size settings. Contrary to our expectations, there was no significant improvement in performance using a searchlight selector over an ANOVA selector, despite a small apparent advantage for searchlights with radius of 0 (i.e. volume of a single voxel). There were significant disadvantages in using larger searchlights, relative to the ANOVA selector (multiple-comparison Bonferroni test executed posterior to a one-way ANOVA).
Summary of results
Figure 15 summarizes the results of all analyses, showing the mean classification accuracy over all datasets for each feature selection and data partitioning strategy examined. The results clearly illustrate the established effects of a cross-session penalty in classification accuracy, and in the groupwise results, an advantage as the number of training sessions and trials increases due to an improvement of signal-to-noise ratio and a broader sampling of the population of trials and subjects. Our feature selection and partitioning strategies both outperform the conventional methods, and DJFS has an advantage over JRFS, approaching the benchmark levels of within-subject analysis. In terms of the feature selector, cross-validated searchlight results are slightly higher than ANOVA, but not significantly, and only for a minimal searchlight size of 1 voxel.