Generalization test results and robustness of learning

4.1 The Effect of High Variability Phonetic Training on L2 Speech Perception

4.1.3 Generalization test results and robustness of learning – Generalization 2

In this section, the results of the generalization test are presented and analyzed in order to discuss transfer of perceptual learning obtained via training. This test assessed generalization to stimuli characterized by having one or more features that were absolutely new to participants, that is, that had never been trained nor heard before in the program. The posttest/generalization test had nine conditions resulting from the combination of the levels Trained, Untrained, and Novel of the variables Pseudoword identity and Talker identity (for more information, cf. Section 3.4.1.3). Table 20 presents those conditions.

Table 20 (Pretest/Posttest and) Generalization Test Conditions TALKERS

Trained Untrained Novel PSEUDO

WORDS

Trained Cond.1 Cond. 2 Cond. 5

Untrained Cond. 3 Cond. 4 Cond. 6

Novel Cond. 9 Cond. 8 Cond. 7

Note. Cond. = Condition. Conditions 1 to 4 refer to the pretest/posttest. Conditions 5 to 9 correspond to the generalization test.

40 50 60 70 80

Cond. 1 Cond. 2 Cond. 3 Cond. 4

Mean % accuracy score

Pretest Posttest

101

A first approach to this analysis consisted in comparing overall performance between pretest, posttest and the generalization test across groups. Table 21 presents the mean accuracy score in the three testing phases for both groups.

Table 21 Mean Percent Accuracy Score in the Pretest, in the Posttest and in the Generalization Test for the Trained and the Control Groups

Group Pretest Mean (SD)

Posttest Mean (SD)

Gen. test Mean (SD) TG

n=17 57.43 (9.26) 68.09 (11.50) 65.13 (9.22) CT

n=17 58.53 (8.65) 60.29 (8.56) 61.61 (9.19) Note. TG = Trained group; CG = Control group; SD = Standard deviation; Gen. = Generalization.

From the pretest to the posttest, both groups improved their scores, but that increase was only significant for the trained group, as seen before. Comparing posttest scores with the generalization test scores, it can be noted that the trained group decreased their accuracy score by three percentage points whereas the control group increased their correct identification score by one percentage point. Analyzing differences in performance globally, that is, from the pretest to the generalization test, the difference in accuracy for the trained group corresponded to eight percentage points, whereas the control group increased their correct identification score by three percentage points. Figure 15 illustrates the evolution of scores from the pretest to the generalization test for both groups.

Figure 15. Mean accuracy score at the three testing phases for the trained and control groups.

52 54 56 58 60 62 64 66 68 70

Pretest Posttest Gen. Test

% Accuracy score

Trained group Control group

102

Considering the decrease in performance from the posttest to the generalization test in the trained group, what needed verification was whether the score in the generalization test was significantly different from that of the posttest: a significantly lower score would be interpreted as an indicator that transfer of learning with trained tokens to novel stimuli (differing in pseudoword identity, specifically the following vowel, and/or talker identity) did not occur; if the generalization test score did not differ from that of posttest under statistically scrutiny that would be taken as an indication that transfer of learning took place, since performance with novel stimuli was similar to that with trained tokens.

A two-way mixed ANOVA was run, with the Pretest, Posttest and Generalization test accuracy scores as the three levels of the repeated-measures factor Time, and the Trained and the Control participants as the levels of the between-subjects factor Group.²⁸ Results revealed that the main effect of Time was significant, F(1.54, 49.28) = 8.39, p = .002, indicating that, when all other variables were ignored, scores differed depending on the moment of testing. Post hoc tests showed that, all participants considered, there was a significant difference between the pretest mean accuracy score (57.98%) and both the posttest (64.19%) and the generalization test (63.37%) mean accuracy rates (all ps < .05), whereas the difference between the generalization test identification rate and the posttest mean score was deemed non-significant (p > .05). The interaction effect between time of testing and the group the participants belonged to was also significant, F(1.54, 49.28) = 3.65, p = .04. This indicates that the way Time affected scores was different in the control and in the trained groups. For the trained group, post hoc comparisons revealed that there was a significant effect of Time between the pretest and the posttest accuracy rates and between the pretest and the generalization test correct identification scores (all ps <

.05); the comparison between the posttest score and the generalization test score was non-significant (p

> .05), suggesting that performance of trained participants in both testing tasks was similar. Regarding the control group, no significant effect of Time was found in any of the comparisons between the factor levels (all ps > .05).

Overall, results suggest that the trained group significantly improved their scores from pretest to posttest and that the level of performance achieved at posttest was maintained in the generalization test, which shows that trained participants transferred learning to novel stimuli. On the contrary, the control group did not change their scores significantly throughout time, that is, their level of performance was

28 In Section 4.1.1, it was noted that scores of the trained and control groups in the pretest and posttest were normally distributed. The Kolmogorov-Smirnov test of normality also indicated that scores in the generalization test did not deviate significantly from a normal distribution for both groups (all ps > .05).

Levene’s tests showed that variances were homogeneous in the control and trained groups at all levels of the factor Time (ps > .05). This means that the assumption of homogeneity of variances was met. Mauchly’s sphericity test for the repeated-measures effect in the model showed that the main effect of Time violated the assumption of sphericity (p = .004) and so the Greenhouse-Geisser-corrected F-values are reported for this effect.

103

similar at posttest compared to pretest and in the generalization test compared to the posttest. Taken together, these results suggest that training induced robust learning.

A second approach to this analysis involved comparing performance between posttest and the generalization test on a per-condition basis in the trained group. The total number of conditions across tests (nine) would allow for multiple comparisons. Additionally, the effect of the Untrained feature of stimuli was already discussed in Section 4.1.2 and it was not within the scope of this work to analyze the differences in performance between untrained and novel pseudowords/talkers. As such, although Conditions 6 and 8 were tested to ensure a balanced design, for the sake of brevity, clarity and purposefulness, at this point no comparisons involving the Untrained level of the Talker or the Pseudoword variables were made, which means that discussion will be centered on the influence of novel talkers and novel pseudowords on perceptual performance by comparing Conditions 5 (trained pseudowords produced by novel talkers – TWNT), 7 (novel pseudowords produced by novel talkers – NWNT) and 9 (novel pseudowords produced by trained talkers – NWTT) in the generalization test to Condition 1 in the posttest (trained pseudowords produced by trained talkers – TWTT), which was used as a baseline category.

Table 22 presents the mean accuracy score of the trained group in Conditions 5, 7 and 9 of the generalization test compared to posttest condition 1. It can be seen that participants scored highest on Condition 1 of the posttest (73.82%), which contained trained pseudowords produced by trained talkers.

The three generalization test conditions under analysis presented scores below that of the baseline category and, among these, Condition 5 (TWNT) showed the greatest difference relative to posttest condition 1 (13.82 percentage points less).

Table 22 Mean Percent Accuracy Score in Posttest Condition 1 and Conditions 5, 7 and 9 of the Generalization Test for the Trained Group

Group Posttest TWTT Mean (SD)

Gen. test TWNT Mean (SD)

Gen. test NWNT Mean (SD)

Gen. test NWTT Mean (SD) TG

n=17 73.82 (14.53) 60.00 (14.03) 69.12 (10.32) 71.08 (12.10)

Note. TG = Trained group; Gen. = Generalization; TWTT = Trained pseudowords by trained talkers (Condition 1); TWNT = Trained pseudowords by novel talkers (Condition 5); NWNT = Novel pseudowords by novel talkers (Condition 7); NWTT = Novel pseudowords by trained talkers (Condition 9).

The next step was to test whether the differences in accuracy scores between the generalization conditions and the baseline category were statistically significant. Any significant difference would be

104

interpreted to suggest inability to transfer learning to the perception of the feature(s) along which generalization stimuli differed from trained tokens, depending on the condition being compared.

A one-way repeated-measures ANOVA was run with scores from the trained group. Stimuli was defined as a within-subjects factor with four levels (Posttest condition 1, Generalization test condition 5, Generalization test condition 7, Generalization test condition 9).²⁹

Results revealed that the main effect of Stimuli was significant, F(3, 48) = 6.27, p = .001, indicating that the trained participants performed differently across conditions. Post hoc comparisons showed that scores in Condition 5 of the generalization test (TWNT) were significantly lower than rates obtained in the baseline category (TWTT) by the trained group (p = .03). The remaining comparisons were non-significant (ps > .05), indicating that scores in Conditions 7 (NWNT) and 9 (NWTT) of the generalization test were similar to the level of accuracy obtained with trained pseudowords produced by trained talkers (baseline condition). Thus, participants in the trained group were able to transfer learning from trained tokens to novel pseudowords produced by trained talkers and to entirely novel stimuli (novel pseudowords, differing specifically from trained and pre/posttest items in the vowel following the target stop, produced by novel talkers), since performance in Conditions 7 and 9 of the generalization test did not differ from performance in Condition 1 of the posttest, with trained pseudowords spoken by trained talkers. A different pattern of results was shown with trained pseudowords produced by novel talkers. In this condition, performance was significantly worse than in the baseline category, which, on a preliminary analysis, could be taken to indicate that participants were not able to transfer learning to the perception of novel voices. However, performance in Condition 7, with novel pseudowords produced by novel talkers – the same two talkers who produced stimuli for Condition 5 –, hinders that interpretation, suggesting that participants were, indeed, capable of generalizing learning to different voices. It could have happened that the specific pseudoword-talker combination in Condition 5 rendered stimuli less intelligible, so results from native speakers’ validation of stimuli were re-inspected to verify whether any differences in their approval rate existed between conditions of the generalization test. Conditions 5, 7 and 9 in the generalization test obtained a 100% approval rate by native speakers (Table 23), which seems to rule out the possibility that stimuli in Condition 5 were less intelligible. Still, the fact that stimuli in these conditions were equally intelligible to native speakers does not mean necessarily that they did not have differential

29 Scores in Condition 1 of the posttest and in the three conditions of the generalization test under analysis were normally distributed for the trained group:

D(17) = .14, p = .20, D(17) = .17, p = .20, D(17) = .15, p = .20, and D(17) = .13, p = .20 for scores in the Condition 1 of the posttest and in Conditions 5, 7 and 9 of the generalization test, respectively. Mauchly’s sphericity test for the repeated-measures effect showed that the main effect of Stimuli did not violate the assumption of sphericity (p = .34).

105

degrees of intelligibility for L2 learners. Therefore, the reasons why trained participants performed worse in Condition 5 than in the other generalization conditions involving the same novel talkers remain unclear.

Table 23 Native Speakers’ Percent Approval Rate of Stimuli by Testing Condition Talkers

Pseudowords T1-T2 T7-T8 T9-T10 Mean

Trained ^{Cond. 1} 98.00 ^{Cond. 2} 94.00 ^{Cond. 5} 100.00 97.00 Untrained ^{Cond. 3} 96.00 ^{Cond. 4} 95.00 ^{Cond. 6} 100.00 97.00 Novel ^{Cond. 9} 100.00 ^{Cond. 8} 91.67 ^{Cond. 7} 100.00 97.22

Mean 98.00 93.56 100.00

Note. Cond. = Condition. Conditions 1 to 4 refer to the pretest/posttest. Conditions 5 to 9 correspond to the generalization test.

All aspects considered, it can be said that trainees were able to generalize learning to bilabial stops followed by novel vocalic contexts and produced by novel voices.

No documento Diana Moreira de Oliveira (páginas 117-122)