• Nenhum resultado encontrado

ESTER experiments

No documento Xuan Zhu (páginas 104-107)

In 2005, the multi-stage speaker diarization discussed in Section 4.2 was also used to participate in the French ESTER Broadcast News evaluation [Gallianoet al., 2005], where this system pro- vided the best speaker diarization performance of the all submitted systems. The experiments conducted on the ESTER databases from the multi-stage system will be presented in this sec- tion. The default system configurations are the same settings as that were used in the RT-04F evaluation: i.e.λ= 5.5for thec-bicsystem andλ= 3.5, δ = 0.1for thec-sid.

4.4.1 Database description

The data used in ESTER were extracted from French radio broadcast news shows, provided by ELDA (Evaluations and Language resources Distribution Agency) and the DGA (Délégation Générale pour l’Armement). The training corpus contains totally 82 hours of data from the France Inter, France Info, RFI and RTM radio stations, recorded in 1998, 2000 and 2003, with the audio file durations ranging from 10 minutes to 1 hour. The development corpus contains 8 hours of data, in 14 audio files recorded from April to July 2003 from the same stations as the training corpus. The evaluation corpus is comprised of 18 audio files recorded from October to December 2004, with a total duration of 10 hours. The evaluation corpus contains data from two radio stations (’France Culture’ and ’Radio Classique’) not present in the training or development corpora. There is also a 14 month interval between the recording period of the development and test data (July 2003 to October 2004) of these two corpora. Both data sets present a large variability in audio file durations (10 minutes to 1 hour). A summary of all databases used in the ESTER evaluation is given in Table 4.7.

source train development evaluation

duration duration epoch duration epoch

France Inter 33 hr. 2×1 hr. 2003 2×1 hr. 2004

France Info 8 hr. 2×1 hr. 2003 1×1 hr.+2×30 min. 2004

RFI 23 hr. 2×1 hr. 2003 4×30 min. 2004

RTM 18 hr. 8×(10-20 min.) 2003 7×(14-22 min.) 2004

France Culture - - 1×1 hr. 2004

Radio Classique - - 1×1 hr. 2004

Table 4.7: The databases used in the ESTER evaluation.

4.4.2 Results on ESTER development data

For the French ESTER data, SAD was performed using the same American English speech/non- speech acoustic models as were used for the RT-04F evaluation, plus an additional speech over

music model trained on French broadcast news data for a better recognition of the jingles found in the French radio data, and the UBMs employed in the SID clustering stage were trained on the same English Broadcast News data as that were used for the RT-04F evaluation. The optimal threshold for the SID clustering on development data wasδ = 1.5. As can be seen in Table 4.8, the c-sid system has low purity and coverage error rates (4.7% and 5.2% respectively) on the ESTER development data. A 50% reduction of the overall error rate is gained by adding the c-sid system to the c-bicsystem. Theδ threshold found for the RT-04F data (i.e δ = 0.1) did not carry over to the ESTER data. This may be due to the larger variability in show sources, durations and types observed in ESTER and the mismatch between the UBMs and the ESTER data.

system purity error (%) coverage error (%) overall DER (%)

c-bic (λ= 5.5) 7.2 10.6 15.8

c-sid (λ= 3.5, δ = 1.5) 4.7 5.2 8.0

Table 4.8: The purity, coverage and overall diarization error rates from thec-std,c-bicandc-sid systems on the ESTER development dataset.

4.4.3 SID clustering threshold

Figure 4.8 demonstrates the speaker match error and the cluster purity error as a function of the SID clustering threshold for thec-sidsystem. This is a similar result as that was observed on the RT-04F development data (c.f. Figure 4.6), with the BIC penalty weight set to 3.5 in both cases:

the speaker match error curve decreases with a stably low cluster purity error when reducing the thresholdδ in a positive range, and then it increases fast with the same tendence for purity error when continuing reducing theδ. However, the minimum speaker match error is obtained with different δ value on two development datasets (i.e. around zero for the RT-04F and around 1.5 for the ESTER). Therefore, the best threshold for stopping the SID clustering depends largely on the specific data. In addition, it can be found that the minimum speaker match error on the RT-04F data is less than that can be achieved on the ESTER data. The possible reason is that the identical UBMs trained on the English broadcast news data for RT-04F were directly used on the ESTER development data.

4.4.4 ESTER evaluation results

Results on the ESTER evaluation data are given in Table 4.9, with the parameter setting optimized on the development data. The overall diarization error was reduced from 13.8% for the c-bic system to 11.5% for thec-sid system. The submitted system also had the best performance for the speaker diarization task in the ESTER evaluation [Galliano et al., 2005]. It needs to note that these scorings are computed over all the transcribed audio except long regions of music and publicity. As long silence segments are also evaluated in this case, the diarization system is

0 10 20 30 40 50 60 70

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

error rate (%)

SID clustering threshold δ

speaker match error cluster purity error

Figure 4.8: Speaker match error and purity error rates on ESTER development dataset for the c-sidsystem as a function of the SID clustering thresholdδ.

required to be able to detect them, otherwise high false alarm error will be produced. However, when the performance metrics are measured strictly on labeled speaker segments and the long silence regions are not taken into account, the overall DER should be reduced, especially false alarm error (e.g. for thec-sidsystem, the DER and false alarm error are reduced to 10.2% and 0 respectively). In a post-evaluation experiment, a 20% relative reduction of the overall diarization error was observed for the c-sid system with the best a posteriori threshold, showing that the error rate is highly dependent on the clustering threshold.

system missed speaker false alarm speaker match overall error (%) speaker error (%) error (%) DER (%)

results submitted to ESTER

c-bic (λ= 5.5) 0.7 1.0 12.1 13.8

c-sid (λ= 3.5, δ = 1.5) 0.7 1.0 9.8 11.5

post-evaluation results

c-sid (λ= 3.5, δ = 2.0)* 0.7 1.0 7.4 9.1

Table 4.9: Performances ofc-bicandc-sidsystems on the ESTER evaluation data.

For the submitted c-sid system with the best performance, the per-show and total diarization errors on the ESTER evaluation data are given in Figure 4.9. Like the RT-04F evaluation re- sults (see Figure 4.7), there is also a large variability of the system performance over the shows with different durations. To investigate whether there are some correlations between the acous- tic characteristics of the data and the diarization error rates, the system output for the show

“1026_30_RFI” with the highest speaker match error of approximately 30% was analyzed. It has been found that this 30-minute show contains a 20-minute spot coverage to an American evangelic community on the American presidential campaign, where the same reporter occurs in the presence of different background noise, music or noisy background speech from differ- ent persons and the English speech from the interviewees is overlapped totally or alternately by French translation. Due to this complexity in the acoustic environments, the diarization system tends to divide speech from speakers into several clusters corresponding to different acoustic environments and provide a higher speaker match error.

0 5 10 15 20 25 30 35

1008_30_INFO 1012_30_INFO

1013_60_INFO 1007_60_INTER

1011_60_INTER 1006_60_CLASSIQUE

1006_60_CULTURE 1025_30_RFI

1026_30_RFI 1027_30_RFI

1124_30_RFI 1217_22_RTM

1218_14_RTM 1219_14_RTM

1220_14_RTM 1221_21_RTM

1222_20_RTM

1223_18_RTM ALL

D E R (% )

SPE FA MS

Figure 4.9: Per-show and total ESTER evaluation results from thec-sidsystem.

No documento Xuan Zhu (páginas 104-107)