• Nenhum resultado encontrado

Audio input conditions

No documento Xuan Zhu (páginas 115-118)

4.6 Conclusions

5.1.1 Audio input conditions

The purpose of the analysis of the audio input conditions is to show the different complexity of the input signals between broadcast news and meetings domains. This is particularly necessary for meeting data due to the use of the different types of microphone. In addition, the Speech to

Noise Ratio (SNR) will be estimated on the signals of each dataset. Although this is a simple SNR approximation with bimodal modeling of log energy, it provides coherent SNR values to allow the comparison of signal qualities between different data domains.

Regarding the broadcast news shows first, there is a single audio input for each BN show. In order to further investigate the signal quality, the estimated SNR of each show from the RT-04F development database is given in Figure 5.1. A large variation in SNR values is observed on broadcast news data, with an average SNR of 28.8 dB. This phenomenon can be explained by analyzing the organization of BN programs. On the one hand, there is normally an anchor in each broadcast news show who takes charge of the whole program, the corresponding part of the signal is recorded via a very good quality microphone in a closed studio where the acoustic environment is very well controlled. On the other hand, many shows contain some reports from the spot, in the presence of different background noise. Hence, the signal quality of broadcast news shows varies largely according to the amount of live reports.

show SNR

20010206_1830_1900_ABC_WNT 70.2 20010217_1000_1030_VOA_ENG 28.2 20010220_2000_2100_PRI_TWD 22.5 20010221_1830_1900_NBC_NNW 69.5 20010225_0900_0930_CNN_HDL 71.5 20010228_2100_2200_MNB_NBW 43.7 20031115_180413_CSPAN_ENG 23.1 20031118_050200_CNN_ENG 48.7 20031120_003511_PBS_ENG 19.7 20031127_183655_ABC_ENG 22.0 20031129_000712_CNNHL_ENG 23.5 20031201_203000_CNBC_ENG 22.9

average 28.8

Table 5.1: SNR estimations on the RT-04F broadcast news development dataset.

In the framework of NIST meeting recognition evaluations, the meeting domain has been divided into two sub-domains since 2005, with different microphone setups in each sub-domain:

Conference room meetings: these are carried out by several participants sitting around a meeting table, with each one wearing a high quality close-talking microphone. Since the oriented goal is meetings, all participants are engaged in the conversation with a similar interaction level.

Lecture room meetings: these are also called seminar-like meetings, where a lecturer stands normally in front of all the other participants. These seminars typically consist of a presentation from the lecturer followed by a question/answering session or discussion

period. Thus the lecture meetings have less interactivity between participants compared to the conference meetings.

There are different types of microphone used in each meeting room, thus multiple audio record- ings are available to provide synchronized signals for each meeting. In NIST meeting recognition evaluations, several audio input conditions were proposed for different evaluation tasks and data type. According to the RT-06S meeting recognition evaluation [NIST, 2006], the list below ex- plains each audio input condition and indicates which sub-domain it is applied to:

Single Distant Microphone (SDM): this condition is defined as the audio recorded from a single omni directional microphone, centrally located on the meeting table. This audio recording is always included in the set of MDM channels. This condition is available for both conference and lecture meetings.

Multiple Distant Microphone (MDM): this evaluation condition contains a set of audio collected from several omni directional microphones placed on a table between the partic- ipants. It is required as the primary task for conference and lecture sub-domains.

Multiple Mark III Microphone Arrays (MM3A): in some lecture meeting rooms, the Mark III array is used to provide 64 signal channels. A beamformed signal generated by the University of Karlsruhe is available to each participating team for RT-06S data.

Multiple Source Localization Microphone Arrays (MSLA): this is a 4 digit microphone array arranged in an “T” topology. As this microphone array was built and used by the project CHIL, this condition exists only in the lecture meeting sub-domain.

All Distant Microphone (ADM): this evaluation condition permits the use of all distant microphone presented previously.

Individual Head Microphone (IHM): this condition includes the audio recordings col- lected from a close-talking microphone worn by some of the participants in the meeting.

It was not evaluated for speaker diarization task, but used for speech-to-text and speech activity detection tasks.

Since the MDM condition is the primary evaluation condition for both conference and lecture meetings, the analysis of the audio input conditions in meeting domain has been performed on the MDM condition. The channel numbers available for MDM task and the SNR estimation are given in Table 5.2 and 5.3 for RT-06S conference and lecture meetings development datasets respectively. Compared to only a single audio recording available for a broadcast news show, the MDM condition provides at least 2 audio input for a same meeting. The MDM channels are recorded by using different distant microphones located at various positions on a meeting table, which leads to difference in arrival time from a same sound source between each of the multiple distant channels. How to effectively exploit the multiple signals from different microphones is a crucial problem for the MDM speaker diarization task. In the field of the speaker diarization

for meetings, various techniques have been proposed to use the information from all available channels [Jin et al., 2004; Anguera et al., 2005a; Anguera et al., 2005b; Istrate et al., 2005a;

Pardoet al., 2006a; Pardoet al., 2006b].

meeting channels min. max. ave.

number SNR SNR SNR

AMI_20041210-1052 12 10.6 14.0 12.6

AMI_20050204-1206 16 10.6 12.8 11.5

CMU_20050228-1615 3 9.6 10.3 10.0

CMU_20050301-1415 3 12.3 13.3 12.7

ICSI_20010531-1030 6 6.9 10.1 8.5

ICSI_20011113-1100 6 8.3 10.1 9.2

NIST_20050412-1303 7 13.7 24.4 19.4

NIST_20050427-0939 7 8.3 14.9 11.3

VT_20050304-1300 2 17.2 19.7 18.4

VT_20050318-1430 2 15.6 18.5 17.1

all - 6.9 24.4 12.4

Table 5.2: SNR estimations on the RT-06S conference development dataset.

When looking into the details of SNR measurements, it can be found that the meeting data have lower SNR estimations than the broadcast news data. In details, the average SNR of both the conference and lecture meetings recordings are around of 12.5 dB, while the broadcast news shows have an averaged SNR value close to 30 dB. This fact demonstrates that the meeting recordings usually have less signal quality than the broadcast news ones. The possible reason is that the distant microphones employed in meeting domain commonly have lower quality than that used in the broadcast studio. Moreover, the SNR estimation varies largely across different microphone channels for some meetings (e.g. the “ITC_20050429_Segment1” seminar).

No documento Xuan Zhu (páginas 115-118)