On the development of an automatic voice pleasantness classification and intensity estimation system

(1)

ComputerSpeechandLanguagexxx(2012)xxx–xxx

On

the

development

of

an

automatic

voice

pleasantness

classification

and

intensity

estimation

system

夽

Luis

Pinto-Coelho

a,b,∗

,

Daniela

Braga

c

,

Miguel

Sales-Dias

b,d

,

Carmen

Garcia-Mateo

e a_Polytechnic_Institute_of_Porto,_Portugal

b_Microsoft_Language_Development_Center,_Portugal c_TTS_Team,_Microsoft,_China

d_ISCTE-Lisbon_University_Institute,_Portugal e_University_of_Vigo,_Spain

Received9May2011;receivedinrevisedform19January2012;accepted20January2012

Abstract

Inthelastfewyears,thenumberofsystemsanddevicesthatusevoicebasedinteractionhasgrownsignificantly.Foracontinued useofthesesystems,theinterfacemustbereliableandpleasantinordertoprovideanoptimaluserexperience.Howeverthereare currentlyveryfewstudiesthattrytoevaluatehowpleasantisavoicefromaperceptualpointofviewwhenthefinalapplication isaspeechbasedinterface.Inthispaperwepresentanobjectivedefinitionforvoicepleasantnessbasedonthecompositionofa representativefeaturesubsetandanewautomaticvoicepleasantnessclassificationandintensityestimationsystem.Ourstudyis basedonadatabasecomposedbyEuropeanPortuguesefemalevoicesbutthemethodologycanbeextendedtomalevoicesorto otherlanguages.Intheobjectiveperformanceevaluationthesystemachieveda9.1%errorrateforvoicepleasantnessclassification anda15.7%errorrateforvoicepleasantnessintensityestimation.

Keywords: Voicepleasantness;Subtleemotions;Perceptualspeechanalysis;Text-to-Speechsynthesis

1. Introduction

Text-to-Speech(TTS)technologyhasdramaticallyimprovedinthelastfewyearsandtheinclusionofthismature technology in our dailylives is now a reality.We caneasily find examples such as GPSs, smartphones, reading assistants,interactivevoiceresponse(IVR)andautomotiveapplicationsthatassistusinseveraltasks.Intelligibility isfullyrequiredforanycommercialsystemandseveralsystemsarebecomingprogressivelymorenatural.However thereareusersthatdon’tfeelfullysatisfiedwiththeirproductsandoftensaythatthevoiceisboring,thatthestyleis notveryfriendlyoreventhattheydislikethesoundofthevoice,amongothers.Totrytoanswertothesecomplaints, wedecidedtostudytheconceptofvoicepleasantnessaccordingtothedefinitionsfoundinFellbaum(1998)inwhich “Pleasantnessisthefeelingcausedbyagreeablestimuli”,andinOxfordDictionary(2009)wherepleasantnessiswhat

夽 _This_paper_has_been_recommended_for_acceptance_by_‘Björn_Schuller,_PhD’.

∗_{Corresponding}_author_at:_Polytechnic_Institute_of_Porto,_Portugal._Tel.:₊₃₅₁₉₁₄₂₁₆_862;_fax:₊₃₅₁₂₁₄₂₁₈_488. E-mailaddress:[email protected](L.Pinto-Coelho).

(2)

gives“asenseofhappysatisfactionorenjoyment”.Thisisdistinctfromtheconceptofattractiveness(“appealingto thesenses”,“sexuallyalluring”OxfordDictionary,2009)whichwasalsoevaluatedbutisoutofthescopeofthiswork. Theconceptofvoicepleasantnessandthesuitabilityofagivenvoicetobeusedonagivencommunicativesituation arebarelytouchedbyscientificliterature(Schröder,2009).Recently,Campbell(2008)integratedtheconceptinthe areaofexpressive/affectivespeechasaformofsubtleemotionwhereitappearsasadesiredfeatureinTTSsystems whiletheinteractionbetweenspeakerandlistenerisshiftedfromatypical“readspeechtowardsamoreconversational styleofspeech”.Tocharacterizeexpressivespeechweoftenfindreferencestovoicequalityandprosodyparameters andMonteroetal.(1999)andAudibert etal.(2006) showedthat thesecanbecombinedwithdistinct weightsto conveyagivenexpression.Additionally,voicepleasantnesscanbeseenasamorepermanentattributethantheintense andtimelimited emotionsoftenfoundintheexpressivespeechliterature.Thischaracteristicallowstoenvisionan integrationonaspeakeridentificationframework.Fromthesestudiesanddistinctperspectiveswestarttoseethatvoice pleasantnessisindeedacomplexconceptthatencompassesseveraldimensionsand,inthisway,itsanalysiscanbenefit fromareassuchasvoicequality,speakeridentificationandemotionrecognition.Basedonourtheoreticalframework wewillmainlyrelyonthelaterbutalwayswithsupportfromthefirsttwo.

WehavepreviouslyexploredcorrelationsbetweensubjectiveratingsandobjectivefeaturesinBragaetal.(2007b)

andthe creation of the database on whichwe relied is described in Bragaet al.(2007a).A first study on voice pleasantnessclassification,usingasmallmultilingualcorpus,waspublishedinCoelhoetal.(2011).

Inthisworkthemaingoalsandnoveltiesare(a)thepresentationofanobjectivedefinitionforvoicepleasantness basedonthecompositionofarepresentativefeaturesubsetand(b)theoptimizationofavoicepleasantnessevaluation pipelinethatallowstoautomaticallyidentifyoccurrencesoftheconceptaswellasestimatingitsintensity.

WehavespecificallyfocusedonEuropeanPortuguese(EP)sincewehadaveryhomogeneousdatabase withan extensivebaseofsubjectiveevaluations.Themulti-lingualproblemwillbeanalysedinafuturework.

1.1. Background

Fromthelistener’spointofview,whenevaluatingtheimpactofthemessageasstimuli,thereareveryfewstudiesbut thetopicisgainingimportance.Infirstreportedstudiestheauthorsmainlyexplorecorrelationsbetweensubjectiveand objectivedataandthesupportingdatabasesareoftensmallandnotspecificfortheanalysisofthevoicepleasantness topic.Forexample,Syrdaletal.(1998)conductedastudyinordertocheckthesuitabilityofaspeakers’voiceto developaTTSsystem,basedontheassumptionthattheperceivedqualityofanaturalvoicedoesnotnecessarilymean synthesizedvoicequality.Theauthorshaveexploredthecorrelationbetweenacousticcharacteristicsandthesubjective attributesofsyntheticspeechquality(intelligibility,naturalnessandpleasantness).Thespeakers’selectionprocessis saidtohavebeenmadeempiricallyalthoughallthecandidates(6femalesand9males)wereprofessionalspeakers. AnotherexampleistheworkofYabuokaetal.(2000)whoevaluatedpsychometricscalesandcorrelatedtheresults withanobjectiveevaluationoftheacousticsignal.Theanalysiswasmainlydirectedtoinvestigatethebehaviourofthe scalesandnottotheobjectivedefinitionofthepsychometricscales.Theauthorsshowedthattheobjectivevariables canbegroupedintotwofactors:onerelatedto“clarity”andtheotheronerelatedto“sensation”.However,concerning thetypeandstructureofthesubjectivetest,verylittledetailsweregiven.

Fromanemotionperceptionpointofview,GoblandChasasaide(2003)andYanushevskayaetal.(2008)showed thatitcanbehighlyinfluencedbyvoicequalityparametersandLuggerandYang(2008)haveusedthesameparameters toidentifyhappy,bored,neutral,sad,angryandanxiousstateswithverygoodresultsforasetof10speakers.However thereisstillalackofagreementinexistentliteratureonthedefinitionoftheoptimalfeaturesforcharacterizingemotions (Schuller,2010).

On similarfields, we find the workof Biadsyet al.(2008)that explored judgements of charisma inStandard American Englishand Palestinian Arabic speech ina series of perceptual experiments.The authors investigated acoustic,prosodic,andlexicalcluesthat couldidentifycharismaticspeech. Fromanobjectivepointofview, using acoustic–prosodicfeatures,theauthorsfound,forbothcultures,thatlongerspeechsegments,withfrequentchangesin

f0,adynamicusageofdifferentspeakingratesandvariationsinintensityacrossintonationalphrases,wereperceived asbeingcharismatic.

Morerecently,wecanfindstudiesthatencompassanextendedsetoffeaturesandrelyonmachinelearningtechniques foradeeperanalysis.WeissandBurkhardt(2010)correlatedlistener’slikabilityratingswithanacousticanalysisusing emotionallyneutral sentencesfromemoDB(BurkhardtandPaeschke,2005).Higharticulationrate,lowerspectral

(3)

Fig.1.System’sdevelopmentpipeline.

centreofgravityandhigherspectralstandarddeviationandskewnessprovidedsomeofthehighestcorrelationsfor femalespeakers.Inpart,theseresultsarealignedwiththoseobtainedbyBragaetal.(2007a).Theauthorshavealso developedabinaryclassificationsystemthatachieved62.9%accuracy.Burkhardtetal.(2011)havepresentedresults forabinaryclassificationofvoicepleasantnessusingtheAgenderdatabase(Burkhardtetal.,2010).Despitethelarge numberofspeakers,theevaluationswereperformedusingsentenceswithacommandembeddedandnotalwayswith thesamecontent.Thereportedaccuracyofthesystemwas67.6%.

1.2. Workoverviewanddocumentstructure

Forourpurposeswewillfollowageneralmachinelearningframework.Thelackofagreementinexistingliterature onthedefinitionoftheoptimalfeaturesforcharacterizingemotionsandthescarcenumberofstudiesthatspecifically covervoicepleasantnessarethemainmotivationsfortheuseofthisapproachsince,inthepresenceofsuchconditions, itcanprovidearobustandflexibleframeworkindependentlyofdata’snatureandallowstheexplorationof alarge numberofparametersforsystem’sfinetuning.Ourpipelineisorganizedonasixstagesequence,asdepictedinFig.1, withfunctionblocksrepresentingthemaintaskssets.Thespecificfunctionsperformedoneachblockwillbedetailed. Inthenextsection,wepresentadescriptionof theusedmethodologystartingwithanoverviewofthesystem’s architecturewhichwillthenbedetailedintheensuingsub-sections.Wewillthoroughlycovertheconstructionofour databaseandhowweanalysedtheparalinguisticinformationconveyedinthemessageinparallelwiththepsychometric variablestomeasurethereactionofthelistenertotheprovidedstimuli.Afterwards,wewilldescribetheextraction andselectionoffeaturesaswellasthedevelopmentofamodelthatcouldholdthemainvariablesthatdescribevoice pleasantnessandhow theyinteractbetween themselves.Finally,after training andevaluatingthe system,we will presentathoroughdiscussionoftheobtained resultsandthefinalconclusionsfollowedbysomeenvisionedfuture work.

2. Experimentalframework

2.1. Database

Thedatabasethatsupportedthedevelopmentofthisworkisexclusivelycomposedbyfemalevoices.Thesevoices belongtoprofessionalspeakers,withradio,theatreorothervocalexperience,andwererecordedduringvoicetalent selectionprocessesforthedevelopmentofnewTTSsystems.Therelatedrecordingprocedureaswellasthequality requirementsthatwereimposedduringtheprocesseshavebeenpreviouslypublished(Bragaetal.,2007b)andforthis workwewillonlymentiontherelevantfeatures.Ourdatabasewascomposedby77recordingsofdistinctspeakers, around3minofspeecheach,containingphoneticallyandprosodicallyrichsentenceswhichallowedtheexpression of emotions.The speakers, with EP as mother tongue, had an age range between 20 and 42years andwere all native,speakingthestandarddialectalvariety(sinceregionalvarietiescanleadtolesspositiveevaluationsEklund and Lindström, 2001; van Bezooijen, 2005). All the speakers hadtheir own very personal speaking style which allowedtoobtainagooddiversityforeachparameter.

2.2. Subjectiveevaluation

Toevaluatetherecordedspeechweconductedasurvey,usingawebapplication,whereeachutterancewasrated accordingtopleasantness(andotherfactors)usinga5pointsscale.Theexactstatementtoberatedwas“Thisvoiceis themostpleasant,hastherightmelodyandisnotmonotonousatall”.Theutteranceswerepresentedinrandomorder andthelistenerswereaskedtouseheadphonesduringtheevaluation.Thesurveyreceived112responses,frommale andfemalelisteners,nativeandnon-nativespeakers,inanagerangefrom23to60yearsold.Theinter-rateragreement

(4)

wasanalysedusingtheKendal’s(KendallandSmith,1939)coefficientofconcordance.Theobtainedvaluewas0.673 whichindicatesmoderatetosubstantialagreementamonglisteners.

Thesubjectiveevaluationraisedseveralissuesrelatedtopotentialbiasingfactors.Sex,age,expertiseor native/non-nativespeaker,factorsthatgobeyondthesimpleselectionofparameters,canpossiblybiasthelisteners’judgement analysis.Onemajorconcernwasthelevelofexpertiseonspeechprocessing,sincethelistenershaddistinctbackgrounds. AsimilarproblemisaddressedbyKreimanetal.(1990)thatconcludes“thatperceptualstrategiesbetweenmoreand lessexperiencedlistenersarenotdifferent,butratherthattheselistenersadoptdifferentbaselinesduringperceptual tasks”.Toreducethegroupvariancethelistenerswereaskedtoratethevoicesmoreemotionallyratherthanusingany oftheirpreviousexperienceonthesubject.

2.3. Protocol

Usingthelisteners’voicepleasantnessratings(1–5)wedividedthedatabaserecordsintotwomajorclasses:pleasant (P)andnotpleasant(NP).Thefirstclasscontainedthevoicesrattedinthetwohighestscalepositions(35voices)while thesecondwascomposedbythevoicesrankedintheremainingpositions(42voices).Thevoicepleasantnessratings wereusedtotrainthevoicepleasantnessintensitymodelwhiletheclasseswereusedtotraintheclassificationmodule. Wehaveuseda10-foldcrossvalidationmethodologyandoneverygroupwehavetriedtokeepasimilarP/NPratio.

3. Featureextractionandselection

Theconstructionofanoptimalfeaturevectorisalwaysamajorconcerninmachinelearningproblems.Ononehand, becauseitisverydifficult,inthepresenceonanewproblem,toidentify,fromanextremelylargesetofpossibilities, whatfeatureswillbetterdescribeourclassesandleadtothebestresults.Ontheotherhand,becausethecomplexity ofmostmethodologiesgrowsexponentiallywhenthenumberoffeaturesincreases,havingareducedfeaturesetcan helptosignificantlyreducethecomputationalrequirements.

3.1. Featureextraction

Forthestudyofspeechwemainlyfoundtwoapproaches:one,whereasmallsetofdescriptorsiscarefullyselected basedontheauthorsexpertiseandanotherwhereahugenumberoffeaturesisextractedandprocessedinaposterior stage.In thelatestyears,withthe appearanceof severalfeatureextractiontools,suchas OpenEAR(Eyben etal., 2009),andwiththeimprovementoffeatureselectionalgorithms,wehaveassistedtoanincreaseonthenumberof featuresallowingtoexploreadditionaldimensions.Forthesereasonsthesecondapproachisbecomingmorepopular. Inourcase,sincewewereworkingwithasmalldatabase,wehadanextraconcernbecausemostmachinelearning algorithmsmethodsrequireasufficientnumberofrecordsinrelationtothenumberoffeatures inordertoprovide robustandmeaningfulresults.Wedecidedtouseamixedapproachwherewefirstselectedasetoffeatures,based ontheexperienceofotherauthorsinthisareaandinadjacentareas,andsecondweusedafeatureselectionstageto obtainanoptimizedsubset.

Hencewestartedwithfeaturesfromtheclinicalandvoicequalityareas.TheGRBAS(grade,roughness,breathiness, astenyandstrain)scale(PinhoandPontes,2002)isatypicalsubjectivescaleintheclinicalfieldalthoughsomeobjective featuresarealsoused.Fundamentalfrequencyandassociateddynamics,aswellasshimmer,jitter,HarmonictoNoise Ratio(HNR)(Lopesetal.,2008)andNormalizedAmplitudeCoefficient(NAQ)(Alkuetal.,2002),arepopularfeatures. Fromanintelligibility perspectivewe seethat itsassessmentishighly dependentonthe properarticulationof the languagesoundsaccordingtoitsstandardandthatitcanvarysignificantlywhendialectalvariationsareinvolved,even amongspeakersofthesamelanguage.HoweverAmano-KusumotoandHosom(2009)andlaterCoelhoetal.(2010)

showedthatparameterssuchasformanttrajectoriesinvowelsandsegmentaldurationshaveparamountimportance inspeechintelligibility.From the naturalnesspointof view wehave foundreferences tomeasuresof artefacts or discontinuities(Mayoetal.,2011).Fromthespeakeridentificationpointofviewwemainlyfindacousticfeaturesor spectralmodelswhicharecombinedintoaspeaker modelthatcanbeseenasauniquevoicesignature(Campbell, 1997;KinnunenandLi,2010).Finally,fromtheemotionalpointofview,thereisstillalackofagreementconcerning thefeaturesthatleadtooptimalidentificationresults(Schuller,2010)whilethetaskcomplexityhasenlargedwiththe highincreaseonthenumberofinvolvedfeaturesovertheyears.Forexample,Ververidisetal.(2004)used87acoustic

(5)

featurestorecognize5emotions,Schulleretal.(2006)used4000parameterstoidentify7emotionswhileinStuhlsatz etal.(2011)a6552dimensionalvectorwasused.Suchalargenumberoffeaturesrequiresverylargedatabasesin ordertoachievesufficientmodelrobustnessandonlycross-corporastudiescanleadtomoregeneralsolutions(Schuller etal.,2010;Stuhlsatzetal.,2011).

Onthevoicepleasantnessareatherearesmalldifferencesfromtheemotionanalysisintermsofextractedfeatures. IntheworkofSyrdaletal.(1998),RMSenergy,breathiness,long-termspectra,f0,formantsandtheirbandwidths, speakingrate,andTTSconcatenationandtargetcostsareusedforTTSvoicepleasantnessevaluation.Ontheother hand,Yabuokaetal.(2000)showedthattheobjectivevariables,suchascepstrumdistance,amplitudedistortion,and waveformdistortion,canprovideasensationof“clarity”whilephasedistortionanddifferentialspectrumdistortion canberelatedto“sensation”.Chattopadhyayetal.(2003)usedspeechrate,pausingandpitchtoidentifyapositive attitudetowardsvoicesadvertisement.Alsoinapersuasionrelatedstudy,WeissandBurkhardt(2010)used988features forcorrelatingvoicelikabilitywithsubjectiveratingsforclassificationpurposes.Burkhardtetal.(2011)used4368 featurescomprisinggroupsoflow-leveldescriptorsandasetoffunctionalsforeachgroup.Auditoryspectralfeatures providedthebestcontributionforreliablebinaryvoicelikabilityclassificationestimates.

NeverthelessandasinSchulleretal.(2007),formostcasesacombinationofprosody,voicequalityandspectral parametersandtheirdynamicsalwaysseemtoprovidethebestresults.Hence,ourprototypevectorencompasseda broadrangeofsignalaspects,coveringintra-andinter-periodcharacteristics,timeandfrequencydomainscontents andseveralstatisticsthatcouldcomplementtherawinformation.Itwasorganizedinfourgroups:(a)acousticfeatures, (b)signalfeatures,(c)periodicityfeaturesand(d)phonationspeedfeatures.

ForthecreationoftheprototypefeaturevectorwealsotookintoaccounttherecommendationsofWolf(1972)who advocatesthattheusedfeaturesshouldoccurnaturallyandfrequentlyinnormalspeech,beeasilymeasurable,have highvariabilitybetweenspeakers,beconsistentforeachspeaker,notchangeovertimeorbeaffectedbythespeaker’s healthandnot beaffectedbyreasonable backgroundnoise.In the firstgroup (a),we consideredthe fundamental frequency(f0)envelopeanditsfirst(f0)andsecondderivatives(f0).Fromthesewecalculatedfourfirstorder statistics,namelyaverage(Av),standarddeviation(Std),skewness(Sk)andkurtosis(Kt),andextractedthemaximum (Max)andminimum(Min)valuesoftheenvelope(minimumwasobtainedexcludingzeros).Forfourvowels,namely /a/,/6/,/i/and/u/(usingIPA-SAMPArepresentation)whicharethemostfrequentforEP(Teixeiraetal.,2001),we haveextractedthefirstfourformantsandtheirrelatedbandwidth([V]irepresentsthefrequencyofformantiforvowel [V])andalsocalculatedthesixabove-mentionedfunctionals.Thesecondgroup(b)includedtheinstantaneouspower (P)obtainedbyfollowingasimilarproceduretotheonedescribedforf0.Apossiblevoicequalityfactoristhestability andcrossperiodcoherenceofthesignalinvoicedsounds.Thisinformationwasincludedinthethirdgroup(c)where we haveconsideredjitter(J),shimmer (S) andharmonic-to-noiseratio (HNR).Foreachparameterwe considered severalvarietiessincetheycouldprovidenon-redundantinformation.Forjitterwehaveconsideredlocaljitter(Jloc) wehavealsoincludedothersimilarmetrics(thatcanprovidenon-redundantinformation):absolutejitter(Jabs),relative averageperturbation(Jrap),periodperturbationquotientandperiodicdifference(Jddp).Weaccountedforsixvarieties of shimmer:local(Sloc),localindB(SdB), periodicdifference(Sddp)andthreeamplitudeperturbationquotients computedfor distinctneighbourhoods using3, 5and11points. Wealsousedharmonic tonoiseratio(HNR) and thesamevalueindB(HNRdb).Alljitter,shimmerandharmonicityparameterswerecalculatedaccordingtoPraat’s (Boersma,2001)description.Finally,thelastfeaturegroup(d)composedofphonationspeedmetrics,includesthe wordrate(WR),aswordspersecond,speakingrate(SR),asphonemespersecond,andpauserate(PR),astherelation betweenpausetimeandtotalspeakingtime.ThefinalprototypevectorcompositionisshowninTable1.

Besidesthedescribedparameterswealsohaveconsideredaspeakermodelbasedon16MFCCsextractedfrom 20mswindowswith5ms overlaps,considering4 Gaussianmixtures.Thismodel,basedonperceptuallyweighted parameters,canprovideusefulinformationonspeakeridentificationsystemsthatcanshareseveralsimilaritieswith thevoicepleasantnessevaluationproblem.Sincethismodelcanonlybeconsideredasawholewetreateditseparately andnoMFCCfeatureswereusedduringthefeatureselectionstage.

3.2. Multi-criteriafeatureselection

Theprototypefeaturevectordescribedintheprevioussectioniscomposedby179dimensions.Thisnumberbrings an increased complexity for the development of the classifier andsome of the components may provide little or redundantinformationfor thetask.Additionally, fromamachinelearningpointof view, itis necessarytohavea

(6)

Table1

Compositionofprototypefeaturevectorforvoicepleasantnessanalysis.

Group Feat. Stat. #

Acoustic f0,f0,f0 Av,Std,Kt,Sk,Min,Max 18

4× Vow.:4Fmt Av,Std,Kt,Sk,Min,Max 64

4× Vow.:4Bw Av,Std,Kt,Sk,Min,Max 64

Signal P,P,P Av,Std,Kt,Sk,Min,Max 18

Periodicity Jitter – 4

Shimmer – 6

HNR,HNRdb – 2

Phonationspeed WR,SR,PR – 3

sufficientlylargedatabasetoensureanadequategeneralizationperformanceoftheinvolvedmodels.Thisisusually notthecase.Sincetheincreaseinthenumberofrecordsmaynotbeeasy,itiscommontousetwodistinctcategories ofalgorithmstoreducethesizeofthefeature vector:oneoperatesbytransformingthefeaturespaceandtheother oneisagoalorientedsearchinordertofindanoptimizedfeaturesubsetspace.Inbothcasestheaimistomaximize theclassdiscriminativepowerandtoreduceinformationredundancy(TheodoridisandKoutroumbas,2006).Inthe firstcategorywecanfindlinearmethods,suchasPrincipalComponentsAnalysisorMultidimensionalScaling(Borg andGroenen,2005),andnon-linear methods,likeSelfOrganizingMaps(Kohonen,2000)orISOMap(Tenenbaum etal.,2000).vanderMaatenandHinton(2008)andvanderMaatenandvandenHerik(2009)havecompared32 populardimensionalityreductiontechniquesandhaveconcludedthat“nonlineartechniquesareoftennotcapableof outperformingtraditionaltechniquessuchasPCA”.Inthesecondcategorythereisawidevarietyofmethodologiesand thelargeamountofpublishedarticles,oftencontradictory,makesitdifficulttoclearlyselecttheoptionthatwillbest suitagivenproblemordataset.Zhaoetal.(2009)performedacomparisonoftwelvecommonlyusedfeatureselection methodologiesusingseveraldistinctdatasetsandshowedthatsimplemethodologies,suchasthet-test,canperform asgoodasothermodernandmorecomplexapproaches,liketheKruskal–Wallismethodology(CoverandThomas, 1991).Intheemotionrecognitionarena,Luengo(2010)hasdevelopedasystemcentredinthebasicEkman’sbigsix (Ekmanetal.,1969),thatachievedgoodresultsusingthemRMRfeatureselectionmethodology(Pengetal.,2005) (alsoincludedinthestudyperformedbyZhaoetal.(2009)).TheuseofSequentialForwardSelection(Ververidisetal., 2004)andrelatedalgorithms,suchasSequentialForwardFloatingSelection(Pudiletal.,1994;HassanandDamper, 2009;LuggerandYang,2008),isalsopopularbutitcanleadtosub-optimalsolutions.

Inourcasewewantedtoobtainanoptimalsubsetthatcouldonlybeguaranteedbyanexhaustivesearch.However, withsuchalargenumberof features,thedirectapplicationofthisapproachwouldbecomputationallyunfeasible. Thiswaywehadtoapplyamulti-criteriamethodologyforfeaturepre-processingthatwedesignatedbymulti-criteria featureselection(MCFS).Ourpipelinewascomposedbyfourstages:feature normalization,featurepre-selection, compositefeatureselectionandexhaustivesearchselection.Hence,consideringthat thevaluesoneach dimension followedaGaussiandistribution,wenormalizedthefeaturecomponentsbycalculatingtheirZ-values.Thisprocedure centredthepointsintheoriginandequalizedtherangeof valuespreventingthedominationofattributes thatvary inhighernumericranges.Wethendefinedasoutliersallthepointsthatexceedinvaluetwostandarddeviations,in anyvectordimension,andremovedthem(thisallowstokeeparound95%ofthevaluesaroundthemeanvalue).A variationoftheKolmogorov–Smirnovtest(Lilliefors,1967)wasusedtoverifytheinitialassumptionthatthevalues oneachdimensionfollowedanormaldistribution.Thenat-test,witha90%confidenceinterval,wasusedtoanalyse ifagivenfeaturecouldbeusefultodiscriminatebetweentwoclasses.Thefeaturesthathavenotpassedthesetests wherediscarded.Thenwestartedthecompositefeatureselectionbyfirstcreatingafeaturelistsortedaccordingwith thefeature’sclass-discriminatorypower.Addingintoaccounttheinter-featurecorrelationwehaveiterativelycreateda newlistwhereeachfeaturewasselectedbyhavingthehighestclass-discriminatoryandthelowestcorrelationwiththe previouslyselectedfeatures.Fromthislistwehaveselectedthe12highestrankedfeaturesandperformedanexhaustive searchforcombinationsofnfeaturesusingscattermatrices(Fukunaga,1990).Ascostfunctionforclassseparability measurementweusedtheJ3criteriadefinedas:

(7)

whereS_wistheintraclassscattermatrixandSbistheinterclassscattermatrix.Foreachnweretainedthehighest scoredsubset.Thisprocessallowsobtainingafeaturelistwherethebestrankedfeaturesmaximizethediscriminative powerandminimizetheredundancy.Toevaluatetheperformanceofourapproachwecomparedtheresultswiththe onesobtainedusingPCAandISOMap,twopopularmethodologies.

4. Classiﬁcationandintensityestimation

Voicepleasantnessclassificationisaspecificpatternrecognitionproblemunderthemachinelearningtopic.Inthis case,thegeneraltaskofassigningalabelorvaluetoagiveninputpatternappearsintheformofdefiningifavoice ispleasantornotusingasetofobjectivefeaturesextractedfromagiveninputaudiostream.Thevoicepleasantness intensitymeasurementtaskinvolvesmeasuringthedegreeofpleasantnesswhenavoiceisclassifiedasbeingpleasant ornotpleasant.

4.1. Classification

Fromtheemotionrecognitionarea,thathassimilaritieswithourproblem,weoftenfind,forclassificationpurposes, methodologiesbasedonArtificialNeuralNetworks(ANN)(Unluturketal.,2009)orSupportVectorMachines(SVM) (ShaukatandChen,2008),withthelaterbeing morepopular.Infact,ShamiandVerhelst(2007)andCasaleetal. (2008)performedacomparisonofseveralclassificationtechniquesandobtainedthebestresultsusingSVMs.More recently,newmethodologies,suchasLogitBoostAlternatingDecisionTrees(WeissandBurkhardt,2010)orensembles ofREPTrees(Burkhardtetal.,2011),releasedasplug-insfortheWEKAtoolkit(WittenandFrank,2005),arealso becomingpopular.Howevertheperformanceofthenewapproachesmaybehighlyconditionedbythedatadistribution whileANNandSVMoftenofferamoreconsistentgoodperformanceacrossdatabases(Schulleretal.,2010).Onthe otherhand,HassanandDamper(2009)reportedtobeatthebestreportedclassificationresultsontheBerlinandDES emotiondatabasesusingastrategybasedonasimpleclassifier(k-nearestneighbour).Inourcaseweconsideredthe ANNandSVMapproaches,duetotheirrobustnessandflexiblemodels,andcomparedtheirperformance.

Sincevoicepleasantnesscanbeamorepermanentcharacteristicand,assumingthatitisnotmanipulatedoraffected byanillstate,itcanalsobeunderstoodasbeingintrinsictothespeaker.Inthissensewealsoevaluatedthesuitability ofaspeakeridentificationmodelwherethespeaker’sacousticsignatureisusedtorepresentvoicepleasantnessidentity. ForthisweusedaBayesianclassifierandaspeakermodelbasedonGaussianmixtures(GMM),whichcanleadto goodresults(Beigi,2011).TheimplementationwasinspiredonReynoldsetal.(2000).

The ANNandGMM basedapproaches wereimplementedusing Matlab scripts.TheANN hadafeed-forward structurewith2hiddenlayersand14neuronseach(thisconfigurationwasobtainedbyoptimizingthecostfunction). Thesizeoftheinputlayerwasequaltothedimensionoftheinputvector(whichwasoneofthevariedparameters). Foroutputwehadasingleneuron,withasigmoidactivationfunction,where0and1representednotpleasantand pleasant,respectively.Theback-propagationalgorithmwasusedfortraining.TheGMMmodelwasbuiltasdescribed inthefeaturessectionandtheExpectationMaximization(EM)(Dempsteretal.,1977)algorithmwasusedfortraining. TheSVMapproachwasbasedontheimplementationprovidedbytheLibSVM(ChangandLin,2011)toolkitusing aC-SVCSVM modelandradial basisfunctionswithagammavalueequal to8 andacost parameterof2 (these parameterswereadjustedbyminimizingclassificationerrorwithagridsearchmethodology).

Theclassifiers’training/testingprocesseswasperformedusing a10-foldcrossvalidationthat allowedtoobtain statisticallymeaningfulresults.Alatefusionbasedonaweightedvotingscheme,adjustedbycalculatingtherelative numberoftruepositivesforeachclassifieragainstthetotalnumberoftruepositives,allowedtoreachahybridmodel forcategoryestimation.

4.2. Voicepleasantnessintensityestimation

Besidesidentifyingvoicepleasantnessitwasalsoourgoaltomeasureitsintensity.Thisisdifferentfromtheconcept of“ranking”,that oftenappearsinthemachinelearningliterature,wheretheprobabilityassociatedwiththebinary classificationdecisionisusedtoprovideadditionalinformationconcerningitstrustfulness.Thiscanbeviewedasan “intensityestimation”fortheclassbutforourcasethisisoflimiteduse.Forexample,whenagivenpointislocated veryfarfromthedecisionboundaryandrightinsidetheclass,itwillhaveassociatedahighprobabilitybutitmight

(8)

Fig.2. Systemarchitecture.

notrepresentan optimallocation interms ofvoice pleasantness. The feature selectionprocesstested features for normalitythusmakingthedataGaussiancompatible.Hence,forvoicepleasantnessintensityestimationwehaveused amultivariateGaussiandistributionwhoseparametersweretrainedwithvoicejudgementdatafromthetwoclasses andusingtheoptimizedfeaturevectorpreviouslydevised.Eachclasshasitsownmodelthatischosenafterhaving thevoicepleasantnessclassestimation.Thisapproachallowedtouseanonlinearclassdiscriminationschemethatwill firstdividethefeaturespaceandthenwilluseaclassdependentmodel,basedonamultivariateGaussiandistribution, toreachafinalvoicepleasantnessintensityestimation.

Asanalternativescenariowehaveusedonlythefirsttwobestfeatures/componentscalculatedinaccordancewith thepresentedmethods.Foreachparameterpair(p1,p2)weassociatedapleasantnessrankingasadependentfunction

r=f(p1,p2).Thenusingauniform bicubicB-splinesurfaceS(x,y), westarted withaplane that wassuccessively deformedinordertoapproximatethetruerankingvalueforthegivenpoints(Barsky,1982).Foreachclasswehave adistinctsurfacethatwasadjustedtotherelatedpoints.Thisapproachcanprovidegoodresultsthoughitposessome limitationsduetoitsnon-parametricnature.

5. Evaluationandresults

ThebestresultswereobtainedusingtheconfigurationdepictedinFig.2.Thisoptimizedpipelineallowedtoachieve afinalvoiceclassificationerrorrateof9.1±2.1%andavoicepleasantnessintensityestimationabsoluteerrorrateof 15.7±2.5,bothfora90%confidenceinterval.

(9)

Table2

Listofthefirst84featuresintherankedlist.

0 12 24 36 48 60 72 84

1 Jrap f0Min VoB4Sk VeF1Sk VaB1Kt VeF1Kt NHR ViF2Max

2 SR PSk VeB1Av f0Max VeB3Max VeB3Min ViF1Max VaF3Max

3 VaB1Av VaF2Kt VaF1Std VoB4Av VaB1Max VoB4Max VoF2Kt VaB3Kt

4 PAv VoB3Std VeF3Max VoF3Min VoB4Std ViF3Std ViB2Max ViF2Sk

5 f0Sk PMin VoB3Av VoB1Av VaB2Std VeF4Min ViF2Min VeB1Sk

6 f0SD VoB3Sk PSD f0SD ViF4Min VeB4Min VoF4Sk VeF3Min

7 Sapq3 PMax PR VoB1Min VeF4Sk ViF3Min VoF2Sk Sdda

8 HNR(dB) VaB4Min HAC VaB2Max ViF1Av VeF2Std VeB3Sk Jddp

9 f0Av VeB4Kt f0K VaF4Kt ViB2Av VoB3Min VeF3Av VoF4Max

10 f0Av JLAbs VeB2Kt PAv PAv f0Min VoF4Min VaB3Max

11 PKt VeB3Std JLAbs2 VoB2Sk ViF2Kt VoF1Std VeF2Sk VaB1Sk

12 f0Max PSK VoF2Std VaF4Sk ViF1Min VoB1Std VaB4Std VoB3Max

Table3

Referencevaluesforthebestrankedfeatures.Valuesareabsolutewitharelativevariationwithina90%confidenceinterval.

Feature Jrap PAv f0Sk Sapq3 f0Av f0Max

Value 0.00946±9.3% 0.0110±12.1% 0.576±4.7% 0.02697±5.9% 196±5.3% 34.24±6.2%

The systemoptimizationwas performedbyindividuallyadjusting eachof theparameterson everystagewhile observingtheperformanceimprovementsintheoutput.Theperformancewasalwaysmeasuredintermsofclassification error(sumoffalsepositivesandfalsenegativesoverthetotalnumberofevaluatedcases)anda10-foldcrossvalidation allowedtoreachthemeanerrorrateaswellastheconfidenceinterval.

Wewillfollowthepipelinefromthebeginning.Afterextractingthefeatures,wehaveappliedtheproposedMCFS methodology.Therankedfeaturelist,aftercompositefeatureselection,ispresentedinTable2.

Fromtherankedfeaturelistwewenttoinvestigatetheamountof necessaryfeaturestoprovidethebestresults. UsingaSVMclassifierwithaRBFkernel,whosegammaandcostparameterwereoptimizedineachexperiment,we testedseveralcombinationsofnoutofthe12bestrankedfeaturesinthelist.Thebestresultswereobtainedusinga6 elementsvectorcomposedbythefeaturesrepresentedinboldfaceinthefirstcolumnofTable2.Therelatedvaluesare presentedinTable3.Additionallywewantedtoevaluateifitwaspossibletoobtainbetterresultsusingdimensionality reductionapproachessuchasPCAorISOMap.InFig.3wecanobservetheobtainedclassificationerrorandtherelated 90%confidenceintervalforthethreeapproacheswhilevaryingthenumberoffeaturesorvectorcomponents.Wecan seethatPCAandISOMapclearlyoutperformMCFSwhenareducednumberofcomponentsareconsideredandthat theevaluationofasmallsetofindependentfeaturescannotprovideasufficientdiscriminantpower.Theperformance ofallmethodsimproveswiththesizeofthefeaturevectoruntilacertainpointwithPCAandISOMapachievingthe bestresultswith5componentsandMCFSwith6featuresandtheoverallsmallesterrorrate.Whenconsideringalarger feature vectorwecanseethat theMCFSperformanceisdegradedwhichmayindicatedifficultiesduringclassifier

0 5 10 15 20 25 3 2 4 5 6 7 8 9 10 MCFS PCA ISOMap

(10)

Table4

CompositionofthefirstPCAcomponent.

Component 1 2 3 4 5 6

Feature SR Sapq3 f0Av EMax Jrap VeB3Kt

Weight 12.2 10.8 8.5 5.5 5.3 3.9

Acc.weight 12.2 23.0 31.5 37.0 42.3 46.2

Component 7 8 9 10 11 12

Feature f0SD ESK VeB1Av HAC EMax PR

Weight 3.7 3.0 2.8 2.7 2.5 2.3

Acc.weight 49.9 52.9 55.7 58.4 60.9 63.2

trainingwhendealingwithadditionaldimensions.Ontheotherhand,PCAandISOMap,showedtobemuchmore robusttovectorsize.

SincePCAandISOMapperformed wellwithsmall sizeddescription vectorsandwewantedtounderstandthe features thatbest represented voicepleasantnesswe investigated the composition of thefirst PCA component. In

Table4weseethe12highestweightedfeaturesaswellastheiraccumulatedweight.Wecanobservethatmostofthese featuresarealsoamongthehighestrankedintheMCFSfeaturelist(Table2)whichreinforcestheirimportanceonthe definitionofvoicepleasantness.Itisalsonoticeablethatthemostimportantfeaturesinbothapproachesaremainly voicequalityandprosodyparameters.Theaveragef0,themaximumf0,speakingrateandpauserate,amongothers, arealsolisted ashavingahighcorrelationwithvoicelikabilityinBragaetal.(2007a),whichcanbringadditional confidenceintheresults.

Intheclassificationstage,andstillthinkinginthepreviousstage,wecomparedtheperformanceoftheSVMand ANNapproacheswhilevaryingthefeaturevectordimension.TheresultscanbeseeninFig.4(a)whereweobserve thattheSVMclassifierstillholdsthebestperformance.(TheGMMclassifierwasevaluatedseparatelyandwehave notvariedthedimensionoftheinputvector.)

For classification we evaluated independently three classifiers, ANN, SVM and GMM as can be seen on

Table5(a)–(c),respectively.TheSVMclassifierperformedbetterandtooptimallytunetheSVMclassifier,we evalu-atedthesystem’sperformancewhenvaryingthetypeofkernelfunctionandrelatedparameters,asshowninthefirst rowofTable6(beforefusion).WecanobservethattheRBFkernelperformedbetterbutveryclosely,followedbythe quadraticpolynomialkernel(despitethemuchlongertrainingtimerequiredbythelast).

InTable5(c)wecanobservethattheGMM–Bayesclassifiershowedtohaveabetterperformancewhenidentifying thenotsopleasantvoices.TofurtherimprovethesystemwecombinedtheSVMclassifier,whichachievedthebest overallperformance,withtheGMM/Bayesclassifierusingaweightedvotingscheme(performedduringtraining). Thishybridapproachallowedtoimprovetheoverallresults,whichareshowninthesecondrowofTable6.Forthe RBFkernel,arelativeweightof30%GMMand70%SVMwasobtainedbyanerrorminimizationsearchprocedure. Additionallywealsoevaluatedinwhatextentthefeatureselectionprocessandtheinclusionofasecondparallel classifier(GMM)helpedtoimprovetheclassificationresults.InTable7,wecanobservetheimprovementsintroduced

0 5 10 15 20 25 30 10 9 8 7 6 5 4 3 2 SVM ANN

(a) Classification error

0,00 0,05 0,10 0,15 0,20 0,25 0,30 8 7 6 5 4 3 2 9 10

(b) Voice pleasantness intensity estimation error

Fig.4.Performanceevaluationaccordingtofeaturevectordimension.Horizontalaxisindicatesthedimensionofthefeaturevectorandvertical axisshowserrorinpercentage(a)andabsolutevalue(b).

(11)

Table5

Confusionmatriceswithclassificationresults,inpercentage,fordistinctapproaches.Classesarepleasant(P)ornotpleasant(NP).Ontheleftwe havetherealclasses(R)andonthefirstrowwehavetheestimatedclasses(E).

R\E P NP (a)ANN P 80.5 21.1 NP 19.5 78.9 (b)SVM P 89.8 12.7 NP 10.2 87.3 (c)GMM P 68.5 11.4 NP 31.5 88.6 Table6

Voicepreferenceclassificationerrorfora90%confidenceintervalusingdifferentkerneltypeswithaSVMbasedclassifierandafterfusionwitha GMM/Bayesclassifier.(Twoclassesoutlierfreedataset,errorwithina90%confidenceinterval.)

Linear Quadratic RBF Sigmoid

SVMalone 23.7±3.7% 18.8±3.2% 11.5±2.6% 24.2±3.9%

Class.fusion 19.3±3.7% 14.5±2.1% 9.1±2.1% 21.9±4.2%

Table7

Classificationerrorrateaftertheintroductionofeachcomponent.(Errorwithina90%confidenceinterval.)

Non-tunedsystem Independentfeat.sel. Compositefeat.sel. Optimizedfeat.subset Compositeclassification

38.5±7.6% 18.9±5.2% 14.4±2.3% 12.3±3.1 9.1±2.1%

byeachblockofthepipelineusingasreferencetheresultsobtainedwithaRBFkernel(sinceitperformedbetter)and consideringalltheavailableobjectivefeatures.

Afterobtainingaclassificationresultweperformedthevoicepleasantnessintensityestimation.Theperformance evaluationwasbasedonthemeanabsoluteestimationerror,orbyotherwords,themeanabsolutedifferencebetween thetruehumanjudgedpleasantnessrank(ona1–5scale)andthevalueestimatedbythesystem.Theresultsobtained forthe multivariateGaussianmodelapproacharedepictedinFig.4(b)wherewe canobservethat thebestresults, 15.7±2.5%fora90%confidenceinterval,areobtainedagainfora6dimensionsvector.Thesurfacebasedapproach wasimplementedusinga20×20patchgridsupportedonthebesttworankedfeaturesusingtheMCFSprocessand thefirsttwocomponents/vectorsofthePCAandISOMapmethodologies.Theobtainederrorrateswere27.3±5.7%, 15.8±3.2%and15.6±3.9%,respectively,fora90%confidenceinterval.Asurfacemappedfromabidimensional referencebasedon rawfeatures wasclearlyinappropriatehowever,theuseof PCA orISOMap, ledtosuccessful results.EspeciallytheISOMapbasedreferenceprovidedaresiduallylowererrorratebutwithahighervariationinthe confidenceinterval.Forthisreasonwepreferredtheparametricapproachwithbetterrepeatability.

6. Conclusionsandfuturework

Inthispaperwehavepresentedanoptimizedpipelinefortheautomaticevaluationofvoicepleasantness.Westarted bydefiningthepleasantnessconcept,whichhasacentralroleinourwork,andshowedhowitcanberelatedtothe perceptionofvoiceasstimuli.Anextensivepresentationofthemostsignificantstudiesintheareawasalsopresented. Aftershowingan overviewof theprocessingpipeline,we havedetailed thefunctionsof eachstage. Weprovided abriefpresentationofthedatabase,whichwasspecificallydevelopedforthestudyofvoicepleasantness.Then,we thoroughlyexplainedtheprocedurethatwasfollowedtoobtainanoptimalfeaturevectorforachievingimprovedresults. Westartedbyexploringabroadrangeofdimensions,coveringacoustic,signal,periodicityandphonationspeedaspects

(12)

andsuccessivelyselectedthebestfeaturesbymaximizingtheirclassdiscriminatorypowerandbyreducingthe inter-featureredundancy.Thisapproachprovidedthebestresultsforourdataafteracomparisonwithothermethodologies. Ouroptimizedfeaturevector, mainlycomposedbyprosodic andvoicequalityparameters,included Jrap(relative averageperturbation),PAv(derivativeofaveragepower),f0Sk(skewnessoffundamentalfrequencyderivative),

Sapq3(amplitudeperturbationquotient),f0Avandf0Max.ASVMclassifier wassuccessfullycombined witha GMM/Bayesmethodologyin alatefusion scheme.Thisnewapproach,motivatedbythe specificnature of voice pleasantnessanalysis,allowedtoachieveafinalvoicepleasantnessclassificationerrorrateof9.1±2.1%,whichis betterthanotherreportedmarksforsimilartasks.Finally,forestimatingvoicepleasantnessintensity,weachievedthe bestresultsusingaclassdependentmultivariateGaussianmodel,whichprovidedanerrorrateof15.7±2.5%.The obtainedresultswereobtainedusingadatabasethatwasspecificallydevelopedforthestudyofvoicepleasantness.

Thepresentedsystemisapowerfulobjectiveanalysistool thatcansupportsubjectiveevaluations duringvoice talentselectionstagefornewvoicefontsrecording,thuscontributingtoreduceoverallTTSdevelopmentcosts.This workalsoprovidesanadditionalinsightonthestudyofpleasantness,oneofthekeydimensionsofexpressivespeech, andcancontributetotheenhancementofTTSsystems’voicequalityaswellastoleveragetheintroductionofthis technologyinmorereallifeapplications.

Asfutureworkweplantoapplyasimilarmethodologytootherlanguagesallowingtosimultaneouslyincrease theavailabledataandtoevaluatethecross-languageperformance.Anextensionoftheworkforcharacterizingmale voicesisalsoenvisioned.

Acknowledgements

ThisworkwaspartiallysupportedbyERDFfunds,theSpanishGovernment(TEC2009-14094-C04-04),andXunta deGalicia(CN2011/019,2009/062).

References

Alku,P.,Backstrom,T.,Vikman,E.,2002.Normalizedamplitudequotientforparametrizationoftheglottalflow.AcousticalSocietyofAmerica2, 701–710.

Amano-Kusumoto,A.,Hosom,J.-P.,2009.Theeffectofformanttrajectoriesandphonemedurationsonvowelintelligibility.In:Proc.IEEEConf. onAcoustics,Speech,andSignalProcessing,pp.4677–4680.

Audibert,N.,Vincent,D.,Aubergé,V.,Rosec,O.,2006.Expressivespeechsynthesis:evaluationofavoicequalitycenteredcoderonthedifferent acousticdimensions.In:Proc.of3rdInternationalConferenceonSpeechProsody,Dresden,Germany.

Barsky,B.A.,1982.Endconditionsandboundaryconditionsforuniformb-splinecurveandsurfacerepresentations.ComputersinIndustry3,17–29. Beigi,H.,2011.FundamentalsofSpeakerRecognition.Springer.

Biadsy,F.,Rosenberg,A.,Carlson,R.,Hirschberg,J.,Strangert,E.,2008.Across-culturalcomparisonofAmerican,Palestinian,andSwedish perceptionofcharismaticspeech.In:Proc.SpeechProsody,Campinas,Brazil.

Boersma,P.,2001.Praat:doingphoneticsbycomputer.GlotInternational9/10(5),341–345,URLhttp://www.fon.hum.uva.nl/praat/. Borg,I.,Groenen,P.,2005.ModernMultidimensionalScaling:TheoryandApplications.Springer-Verlag.

Braga,D.,Coelho,L.,Junior,F.G.R.,Dias,M.S.,2007,July.SubjectiveandobjectiveevaluationofBrazilianPortugueseTTSvoicefontquality. In:Int.WorkshoponAdvancesinSpeechTechnology,Maribor,Slovenia,pp.306–311.

Braga,D.,Coelho,L.,Junior,F.G.V.R.,Dias,M.S.,2007,October.SubjectiveandobjectiveassessmentofTTSvoicefontquality.In:Proc.of InternationalConferenceonSpeechandComputers(SPECOM2007),Moscow,pp.306–311.

Burkhardt,F.,Eckert,M.,Johannsen,W.,Stegmann,J.,2010.Adatabaseofageandgenderannotatedtelephonespeech.In:Proc.Languages ResourcesEvaluationConference.

Burkhardt,F.,Paeschke,A.,2005.AdatabaseofGermanemotionalspeech.In:Proc.Interspeech.

Burkhardt,F.,Schuller,B.,Weiss,B.,Weninger,F.,2011,August.“Wouldyoubuyacarfromme?”–onthelikabilityoftelephonevoices.In:Proc. Interspeech,Florence.

Campbell,J.P.,1997.Speakerrecognition:atutorial.ProceedingsoftheIEEE85(9),1437–1462.

Campbell, N., 2008. Expressive/affective speech synthesis. In: Springer Handbook on Speech Processing. Springer, http://www.speech-data.jp/fastnet/ApplicantCV-Campbell.pdf,pp.505–517.

Casale,S.,Russo,A.,Scebba,G.,Serrano,S.,2008.Speechemotionclassificationusingmachinelearningalgorithms.In:Proceedingsofthe2008 IEEEInternationalConferenceonSemanticComputing.IEEEComputerSociety,Washington,DC,USA,pp.158–165.

Chang,C.-C.,Lin,C.-J.,2011.LIBSVM:alibraryforsupportvectormachines.ACMTransactionsonIntelligentSystemsandTechnology2, 27:1–27:27.

Chattopadhyay,A.,Dahl,D.W.,Ritchie,R.J.,Shahin,K.N.,2003.Hearingvoices:theimpactofannouncerspeechcharacteristicsonconsumer responsetobroadcastadvertising.JournalofConsumerPsychology13(3),198–204.

(13)

Coelho,L.,Braga,D.,Dias,M.,Mateo,C.,2011.Anautomaticvoicepleasantnessclassificationsystembasedonprosodicandacousticpatternsof voicepreference.In:Proc.ofInterspeech.

Coelho,L.,Braga,D.,Garcia-Mateo,C.,2010.KalmantrackinglinearpredictorforvowelintelligibilityenhancementonEuropeanPortuguese HMMbasedspeechsynthesis.In:ProcofICASSP.No.1,pp.4734–4737.

Cover,T.,Thomas,J.,1991.ElementsofInformationTheory.JohnWiley&Sons,NewYork.

Dempster,A.P.,Laird,N.M.,Rubin,D.B.,1977.Maximum-likelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical Society:SeriesB39.

Eklund,R.,Lindström,A.,2001.Xenophones:aninvestigationofphonesetexpansioninSwedishandimplicationsforspeechrecognitionand speechsynthesis.SpeechCommunication35(2),81–102.

Ekman,P.,Sorenson,E.,Friesen,W.V.,1969.Pan-culturalelementsinfacialdisplaysofemotions.Science164,86–88.

Eyben,F.,Wöllmer,M.,Schuller,B.,2009.OpenEAR–introducingtheMunichopen-sourceemotionandaffectrecognitiontoolkit.In:Proc.4th Int.HUMAINEAssociationConf.onAffectiveComputingandIntelligentInteraction.

Fellbaum,C.,1998.WordNet:AnElectronicalLexicalDatabase.TheMITPress,Cambridge,MA. Fukunaga,K.,1990.IntroductiontoStatisticalPatternRecognition.AcademicPress.

Gobl,C.,Chasasaide,A.N.,2003.Theroleofvoicequalityincommunicatingemotion,moodandattitude.SpeechCommunication40,189–212. Hassan,A.,Damper,R.I.,2009.Emotionrecognitionfromspeechusingextendedfeatureselectionandasimpleclassifier.In:Proc.Interspeech,

Brighton.

Kendall,M.G.,Smith,B.,1939.Theproblemofmrankings.TheAnnalsofMathematicalStatistics10(3),275–287.

Kinnunen,T.,Li,H.,2010.Anoverviewoftext-independentspeakerrecognition:fromfeaturestosupervectors.SpeechCommunication52(January), 12–40.

Kohonen,T.,2000.SelfOrganizingMaps.Springer.

Kreiman,J.,Gerratt,B.,Precoda,K.,1990.Listenerexperienceandperceptionofvoicequality.JournalofSpeechandHearingResearch33, 103–115.

Lilliefors,H.W.,1967.OntheKolmogorov–Smirnovtestfornormalitywithmeanandvarianceunknown.JournaloftheAmericanStatistical Association62,399–402.

Lopes,J.,Freitas,S.,Sousa,R.,Matos,J.,Abreu,F.,Ferreira,A.,2008.AmedidaHNR:Suarelevâncianaanáliseacústicadavozesuaestimac¸ão precisa.In:Proceedingsof“IJornadassobreTecnologiaeSaúde”,Guarda,Portugal.

Luengo,I.,2010.Análisisyevaluacióndeparámetrosparaidentificaciónautomáticadeemocionesenelhabla.Ph.D.Thesis,UniversidaddelPaís Vasco.

Lugger,M.,Yang,B.,2008.calmotivatedmulti-stageemotionclassificationexploitingvoicequalityfeatures.In:SpeechRecognition,Technologies andApplications,pp.395–410.

Mayo,C.,Clark,R.A.J.,King,S.,2011.Listeners’weightingofacousticcuestosyntheticspeechnaturalness:amultidimensionalscalinganalysis. SpeechCommunication53(March),311–326.

Montero,J.M.,Gutiérrez-Arriola,J.,Colas,J.,Enriquez,E.,Pardo,J.M.,1999.AnalysisandmodellingofemotionalspeechinSpanish.ICPhS99 2,957–960.

OxfordDictionary,2009[Online;accessed13.03.2011].URLhttp://oxforddictionaries.com/.

Peng,H.,Long,F.,Ding,C.,2005.Featureselectionbasedonmutualinformation:criteriaofmax-dependency,max-relevanceandmin-redundancy. IEEETransactionsonPatternAnalysisandMachineIntelligence27,1226–1238.

Pinho,S.,Pontes,P.,2002.Escaladeavaliac¸ãoperceptivadafonteglótica:Rasat.VoxBrasilis8(3),8–13.

Pudil,P.,Ferri,F.,Novovicova,J.,Kittler,J.,1994.Floatingsearchmethodforfeatureselectionwithnonmonotoniccriterionfunctions.Pattern Recognition2,279–283.

Reynolds,D.A.,Quatieri,T.F.,Dunn,R.B.,2000.SpeakerverificationusingadaptedGaussianmixturemodels.DigitalSignalProcessing10,19–41. Schuller,B.,2010.Ontheacousticsofemotioninspeech:desperatelyseekingastandard.JournaloftheAcousticalSocietyofAmerica127(3),

1995.

Schuller,B.,Arsic,D.,Wallhoff,F.,Rigoll,G.,2006.Emotionrecognitioninthenoiseapplyinglargeacousticfeaturesets.In:Proc.SpeechProsody, Dresden.

Schuller,B.,Batliner,A.,Seppi,D.,Steidl,S.,Vogt,T.,Wagner,J.,Devillers,L.,Vidrascu,L.,Amir,N.,Kossous,L.,Aharonson,V.,2007,August. Therelevanceoffeaturetypefortheautomaticclassificationofemotionaluserstates:lowleveldescriptorsandfunctionals.In:Proceedingsof Interspeech,Antwerp,Belgium.

Schuller,B.,Vlasenko,B.,Eyben,F.,Wollmer,M.,Stuhlsatz,A.,Wendemuth,A.,Rigoll,G.,2010.Cross-corpusacousticemotionrecognition: variancesandstrategies.IEEETransactionsonAffectiveComputing1(July),119–131.

Schröder,M.,2009.Expressivespeechsynthesis:past,present,andpossiblefutures.In:AffectiveInformationProcessing.Springer,London,pp. 111–126.

Shami,M.,Verhelst,W.,2007.Anevaluationoftherobustnessofexistingsupervisedmachinelearningapproachestotheclassificationofemotions inspeech.SpeechCommunication49(March),201–212.

Shaukat,A.,Chen,K.,2008.Towardsautomaticemotionalstatecategorizationfromspeechsignals.In:INTERSPEECH’08,pp.2771–2774. Stuhlsatz,A.,Meyer,C.,Eyben,F.,Zielke,T.,Meier,H.-G.,Schuller,B.,2011.Deepneuralnetworksforacousticemotionrecognition:raisingthe

benchmarks.In:ProcofICASSP.IEEE,pp.5688–5691.

Syrdal,A.,Conkie,A.,Stylianou,Y.,1998.Explorationofacousticcorrelatesinspeakerselectionforconcatenativesynthesis.In:Proc.ofInt.Conf. SpeechandLang.Processing98,Lisbon,Portugal.

Teixeira,J.P.,Freitas,D.,Braga,D.,Barros,M.J.,Latsch,V.,2001.PhoneticeventsfromthelabellingtheEuropeanPortuguesedatabaseforspeech synthesis,FEUP/IPB-DB.In:Proc.ofEurospeech.

(14)

Tenenbaum,J.B.,deSilva,V.,Langford,J.C.,2000.Aglobalgeometricframeworkfornonlineardimensionalityreduction.Science290,2319–2323. Theodoridis,S.,Koutroumbas,K.,2006,February.PatternRecognition,3rded.AcademicPress.

Unluturk,M.S.,Oguz,K.,Atay,C.,2009.Emotionrecognitionusingneuralnetworks.In:Proceedingsofthe10thWSEASInternationalConference onNeuralNetworks.WSEAS,StevensPoint,WI,USA,pp.82–85.

vanBezooijen,R.,2005.Approximant/r/inDutch:routesandfeelings.SpeechCommunication47(1),15–31.

L.J.P.vanderMaaten,E.P.,vandenHerik,H.,2009.Dimensionalityreduction:acomparativereview.Tech.rep.,TilburgUniversityTechnical Report.

vanderMaaten,L.,Hinton,G.,2008.Visualizinghigh-dimensionaldatausingt-sne.JournalofMachineLearningResearch9,2579–2605. Ververidis,D.,Kotropoulos,C.,Pitas,I.,2004.Automaticemotionalspeechclassification.In:Proc.IEEEInternationalConferenceonAcoustics,

Speech,andSignalProcessing(ICASSP’04),vol.1,pp.593–596.

Weiss,B.,Burkhardt,F.,2010,September.Voiceattributesaffectinglikabilityperception.In:Proc.ofInterspeech,Makuhari,Japan. Witten,I.H.,Frank,E.,2005.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann.

Wolf,J.J.,1972.Efficientacousticparametersforspeakerrecognition.JournaloftheAmericanStatisticalAssociation51,2044–2056.

Yabuoka,H.,Nakayama,T.,Kitabayashi,Y.,Asakawa,Y.,2000.Investigationsofindependenceofdistortionscalesinobjectiveevaluationof synthesizedspeechquality.ElectronicsandCommunicationsinJapan(PartIII:FundamentalElectronicScience)83(5),14–22.

Yanushevskaya,I.,Gobl,C.,Chasaide,A.N.,2008.Voicequalityandloudnessinaffectperception.In:Proc.ofSpeechProsody.

Zhao,Z.,Morstatter,F.,Sharma,S.,Alelyani,S.,Anand,A.,Liu,H.,2009.Advancingfeatureselectionresearch–ASUfeatureselectionrepository. Tech.rep.