ComputerSpeechandLanguagexxx(2012)xxx–xxx
On
the
development
of
an
automatic
voice
pleasantness
classification
and
intensity
estimation
system
夽
Luis
Pinto-Coelho
a,b,∗,
Daniela
Braga
c,
Miguel
Sales-Dias
b,d,
Carmen
Garcia-Mateo
e aPolytechnicInstituteofPorto,PortugalbMicrosoftLanguageDevelopmentCenter,Portugal cTTSTeam,Microsoft,China
dISCTE-LisbonUniversityInstitute,Portugal eUniversityofVigo,Spain
Received9May2011;receivedinrevisedform19January2012;accepted20January2012
Abstract
Inthelastfewyears,thenumberofsystemsanddevicesthatusevoicebasedinteractionhasgrownsignificantly.Foracontinued useofthesesystems,theinterfacemustbereliableandpleasantinordertoprovideanoptimaluserexperience.Howeverthereare currentlyveryfewstudiesthattrytoevaluatehowpleasantisavoicefromaperceptualpointofviewwhenthefinalapplication isaspeechbasedinterface.Inthispaperwepresentanobjectivedefinitionforvoicepleasantnessbasedonthecompositionofa representativefeaturesubsetandanewautomaticvoicepleasantnessclassificationandintensityestimationsystem.Ourstudyis basedonadatabasecomposedbyEuropeanPortuguesefemalevoicesbutthemethodologycanbeextendedtomalevoicesorto otherlanguages.Intheobjectiveperformanceevaluationthesystemachieveda9.1%errorrateforvoicepleasantnessclassification anda15.7%errorrateforvoicepleasantnessintensityestimation.
©2012ElsevierLtd.Allrightsreserved.
Keywords: Voicepleasantness;Subtleemotions;Perceptualspeechanalysis;Text-to-Speechsynthesis
1. Introduction
Text-to-Speech(TTS)technologyhasdramaticallyimprovedinthelastfewyearsandtheinclusionofthismature technology in our dailylives is now a reality.We caneasily find examples such as GPSs, smartphones, reading assistants,interactivevoiceresponse(IVR)andautomotiveapplicationsthatassistusinseveraltasks.Intelligibility isfullyrequiredforanycommercialsystemandseveralsystemsarebecomingprogressivelymorenatural.However thereareusersthatdon’tfeelfullysatisfiedwiththeirproductsandoftensaythatthevoiceisboring,thatthestyleis notveryfriendlyoreventhattheydislikethesoundofthevoice,amongothers.Totrytoanswertothesecomplaints, wedecidedtostudytheconceptofvoicepleasantnessaccordingtothedefinitionsfoundinFellbaum(1998)inwhich “Pleasantnessisthefeelingcausedbyagreeablestimuli”,andinOxfordDictionary(2009)wherepleasantnessiswhat
夽 Thispaperhasbeenrecommendedforacceptanceby‘BjörnSchuller,PhD’.
∗Correspondingauthorat:PolytechnicInstituteofPorto,Portugal.Tel.:+351914216862;fax:+351214218488. E-mailaddress:[email protected](L.Pinto-Coelho).
0885-2308/$–seefrontmatter©2012ElsevierLtd.Allrightsreserved. doi:10.1016/j.csl.2012.01.006
gives“asenseofhappysatisfactionorenjoyment”.Thisisdistinctfromtheconceptofattractiveness(“appealingto thesenses”,“sexuallyalluring”OxfordDictionary,2009)whichwasalsoevaluatedbutisoutofthescopeofthiswork. Theconceptofvoicepleasantnessandthesuitabilityofagivenvoicetobeusedonagivencommunicativesituation arebarelytouchedbyscientificliterature(Schröder,2009).Recently,Campbell(2008)integratedtheconceptinthe areaofexpressive/affectivespeechasaformofsubtleemotionwhereitappearsasadesiredfeatureinTTSsystems whiletheinteractionbetweenspeakerandlistenerisshiftedfromatypical“readspeechtowardsamoreconversational styleofspeech”.Tocharacterizeexpressivespeechweoftenfindreferencestovoicequalityandprosodyparameters andMonteroetal.(1999)andAudibert etal.(2006) showedthat thesecanbecombinedwithdistinct weightsto conveyagivenexpression.Additionally,voicepleasantnesscanbeseenasamorepermanentattributethantheintense andtimelimited emotionsoftenfoundintheexpressivespeechliterature.Thischaracteristicallowstoenvisionan integrationonaspeakeridentificationframework.Fromthesestudiesanddistinctperspectiveswestarttoseethatvoice pleasantnessisindeedacomplexconceptthatencompassesseveraldimensionsand,inthisway,itsanalysiscanbenefit fromareassuchasvoicequality,speakeridentificationandemotionrecognition.Basedonourtheoreticalframework wewillmainlyrelyonthelaterbutalwayswithsupportfromthefirsttwo.
WehavepreviouslyexploredcorrelationsbetweensubjectiveratingsandobjectivefeaturesinBragaetal.(2007b)
andthe creation of the database on whichwe relied is described in Bragaet al.(2007a).A first study on voice pleasantnessclassification,usingasmallmultilingualcorpus,waspublishedinCoelhoetal.(2011).
Inthisworkthemaingoalsandnoveltiesare(a)thepresentationofanobjectivedefinitionforvoicepleasantness basedonthecompositionofarepresentativefeaturesubsetand(b)theoptimizationofavoicepleasantnessevaluation pipelinethatallowstoautomaticallyidentifyoccurrencesoftheconceptaswellasestimatingitsintensity.
WehavespecificallyfocusedonEuropeanPortuguese(EP)sincewehadaveryhomogeneousdatabase withan extensivebaseofsubjectiveevaluations.Themulti-lingualproblemwillbeanalysedinafuturework.
1.1. Background
Fromthelistener’spointofview,whenevaluatingtheimpactofthemessageasstimuli,thereareveryfewstudiesbut thetopicisgainingimportance.Infirstreportedstudiestheauthorsmainlyexplorecorrelationsbetweensubjectiveand objectivedataandthesupportingdatabasesareoftensmallandnotspecificfortheanalysisofthevoicepleasantness topic.Forexample,Syrdaletal.(1998)conductedastudyinordertocheckthesuitabilityofaspeakers’voiceto developaTTSsystem,basedontheassumptionthattheperceivedqualityofanaturalvoicedoesnotnecessarilymean synthesizedvoicequality.Theauthorshaveexploredthecorrelationbetweenacousticcharacteristicsandthesubjective attributesofsyntheticspeechquality(intelligibility,naturalnessandpleasantness).Thespeakers’selectionprocessis saidtohavebeenmadeempiricallyalthoughallthecandidates(6femalesand9males)wereprofessionalspeakers. AnotherexampleistheworkofYabuokaetal.(2000)whoevaluatedpsychometricscalesandcorrelatedtheresults withanobjectiveevaluationoftheacousticsignal.Theanalysiswasmainlydirectedtoinvestigatethebehaviourofthe scalesandnottotheobjectivedefinitionofthepsychometricscales.Theauthorsshowedthattheobjectivevariables canbegroupedintotwofactors:onerelatedto“clarity”andtheotheronerelatedto“sensation”.However,concerning thetypeandstructureofthesubjectivetest,verylittledetailsweregiven.
Fromanemotionperceptionpointofview,GoblandChasasaide(2003)andYanushevskayaetal.(2008)showed thatitcanbehighlyinfluencedbyvoicequalityparametersandLuggerandYang(2008)haveusedthesameparameters toidentifyhappy,bored,neutral,sad,angryandanxiousstateswithverygoodresultsforasetof10speakers.However thereisstillalackofagreementinexistentliteratureonthedefinitionoftheoptimalfeaturesforcharacterizingemotions (Schuller,2010).
On similarfields, we find the workof Biadsyet al.(2008)that explored judgements of charisma inStandard American Englishand Palestinian Arabic speech ina series of perceptual experiments.The authors investigated acoustic,prosodic,andlexicalcluesthat couldidentifycharismaticspeech. Fromanobjectivepointofview, using acoustic–prosodicfeatures,theauthorsfound,forbothcultures,thatlongerspeechsegments,withfrequentchangesin
f0,adynamicusageofdifferentspeakingratesandvariationsinintensityacrossintonationalphrases,wereperceived asbeingcharismatic.
Morerecently,wecanfindstudiesthatencompassanextendedsetoffeaturesandrelyonmachinelearningtechniques foradeeperanalysis.WeissandBurkhardt(2010)correlatedlistener’slikabilityratingswithanacousticanalysisusing emotionallyneutral sentencesfromemoDB(BurkhardtandPaeschke,2005).Higharticulationrate,lowerspectral
Fig.1.System’sdevelopmentpipeline.
centreofgravityandhigherspectralstandarddeviationandskewnessprovidedsomeofthehighestcorrelationsfor femalespeakers.Inpart,theseresultsarealignedwiththoseobtainedbyBragaetal.(2007a).Theauthorshavealso developedabinaryclassificationsystemthatachieved62.9%accuracy.Burkhardtetal.(2011)havepresentedresults forabinaryclassificationofvoicepleasantnessusingtheAgenderdatabase(Burkhardtetal.,2010).Despitethelarge numberofspeakers,theevaluationswereperformedusingsentenceswithacommandembeddedandnotalwayswith thesamecontent.Thereportedaccuracyofthesystemwas67.6%.
1.2. Workoverviewanddocumentstructure
Forourpurposeswewillfollowageneralmachinelearningframework.Thelackofagreementinexistingliterature onthedefinitionoftheoptimalfeaturesforcharacterizingemotionsandthescarcenumberofstudiesthatspecifically covervoicepleasantnessarethemainmotivationsfortheuseofthisapproachsince,inthepresenceofsuchconditions, itcanprovidearobustandflexibleframeworkindependentlyofdata’snatureandallowstheexplorationof alarge numberofparametersforsystem’sfinetuning.Ourpipelineisorganizedonasixstagesequence,asdepictedinFig.1, withfunctionblocksrepresentingthemaintaskssets.Thespecificfunctionsperformedoneachblockwillbedetailed. Inthenextsection,wepresentadescriptionof theusedmethodologystartingwithanoverviewofthesystem’s architecturewhichwillthenbedetailedintheensuingsub-sections.Wewillthoroughlycovertheconstructionofour databaseandhowweanalysedtheparalinguisticinformationconveyedinthemessageinparallelwiththepsychometric variablestomeasurethereactionofthelistenertotheprovidedstimuli.Afterwards,wewilldescribetheextraction andselectionoffeaturesaswellasthedevelopmentofamodelthatcouldholdthemainvariablesthatdescribevoice pleasantnessandhow theyinteractbetween themselves.Finally,after training andevaluatingthe system,we will presentathoroughdiscussionoftheobtained resultsandthefinalconclusionsfollowedbysomeenvisionedfuture work.
2. Experimentalframework
2.1. Database
Thedatabasethatsupportedthedevelopmentofthisworkisexclusivelycomposedbyfemalevoices.Thesevoices belongtoprofessionalspeakers,withradio,theatreorothervocalexperience,andwererecordedduringvoicetalent selectionprocessesforthedevelopmentofnewTTSsystems.Therelatedrecordingprocedureaswellasthequality requirementsthatwereimposedduringtheprocesseshavebeenpreviouslypublished(Bragaetal.,2007b)andforthis workwewillonlymentiontherelevantfeatures.Ourdatabasewascomposedby77recordingsofdistinctspeakers, around3minofspeecheach,containingphoneticallyandprosodicallyrichsentenceswhichallowedtheexpression of emotions.The speakers, with EP as mother tongue, had an age range between 20 and 42years andwere all native,speakingthestandarddialectalvariety(sinceregionalvarietiescanleadtolesspositiveevaluationsEklund and Lindström, 2001; van Bezooijen, 2005). All the speakers hadtheir own very personal speaking style which allowedtoobtainagooddiversityforeachparameter.
2.2. Subjectiveevaluation
Toevaluatetherecordedspeechweconductedasurvey,usingawebapplication,whereeachutterancewasrated accordingtopleasantness(andotherfactors)usinga5pointsscale.Theexactstatementtoberatedwas“Thisvoiceis themostpleasant,hastherightmelodyandisnotmonotonousatall”.Theutteranceswerepresentedinrandomorder andthelistenerswereaskedtouseheadphonesduringtheevaluation.Thesurveyreceived112responses,frommale andfemalelisteners,nativeandnon-nativespeakers,inanagerangefrom23to60yearsold.Theinter-rateragreement
wasanalysedusingtheKendal’s(KendallandSmith,1939)coefficientofconcordance.Theobtainedvaluewas0.673 whichindicatesmoderatetosubstantialagreementamonglisteners.
Thesubjectiveevaluationraisedseveralissuesrelatedtopotentialbiasingfactors.Sex,age,expertiseor native/non-nativespeaker,factorsthatgobeyondthesimpleselectionofparameters,canpossiblybiasthelisteners’judgement analysis.Onemajorconcernwasthelevelofexpertiseonspeechprocessing,sincethelistenershaddistinctbackgrounds. AsimilarproblemisaddressedbyKreimanetal.(1990)thatconcludes“thatperceptualstrategiesbetweenmoreand lessexperiencedlistenersarenotdifferent,butratherthattheselistenersadoptdifferentbaselinesduringperceptual tasks”.Toreducethegroupvariancethelistenerswereaskedtoratethevoicesmoreemotionallyratherthanusingany oftheirpreviousexperienceonthesubject.
2.3. Protocol
Usingthelisteners’voicepleasantnessratings(1–5)wedividedthedatabaserecordsintotwomajorclasses:pleasant (P)andnotpleasant(NP).Thefirstclasscontainedthevoicesrattedinthetwohighestscalepositions(35voices)while thesecondwascomposedbythevoicesrankedintheremainingpositions(42voices).Thevoicepleasantnessratings wereusedtotrainthevoicepleasantnessintensitymodelwhiletheclasseswereusedtotraintheclassificationmodule. Wehaveuseda10-foldcrossvalidationmethodologyandoneverygroupwehavetriedtokeepasimilarP/NPratio.
3. Featureextractionandselection
Theconstructionofanoptimalfeaturevectorisalwaysamajorconcerninmachinelearningproblems.Ononehand, becauseitisverydifficult,inthepresenceonanewproblem,toidentify,fromanextremelylargesetofpossibilities, whatfeatureswillbetterdescribeourclassesandleadtothebestresults.Ontheotherhand,becausethecomplexity ofmostmethodologiesgrowsexponentiallywhenthenumberoffeaturesincreases,havingareducedfeaturesetcan helptosignificantlyreducethecomputationalrequirements.
3.1. Featureextraction
Forthestudyofspeechwemainlyfoundtwoapproaches:one,whereasmallsetofdescriptorsiscarefullyselected basedontheauthorsexpertiseandanotherwhereahugenumberoffeaturesisextractedandprocessedinaposterior stage.In thelatestyears,withthe appearanceof severalfeatureextractiontools,suchas OpenEAR(Eyben etal., 2009),andwiththeimprovementoffeatureselectionalgorithms,wehaveassistedtoanincreaseonthenumberof featuresallowingtoexploreadditionaldimensions.Forthesereasonsthesecondapproachisbecomingmorepopular. Inourcase,sincewewereworkingwithasmalldatabase,wehadanextraconcernbecausemostmachinelearning algorithmsmethodsrequireasufficientnumberofrecordsinrelationtothenumberoffeatures inordertoprovide robustandmeaningfulresults.Wedecidedtouseamixedapproachwherewefirstselectedasetoffeatures,based ontheexperienceofotherauthorsinthisareaandinadjacentareas,andsecondweusedafeatureselectionstageto obtainanoptimizedsubset.
Hencewestartedwithfeaturesfromtheclinicalandvoicequalityareas.TheGRBAS(grade,roughness,breathiness, astenyandstrain)scale(PinhoandPontes,2002)isatypicalsubjectivescaleintheclinicalfieldalthoughsomeobjective featuresarealsoused.Fundamentalfrequencyandassociateddynamics,aswellasshimmer,jitter,HarmonictoNoise Ratio(HNR)(Lopesetal.,2008)andNormalizedAmplitudeCoefficient(NAQ)(Alkuetal.,2002),arepopularfeatures. Fromanintelligibility perspectivewe seethat itsassessmentishighly dependentonthe properarticulationof the languagesoundsaccordingtoitsstandardandthatitcanvarysignificantlywhendialectalvariationsareinvolved,even amongspeakersofthesamelanguage.HoweverAmano-KusumotoandHosom(2009)andlaterCoelhoetal.(2010)
showedthatparameterssuchasformanttrajectoriesinvowelsandsegmentaldurationshaveparamountimportance inspeechintelligibility.From the naturalnesspointof view wehave foundreferences tomeasuresof artefacts or discontinuities(Mayoetal.,2011).Fromthespeakeridentificationpointofviewwemainlyfindacousticfeaturesor spectralmodelswhicharecombinedintoaspeaker modelthatcanbeseenasauniquevoicesignature(Campbell, 1997;KinnunenandLi,2010).Finally,fromtheemotionalpointofview,thereisstillalackofagreementconcerning thefeaturesthatleadtooptimalidentificationresults(Schuller,2010)whilethetaskcomplexityhasenlargedwiththe highincreaseonthenumberofinvolvedfeaturesovertheyears.Forexample,Ververidisetal.(2004)used87acoustic
featurestorecognize5emotions,Schulleretal.(2006)used4000parameterstoidentify7emotionswhileinStuhlsatz etal.(2011)a6552dimensionalvectorwasused.Suchalargenumberoffeaturesrequiresverylargedatabasesin ordertoachievesufficientmodelrobustnessandonlycross-corporastudiescanleadtomoregeneralsolutions(Schuller etal.,2010;Stuhlsatzetal.,2011).
Onthevoicepleasantnessareatherearesmalldifferencesfromtheemotionanalysisintermsofextractedfeatures. IntheworkofSyrdaletal.(1998),RMSenergy,breathiness,long-termspectra,f0,formantsandtheirbandwidths, speakingrate,andTTSconcatenationandtargetcostsareusedforTTSvoicepleasantnessevaluation.Ontheother hand,Yabuokaetal.(2000)showedthattheobjectivevariables,suchascepstrumdistance,amplitudedistortion,and waveformdistortion,canprovideasensationof“clarity”whilephasedistortionanddifferentialspectrumdistortion canberelatedto“sensation”.Chattopadhyayetal.(2003)usedspeechrate,pausingandpitchtoidentifyapositive attitudetowardsvoicesadvertisement.Alsoinapersuasionrelatedstudy,WeissandBurkhardt(2010)used988features forcorrelatingvoicelikabilitywithsubjectiveratingsforclassificationpurposes.Burkhardtetal.(2011)used4368 featurescomprisinggroupsoflow-leveldescriptorsandasetoffunctionalsforeachgroup.Auditoryspectralfeatures providedthebestcontributionforreliablebinaryvoicelikabilityclassificationestimates.
NeverthelessandasinSchulleretal.(2007),formostcasesacombinationofprosody,voicequalityandspectral parametersandtheirdynamicsalwaysseemtoprovidethebestresults.Hence,ourprototypevectorencompasseda broadrangeofsignalaspects,coveringintra-andinter-periodcharacteristics,timeandfrequencydomainscontents andseveralstatisticsthatcouldcomplementtherawinformation.Itwasorganizedinfourgroups:(a)acousticfeatures, (b)signalfeatures,(c)periodicityfeaturesand(d)phonationspeedfeatures.
ForthecreationoftheprototypefeaturevectorwealsotookintoaccounttherecommendationsofWolf(1972)who advocatesthattheusedfeaturesshouldoccurnaturallyandfrequentlyinnormalspeech,beeasilymeasurable,have highvariabilitybetweenspeakers,beconsistentforeachspeaker,notchangeovertimeorbeaffectedbythespeaker’s healthandnot beaffectedbyreasonable backgroundnoise.In the firstgroup (a),we consideredthe fundamental frequency(f0)envelopeanditsfirst(f0)andsecondderivatives(f0).Fromthesewecalculatedfourfirstorder statistics,namelyaverage(Av),standarddeviation(Std),skewness(Sk)andkurtosis(Kt),andextractedthemaximum (Max)andminimum(Min)valuesoftheenvelope(minimumwasobtainedexcludingzeros).Forfourvowels,namely /a/,/6/,/i/and/u/(usingIPA-SAMPArepresentation)whicharethemostfrequentforEP(Teixeiraetal.,2001),we haveextractedthefirstfourformantsandtheirrelatedbandwidth([V]irepresentsthefrequencyofformantiforvowel [V])andalsocalculatedthesixabove-mentionedfunctionals.Thesecondgroup(b)includedtheinstantaneouspower (P)obtainedbyfollowingasimilarproceduretotheonedescribedforf0.Apossiblevoicequalityfactoristhestability andcrossperiodcoherenceofthesignalinvoicedsounds.Thisinformationwasincludedinthethirdgroup(c)where we haveconsideredjitter(J),shimmer (S) andharmonic-to-noiseratio (HNR).Foreachparameterwe considered severalvarietiessincetheycouldprovidenon-redundantinformation.Forjitterwehaveconsideredlocaljitter(Jloc) wehavealsoincludedothersimilarmetrics(thatcanprovidenon-redundantinformation):absolutejitter(Jabs),relative averageperturbation(Jrap),periodperturbationquotientandperiodicdifference(Jddp).Weaccountedforsixvarieties of shimmer:local(Sloc),localindB(SdB), periodicdifference(Sddp)andthreeamplitudeperturbationquotients computedfor distinctneighbourhoods using3, 5and11points. Wealsousedharmonic tonoiseratio(HNR) and thesamevalueindB(HNRdb).Alljitter,shimmerandharmonicityparameterswerecalculatedaccordingtoPraat’s (Boersma,2001)description.Finally,thelastfeaturegroup(d)composedofphonationspeedmetrics,includesthe wordrate(WR),aswordspersecond,speakingrate(SR),asphonemespersecond,andpauserate(PR),astherelation betweenpausetimeandtotalspeakingtime.ThefinalprototypevectorcompositionisshowninTable1.
Besidesthedescribedparameterswealsohaveconsideredaspeakermodelbasedon16MFCCsextractedfrom 20mswindowswith5ms overlaps,considering4 Gaussianmixtures.Thismodel,basedonperceptuallyweighted parameters,canprovideusefulinformationonspeakeridentificationsystemsthatcanshareseveralsimilaritieswith thevoicepleasantnessevaluationproblem.Sincethismodelcanonlybeconsideredasawholewetreateditseparately andnoMFCCfeatureswereusedduringthefeatureselectionstage.
3.2. Multi-criteriafeatureselection
Theprototypefeaturevectordescribedintheprevioussectioniscomposedby179dimensions.Thisnumberbrings an increased complexity for the development of the classifier andsome of the components may provide little or redundantinformationfor thetask.Additionally, fromamachinelearningpointof view, itis necessarytohavea
Table1
Compositionofprototypefeaturevectorforvoicepleasantnessanalysis.
Group Feat. Stat. #
Acoustic f0,f0,f0 Av,Std,Kt,Sk,Min,Max 18
4× Vow.:4Fmt Av,Std,Kt,Sk,Min,Max 64
4× Vow.:4Bw Av,Std,Kt,Sk,Min,Max 64
Signal P,P,P Av,Std,Kt,Sk,Min,Max 18
Periodicity Jitter – 4
Shimmer – 6
HNR,HNRdb – 2
Phonationspeed WR,SR,PR – 3
sufficientlylargedatabasetoensureanadequategeneralizationperformanceoftheinvolvedmodels.Thisisusually notthecase.Sincetheincreaseinthenumberofrecordsmaynotbeeasy,itiscommontousetwodistinctcategories ofalgorithmstoreducethesizeofthefeature vector:oneoperatesbytransformingthefeaturespaceandtheother oneisagoalorientedsearchinordertofindanoptimizedfeaturesubsetspace.Inbothcasestheaimistomaximize theclassdiscriminativepowerandtoreduceinformationredundancy(TheodoridisandKoutroumbas,2006).Inthe firstcategorywecanfindlinearmethods,suchasPrincipalComponentsAnalysisorMultidimensionalScaling(Borg andGroenen,2005),andnon-linear methods,likeSelfOrganizingMaps(Kohonen,2000)orISOMap(Tenenbaum etal.,2000).vanderMaatenandHinton(2008)andvanderMaatenandvandenHerik(2009)havecompared32 populardimensionalityreductiontechniquesandhaveconcludedthat“nonlineartechniquesareoftennotcapableof outperformingtraditionaltechniquessuchasPCA”.Inthesecondcategorythereisawidevarietyofmethodologiesand thelargeamountofpublishedarticles,oftencontradictory,makesitdifficulttoclearlyselecttheoptionthatwillbest suitagivenproblemordataset.Zhaoetal.(2009)performedacomparisonoftwelvecommonlyusedfeatureselection methodologiesusingseveraldistinctdatasetsandshowedthatsimplemethodologies,suchasthet-test,canperform asgoodasothermodernandmorecomplexapproaches,liketheKruskal–Wallismethodology(CoverandThomas, 1991).Intheemotionrecognitionarena,Luengo(2010)hasdevelopedasystemcentredinthebasicEkman’sbigsix (Ekmanetal.,1969),thatachievedgoodresultsusingthemRMRfeatureselectionmethodology(Pengetal.,2005) (alsoincludedinthestudyperformedbyZhaoetal.(2009)).TheuseofSequentialForwardSelection(Ververidisetal., 2004)andrelatedalgorithms,suchasSequentialForwardFloatingSelection(Pudiletal.,1994;HassanandDamper, 2009;LuggerandYang,2008),isalsopopularbutitcanleadtosub-optimalsolutions.
Inourcasewewantedtoobtainanoptimalsubsetthatcouldonlybeguaranteedbyanexhaustivesearch.However, withsuchalargenumberof features,thedirectapplicationofthisapproachwouldbecomputationallyunfeasible. Thiswaywehadtoapplyamulti-criteriamethodologyforfeaturepre-processingthatwedesignatedbymulti-criteria featureselection(MCFS).Ourpipelinewascomposedbyfourstages:feature normalization,featurepre-selection, compositefeatureselectionandexhaustivesearchselection.Hence,consideringthat thevaluesoneach dimension followedaGaussiandistribution,wenormalizedthefeaturecomponentsbycalculatingtheirZ-values.Thisprocedure centredthepointsintheoriginandequalizedtherangeof valuespreventingthedominationofattributes thatvary inhighernumericranges.Wethendefinedasoutliersallthepointsthatexceedinvaluetwostandarddeviations,in anyvectordimension,andremovedthem(thisallowstokeeparound95%ofthevaluesaroundthemeanvalue).A variationoftheKolmogorov–Smirnovtest(Lilliefors,1967)wasusedtoverifytheinitialassumptionthatthevalues oneachdimensionfollowedanormaldistribution.Thenat-test,witha90%confidenceinterval,wasusedtoanalyse ifagivenfeaturecouldbeusefultodiscriminatebetweentwoclasses.Thefeaturesthathavenotpassedthesetests wherediscarded.Thenwestartedthecompositefeatureselectionbyfirstcreatingafeaturelistsortedaccordingwith thefeature’sclass-discriminatorypower.Addingintoaccounttheinter-featurecorrelationwehaveiterativelycreateda newlistwhereeachfeaturewasselectedbyhavingthehighestclass-discriminatoryandthelowestcorrelationwiththe previouslyselectedfeatures.Fromthislistwehaveselectedthe12highestrankedfeaturesandperformedanexhaustive searchforcombinationsofnfeaturesusingscattermatrices(Fukunaga,1990).Ascostfunctionforclassseparability measurementweusedtheJ3criteriadefinedas:
whereSwistheintraclassscattermatrixandSbistheinterclassscattermatrix.Foreachnweretainedthehighest scoredsubset.Thisprocessallowsobtainingafeaturelistwherethebestrankedfeaturesmaximizethediscriminative powerandminimizetheredundancy.Toevaluatetheperformanceofourapproachwecomparedtheresultswiththe onesobtainedusingPCAandISOMap,twopopularmethodologies.
4. Classificationandintensityestimation
Voicepleasantnessclassificationisaspecificpatternrecognitionproblemunderthemachinelearningtopic.Inthis case,thegeneraltaskofassigningalabelorvaluetoagiveninputpatternappearsintheformofdefiningifavoice ispleasantornotusingasetofobjectivefeaturesextractedfromagiveninputaudiostream.Thevoicepleasantness intensitymeasurementtaskinvolvesmeasuringthedegreeofpleasantnesswhenavoiceisclassifiedasbeingpleasant ornotpleasant.
4.1. Classification
Fromtheemotionrecognitionarea,thathassimilaritieswithourproblem,weoftenfind,forclassificationpurposes, methodologiesbasedonArtificialNeuralNetworks(ANN)(Unluturketal.,2009)orSupportVectorMachines(SVM) (ShaukatandChen,2008),withthelaterbeing morepopular.Infact,ShamiandVerhelst(2007)andCasaleetal. (2008)performedacomparisonofseveralclassificationtechniquesandobtainedthebestresultsusingSVMs.More recently,newmethodologies,suchasLogitBoostAlternatingDecisionTrees(WeissandBurkhardt,2010)orensembles ofREPTrees(Burkhardtetal.,2011),releasedasplug-insfortheWEKAtoolkit(WittenandFrank,2005),arealso becomingpopular.Howevertheperformanceofthenewapproachesmaybehighlyconditionedbythedatadistribution whileANNandSVMoftenofferamoreconsistentgoodperformanceacrossdatabases(Schulleretal.,2010).Onthe otherhand,HassanandDamper(2009)reportedtobeatthebestreportedclassificationresultsontheBerlinandDES emotiondatabasesusingastrategybasedonasimpleclassifier(k-nearestneighbour).Inourcaseweconsideredthe ANNandSVMapproaches,duetotheirrobustnessandflexiblemodels,andcomparedtheirperformance.
Sincevoicepleasantnesscanbeamorepermanentcharacteristicand,assumingthatitisnotmanipulatedoraffected byanillstate,itcanalsobeunderstoodasbeingintrinsictothespeaker.Inthissensewealsoevaluatedthesuitability ofaspeakeridentificationmodelwherethespeaker’sacousticsignatureisusedtorepresentvoicepleasantnessidentity. ForthisweusedaBayesianclassifierandaspeakermodelbasedonGaussianmixtures(GMM),whichcanleadto goodresults(Beigi,2011).TheimplementationwasinspiredonReynoldsetal.(2000).
The ANNandGMM basedapproaches wereimplementedusing Matlab scripts.TheANN hadafeed-forward structurewith2hiddenlayersand14neuronseach(thisconfigurationwasobtainedbyoptimizingthecostfunction). Thesizeoftheinputlayerwasequaltothedimensionoftheinputvector(whichwasoneofthevariedparameters). Foroutputwehadasingleneuron,withasigmoidactivationfunction,where0and1representednotpleasantand pleasant,respectively.Theback-propagationalgorithmwasusedfortraining.TheGMMmodelwasbuiltasdescribed inthefeaturessectionandtheExpectationMaximization(EM)(Dempsteretal.,1977)algorithmwasusedfortraining. TheSVMapproachwasbasedontheimplementationprovidedbytheLibSVM(ChangandLin,2011)toolkitusing aC-SVCSVM modelandradial basisfunctionswithagammavalueequal to8 andacost parameterof2 (these parameterswereadjustedbyminimizingclassificationerrorwithagridsearchmethodology).
Theclassifiers’training/testingprocesseswasperformedusing a10-foldcrossvalidationthat allowedtoobtain statisticallymeaningfulresults.Alatefusionbasedonaweightedvotingscheme,adjustedbycalculatingtherelative numberoftruepositivesforeachclassifieragainstthetotalnumberoftruepositives,allowedtoreachahybridmodel forcategoryestimation.
4.2. Voicepleasantnessintensityestimation
Besidesidentifyingvoicepleasantnessitwasalsoourgoaltomeasureitsintensity.Thisisdifferentfromtheconcept of“ranking”,that oftenappearsinthemachinelearningliterature,wheretheprobabilityassociatedwiththebinary classificationdecisionisusedtoprovideadditionalinformationconcerningitstrustfulness.Thiscanbeviewedasan “intensityestimation”fortheclassbutforourcasethisisoflimiteduse.Forexample,whenagivenpointislocated veryfarfromthedecisionboundaryandrightinsidetheclass,itwillhaveassociatedahighprobabilitybutitmight
Fig.2. Systemarchitecture.
notrepresentan optimallocation interms ofvoice pleasantness. The feature selectionprocesstested features for normalitythusmakingthedataGaussiancompatible.Hence,forvoicepleasantnessintensityestimationwehaveused amultivariateGaussiandistributionwhoseparametersweretrainedwithvoicejudgementdatafromthetwoclasses andusingtheoptimizedfeaturevectorpreviouslydevised.Eachclasshasitsownmodelthatischosenafterhaving thevoicepleasantnessclassestimation.Thisapproachallowedtouseanonlinearclassdiscriminationschemethatwill firstdividethefeaturespaceandthenwilluseaclassdependentmodel,basedonamultivariateGaussiandistribution, toreachafinalvoicepleasantnessintensityestimation.
Asanalternativescenariowehaveusedonlythefirsttwobestfeatures/componentscalculatedinaccordancewith thepresentedmethods.Foreachparameterpair(p1,p2)weassociatedapleasantnessrankingasadependentfunction
r=f(p1,p2).Thenusingauniform bicubicB-splinesurfaceS(x,y), westarted withaplane that wassuccessively deformedinordertoapproximatethetruerankingvalueforthegivenpoints(Barsky,1982).Foreachclasswehave adistinctsurfacethatwasadjustedtotherelatedpoints.Thisapproachcanprovidegoodresultsthoughitposessome limitationsduetoitsnon-parametricnature.
5. Evaluationandresults
ThebestresultswereobtainedusingtheconfigurationdepictedinFig.2.Thisoptimizedpipelineallowedtoachieve afinalvoiceclassificationerrorrateof9.1±2.1%andavoicepleasantnessintensityestimationabsoluteerrorrateof 15.7±2.5,bothfora90%confidenceinterval.
Table2
Listofthefirst84featuresintherankedlist.
0 12 24 36 48 60 72 84
1 Jrap f0Min VoB4Sk VeF1Sk VaB1Kt VeF1Kt NHR ViF2Max
2 SR PSk VeB1Av f0Max VeB3Max VeB3Min ViF1Max VaF3Max
3 VaB1Av VaF2Kt VaF1Std VoB4Av VaB1Max VoB4Max VoF2Kt VaB3Kt
4 PAv VoB3Std VeF3Max VoF3Min VoB4Std ViF3Std ViB2Max ViF2Sk
5 f0Sk PMin VoB3Av VoB1Av VaB2Std VeF4Min ViF2Min VeB1Sk
6 f0SD VoB3Sk PSD f0SD ViF4Min VeB4Min VoF4Sk VeF3Min
7 Sapq3 PMax PR VoB1Min VeF4Sk ViF3Min VoF2Sk Sdda
8 HNR(dB) VaB4Min HAC VaB2Max ViF1Av VeF2Std VeB3Sk Jddp
9 f0Av VeB4Kt f0K VaF4Kt ViB2Av VoB3Min VeF3Av VoF4Max
10 f0Av JLAbs VeB2Kt PAv PAv f0Min VoF4Min VaB3Max
11 PKt VeB3Std JLAbs2 VoB2Sk ViF2Kt VoF1Std VeF2Sk VaB1Sk
12 f0Max PSK VoF2Std VaF4Sk ViF1Min VoB1Std VaB4Std VoB3Max
Table3
Referencevaluesforthebestrankedfeatures.Valuesareabsolutewitharelativevariationwithina90%confidenceinterval.
Feature Jrap PAv f0Sk Sapq3 f0Av f0Max
Value 0.00946±9.3% 0.0110±12.1% 0.576±4.7% 0.02697±5.9% 196±5.3% 34.24±6.2%
The systemoptimizationwas performedbyindividuallyadjusting eachof theparameterson everystagewhile observingtheperformanceimprovementsintheoutput.Theperformancewasalwaysmeasuredintermsofclassification error(sumoffalsepositivesandfalsenegativesoverthetotalnumberofevaluatedcases)anda10-foldcrossvalidation allowedtoreachthemeanerrorrateaswellastheconfidenceinterval.
Wewillfollowthepipelinefromthebeginning.Afterextractingthefeatures,wehaveappliedtheproposedMCFS methodology.Therankedfeaturelist,aftercompositefeatureselection,ispresentedinTable2.
Fromtherankedfeaturelistwewenttoinvestigatetheamountof necessaryfeaturestoprovidethebestresults. UsingaSVMclassifierwithaRBFkernel,whosegammaandcostparameterwereoptimizedineachexperiment,we testedseveralcombinationsofnoutofthe12bestrankedfeaturesinthelist.Thebestresultswereobtainedusinga6 elementsvectorcomposedbythefeaturesrepresentedinboldfaceinthefirstcolumnofTable2.Therelatedvaluesare presentedinTable3.Additionallywewantedtoevaluateifitwaspossibletoobtainbetterresultsusingdimensionality reductionapproachessuchasPCAorISOMap.InFig.3wecanobservetheobtainedclassificationerrorandtherelated 90%confidenceintervalforthethreeapproacheswhilevaryingthenumberoffeaturesorvectorcomponents.Wecan seethatPCAandISOMapclearlyoutperformMCFSwhenareducednumberofcomponentsareconsideredandthat theevaluationofasmallsetofindependentfeaturescannotprovideasufficientdiscriminantpower.Theperformance ofallmethodsimproveswiththesizeofthefeaturevectoruntilacertainpointwithPCAandISOMapachievingthe bestresultswith5componentsandMCFSwith6featuresandtheoverallsmallesterrorrate.Whenconsideringalarger feature vectorwecanseethat theMCFSperformanceisdegradedwhichmayindicatedifficultiesduringclassifier
0 5 10 15 20 25 3 2 4 5 6 7 8 9 10 MCFS PCA ISOMap
Table4
CompositionofthefirstPCAcomponent.
Component 1 2 3 4 5 6
Feature SR Sapq3 f0Av EMax Jrap VeB3Kt
Weight 12.2 10.8 8.5 5.5 5.3 3.9
Acc.weight 12.2 23.0 31.5 37.0 42.3 46.2
Component 7 8 9 10 11 12
Feature f0SD ESK VeB1Av HAC EMax PR
Weight 3.7 3.0 2.8 2.7 2.5 2.3
Acc.weight 49.9 52.9 55.7 58.4 60.9 63.2
trainingwhendealingwithadditionaldimensions.Ontheotherhand,PCAandISOMap,showedtobemuchmore robusttovectorsize.
SincePCAandISOMapperformed wellwithsmall sizeddescription vectorsandwewantedtounderstandthe features thatbest represented voicepleasantnesswe investigated the composition of thefirst PCA component. In
Table4weseethe12highestweightedfeaturesaswellastheiraccumulatedweight.Wecanobservethatmostofthese featuresarealsoamongthehighestrankedintheMCFSfeaturelist(Table2)whichreinforcestheirimportanceonthe definitionofvoicepleasantness.Itisalsonoticeablethatthemostimportantfeaturesinbothapproachesaremainly voicequalityandprosodyparameters.Theaveragef0,themaximumf0,speakingrateandpauserate,amongothers, arealsolisted ashavingahighcorrelationwithvoicelikabilityinBragaetal.(2007a),whichcanbringadditional confidenceintheresults.
Intheclassificationstage,andstillthinkinginthepreviousstage,wecomparedtheperformanceoftheSVMand ANNapproacheswhilevaryingthefeaturevectordimension.TheresultscanbeseeninFig.4(a)whereweobserve thattheSVMclassifierstillholdsthebestperformance.(TheGMMclassifierwasevaluatedseparatelyandwehave notvariedthedimensionoftheinputvector.)
For classification we evaluated independently three classifiers, ANN, SVM and GMM as can be seen on
Table5(a)–(c),respectively.TheSVMclassifierperformedbetterandtooptimallytunetheSVMclassifier,we evalu-atedthesystem’sperformancewhenvaryingthetypeofkernelfunctionandrelatedparameters,asshowninthefirst rowofTable6(beforefusion).WecanobservethattheRBFkernelperformedbetterbutveryclosely,followedbythe quadraticpolynomialkernel(despitethemuchlongertrainingtimerequiredbythelast).
InTable5(c)wecanobservethattheGMM–Bayesclassifiershowedtohaveabetterperformancewhenidentifying thenotsopleasantvoices.TofurtherimprovethesystemwecombinedtheSVMclassifier,whichachievedthebest overallperformance,withtheGMM/Bayesclassifierusingaweightedvotingscheme(performedduringtraining). Thishybridapproachallowedtoimprovetheoverallresults,whichareshowninthesecondrowofTable6.Forthe RBFkernel,arelativeweightof30%GMMand70%SVMwasobtainedbyanerrorminimizationsearchprocedure. Additionallywealsoevaluatedinwhatextentthefeatureselectionprocessandtheinclusionofasecondparallel classifier(GMM)helpedtoimprovetheclassificationresults.InTable7,wecanobservetheimprovementsintroduced
0 5 10 15 20 25 30 10 9 8 7 6 5 4 3 2 SVM ANN
(a) Classification error
0,00 0,05 0,10 0,15 0,20 0,25 0,30 8 7 6 5 4 3 2 9 10
(b) Voice pleasantness intensity estimation error
Fig.4.Performanceevaluationaccordingtofeaturevectordimension.Horizontalaxisindicatesthedimensionofthefeaturevectorandvertical axisshowserrorinpercentage(a)andabsolutevalue(b).
Table5
Confusionmatriceswithclassificationresults,inpercentage,fordistinctapproaches.Classesarepleasant(P)ornotpleasant(NP).Ontheleftwe havetherealclasses(R)andonthefirstrowwehavetheestimatedclasses(E).
R\E P NP (a)ANN P 80.5 21.1 NP 19.5 78.9 (b)SVM P 89.8 12.7 NP 10.2 87.3 (c)GMM P 68.5 11.4 NP 31.5 88.6 Table6
Voicepreferenceclassificationerrorfora90%confidenceintervalusingdifferentkerneltypeswithaSVMbasedclassifierandafterfusionwitha GMM/Bayesclassifier.(Twoclassesoutlierfreedataset,errorwithina90%confidenceinterval.)
Linear Quadratic RBF Sigmoid
SVMalone 23.7±3.7% 18.8±3.2% 11.5±2.6% 24.2±3.9%
Class.fusion 19.3±3.7% 14.5±2.1% 9.1±2.1% 21.9±4.2%
Table7
Classificationerrorrateaftertheintroductionofeachcomponent.(Errorwithina90%confidenceinterval.)
Non-tunedsystem Independentfeat.sel. Compositefeat.sel. Optimizedfeat.subset Compositeclassification
38.5±7.6% 18.9±5.2% 14.4±2.3% 12.3±3.1 9.1±2.1%
byeachblockofthepipelineusingasreferencetheresultsobtainedwithaRBFkernel(sinceitperformedbetter)and consideringalltheavailableobjectivefeatures.
Afterobtainingaclassificationresultweperformedthevoicepleasantnessintensityestimation.Theperformance evaluationwasbasedonthemeanabsoluteestimationerror,orbyotherwords,themeanabsolutedifferencebetween thetruehumanjudgedpleasantnessrank(ona1–5scale)andthevalueestimatedbythesystem.Theresultsobtained forthe multivariateGaussianmodelapproacharedepictedinFig.4(b)wherewe canobservethat thebestresults, 15.7±2.5%fora90%confidenceinterval,areobtainedagainfora6dimensionsvector.Thesurfacebasedapproach wasimplementedusinga20×20patchgridsupportedonthebesttworankedfeaturesusingtheMCFSprocessand thefirsttwocomponents/vectorsofthePCAandISOMapmethodologies.Theobtainederrorrateswere27.3±5.7%, 15.8±3.2%and15.6±3.9%,respectively,fora90%confidenceinterval.Asurfacemappedfromabidimensional referencebasedon rawfeatures wasclearlyinappropriatehowever,theuseof PCA orISOMap, ledtosuccessful results.EspeciallytheISOMapbasedreferenceprovidedaresiduallylowererrorratebutwithahighervariationinthe confidenceinterval.Forthisreasonwepreferredtheparametricapproachwithbetterrepeatability.
6. Conclusionsandfuturework
Inthispaperwehavepresentedanoptimizedpipelinefortheautomaticevaluationofvoicepleasantness.Westarted bydefiningthepleasantnessconcept,whichhasacentralroleinourwork,andshowedhowitcanberelatedtothe perceptionofvoiceasstimuli.Anextensivepresentationofthemostsignificantstudiesintheareawasalsopresented. Aftershowingan overviewof theprocessingpipeline,we havedetailed thefunctionsof eachstage. Weprovided abriefpresentationofthedatabase,whichwasspecificallydevelopedforthestudyofvoicepleasantness.Then,we thoroughlyexplainedtheprocedurethatwasfollowedtoobtainanoptimalfeaturevectorforachievingimprovedresults. Westartedbyexploringabroadrangeofdimensions,coveringacoustic,signal,periodicityandphonationspeedaspects
andsuccessivelyselectedthebestfeaturesbymaximizingtheirclassdiscriminatorypowerandbyreducingthe inter-featureredundancy.Thisapproachprovidedthebestresultsforourdataafteracomparisonwithothermethodologies. Ouroptimizedfeaturevector, mainlycomposedbyprosodic andvoicequalityparameters,included Jrap(relative averageperturbation),PAv(derivativeofaveragepower),f0Sk(skewnessoffundamentalfrequencyderivative),
Sapq3(amplitudeperturbationquotient),f0Avandf0Max.ASVMclassifier wassuccessfullycombined witha GMM/Bayesmethodologyin alatefusion scheme.Thisnewapproach,motivatedbythe specificnature of voice pleasantnessanalysis,allowedtoachieveafinalvoicepleasantnessclassificationerrorrateof9.1±2.1%,whichis betterthanotherreportedmarksforsimilartasks.Finally,forestimatingvoicepleasantnessintensity,weachievedthe bestresultsusingaclassdependentmultivariateGaussianmodel,whichprovidedanerrorrateof15.7±2.5%.The obtainedresultswereobtainedusingadatabasethatwasspecificallydevelopedforthestudyofvoicepleasantness.
Thepresentedsystemisapowerfulobjectiveanalysistool thatcansupportsubjectiveevaluations duringvoice talentselectionstagefornewvoicefontsrecording,thuscontributingtoreduceoverallTTSdevelopmentcosts.This workalsoprovidesanadditionalinsightonthestudyofpleasantness,oneofthekeydimensionsofexpressivespeech, andcancontributetotheenhancementofTTSsystems’voicequalityaswellastoleveragetheintroductionofthis technologyinmorereallifeapplications.
Asfutureworkweplantoapplyasimilarmethodologytootherlanguagesallowingtosimultaneouslyincrease theavailabledataandtoevaluatethecross-languageperformance.Anextensionoftheworkforcharacterizingmale voicesisalsoenvisioned.
Acknowledgements
ThisworkwaspartiallysupportedbyERDFfunds,theSpanishGovernment(TEC2009-14094-C04-04),andXunta deGalicia(CN2011/019,2009/062).
References
Alku,P.,Backstrom,T.,Vikman,E.,2002.Normalizedamplitudequotientforparametrizationoftheglottalflow.AcousticalSocietyofAmerica2, 701–710.
Amano-Kusumoto,A.,Hosom,J.-P.,2009.Theeffectofformanttrajectoriesandphonemedurationsonvowelintelligibility.In:Proc.IEEEConf. onAcoustics,Speech,andSignalProcessing,pp.4677–4680.
Audibert,N.,Vincent,D.,Aubergé,V.,Rosec,O.,2006.Expressivespeechsynthesis:evaluationofavoicequalitycenteredcoderonthedifferent acousticdimensions.In:Proc.of3rdInternationalConferenceonSpeechProsody,Dresden,Germany.
Barsky,B.A.,1982.Endconditionsandboundaryconditionsforuniformb-splinecurveandsurfacerepresentations.ComputersinIndustry3,17–29. Beigi,H.,2011.FundamentalsofSpeakerRecognition.Springer.
Biadsy,F.,Rosenberg,A.,Carlson,R.,Hirschberg,J.,Strangert,E.,2008.Across-culturalcomparisonofAmerican,Palestinian,andSwedish perceptionofcharismaticspeech.In:Proc.SpeechProsody,Campinas,Brazil.
Boersma,P.,2001.Praat:doingphoneticsbycomputer.GlotInternational9/10(5),341–345,URLhttp://www.fon.hum.uva.nl/praat/. Borg,I.,Groenen,P.,2005.ModernMultidimensionalScaling:TheoryandApplications.Springer-Verlag.
Braga,D.,Coelho,L.,Junior,F.G.R.,Dias,M.S.,2007,July.SubjectiveandobjectiveevaluationofBrazilianPortugueseTTSvoicefontquality. In:Int.WorkshoponAdvancesinSpeechTechnology,Maribor,Slovenia,pp.306–311.
Braga,D.,Coelho,L.,Junior,F.G.V.R.,Dias,M.S.,2007,October.SubjectiveandobjectiveassessmentofTTSvoicefontquality.In:Proc.of InternationalConferenceonSpeechandComputers(SPECOM2007),Moscow,pp.306–311.
Burkhardt,F.,Eckert,M.,Johannsen,W.,Stegmann,J.,2010.Adatabaseofageandgenderannotatedtelephonespeech.In:Proc.Languages ResourcesEvaluationConference.
Burkhardt,F.,Paeschke,A.,2005.AdatabaseofGermanemotionalspeech.In:Proc.Interspeech.
Burkhardt,F.,Schuller,B.,Weiss,B.,Weninger,F.,2011,August.“Wouldyoubuyacarfromme?”–onthelikabilityoftelephonevoices.In:Proc. Interspeech,Florence.
Campbell,J.P.,1997.Speakerrecognition:atutorial.ProceedingsoftheIEEE85(9),1437–1462.
Campbell, N., 2008. Expressive/affective speech synthesis. In: Springer Handbook on Speech Processing. Springer, http://www.speech-data.jp/fastnet/ApplicantCV-Campbell.pdf,pp.505–517.
Casale,S.,Russo,A.,Scebba,G.,Serrano,S.,2008.Speechemotionclassificationusingmachinelearningalgorithms.In:Proceedingsofthe2008 IEEEInternationalConferenceonSemanticComputing.IEEEComputerSociety,Washington,DC,USA,pp.158–165.
Chang,C.-C.,Lin,C.-J.,2011.LIBSVM:alibraryforsupportvectormachines.ACMTransactionsonIntelligentSystemsandTechnology2, 27:1–27:27.
Chattopadhyay,A.,Dahl,D.W.,Ritchie,R.J.,Shahin,K.N.,2003.Hearingvoices:theimpactofannouncerspeechcharacteristicsonconsumer responsetobroadcastadvertising.JournalofConsumerPsychology13(3),198–204.
Coelho,L.,Braga,D.,Dias,M.,Mateo,C.,2011.Anautomaticvoicepleasantnessclassificationsystembasedonprosodicandacousticpatternsof voicepreference.In:Proc.ofInterspeech.
Coelho,L.,Braga,D.,Garcia-Mateo,C.,2010.KalmantrackinglinearpredictorforvowelintelligibilityenhancementonEuropeanPortuguese HMMbasedspeechsynthesis.In:ProcofICASSP.No.1,pp.4734–4737.
Cover,T.,Thomas,J.,1991.ElementsofInformationTheory.JohnWiley&Sons,NewYork.
Dempster,A.P.,Laird,N.M.,Rubin,D.B.,1977.Maximum-likelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical Society:SeriesB39.
Eklund,R.,Lindström,A.,2001.Xenophones:aninvestigationofphonesetexpansioninSwedishandimplicationsforspeechrecognitionand speechsynthesis.SpeechCommunication35(2),81–102.
Ekman,P.,Sorenson,E.,Friesen,W.V.,1969.Pan-culturalelementsinfacialdisplaysofemotions.Science164,86–88.
Eyben,F.,Wöllmer,M.,Schuller,B.,2009.OpenEAR–introducingtheMunichopen-sourceemotionandaffectrecognitiontoolkit.In:Proc.4th Int.HUMAINEAssociationConf.onAffectiveComputingandIntelligentInteraction.
Fellbaum,C.,1998.WordNet:AnElectronicalLexicalDatabase.TheMITPress,Cambridge,MA. Fukunaga,K.,1990.IntroductiontoStatisticalPatternRecognition.AcademicPress.
Gobl,C.,Chasasaide,A.N.,2003.Theroleofvoicequalityincommunicatingemotion,moodandattitude.SpeechCommunication40,189–212. Hassan,A.,Damper,R.I.,2009.Emotionrecognitionfromspeechusingextendedfeatureselectionandasimpleclassifier.In:Proc.Interspeech,
Brighton.
Kendall,M.G.,Smith,B.,1939.Theproblemofmrankings.TheAnnalsofMathematicalStatistics10(3),275–287.
Kinnunen,T.,Li,H.,2010.Anoverviewoftext-independentspeakerrecognition:fromfeaturestosupervectors.SpeechCommunication52(January), 12–40.
Kohonen,T.,2000.SelfOrganizingMaps.Springer.
Kreiman,J.,Gerratt,B.,Precoda,K.,1990.Listenerexperienceandperceptionofvoicequality.JournalofSpeechandHearingResearch33, 103–115.
Lilliefors,H.W.,1967.OntheKolmogorov–Smirnovtestfornormalitywithmeanandvarianceunknown.JournaloftheAmericanStatistical Association62,399–402.
Lopes,J.,Freitas,S.,Sousa,R.,Matos,J.,Abreu,F.,Ferreira,A.,2008.AmedidaHNR:Suarelevâncianaanáliseacústicadavozesuaestimac¸ão precisa.In:Proceedingsof“IJornadassobreTecnologiaeSaúde”,Guarda,Portugal.
Luengo,I.,2010.Análisisyevaluacióndeparámetrosparaidentificaciónautomáticadeemocionesenelhabla.Ph.D.Thesis,UniversidaddelPaís Vasco.
Lugger,M.,Yang,B.,2008.calmotivatedmulti-stageemotionclassificationexploitingvoicequalityfeatures.In:SpeechRecognition,Technologies andApplications,pp.395–410.
Mayo,C.,Clark,R.A.J.,King,S.,2011.Listeners’weightingofacousticcuestosyntheticspeechnaturalness:amultidimensionalscalinganalysis. SpeechCommunication53(March),311–326.
Montero,J.M.,Gutiérrez-Arriola,J.,Colas,J.,Enriquez,E.,Pardo,J.M.,1999.AnalysisandmodellingofemotionalspeechinSpanish.ICPhS99 2,957–960.
OxfordDictionary,2009[Online;accessed13.03.2011].URLhttp://oxforddictionaries.com/.
Peng,H.,Long,F.,Ding,C.,2005.Featureselectionbasedonmutualinformation:criteriaofmax-dependency,max-relevanceandmin-redundancy. IEEETransactionsonPatternAnalysisandMachineIntelligence27,1226–1238.
Pinho,S.,Pontes,P.,2002.Escaladeavaliac¸ãoperceptivadafonteglótica:Rasat.VoxBrasilis8(3),8–13.
Pudil,P.,Ferri,F.,Novovicova,J.,Kittler,J.,1994.Floatingsearchmethodforfeatureselectionwithnonmonotoniccriterionfunctions.Pattern Recognition2,279–283.
Reynolds,D.A.,Quatieri,T.F.,Dunn,R.B.,2000.SpeakerverificationusingadaptedGaussianmixturemodels.DigitalSignalProcessing10,19–41. Schuller,B.,2010.Ontheacousticsofemotioninspeech:desperatelyseekingastandard.JournaloftheAcousticalSocietyofAmerica127(3),
1995.
Schuller,B.,Arsic,D.,Wallhoff,F.,Rigoll,G.,2006.Emotionrecognitioninthenoiseapplyinglargeacousticfeaturesets.In:Proc.SpeechProsody, Dresden.
Schuller,B.,Batliner,A.,Seppi,D.,Steidl,S.,Vogt,T.,Wagner,J.,Devillers,L.,Vidrascu,L.,Amir,N.,Kossous,L.,Aharonson,V.,2007,August. Therelevanceoffeaturetypefortheautomaticclassificationofemotionaluserstates:lowleveldescriptorsandfunctionals.In:Proceedingsof Interspeech,Antwerp,Belgium.
Schuller,B.,Vlasenko,B.,Eyben,F.,Wollmer,M.,Stuhlsatz,A.,Wendemuth,A.,Rigoll,G.,2010.Cross-corpusacousticemotionrecognition: variancesandstrategies.IEEETransactionsonAffectiveComputing1(July),119–131.
Schröder,M.,2009.Expressivespeechsynthesis:past,present,andpossiblefutures.In:AffectiveInformationProcessing.Springer,London,pp. 111–126.
Shami,M.,Verhelst,W.,2007.Anevaluationoftherobustnessofexistingsupervisedmachinelearningapproachestotheclassificationofemotions inspeech.SpeechCommunication49(March),201–212.
Shaukat,A.,Chen,K.,2008.Towardsautomaticemotionalstatecategorizationfromspeechsignals.In:INTERSPEECH’08,pp.2771–2774. Stuhlsatz,A.,Meyer,C.,Eyben,F.,Zielke,T.,Meier,H.-G.,Schuller,B.,2011.Deepneuralnetworksforacousticemotionrecognition:raisingthe
benchmarks.In:ProcofICASSP.IEEE,pp.5688–5691.
Syrdal,A.,Conkie,A.,Stylianou,Y.,1998.Explorationofacousticcorrelatesinspeakerselectionforconcatenativesynthesis.In:Proc.ofInt.Conf. SpeechandLang.Processing98,Lisbon,Portugal.
Teixeira,J.P.,Freitas,D.,Braga,D.,Barros,M.J.,Latsch,V.,2001.PhoneticeventsfromthelabellingtheEuropeanPortuguesedatabaseforspeech synthesis,FEUP/IPB-DB.In:Proc.ofEurospeech.
Tenenbaum,J.B.,deSilva,V.,Langford,J.C.,2000.Aglobalgeometricframeworkfornonlineardimensionalityreduction.Science290,2319–2323. Theodoridis,S.,Koutroumbas,K.,2006,February.PatternRecognition,3rded.AcademicPress.
Unluturk,M.S.,Oguz,K.,Atay,C.,2009.Emotionrecognitionusingneuralnetworks.In:Proceedingsofthe10thWSEASInternationalConference onNeuralNetworks.WSEAS,StevensPoint,WI,USA,pp.82–85.
vanBezooijen,R.,2005.Approximant/r/inDutch:routesandfeelings.SpeechCommunication47(1),15–31.
L.J.P.vanderMaaten,E.P.,vandenHerik,H.,2009.Dimensionalityreduction:acomparativereview.Tech.rep.,TilburgUniversityTechnical Report.
vanderMaaten,L.,Hinton,G.,2008.Visualizinghigh-dimensionaldatausingt-sne.JournalofMachineLearningResearch9,2579–2605. Ververidis,D.,Kotropoulos,C.,Pitas,I.,2004.Automaticemotionalspeechclassification.In:Proc.IEEEInternationalConferenceonAcoustics,
Speech,andSignalProcessing(ICASSP’04),vol.1,pp.593–596.
Weiss,B.,Burkhardt,F.,2010,September.Voiceattributesaffectinglikabilityperception.In:Proc.ofInterspeech,Makuhari,Japan. Witten,I.H.,Frank,E.,2005.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann.
Wolf,J.J.,1972.Efficientacousticparametersforspeakerrecognition.JournaloftheAmericanStatisticalAssociation51,2044–2056.
Yabuoka,H.,Nakayama,T.,Kitabayashi,Y.,Asakawa,Y.,2000.Investigationsofindependenceofdistortionscalesinobjectiveevaluationof synthesizedspeechquality.ElectronicsandCommunicationsinJapan(PartIII:FundamentalElectronicScience)83(5),14–22.
Yanushevskaya,I.,Gobl,C.,Chasaide,A.N.,2008.Voicequalityandloudnessinaffectperception.In:Proc.ofSpeechProsody.
Zhao,Z.,Morstatter,F.,Sharma,S.,Alelyani,S.,Anand,A.,Liu,H.,2009.Advancingfeatureselectionresearch–ASUfeatureselectionrepository. Tech.rep.