Collaborative Filtering and empirical study on the algorithm selection problem for Metalearning and Recommender Systems: A literature review Information Sciences

(1)

ContentslistsavailableatScienceDirect

Information Sciences

journalhomepage:www.elsevier.com/locate/ins

Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering

Tiago Cunha

^a^,^∗

, Carlos Soares

^a

, André C.P.L.F. de Carvalho

^b

aFaculdade de Engenharia da Universidade do Porto, Portugal

bICMC, Universidade de São Paulo - São Carlos, Brasil

a rt i c l e i n f o

Article history:

Received 15 March 2017 Revised 4 September 2017 Accepted 18 September 2017 Available online 20 September 2017 Keywords:

Metalearning Algorithm selection Recommendation system Collaborative Filtering

a b s t ra c t

TheproblemofinformationoverloadmotivatedtheappearanceofRecommenderSystems.

Fromthe severalopenproblemsin thisarea,thedecision ofwhichis thebestrecom- mendationalgorithmforaspeciﬁcproblemisoneofthemostimportantandlessstudied.

Thecurrenttrendtosolvethisproblemistheexperimentalevaluationofseveralrecom- mendationalgorithmsinahandfulofdatasets.However,thesestudies requireanexten- siveamountofcomputationalresources,particularlyprocessingtime.Toavoidthesedraw- backs,researchershaveinvestigatedtheuseofMetalearningtoselectthebestrecommen- dationalgorithmsindifferentscopes. Suchstudiesallowtounderstandtherelationships betweendatacharacteristicsandtherelativeperformanceofrecommendationalgorithms, whichcanbeusedtoselect thebestalgorithm(s) foranew problem.Thecontributions ofthisstudyaretwo-fold:1)toidentifyanddiscussthekeyconceptsofalgorithmselec- tionforrecommendationalgorithmsviaasystematicliteraturereviewand2)toperform anexperimentalstudyontheMetalearningapproachesreviewedinordertoidentifythe mostpromisingconceptsforautomaticselectionofrecommendationalgorithms.

1. Introduction

Recommender Systems (RSs) were proposed to deal withthe problem of information overload[5]. The basic idea is to embody a system withthe ability to recommend to a speciﬁc user the most relevant itemsamong a set of existing options.Thesuccessofthesesystemsisclearinboth academiaandindustry [25].However,therecommendationproblem isnot withoutﬂaws. One ofthecurrentchallengesis thelackofguidance regardingwhichRSalgorithm wouldbemore adequateforagiventask,and,moreimportantly,why.Currentstrategiestodealwiththisprobleminvolvetheevaluationof everyavailable algorithmforeachproblemtodecidewhichoneismoreadequate.Thisisanexpensiveapproach,notonly regardingtime,butalsoconsideringhumanandcomputationalresources.Anautomatic alternativeforalgorithmselection thatovercomesuchdrawbacksisMetalearning(MtL).

MtL [6] uses Machine Learning (ML) techniques to obtain predictive models, which associate characteristics of tasks to the respectiveperformance ofalgorithms. The methodology initially involves extractingcharacteristics froma dataset, named metafeatures,andassessing the relative performance ofa group ofalgorithms, whichwill be usedasmetalabels.

∗ Corresponding author.

E-mail addresses: [email protected] , [email protected] (T. Cunha), [email protected] (C. Soares),[email protected] (A.C.P.L.F. de Carvalho).

https://doi.org/10.1016/j.ins.2017.09.050

(2)

Afterwards,this information forseveral datasets isused to induce a predictive modelable to represent the relationship betweenthemetafeaturesandmetalabelsofthesedatasets.Ifthismodelhasahighpredictiveperformance,itwouldwill beabletopredictthemostpromisingalgorithmwithouttheoverheadofrunningafull-ﬂedgedempiricalevaluationandto explainwhyRSalgorithmsperformbest/worst[41].

Thiswork focusesﬁrstonproviding asystematicliterature review onthe problemofalgorithm selectionforRSs. The relatedworkisreviewedonthekeydimensionsrequiredtosolvethealgorithmselectionproblem.Ineachdimension,the extentof theresearch conductedso faris discussedand linesofresearch for futurework aresuggested. Afterwards, the mostsuitable relatedwork strategiesforalgorithm selection forCollaborative Filtering(CF) are experimentallyevaluated.

Theexperiments allowto comparethe effectsof thedifferentmetafeatures on thesame experimentalsetup. Alarge ex- perimentalstudy ofbaselevel datasets,algorithms andevaluationmetrics isconducted. Afterwards,each MtL strategy is implementedandevaluated onthesamemetalevelsetup,i.e.samemetatarget,meta-algorithmsandevaluationmeasures.

Finally,conclusionsaredrawnfromthebehaviorofthecurrentstateoftheartapproacheswithregardstoseveralaspects ofthealgorithmselectionproblem.

Thisdocumentisorganizedasfollows:Section2presentsabriefreviewonRS,focusingonCFandtheevaluationproto- col.Section3presentsthemethodologyofalgorithmselectionusingMtL.Next,Section4presentstheprocessforsystematic reviewontherelatedworkaboutalgorithmselectionforRSs.Section5 isresponsible topresentthe empiricalstudyper- formed on a subset ofthe Metalearning approaches discussed previously.Section 6 highlights the ﬁnal conclusionsand pointsoutfutureworkdirections.

2. RecommenderSystems

Thissection presentsthebasicaspects ofRSs,witha focusonCF recommendations.It alsodescribeshowthe perfor- manceRSsareusuallyevaluated.

2.1. Overview

SeveralRSs havebeen proposed inthe last decades. One ofthe main aspects that differentiates thesesystems is the followedapproach.Thesestrategiescanberoughly dividedintoeightapproaches:CollaborativeFiltering(CF)[37],Content basedFiltering[4],SocialbasedFiltering[23],KnowledgebasedFiltering[4],HybridFiltering[7],Context-awareFiltering[1], DeepLearning-basedRecommendations[20,44] andGroup Recommendations(GR)[30].Sincethemajorityofthefocusof the MtLstudies lieson CF, only this strategy will be further discussed in thispaper. Further information regarding the remainingstrategiesisavailableelsewhere[1,5,46].

2.2.CollaborativeFiltering

CFrecommendationsarebasedonthepremisethatauserwillliketheitemsfavoredbyasimilaruser,whichisanuser withsimilarpreferences.Itusesthe(implicitorexplicit)feedbackfromeachindividualusertorecommendeditemsamong similarusers[46].Explicitfeedbackisanumericalvalue,proportionaltotheuser’slikelinesstowardsanitem.Implicitfeed- back,ontheotherhand,derivesanumericalvaluefromtheuser’sinteractionswiththeitem(click-throughdata,like/dislike actions,timespentona task,etc.).ThedatastructureusedinCFisknownastheratingmatrixR.ItisdescribedasR^U^×^I, representingasetofusersU,whereu∈

{

¹,...,N

}

ândâ^setôfîtemsÎ,^whereⁱ∈

{

¹,...,M

}

^.^Each^element^Ruiistheuseru feedbacktoitemi.Fig.1presentssuchmatrix.

TheCFalgorithmsareorganizedintotwomajorclasses:memory-basedandmodel-based[5].Memory-basedalgorithms applyheuristicstoa ratingmatrixtocomputerecommendations,whereasmodel-basedalgorithmsinduceamodelfroma ratingmatrixandusethemodeltorecommenditems.Memory-basedalgorithmsaremostlyrepresentedbyNearestNeigh- bor(NN)strategies,whilemodel-basedalgorithmsworkwithMatrixFactorization(MF).Amemory-basedalgorithmfollows 3steps:1)calculatethesimilaritybetweenusersoritems,2)createaneighborhoodofsimilarusersoritemsand3)generate recommendationsbysortingsimilaritems.MFassumesthat theoriginal ratingmatrixvaluescanbeapproximatedby the multiplicationofmatriceswithlatentfeatures thatcapturetheunderlyingdatapatterns.SeveralMFalgorithmshavebeen successfullyusedinCF,namely,SVD++,SGDandALS.Furtherinformationregardingthistopicisavailableelsewhere[21]. 2.3.Evaluation

TheevaluationofRSscanbeperformedbyoﬄineandonlineprocedures.Oﬄineevaluationsplitsthedatasetintotrain- ingandtestingsubsets(usingsamplingstrategies,suchask-foldcross-validation[18])andassessestheperformanceofthe trainedmodelonthetestingdataset(seeFig.2). However,inordertocomparethepredictedandoriginalvalues,thereis animportant differencefromconventional k-foldcrossvalidation: thetest datamustbe splitinto hiddenandobservable instances.

Anotherimportantissueliesinthescopesfortheevaluationmetrics[25]:

• forratingaccuracy,errormetricssuchasMeanAbsoluteError(MAE)orRootMeanSquaredError(RMSE)mustbeused;

(3)

Fig. 1. Rating matrix.

Fig. 2. Diagram of oﬄine evaluation in RS [8] .

• forclassiﬁcationaccuracy,onemustusePrecision/RecallmeasuresorAreaUndertheCurve(AUC);

• forranking accuracy, the typical metrics are Normalized Discounted CumulativeGain (NDCG), Mean ReciprocalRank (MRR)andExpectedReciprocalRank(ERR)[22,24]

Theonlineevaluationarosefromthenecessityofassessingother aspectsbeyondthoseaccessibleoﬄine,inwhichthe actual input of a real usercannot be reproduced in a better way [18]. The most popular metric is the user acceptance ratio,whichrefersto theamountofitemstheuserliked dividedbythetotal amountofitemsrecommendedtotheuser.

Additionalonlineevaluationmetricsaredescribedelsewhere[43].

(4)

Fig. 3. Training process of a Metalearning process [6] .

3. Metalearning

MtLis concerned with discovering patterns indata and understanding the effecton the behavior of algorithms [41]. Ithas beenextensivelyused foralgorithm selection[32,35,39].The algorithm selection taskcanbe viewed asalearning problemitself,bycastingitasapredictivetask.Forsuch,itusesametadataset,whereeachmeta-examplecorrespondstoa dataset.Foreachmeta-example, thepredictivefeaturesarecharacteristics(metafeatures)extractedfromthecorresponding datasetandthetargetsaretheperformance(metalabels)ofasetofalgorithmswhentheyareappliedtothedataset[6].

A MtL process addresses the algorithm selection problem similarly to a traditional learning process. It induces a metamodel, which can be seen as the metaknowledge able to explain why certain algorithms work better than oth- ersin datasets withspeciﬁc characteristics. The metamodelcan also be used to predict the best algorithm(s) fora new dataset/problem[38].Fig.3presentsthetrainingstageoftheMtLprocess.

One of the main challenges in MtL is to deﬁne which are the metafeatures that effectively describe how strongly a datasetmatchesthebiasofeachalgorithm[6].TheMtLliteraturedividesthemetafeaturesintothreemaingroups[38,41]: statisticaland/orinformation-theoretical,model-basedandlandmarkers.

Statisticaland/orinformation-theoreticalmetafeatures describethedatasetcharacteristicsusingasetofmeasuresfrom statisticsandinformationtheory.Thesemetafeaturesassumethattherearepatternsinthedatasetwhichcanberelatedto themostsuitable algorithmsforthesedatasets. Examplesincludesimplemeasures, suchasthenumber ofexamplesand featuresin thedataset,tomore advancedmeasures, suchasentropy, skewness andkurtosis offeatures andevenmutual informationandcorrelationbetweenfeatures[6].

Model-basedcharacteristicsarepropertiesextractedfrommodelsinducedfromthedataset.Inaclassiﬁcationorregres- sionMtLscenario,theyrefer,forinstance,tothenumberofleafnodesinadecisiontree[6].Therationaleisthatthereis arelationshipbetweenmodelcharacteristicsandalgorithmperformance whichcannotbecapturedfromthedatadirectly.

Forinstance,adecisiontreewithanunusuallylargenumberofleafnodesmayindicateoverﬁtting.

Finally,landmarkers are fast estimatesof the algorithmperformance on thedataset. There are two differenttypes of landmarkers:thoseobtainedfromtheapplicationoffastandsimplealgorithmsoncompletedatasets(e.g.adecisionstump can be regarded asa simpliﬁed version of a decisiontree) and thosewhich are achieved by usingcomplete models for samplesofdatasets,alsoknownassubsamplinglandmarkers[6](e.g.applyingthefulldecisiontreeonasample).

4. AlgorithmselectionforRecommenderSystems

TheproblemofalgorithmselectionwasﬁrstconceptualizedbyRice[33],whichdeﬁnedthefollowingsearchspaces:

• theproblemspaceP,representingthesetofinstancesofaproblem;

• the feature space F, containingmeasurable characteristics of the instances generated by a feature extraction process appliedtoaninstanceofP;

(5)

• thealgorithmspaceA,asthesetofallsuitablealgorithmsfortheproblem;

• theperformancespaceY,whichrepresentsthemappingofeachalgorithmtoasetofperformancemeasures;

Consideringthisframework,theproblemofalgorithmselectioncanbestatedas:“foragivenprobleminstancex∈P,with featuresf(x)∈F,ﬁndtheselectionmapping S(f(x))intothealgorithmspaceA, suchthattheselectedalgorithm

α

∈A maximizes theperformancemappingy(

α

⁽^x⁾⁾∈Y.”[33].

Inthe scope ofthiswork, theproblemspaceP isa set ofrecommendationdatasets,the featurespaceF iscomposed by the metafeatures used to describe the datasets, the algorithm space A contains all the available algorithms and the performancespaceYisdescribedbyasetofsuitableevaluationmeasures.

4.1. Methodology

Inorderto performasystematicreview onalgorithmselection forRSs,severalonlinedatabases wereconsulted:Else- vier’s Scopus,ThomsonReuters’sWebofScience,IEEE XploreDigital Library,Google ScholarandACMDigital Library.The keywordscombinedthekeywords:“RecommenderSystem”,“metalearning”,“algorithmselection”and“performanceprediction”.

ThisdocumentreviewsonlyapproachesthatrelatetheperformanceofRSalgorithmswithdatacharacteristics.

4.2. Researchquestions

ConsideringtheMtLworkﬂow,themostimportantdimensionsofthealgorithmselectionproblemforRSwereidentiﬁed.

Thedimensionsaresplitintoofbase-andmetalevelsandaretranslatedintoResearchQuestions(RQ):

• RQ1:WhatisthecoverageofrecommendationstrategiesinMtLstudies?

• RQ2:Arethepublicdatasetsusedrepresentativeandenough?

• RQ3:Isthepoolofrecommendationalgorithmssuitableandcomplete?

• RQ4:HowwellaretheRSsperformancemeasurescovered?

• RQ5:Arethemetafeaturesuseddiversiﬁedinnatureandenough?

• RQ6:Inthestudies,whatisthetypicalmetatarget?

• RQ7:Whichalgorithmsareemployedinthemetalevel?

• RQ8:Aretheevaluationmeasuresusedinthemetalevelsuitable?

4.3. Relatedwork

TheliteratureprovidesafewapproachesfortheRSalgorithmselectionproblem.TheﬁrstworkstudiestheCFalgorithm selection problem by mappingthe data onto a graphinstead of a ratingmatrix [19].Graph-dependent metafeatures are derivedtoselectamongNNalgorithms.Theselectionprocesstakesadvantageofadomain-dependentrules-basedmodel.

Other papers analyzed the ratingmatrix usingstatistical and/or information-theoretical metafeatures to selectamong NNandMFalgorithms[2,13].Theproblemswereaddressedasregressiontasks,wherethegoalwastooptimizetheRMSE performance.

An alternative approach, using a decision tree regression model, was developed later [16]. It studied the underlying relationshipbetweenuserratingsandneighborhooddataandtheexpectederroroftherecommendationsofaNNalgorithm.

Thisapproachfocusedondescribingmetafeaturesforusers,insteadoftheentiredataset.

Anotherproposallookedtotheco-ratings insteadoftheoriginalratings [26].Thisextendedtheoriginal dataavailable intheratingmatrixwithrelationshipsamongusers.ItusesaregressionmodeltopredicttheRMSEperformanceofNNand MFalgorithms.

ThealgorithmselectionproblemwasextendedbeyondCF,whenaGRapproachwasproposed[49].Itusedseveralclas- siﬁcation algorithms torankvoteaggregationalgorithms. Forsuch,it derivesdomain-dependent metafeaturesandrelates themwiththeperformanceofthealgorithmsintermsoferror.

ThemostcomprehensiveMtLstudyforCFalgorithmselectionstudiestheselectionofMFalgorithmsusinganextensive experimentalsetup[10].Itconsidersmultiplealgorithmsandevaluationmetricsatboththebase-andmetalevels.Further- more,itproposeda setofsystematically generatedmetafeatures whichextendstheunionofalmost alltheonesavailable inthepreviousliterature.

Therelatedworkfortheliteraturereview ispresentedinTable1.EachworkisdescribedintermsoftheRQidentiﬁed earlier.SomeRQsaresub-divided,usingthenotation:

• Thebaselevelalgorithms(RQ3)areorganizedbytype-Heuristics(H),NearestNeighbors(NN),MatrixFactorization(MF) andothers(O).

• Thebase-evaluationmeasures(RQ4)areorganizedintoerrorbased(E),classiﬁcationaccuracy(CA)andrankingaccuracy (RA).

• Themetafeatures(RQ5)aredividedaccordingtothesubjectevaluated:user(U),item(I),ratings(RT),datastructure(S) andothers(O).

• Themetatargets(RQ6)are:1)bestalgorithm(BA),2)rankingofalgorithms(RA)and3)performanceestimation(PE).

• Themetalevelalgorithms(RQ7)useclassiﬁcation(C),regression(RG)orother(O)typesofalgorithms.

(6)

Table 1

Summary table on the algorithm selection problem for RS. Each column refers to the RQ identiﬁed in this work (see Section 4.2 ): RQ1 (recommendation strategies), RQ2 (baselevel datasets), RQ3 (baselevel algorithms), RQ4 (baselevel evaluation measures), RQ5 (metafeatures), RQ6 (metatarget), RQ7 (meta-algorithms) and RQ8 (metalevel evaluation). RQ 3, 4, 5, 6 and 7 are divided in several categories (see Section 4.3 ).

Baselevel Metalevel

Ref. RQ1 RQ2 RQ3 RQ4 RQ5 RQ6 RQ7 RQ8

H NN MF O E CA RA U I RT S O C RG O

[19] CF 3 2 2 – – – 3 1 – – – 2 2 BA – – 1 AUC

[2] CF 4 – 2 1 – 1 – – 1 1 1 3 – PE – 1 – Correlation

[13] CF 1 1 2 1 1 1 – – 3 – – – – PE – 1 – Correlation

[16] CF 3 – 1 – – 1 – – 11 – – – – PE – 1 – MAE

[26] CF 4 – 1 1 – 1 – – – – – 3 – PE – 1 – Correlation

[49] GR 4 11 – – – 1 – – 5 – – – – RA 4 – – MRR

[10] CF 32 4 – 13 – 3 1 3 30 30 10 4 – BA 10 – – Accuracy

4.4.Discussion

ThissectionanalysestheresultspresentedinTable1regardingtheseveralaspectsidentiﬁedintheRQ(seeSection4.2).

4.4.1. Recommendationstrategies

Sincethisresearchareaisstillintheearlystages(allworkswerepublishedinthelast6years),itisexpectedthatonly afewRSstrategieswouldhavebeenstudied.Infact,likeintheRSresearch area,themajorityoftheresearcheshasbeen performedonCF. This isjustiﬁed by the lackof publicframeworks beyondthisrecommendationstrategy. Theexception isarecentstudyonthealgorithmselectionproblemforGroupRecommendation(GR)[49].Therefore,itisessentialto 1) expandthescopeofRSstrategiesstudiedand2)performadeeperanalysisofthealgorithmselectionproblemforCF.

4.4.2. Datasets

Most related work use at most 4 datasets to investigate algorithm selection, with the exception of the most recent work[10].While onsome cases thismaybe acceptableif theproblemisappropriately modeled (for instance,selectthe bestalgorithmforeachuserinsteadofpereach dataset[16]),thissmallnumberisusually adrawback. Infact,algorithm selectionstrategiesmustusealargeamountofdiversedatasetstoensureaproperexplorationoftheproblemspaceP.This dimensionmustbeimproved,althoughtherearefewpublicdatasets.

Thepublic datasetsused intherelatedwork are: AmazonReviews[27],BookCrossing[50],Epinions[34], Flixter[48], Jester[15],LastFM[3],MovieLens[17],MovieTweetings[12],Netﬂix[29],Tripadvisor[42]andYahoo!Music[45].Onlytwo worksuseprivatedata[19,49],whichareexcludedfromfurtheranalysis.

The results show a large variation in how frequent each dataset is used in these works. The most common dataset belongsto the MovieLenscategory (used 5times out of7). This followsthe trend inthe RS research area, wherethese datasetsareconsideredbenchmarks. Thesecond choicesfellontheFlixter,JesterandNetflixdatasets.Whilethefirsttwo areusedinabouthalfoftheworks,theNetflixdatasetismorefrequentandshouldbepresentinmorestudies.Itbecame popularityafteraworld-wide competition thatfinished in2009 [21].However, themaindifficulty lieswiththe factthat thedatasetisnolongeravailablefordownload.Theremainingdatasetsappearonlyonce.

Althoughit isimportant to analyzethe relatedwork in termsof amountandsource of thedatasets, it iseven more importanttoreviewsomeoftheircharacteristics.Theinspectionofthedatasetscharacteristicsillustratesthefollowingas- pects:1)thenumberofusersanditemsisrathersmallwhencomparedtothenumberofratings,2)thesparsityvaluesare usuallygreaterthan0.9and3)themostcommonratingscalerangesfrom1to5.Overall,thedatasetscoverawiderange intermsofscaleandstructureoftheratingmatrixinCF.Thisshowsthatthecollectionofpublicdatasetsisrepresentative, althoughnotexhaustive.Severalpointsneedtobeimproved,includingtheanalysisofdifferentlevelsofsparsityandun- derstandingtheeffectsontheperformanceandtoextendthetypesofscalesused.Theseimprovementscanbeobtainedby creatingartiﬁcialdatasets.Thiscanbeachievedbyapplyingpermutationstotheoriginaldatasets(maintainingsomeglobal characteristics),bycreatingentirelynewdatarecurringtowell-knowndistributions,by addingnoise tothedatasetsorby creatingartiﬁcialdatasetsusingsimulation[9].

4.4.3. Baselevelalgorithms

The baselevelalgorithms frequently used are distributedin the following categories: Heuristics (18), MF (16),NN (8) andOthers(1).The fact that MF andNNapproaches areabundant isexpected, since they are themost commonlyused inCF.The largeamountofheuristicsrefers mostlytotheGRapproach,whichstudies 11algorithmsofthisnature.In CF, heuristicsaretypicallyassociatedwithnaiveapproaches,suchasrandomandmostpopularalgorithms.Itisalsoclearthat newerstudieschangethefocusfromNNalgorithmstoMF.Thisisexpected,sinceMFisnowthestandardinCF.

(7)

Themostfrequentlyusedalgorithmsareuser-basedNNandSVD++,closelyfollowedbyitem-basedNN.Thisrepresents themostbasicapproachesforCFandarethereforeavailableinalargeramountofrecommendationplatforms.Newerap- proaches,suchasMF(besidesSVD++),areusuallymorediﬃculttoﬁndinrecommendationplatforms.Thisisanimportant limitation,givenMF’srelevance.Averagesandmostpopularalgorithmsaremorecommonthanarandomapproach.Thisis expected,sincetheyarebetterbaselines[21].

AlthoughthealgorithmsusedaresuitableforCF,theauthorsdonotknowofanystudythattakesinaccountacomplete anddiverse pool ofalgorithms. Even themost recentandcomplete studyfails inpursuing NNalgorithms [10]. Inorder to properly explore the algorithm space A for CF, it is important to evaluate the new algorithms. Nowadays, there is a large number of resources to conduct experimental evaluations, speciallyfor CF. However, noneof them deals withthe algorithm selection problem. A platformfor modelmanagement that could integrate severalRS algorithms anddatasets, whileproviding faithfulandfairevaluationwouldbe thebestsolution tothisproblem. However, theclosest systemsare frameworksdevotedsolelytotheevaluationofRSs[36].

4.4.4. Baselevelevaluation

Table1showsthatmostevaluationmeasuresare errorbased(usedin6out of7works).Ontheother hand,accuracy measures (either classification orrankingbased) are only usedin2 works.The mostfrequentlyused errormeasures are RMSEandMAE,classificationaccuracyevaluated throughprecision,recallandAUCandrankingaccuracyassessedbyMRR and NDCG. The evaluation procedures usually assess only one aspect of the recommendation process, contradicting the guidelinesfromtheRSliterature[18].Infact,only2worksexpandedtheevaluationscope[10,19].Thisclearlydemonstrates theincompletenessoftheevaluationintherelatedwork.Furtherinvestigationsarerequiredtoimprovetheexplorationof theperformance spaceY,includingincrease thescopeofoffline evaluationmeasures andperformthesamestudiesusing onlineratherthanofflineevaluationprocedures.Thefirstcanbeachievedbyusingrecent,yetnotfullyvalidated,measures.

These try to assess non standard dimensions of the RS problem such as novelty, satisfaction and diversity. Also, online evaluationmeasurescouldbeusedtocomparetheactualfeedbackwiththepredictions,sincetheyareclaimedtobebetter.

4.4.5. Metafeatures

The metafeaturesused intherelatedworksare allstatisticalandinformation-theoreticalmeasures. Theycanbe orga- nizedintoseveralcategories:user(50),item(31),ratingsdistribution(11),datastructure(12)andothers(2).Mostofthem focusesonusers.Infact,someresearchusesonlymetafeaturesofthisdimensionoftheproblem[13,16,49].Characteristics relatedtothedatastructurearealsocommonandareavailable in4out of7works.Oneworkusedallpreviousmetafea- turesinanuserco-occurrencematrix[26].

The numberofmetafeatures usedusually ranges from3to11, withonlyone exception that increasesthisnumberto 74 [10].However, thisis notnecessarily an advantage.Asmaller numberavoidsthe curse ofdimensionalityandismore suitable whenasmallnumberofbaselinealgorithms areevaluated.However, assumingtheexistenceofmorerecommen- dationalgorithms,furtherandmorediversemetafeaturesshouldbestudiedtoproperlyexplorethefeaturespaceF.

The analysis of metafeatures in a deeper level allows us to understand which type of statistical and information- theoretical measureswere usedin therelatedwork.The functionsusedare mostlyratios, averagesandsums.Thisis expected, since they are thesimplest metafeatures found inthe MtLliterature. Entropy,Gini indexand standard deviation appearinthesecondposition(in3works).Allotherfunctionsappearonlyonce.

Despitethefact that diversemetafeaturesare proposed, fewstudies looktowards differentaspects oftheproblem. In fact, 4 studies focus their metafeatures on a singlesubject, which typically is the user.Plus, there are few examples of metafeaturesthatlooktowardsrelationshipsbetweenthedifferentsubjectsoftheproblem.Thismakesdifficulttofindcom- plexpatternsinthedata,restrictingthemetaknowledgeextracted.Recentworkshavesuccessfullyextendedthenatureand amountofmetafeatures,improvingdatasetcharacterizationforCFalgorithmselection.Theseextensionsinclude:1)propose andadaptnewmetafeaturesforotherRSstrategies;2)proposeproblem-specific(andeventuallydomain-specific)metafeatures and3) studynewmetafeatures besidesstatisticaland/or information-theoretical, such asforinstance,landmarkers.

Thislistisbynomeansexhaustive,sinceitisdiﬃculttoanticipatethenecessitiestoproperlydescribethedatasetsofother RSproblemsbesidesCF.Itisespeciallydiﬃcultwhenitcomestoguaranteetheirinformativepower.

4.4.6. Metatarget

RelatedworkonCFalgorithmselectionhasadoptedseveralmetatargets.Themostcommonistheperformanceestima- tion (PE),available in4 out of7works.There are alsotwo examplesof bestalgorithm (BA)andranking(RA) selections.

AlthoughrecentworkshavelookedbeyondPE[10,49],realapplicationscanbeneﬁtfromtheuseofBAorRAduetointer- pretabilityissues.

4.4.7. Metalevelalgorithms

Usually,onlyonealgorithmisusedinthemetalevel.Thisalgorithmmustmatchtherequiredmetatarget.Ascanbeseen inTable1,whenPEisusedasmetatarget,regressionalgorithmsareused(mainlylinearregressionalgorithms).ForBAand RA,popularclassiﬁcationalgorithms,likerulebasedclassiﬁers,NaiveBayes,SVMandkNN,areused.Anexceptionhappens whenacustomprocedurebasedonrules isused[19].Althoughthenumberofalgorithmsusedinthemetaleveldoesnot havethesameimpactasthenumberusedinthebaselevel,theuseofalargerandmorediversesetofalgorithmsinthe

(8)

metalevelincreasesthechancesofuncoverhiddenrelationshipsinthemetadataset.Onlytwostudiesusingmorethanone algorithminthemetalevel[10,49].ArelevantfutureworkistheapplicationofRSonboththebase-andmetalevel.Although thistopichas notreceived anyattentionso farfortheselection ofrecommendationalgorithms,it hasbeensuccessfulin other domains. Such worksare importantto understand whetherMtLapproaches are indeedthe bestway totackle the algorithmselectionproblem.

4.4.8. Metalevelevaluation

Thelast RQ focusesonthe evaluationmeasures usedinthe metalevel.Onceagain, thesemust bein conformitywith the metatarget and meta-algorithm. This is noted by the usage of error based measures or correlation assessments for PE[2,13,16,26],classiﬁcationaccuracy measuresforBA[10,19]andrankingaccuracymeasures [49].Onecan concludethat allrelatedworkperformsvalidationforthealgorithmselectiontask,whileusingsuitablemeasures.

4.5.Summary

AfterreviewinginextenteachkeytopicofthealgorithmselectionproblemforCF,wepresentthekeyconclusionsfound:

• WiththeexceptionofonestudyonGroupRecommendation,onlyCFhasbeenstudiedonthealgorithmselectionprob- lem.

• Thepublicdatasetsarerepresentative(althoughnotexhaustive)andenoughinrecentworks.However,previousstudies useonlyfewdatasets.

• Thepoolofrecommendationalgorithmsstudiedinalgorithmselectionstudiesisalwayssuitable,butnevercomplete.

• MostapproachesevaluateCFwithasinglemeasure.However,recentstudiesareshiftingtheparadigmtowardscombin- ingevaluationmeasures.

• Althoughadiversesetofmetafeaturesisavailable,therelatedworkstypicallyusefewmetafeatures.

• Althoughthetypicalmetatargetisperformanceestimation,thetrendhasshiftedtowardschoosingthebestalgorithm.

• Despitemoststudiesworkingwithregressionalgorithmstobuildmetamodels,thechangeinmetatargethasfueledthe shiftinmeta-algorithms.

• Suitablemetalevelevaluationmeasureswereusedinallrelatedworks.

5. Empiricalstudy

Asseenintheliteraturereview,mostrelatedworkusesmallscaleexperimentalsetups,whichpreventtheextractionof genericconclusionsforRS.Furthermore,thedatasetsandalgorithmsusedaredifferent,makingitdiﬃculttoquantitatively assessthemeritsofeachapproach.ThemaingoalofthisempiricalstudyistoreplicatetheMtLapproachesusedbyrelated worksandtocomparetheir performanceson thesameexperimental setup. Thus,thebaselevel experimentsareconstant forallapproaches, andso are themeta-algorithms,metatarget andmetalevelevaluationmeasures. The onlydifference is thesetofmetafeaturesemployedbyeachindependentalgorithmselectionapproach.Moreformally,thesearchspacesP,A andYareﬁxed,whilespaceFvaries.

5.1. Relatedwork

In order to choose the strategies which will be replicated in this experimental study, certain requirements must be establishedtoensureafairevaluation:

• Therecommendationstrategymustbethesame:thisstudywilldevoteitsattentiontoCF,sinceitisthemostpopular.

ThisﬁlterstheapproachforGR[49],sincethemetafeaturescannotbereproducedfortheCFdomain.

• Themetafeaturesmustbespeciﬁctothedatasetandnottousers:thegoalhereistocomparestrategiesthatselectthe bestalgorithmforawholedataset.Therefore,studieswhichdevotesattentiontotheselectionofalgorithmsperCFuser should be ruledout [13,16].However, studieswithadaptations, by assuming averagevaluesfortheentiredataset, are kept.Forthesemetafeatures,samplesofusersareused,whosesizeissetasthemaximumbetweenthetotalnumberof usersand1000.

• The metafeatures mustreﬂect thecharacteristics ofeitherimplicit orexplicitfeedback datasets:since the majorityof approaches aredesignedforexplicit feedback,the othersare ﬁltered[19].Thisis requiredbecausethe comparisonof metafeaturesfordifferentratingscaleswouldyieldunfairandunreliableresults.

Inordertoformalizethestrategiespresented,thisstudytakesadvantageofasystematicmetafeaturegenerationframe- work[31].Theframeworkrequiresthreemainelements:objecto,functionfandpost-functionpf.Theprocess appliesthe functiontotheobjectand, afterwards,thepost-functionstothatoutcomeinordertogenerateasinglemetafeaturevalue.

Thus,thisframeworkcanberepresentedusingthefollowingnotation:{o.f.pf}.Next,thedifferentmetafeaturesstrategiesare presented,whileformalizedinthisframework.

(9)

Fig. 4. Experimental procedure. This is an adaptation of Fig. 3 to the experiments carried out in this work.

5.1.1. StrategyA

This strategy generates metafeatures using the systematic framework [10]. It considers the objects o dataset, user and item and applies several functions f to characterize these objects: original ratings, sums of ratings, counts of ratings and average ratings. The post-functions pf used are maximum, minimum, mean, standard deviation, me- dian, mode, entropy, Gini index, skewness and kurtosis. Details regarding the calculation of these functions are available elsewhere [28,40]. Additionally, this strategy includes 4 simple statistics: number of users, items, ratings and matrix sparsity. This results in 74 metafeatures, which were reduced by correlation feature selection, end- ing up with: dataset.ratings.kurtosis, dataset.ratings.standard_deviation, dataset.nusers.∅, dataset.sparsity.∅, item.count.kurtosis, item.count.minimum, item.mean.entropy, item.sum.maximum, user.mean.minimum, user.sum.kurtosis, user.mean.skew and user.sum.entropy.

5.1.2. StrategyB

Thisstrategy[26] createsanauxiliary datastructurefromwhichthemetafeaturesare extracted.Thisstructure,known asco-ratingsmatrix,iscreatedby1)randomsampleofusersintheoriginalratingsmatrix,2)assigninguserstodifferent equivalence classes(EC)dependingon thenumberof ratingsassignedand3) creatinga co-ratings matrixthat compares eachpairofequivalenceclassesbytheaveragenumberofcommonco-rateditems.Themetafeaturesare:dataset.sparsity.∅, EC.co-ratings.giniandEC.co-ratings.entropy.

5.1.3. StrategyC

This strategy extracts the following metafeatures directly from the dataset [2]: dataset.density.∅, item.count.gini, item.count.skewness,user.count.gini,user.count.skewnessanddataset.ratings.variance.

5.1.4. StrategyD

Oneofthestrategiesthatrequiredadaptationforthisexperimentalstudylooksattheproblemastheselectionofalgo- rithmsperuser[16].Theadaptationsinvolvedaveragingvaluesofmetafeaturesamongallusersinthedatasetandworking withsamplesofusersformetafeatureswithhighcomputationalrequirements.Thecompletelistofmetafeaturesusedinthe strategy is:user.mean.mean, user.count.mean,item.count.mean,user.standard_dev.mean, item.mean.mean,user.neighbours.mean, user.average_similarity.mean,user.clustering_coef.mean,user_coratings.jaccard.mean,user.TFIDF.meananditem.entropy.mean.

5.1.5. StrategyE

Theotherstrategythatapproachedthealgorithmselectiononuserlevelwasalsoadapted[13].However,duethesimpler nature ofthemetafeatures used,itwasrequiredonlyto averagethevalues acrossusers.The metafeaturesproduced are:

user.count.mean,user.mean.meananduser.variance.mean. 5.2. Experimentalprocedure

Wepresentnowtheexperimentalprocedureusedtocomparethemeritsoftheseveralmetafeaturesstrategiespresented earlier.Fig.4presentsthebase-andmetalevelsintermsofdata,algorithmandevaluationmeasures.Nextwepresenteach levelindetail.

5.2.1. Baselevel

ThebaselevelexperimentsrefertotheCFproblem,whereacollectionofdatasetsisevaluatedonapoolofsuitablealgo- rithms.The38datasetsusedaresplitupintoseveraldomains,namelyAmazonReviews[27],BookCrossing[50],Flixter[48],

(10)

Table 2

Summary description about the datasets used in the experimental study.

Dataset #users #items #ratings

Amazon Apps 132,391 24,366 264,233 Amazon Automotive 85,142 73,135 138,039

Amazon Baby 53,188 23,092 91,468

Amazon Beauty 121,027 76,253 202,719 Amazon CD 157,862 151,198 371,275 Amazon Clothes 311,726 267,503 574,029 Amazon Digital Music 47,824 47,313 83,863 Amazon Food 76,844 51,139 130,235 Amazon Games 82,676 24,600 133,726 Amazon Garden 71,480 34,004 99,111 Amazon Health 185,112 84,108 298,802 Amazon Home 251,162 123,878 425,764 Amazon Instant Video 42,692 8882 58,437 Amazon Instruments 33,922 22,964 50,394 Amazon Kindle 137,107 131,122 308,158

Amazon Movies 7278 1847 11,215

Amazon Oﬃce 90,932 39,229 124,095 Amazon Pet Supplies 74,099 33,852 123,236 Amazon Phones 226,105 91,289 345,285 Amazon Sports 199,052 127,620 326,941 Amazon Tools 121,248 73,742 192,015 Amazon Toys 134,291 94,594 225,670

Bookcrossing 7780 29,533 39,944

Flixter 14,761 22,040 812,930

Jester1 2498 100 181,560

Jester2 2350 100 169,783

Jester3 2493 96 61,770

Movielens 100k 94 1202 9759

Movielens 10m 6987 9814 1,017,159

Movielens 1m 604 3421 106,926

Movielens 20m 13,849 16,680 2,036,552 Movielens Latest 22,906 17,133 2,111,176 MovieTweetings latest 3702 7358 39,097 MovieTweetings RecSys2014 2491 4754 20,913

Tripadvisor 77,851 10,590 151,030

Yahoo! Movies 764 4078 22,135

Yahoo! Music 613 4620 30,852

Yelp 55,233 46,045 211,627

Jester[15],MovieLens[17],MovieTweetings[12],Tripadvisor[42],Yahoo!MusicandMovies[45]andYelp[47].Itshouldbe notedthatadomainmaycontainmultipledatasets.Table2presentsthedatasetsandsomeoftheirbasiccharacteristics.

ExperimentswerecarriedoutwithMyMediaLite[14].TwotypesofCFproblemswereaddressed:RatingPrediction(RP) andItemRecommendation(IR).WhileinRPthegoalistopredictthemissingratinganuserwouldassigntoanewinstance, inIR the goalis torecommend a list ofranked itemsmatching the user’spreferences.Since the problemsare different, soare thealgorithms andevaluationmeasures. The followingCF algorithms wereused inthiswork fortheRP problem:

MatrixFactorization(MF),BiasedMatrix Factorization(BMF),Latent FeatureLog LinearModel(LFLLM)), SVD++,3 variants ofSigmoidAsymmetricFactorModel(SIAFM,SUAFMandSCAFM),UserItemBaseline(UIB)andGlobalAverage(GA).Inthe caseofIR,thealgorithmsusedareBPRMF,WeightedBPRMF(WBPRMF),SoftMarginRankingMF(SMRMF),WRMFandMost Popular(MP).Nearestneighboralgorithmsareexcludedduetoincompatibilitywiththesizeofthedatasets.

InIR,thealgorithms areevaluated usingtheranking accuracymetricNormalized DiscountedCumulativeGain (NDCG) andtheclassiﬁcation accuracy metricArea Underthe Curve(AUC). Inthe caseofRP, thealgorithms are evaluated using theerrorbasedmeasuresRootMeanSquaredError(RMSE)andNormalizedMeanAbsoluteError(NMAE).Allperformances wereassessedusing10-foldcross-validation.FollowingthestandardapproachinMtLresearch,thealgorithmsweretrained usingthedefaultparameters suggestedinthe literatureortheimplementationused.Bynottuningtheparameters ofthe baselevelalgorithms,theoptimalperformanceofeachalgorithmisnotobtained. However,itpreventsusfrombiasing the performanceresultsinfavorofanyofthealgorithms,thusensuringafairassessment.

5.2.2. Metalevel

Themetalevelsetupincludesthemetafeatures,themetatargetandthemeasuresusedtoevaluatethemetamodels.The metafeatureextractionprocessinvolves applyingallstrategiesdiscussed inSection 5.1toallCF datasetslisted inTable2. Theoutcomeis5differentsetsofmetafeatures.

(11)

Fig. 5. Performance of the metafeature strategies on the AUC metatarget.

Fig. 6. Performance of the metafeature strategies on the NDCG metatarget.

Themetatargetisbuiltbyidentifyingthebestalgorithmforeachspeciﬁcdataset.Sinceweusedifferentbaselevelevalu- ationmeasures,thenitisexpectedthatdifferentalgorithmsareselectedasthebestchoiceforaspeciﬁcdatasetfordifferent measures.Thus, NDCGandAUCareusedtoselectthebestalgorithmsinIRtasks,whileinRPtasks,RMSEandNMAEare used.Thiscreates4differentmetatargets,whichwhencombinedwiththe5differentsetsofmetafeatures,leadstothetotal amountof20metadatasetstobeusedinthealgorithmselectionproblem.

Theseproblemsareaddressedasclassiﬁcationtasks.Wetested11algorithmsoneach ofthosemetadatasets,withdif- ferent biases:ctree, C4.5, C5.0,kNN, LDA,Naive Bayes,SVM (linear, polynomial and radial kernels), random forest anda baseline algorithm:majorityvote. Themajorityvotedoesnot takeinto accountanymetafeaturesandalways predictsthe classwhichappearsmoreoften.Sincethemetadatasetshaveareducednumberofexamples,theaccuracyofthemetalevel algorithmswasestimatedusingaleaveoneoutstrategy.

5.3. Results

5.3.1. Metalevelaccuracy

Themetalevelperformanceofthestrategiesregardingtheaccuracyofthemetamodelsineachmetatargetispresentedin Figs.5–8.Ineachﬁgure,allstrategiesarepresentedalongsideeachother,tofacilitatetheircomparison.Withineachﬁgure, theaccuracyofthebesttunedmeta-algorithmsarepresented.Therearealsoindicationsregardingthebaseline,i.e.majority voting. Thus, a modelwhose performance islower than the baseline, isnot considered informative.Two other reference valuesarealsopresented: themeanaccuracyofallalgorithms andthe meanaccuracy ofthetop3bestmeta-algorithms.

Theirpurposeistounderstandtherobustnessofthestrategiesontheaverageandbestcasescenarios,respectively.

(12)

Fig. 7. Performance of the metafeature strategies on the RMSE metatarget.

Fig. 8. Performance of the metafeature strategies on the NMAE metatarget.

RegardingFig.5,one canobservethatall strategiesexceptChaveaverageperformancesabovethebaseline. StrategyC hasapoorperformance, sinceonly oneof themeta-algorithms(knn) beatsthe baseline.Thisstrategy isalso responsible bytheworstperformance,obtainedwithSVMwithlinearkernel.StrategyAbarelybeatsthebaseline,sincetwoalgorithms scorebelowandsixhavethesameperformanceofthebaseline.Thisstrategyonlyworkswiththemodelsbuiltwithc4.5 andNaive Bayes. All remaining strategies have comparable performance, although strategy B seems slightly better. One importantaspectistheperformanceofthec4.5 algorithm,whichobtainsthehighestperformanceacross allstrategies.In termsoftheaverageperformanceofthetop3models,allstrategiesoutperformthebaseline.However, strategiesBandD havehigherperformance,butwithdifferentalgorithms.WhilethebestalgorithmsinBarec4.5,c5.0andNaiveBayes,inD theyarectree,randomforestandSVMwithpolynomialkernel.

Theperformance resultsfortheNDCG metatargetarepresentedinFig.6.Here,all strategiesbeatthebaseline bothin termsofaverageofallmodelsandaverageofthebest3models.Infact,thereareonlythreeoccasionsinwhichthebaseline isnotbeaten:forthealgorithmNaiveBayesinstrategiesB,DandE.Thestrategywiththebestperformanceinbothtypes ofaveragesisD.Ithasthe4bestmodelsfoundforthismetatarget(c4.5,c5.0,ctreeandrandomforest).Theworststrategy byaverageofallalgorithmsisE,althoughthisismainlyduetothelowperformanceoftheNaiveBayesalgorithm.Interms oftheaverageofthetop3algorithms,strategyAperformsslightlyworstthantheothers.However,theworstalgorithmin thisstrategyisthectree,whoseperformanceissimilartothebaseline.

Fig.7presentstheresultsfortheRMSEmetatarget.Intermsoftheaverageperformanceofallalgorithms,Aistheworst, sinceitistheonlytoscorebelowthebaseline.Thisisfueledbytheperformance ofthreealgorithms (c5.0,LDAandSVM withlinearkernel).ThebeststrategiesinthiscaseareB,CandD,withlowvariationontheperformanceresults,although allhaveone algorithmthat wasunabletobeatthebaseline:LDA.Intermsoftheperformanceofthetop3algorithms, all

(13)

Fig. 9. Impact of meta-algorithms on the baselevel performance.

strategiesyieldgoodresults,wellabovethebaseline.ThebestandsecondbeststrategiesinthiscaseareD(withalgorithms SVMwithlinearandpolynomialkernels)andB(achievinghighestperformancewithNaiveBayes),respectively.

Finally, the accuracy results for the NMAE metatarget are presented in Fig. 8. This metatarget has the worst results when considering theaverage accuracy ofall models per strategy: fourstrategies presentvaluesbelow the baseline and onlystrategyBhasaslightlyhigheraccuracy,whichmaynotbesigniﬁcant.Inthismetatarget,severalalgorithmsperform poorly, such asthe examples ofthe c5.0 forstrategy E andLDA in strategy A.On top ofthis, at most3 algorithms per strategyareabletobeatthebaseline.TheonlyalgorithmtobeatthebaselineinallstrategiesisthepolynomialSVM.When thecomparisonconsiderstheaverageperformance ofthetop3algorithms,thescenariochanges: allstrategiesyieldgood results.ThebestresultsareachievedwithstrategyEandalgorithmLDA.

5.3.2. Baselevelimpact

AnimportantaspecttoevaluateinMtLproblemsistheimpactofthemetamodelstrainedonthebaselevelperformance.

Forsuch,themetamodelsareevaluatedbytheperformanceofthebaselevelalgorithmpredictedasthebestineachdataset.

The values are then averaged to obtain a singlevalue that depictsthe overall performance ofthe metamodel across all datasets.Inthesecomparisons,twoelementsarecrucial:theperformanceprovidedbythebaselinemetamodel(i.e.majority voting),thelower bound,andthesocalledoracle,whichcontainstheperformanceoftheactualbestalgorithm,theupper bound.Thisanalysisisimportanttounderstandtherealvalue ofthemetamodelsandespecially sinceitismissinginthe originalworks.

(14)

Table 3

Computational time required for the extraction of each type of metafeature strategy.

Strategy Total time (seconds) Average time (seconds)

A 47.75 1.26

B 11957.95 314.68

C 4.64 0.12

D 35167.35 925.46

E 259.73 6.83

Fig.9presentstheresultsfromallstrategiesacrossallmetatargets.Ineach pairstrategy-metatarget,thebaselevelper- formancefromallmeta-algorithmsisshown.Thebaselineandoracleresultsarepresentedastheredandgreenhorizontal lines,respectively.Theresultswerepresentedinaformattofacilitatetheirreadability.Allvaluesareindependentlyscaled andtheresultsforRMSEandNMAEwereinvertedtouniformtheanalysisprocess.Thus,althoughthegoalistominimize thesemeasures, theinvertedscaleallows thefollowinganalysis:an algorithmisbetter thananotherifitsperformance is higher.

Theresults showthat metatarget NDCGhasthe largestnumberofmetamodels whoseperformance is betterthan the baseline.In fact, inall strategies,there isonly oneexample belowthe baseline.On theother extreme, RMSEandNMAE presenttheworstresults.Here,mostalgorithmsperformsimilartoorworsethanthebaseline.InthecaseofRMSE,thegap betweenthebestmetamodelsandtheoracleisthehighest.WhiletheresultsforNDCGandNMAEareexpected,theRMSE caseisasurprise.Itshowsthat althoughthemetalevelaccuracyforthisproblemisacceptableforasubset ofmodels,the impactontheperformanceoftherecommendationproblemislow.

Regardingstrategies, theresults are overall balanced acrossmetatargets. However, there are some exceptions,such as strategiesC, DandEformetatarget NDCG andstrategy Dformetatarget AUC.In all thesecases,the resultsfromall al- gorithmsscoreabove the baseline.There isno pairstrategy-metatarget whosebaselevel performance ofall algorithms is alwaysequalto orlower thanthebaseline,although theperformanceofthebestmetamodels ontheRMSEmetatargetis justslightlyabovethebaseline.

5.3.3. Computationalcost

Oneimportantaspect to beevaluated in thecomparisonof metafeaturestrategiesis theamount oftime requiredfor metafeatureextraction.Table3presentstherecordedvaluesforthetotalandaverageamountoftimerequired.Whilethe ﬁrstrefers tothe timerequiredto extractall metafeatures forall datasets,the second indicatesthe averagetime forone dataset.Thislastvalueisanindicatorofthetimerequiredfortheapplicationonafuturenewproblem.

Accordingtotheresults,strategiesC,A andEare thefastest,respectively, andtheonlythatseem feasibleintermsof computationalresources. Theseresults areexpecteddueto factthat theirmetafeatures arethesimplesyandrequireonly theratingmatrixtobeproduced.Ontheotherhand,strategiesDandB arethemosttimeconsuming. Thereasonbehind itlieswiththeusageofalternativedatastructuresandtheextractionofmoredetailedmetafeatures.

5.4.Discussion

Themainobservationsfromthisempiricalstudyarethefollowing:

• Allstrategiesoutperformthebaselinewithregardstotheaverageofthetop3algorithms.However,intermsofaverage performance, thestrategies onlyseemto be effectivefortheNDCG metatarget,whilefailingforthe NMAE. Regarding AUC andRMSE, thereis noclearpattern. Theseobservations indicatethat solving theCF algorithmselection problem using MtL is feasible andan eﬃcientalternative to thestandard empirical study.However, it also sheds light on an importantproblem:thepractitionermustdecidewhatisthecriteriatoselectthebestalgorithm,sincethemetatargets willreﬂectdifferentbehaviors.Andifthemetatargetsaredifferent,thendifferentmetafeaturesarerequiredtosolvethe problem.

• When comparing the average performancesacross metatargets, strategy Dyields the best results,closely followedby B. Strategies A, C andE have the worst performances,although in different metatargets. The best strategies have in commonthefactthattheyexplorethedatasetsinmoredetailandfocusthemetafeatures(directlyorindirectly)onthe userinteractions.Thismaybean indicationthatfutureworkshouldlooktowards morecomplexmetafeatures suchas thoseusedintheseworks.

• The best andworst meta-algorithms across metatargetsappearto be thepolynomial SVM andLDA, respectively. This can be seen by inspectingthe top positions ofeach algorithm ineach combinationof strategyand metatarget.Other algorithms do not have such an evident pattern. It is hard to attempt to explain whydoes this phenomenon occur.

Clearlythebiasinthealgorithmsistheresponsiblefactor,althoughtheactualreasonsarestillunknown.

• Intermsofbaselevelimpact,thestrategiesworkwellfortheNDCGmetatargetandpoorlyforRMSE.Theperformance intermsofstrategiesiscomparable,althoughtherearesomespecialcases.Thispointstotheimportanceofthistypeof

(15)

Table 4

Tukey’s test p -values for the comparison among baseline, average of all algorithms and average of top 3 algorithms. The p -values are highlighted if the differences are signiﬁcant, i.e. p -value 0.05.

Pairwise AUC NDCG NMAE RMSE All

top3-avg 0.078571 0.003901 0.0 0 0 024 0.0 0 0 074 0.0 0 0474 base-avg 0.350451 0.0 0 0191 0.206408 0.0734732 0.030089 base-top3 0.006020 0.0 0 0 0 01 0.0 0 0357 0.0 0 0 0 03 0.0 0 0 0 0 0

Table 5

Tukey’s test p -values for the comparison among strategies. The p -values are highlighted if the differences are signiﬁcant, i.e. p -value 0.05.

Pairwise AUC NDCG NMAE RMSE All

B-A 0.060590 0.999934 0.443748 0.615853 0.593524 C-A 0.894940 0.965678 0.938393 0.705771 0.915770 D-A 0.292990 0.819273 0.840159 0.433962 0.170064 E-A 0.292990 0.987142 0.611546 0.978983 0.973180 base-A 0.999639 0.428473 0.526624 0.999722 0.949847 C-B 0.003110 0.910 0 09 0.938393 0.999991 0.990722 D-B 0.973442 0.699014 0.985458 0.999722 0.973180 E-B 0.973442 0.998061 0.999815 0.954044 0.958747 base-B 0.027991 0.563215 0.999994 0.788278 0.130146 D-C 0.027991 0.998061 0.999815 0.998017 0.746388 E-C 0.027991 0.699014 0.985458 0.978983 0.999887 base-C 0.973442 0.096515 0.967604 0.858851 0.410626 E-D 1.0 0 0 0 0 0 0.428473 0.998667 0.858851 0.593524 base-D 0.166905 0.033901 0.994775 0.615853 0.016084 base-E 0.166905 0.819273 0.999994 0.998017 0.566884

analysis:althoughone canrankthestrategiesaccordinglytothemetalevelaccuracy,theresultsshow thattheirimpact onthebaselevelislowerthanexpected.

• Thereisatrade-off betweenaccuracyandcomputationaltime.Themostaccuratestrategieshasaprocessingtimeupto onethousandtimeslowerthanthetimerequiredforthefasteststrategy.However,theperformanceisthesecasesisnot optimal.Itwillbeuptothepractitionertodecidewhichonetouse.

Inordertovalidate theobservations,statisticalsignificancetestswereperformed. Mostvalidationsarecalculated with ANOVA(tounderstandiftherearedifferencesinthepopulationsstudiedineachobservation)andTukey’shonestsignificant difference tests (to understand whetherthe differences are statistical significant or not). Table 4 presentsthe resultsof the Tukey’s test for thefollowing null hypothesis:there is no difference in performance among baseline (base), average performance ofallalgorithms(avg) andaverageperformanceofthetop3algorithms (top3)foreachmetatargetandona generalpointofview.Thep-valuesarehighlighted ifthedifferencesare significant,i.e.p-value<0.05. Onecanassertthat the average ofthe top 3algorithms areindeed better than thebaseline. The results alsoshow that theonly metatarget for which the average of all algorithms is statistical significantly better than the baseline is NDCG. The variance in the resultsfortheother metatargetsmakeit impossibleto makesuch assertionforthose cases.However, onthe analysisfor allmetatargets,thedifferenceremainsstatisticallysignificant.Itisalsopossibletoconcludethattheaveragetop3isbetter thanaverageofallalgorithmsfortheNDCG,NMAEandRMSEmetatargetsandalsoonaglobalpointofview.

Inorder tovalidate the statisticalsignificanceinthe rankingof strategies,theprevious tests, butwiththeresults ag- gregatedbystrategy,canbeperformed.Thecomparisonsaremadeforeachmetatargetandalsoforthecombinationofall metatargets. The p-valuesforthe pairwise comparisonusingTukey’s testare presentedin Table5.The resultsshow that theobservationsregardingtherankingofstrategiesdonotholdstatisticalsignificance,exceptforfewspecialcases.Infact, whenconsidering allmetatargets, theonlystatisticallysignificant differenceisthat strategyDisbetterthan thebaseline.

Allothercomparisonsinthesetofallmetatargetsdonotholdanystatisticalsigniﬁcantvalue.Thistrendextendswhenthe comparisonismadeforeachmetatarget,withtheexceptionofAUC.Here,thereisevidencetosupportthatBisbetterthan thebaselineandthatCisworsethanB,DandE.IntheNDCGmetatarget,Disalsoshowntobebetterthanthebaseline.

Thestatisticalsigniﬁcanceofmultiplealgorithmsonmultipledatasetscanbe veriﬁedusingCriticalDifference(CD)diagrams[11].Thisconsistsofplottingtheaverageranking positionforeachelementbeingcompared andcalculatetheCD intervalthat statesthatthereisnoevidenceinthedatatoshow thattheelementswithinthat intervalcanbeconsidered different.InCDplots,twoelementsnotconnectedbyalinecanbeconsidereddifferent,i.e.,onealgorithmisrankedhigher thantheother.Inthiswork,thedatasetsareprovidedbythecombinationsofmetafeaturestrategiesandmetatargets,which refertotheindependentanddependentvariables,respectively.

TheCDdiagram inFig.10allows toconﬁrmthat SVMpolynomial isthehighestrankedalgorithm.However, thereare nostatisticallysigniﬁcant differencesamongSVMpolynomial,SVMradial,C4.5,KNNandSVMlinear.Despitethis,onecan

(16)

Fig. 10. CD diagrams for meta-algorithms across strategies and metatargets.

assertthatSVMpolynomialisbetterthanNaiveBayes,ctree,randomforest,C5.0,LDAandthebaselinemajorityvoting.In termsofstatisticalsigniﬁcance,nootherconclusionscanbedrawnfromthisdiagram.

6. Conclusions

ThisworkanalyzesindepththeproblemofalgorithmselectionforRecommenderSystems,withfocusonCollaborative Filtering.It startswithasystematicliteraturereviewregardingrelatedwork.Inthisreview,theproblemisframedwithin theclassicalMetalearningconceptualframeworkandeachoftheavailableworksispresentedanddiscussedontheseveral dimensionsoftheproblem.Afterwards,anempiricalstudyisperformedtoassessthequalityofasubsetoftheapproaches discussedontheselectionofalgorithms.Experimentalresultsregardingthemetalevelaccuracy,baselevelimpactandcom- putationalresourcesarepresented,discussedandstatisticallyvalidated.

TheliteraturereviewshowsthatmostworkisperformedonCollaborativeFiltering,withareducednumberofdatasets, algorithms andevaluation measuresin thebase-level. However, recentworkshaveimprovedsuch dimensions, effectively contributingtoadvancesintheﬁeld.Intermsofmeta-level,thereareseveralapproacheswhichlookatthealgorithmselec- tionproblemindifferentandvalidperspectives.Mostworksusedregressionapproacheswithawiderangeofmetafeatures.

However,therewasnocomparisonamongthem,whichpreventstheunderstandingoftheirmerits.

Tosolve thisproblem, we conducted an empirical study. It has shown that, in the best casescenarios, all strategies areabletocreateeffectivemodelstosolvethealgorithmselectionproblem.However, inadeeperlook,allstrategieshave differentbehaviorsdependingonthemetatargetsandmeta-algorithmsused,whichindicatesthat theresearchproblemis notyetsolved.TheresultshavealsoshownthatstrategyDisstatisticalsigniﬁcantlybetterthanthebaseline,althoughthe sameconclusionscannot be drawn regardingthe other strategies.The mainconclusion ofthisstudylies inthe factthat thereisnosinglebeststrategythatalwaysoutperformstheothers,butratheraspectsonwhichtheperformance isbetter.

Theauthorshighlightedtheadvantagesandweaknessesofeach work,butleavethechoiceofmetafeaturestrategy tothe practitioner.Nevertheless,theknowledgegatheredinthisdocumentallowstopositivelydirectfutureworkinthisresearch topic.

Acknowledgments

ThisworkisﬁnancedbytheERDFFundthroughtheOperationalProgramforCompetitivenessandInternationalization- COMPETE2020ofPortugal2020throughNationalInnovationAgency(ANI)aspartofproject3506.Theauthorsalsowishto acknowledgethePortuguesefundinginstitutionFCT-FundaçãoparaaCiˇenciaeaTecnologiaforsupportingtheirresearch throughgrantSFRH/BD/117531/2016andtheBrazilianfundingagenciesCNPqandFAPESP.

References

[1] G. Adomavicius , R. Sankaranarayanan , S. Sen , A. Tuzhilin , Incorporating contextual information in recommender systems using a multidimensional approach, ACM Trans. Inf. Syst. 23 (1) (2005) 103–145 .

[2] G. Adomavicius , J. Zhang , Impact of data characteristics on recommender systems performance, ACM Trans. Manag. Inf. Syst. 3 (1) (2012) 1–17 . [3] T. Bertin-Mahieux , D.P.W. Ellis , B. Whitman , P. Lamere , The million song dataset, in: International Conference on Music Information Retrieval, 2011 . [4] Y. Blanco-Fernández , J. Pazos-Arias , A. Gil-solla , M. Ramos-Cabrer , Providing entertainment by content-based ﬁltering and semantic reasoning in intel-

ligent recommender systems, IEEE Trans. Consum. Electron. 54 (2) (2008) 727–735 .

[5] J. Bobadilla , F. Ortega , A. Hernando , A. Gutiérrez , Recommender systems survey, Knowl. Based Syst. 46 (2013) 109–132 . [6] P. Brazdil , C. Giraud-Carrier , C. Soares , R. Vilalta , Metalearning: Applications to Data Mining, 1, Springer, 2009 . [7] R. Burke , Hybrid web recommender systems, The adaptive web (2007) 377–408 .