MethodsX 8 (2021) 101567
ContentslistsavailableatScienceDirect
MethodsX
journal homepage:www.elsevier.com/locate/mex
Method Article
Automated cleansing and harmonization of international trade data
Sandra Oliveira
∗, César Capinha , Jorge Rocha
Centre for Geographical Studies and Associated Laboratory TERRA, Institute of Geography and Spatial Planning, Universidade de Lisboa, Lisbon, Portugal
abstract
Largevolumesofdataarebecomingincreasinglyavailableandcanbeveryvaluablefortheanalysisofdifferent phenomena. These data can originate from multiple sources and be recorded in diverse formats, requiring preliminaryscrutiny inordertobefurther used inscientificanalyses. Thisfirstcrucialphase offiltering and cleansing dataisusuallyacumbersomeandtime-consumingtask,butautomatedroutinescanbedevelopedto helpresearchers.AroutinecreatedwiththeRlanguageisherepresented,toscreen,harmonizeandaggregate international trade data,representingthe tradeflows betweencountries forspecificproducts, inatimeframe thatcoversmonthlyflowsforatleast15yearsformostcountries.TheRscriptimplementingtheseroutinesis provided,beingeasilyadaptedtootherdatasetswithsimilarissues.
• A step-by-stepprocedurefor cleansing and harmonizing international trade data,using Rprogramming language,ispresented
•Automatedroutinesareveryeffectiveinobtainingrobustandfiltereddatainputstointegrateinscientific models
•Spatialandtemporalpatternsofworldwidetraderelationscanbeexploredtoenhanceourunderstanding ofvariousassociatedphenomena
© 2021TheAuthors.PublishedbyElsevierB.V.
ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)
article info
Method name: Cleansing and harmonization of international trade data
Keywords: Automated screening, Data harmonization, Time-series analysis, R software
Article history: Received 9 September 2021; Accepted 30 October 2021; Available online 2 November 2021
∗ Corresponding author.
E-mail address: sandra.oliveira1@campus.ul.pt (S. Oliveira).
https://doi.org/10.1016/j.mex.2021.101567
2215-0161/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
2 S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567 SpecificationsTable
Subject Area: Environmental Science
More specific subject area: Spatial and time-series data analysis
Method name: Cleansing and harmonization of international trade data Name and reference of
original method:
Not applicable
Resource availability: R script available as supplementary material
Dataacquisition
Accesstostatisticaldataconcerningtheinternationaltradeofgoods(imports)wasobtainedfrom theInternationalTradeCentre (ITC)[1].The ITCenlarged thedigitalopen accessofitsdatabasefor several months in2020, providing monthlystatistics of imported goods by countryfree ofcharge, via their website[1].The availableinformation forspecificcommodities potentially associatedwith theaccidentalspreadofAedesspp.mosquitoes,specificallytyres(newandused)andliveplants,was collectedfor79countries.
Theaccesstothisdatabasefacedseveralconstraints,inparticularthelimitedamountofdatathat could bedownloaded ineachweb interactionandtheirsavingformat. Duetotheselimitations,the dataretrievedforeachcountrywascomposedofasetofseparatefiles, thetotalnumberdepending ontheyearsavailable,andthelastcolumnofafilewasautomaticallyassignedasthefirstcolumnof thefollowingfile,creatingduplicates.Thestructureofthefilesforonesinglecountrycouldalsodiffer, specificallyrelatedtotwoissues:i)thenumberofrowsineachfilewasdifferent,becausethenumber of exporter countries varied over time; ii) the number andposition of month/year columns could differ,becauseotherinformationwassometimesaddedbesidesthedate(e.g.units).Inaddition,the recordsforeachindividualcountrycouldbepresentedindifferentunits(forexamplekilograms,tons ornumberofindividual units),insomecaseswithmixedquantitiesforthesamedate,whichcould resultinamissing(zero)orpartialvalueforthetotalamountofimports,calculatedasa“World” row thatwasprovidedintheoriginaldataalongsideeachspecificexportercountry.
Methoddetails
A cleansing and harmonization procedure to transform and organize the trade data obtained from ITC wasdeveloped, using R scripting tools and datawrangling packages [2–8]. The proposed methodology (Fig. 1) was applied to each country and product individually, storing the results in specific folders createdalong thescreening process, tomaintainthe databaseorganizedforfurther use.
Fig. 1. Scheme of the overall methodology developed with R programming language for the cleansing and harmonization of the international trade data, including the major issues, data examples and processing steps.
S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567 3 Separatefilesforeachcountry
Thefirstissuetoovercomewasthedatabeingsavedinseparatefilesforeachproductandcountry, due to table size limitations in the downloading process (Fig. 1). In each web interaction, only a restrictednumberofdatecolumnscouldbedownloaded(n=20),whichrecordedtheyearandmonth ofthetradeflow. Formostcountriesdatawasavailable between2004and2019,butinsomecases theavailableperiodwasshorter,thereforethenumberofdownloadedfilescouldalsodifferbetween countries. All the corresponding files (for product and country) were imported to R software and storedasalist,basedonstringpatternsearchwithintheworkingfolder,andwereafterwardsmerged inaloopbythenameofthecountriesfromwheretheproductisimported(Exporters).Theresultis a singlefilewiththeamount ofimportsper year-monthincolumns,andthe nameofthemultiple Exportercountriesinrows.
Datarecordedindifferentunits
The second issuearose fromtheregistrationofthe importsin differentunits,dependingon the Exporter country, the date or both. As such, some of the columns did not correspond to import amounts (herebycalledvalues),butinsteadtothedescriptionoftheunitassignedtothevalue that wasrecorded inthe previous column(Fig. 1),such askilograms orindividual unitsof theproduct.
Moreover,differentunitscould berecordedinthesamecolumn.Thesecharacteristicsoftheoriginal data required a harmonization step that consistedin searching, within the merged dataframe, the location of the values recorded in different units and subsequently converting them to the same weight unit (in this case, to tons).The search wasfollowed by an indexing function, basedon the positionofthecorrespondingcolumnsandrows,sincethishelpsdefinepointerstowherethevalues arestoredwithinadataframe.Thishadtobeimplementedintwodifferentways:
1) When the entirecolumnrecorded the same weight unit butdifferent fromtons, the search was based on a full string pattern that was found in the last row of the merged dataframe, such as “Imported quantity, Units”. The index position of these columns was then used to convert simultaneouslyallthevaluesofthecolumnsthatmatchedthepatternsearchcriteria.
2) Inthe caseof the columnswhere thevalues were recorded inmultiple weight units,the search foreach specific value that requiredconversion wasbased on theindexing of thecorresponding columnandrowwhereeachvaluewaslocated.
Toautomatethisstep,thesearch,indexingandconversionofthevaluesweredonewithiterating functions. Toensure the applicability of the script fordifferent countriesand the executionof the whole procedure without obstacles, and dueto the possible differentformats andstructure of the original datapercountryandproduct,eachoftheloopingfunctionswasonlyruniftheappropriate conditionsweremet,otherwisetheexecutionofthescriptwouldproceedtothefollowingstep.
Duplicatedandnodatacolumns
The thirdmain issuewasrelatedto theduplicationofcolumns,resultingfromthemergeofthe original files, and to the existence of columns that presented the description of the type of unit assigned,insteadofimportamounts.Thiswashandledthroughthecleaningandsortingofcolumns, using iterationsforstring patternsearch andreplacementbased oncolumnnames,the selectionof the columns whichmatched specific criteriaand thesubsequent filtering ofthe merged dataframe by excluding theselectedcolumns. Theremaining columnswere thensorted byascending orderof year-month,creatingastructuredtime-seriesofproductimportsforeachspecificcountry.
Missingorpartialworldsumvalues
The final issue was associated with the mixed types of units that were recorded in the same column, which resulted in missingor partial values being provided forthe row that collected the
“World” value (sum of all imports)for the product and country. As such, the original row named
4 S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567
“World” inthecolumnofExporterswasreplacedbytheupdatedsumofallrowsforeachyear-month column,andthiswasplacedasthelastrowofthedataframe,overtakingthestrictalphabeticalorder ofExportersnamesthatthepreviousstepshadautomaticallyimplemented.
Conclusions
The data screening procedure herepresented wasable to overcomethe several issues found in the data acquired. The functionsapplied are easily accessible from R software and corresponding packages, and it can be implemented by different users, even those with limited experience in programming tools. For each product and country, the whole script runs in about 1 minute with computationcapabilitiesupto16GbRAM,anditispossibletoadjustittorunforseveralcountries simultaneously.Thescreeneddatacontainsafullyharmonizedtime-seriesdatabasewiththeamount (weightintons)ofimports regardingliveplantsandtyres,mostbeingavailable between2004and 2019. The monthly data that becameaccessible after the implementationof thisprocedure, allows fortheanalysisof seasonalandannualtrendsoftrade flowsbetweencountriesandregions, which can be explored inassociation withother phenomena, such asthe potential spread ofspeciesthat disseminate vector-borne diseases [9–10]. The base script that assembles the whole procedure is availableasasupplementaryfileandcanbeadaptedtootherdatawithsimilarissues.
DeclarationofCompetingInterest
The authors declare that they have no known competing financial interests or personal relationshipsthatcouldhaveappearedtoinfluencetheworkreportedinthispaper.
Acknowledgments
This work was financed by national funds through FCT—Portuguese Foundationfor Science and Technology, I.P., under the framework of the Project “TRIAD—health Risk and social vulnerability to ArboviralDiseases inmainland Portugal” [PTDC/GES-OUT/30210/2017] and by theResearch Unit UIDB/00295/2020andUIDP/00295/2020. CC was fundedthrough FCT, I.P.,under the programmeof
‘StimulusofScientificEmployment—IndividualSupport’withinthecontract’CEECIND/02037/2017’.
Supplementarymaterials
Supplementarymaterialassociatedwiththisarticlecanbefound,intheonlineversion,atdoi:10.
1016/j.mex.2021.101567.
References
[1] ITC, International Trade Centre. Trade statistics for international business development. Available from www.trademap.org [Retrieved in June-July 2020]
[2] R Core TeamR: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2020 URL https://www.R-project.org/ .
[3] Wickham, H.; François, R.; Henry, L.; Müller, K. (2020). dplyr: a Grammar of Data Manipulation. R package version 1.0.2.
https://CRAN.R-project.org/package=dplyr
[4] Wickham, H. (2020). tidyr: Tidy Messy Data. R package version 1.1.2. https://CRAN.R-project.org/package=tidyr
[5] H. Wickham, M. Averick, J. Bryan, W. Chang, L.D.A. McGowan, R. François, . . . H. Yutani, Welcome to the Tidyverse, J. Open Source Softw. 4 (43) (2019) 1686, doi: 10.21105/joss.01686 .
[6] Wickham, H.; Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl [7] Dragulescu, A.; Arendt, C. (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/20 0 0/XP/20 03 Files. R package version
0.6.5. https://CRAN.R-project.org/package=xlsx
[8] Microsoft and Steve Weston (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.1. https://CRAN.
R-project.org/package=foreach
[9] P. Reiter , Aedes albopictus and the world trade in used tires, 1988-1995: the shape of things to come? J. Am. Mosq. Control Assoc. 14 (1) (1998) 83–94 .
[10] P.E. Hulme , Unwelcome exchange: International trade as a direct and indirect driver of biological invasions worldwide, One Earth 4 (5) (2021) 666–679 .