• Nenhum resultado encontrado

Automated cleansing and harmonization of international trade data

N/A
N/A
Protected

Academic year: 2022

Share "Automated cleansing and harmonization of international trade data"

Copied!
4
0
0

Texto

(1)

MethodsX 8 (2021) 101567

ContentslistsavailableatScienceDirect

MethodsX

journal homepage:www.elsevier.com/locate/mex

Method Article

Automated cleansing and harmonization of international trade data

Sandra Oliveira

, César Capinha , Jorge Rocha

Centre for Geographical Studies and Associated Laboratory TERRA, Institute of Geography and Spatial Planning, Universidade de Lisboa, Lisbon, Portugal

abstract

Largevolumesofdataarebecomingincreasinglyavailableandcanbeveryvaluablefortheanalysisofdifferent phenomena. These data can originate from multiple sources and be recorded in diverse formats, requiring preliminaryscrutiny inordertobefurther used inscientificanalyses. Thisfirstcrucialphase offiltering and cleansing dataisusuallyacumbersomeandtime-consumingtask,butautomatedroutinescanbedevelopedto helpresearchers.AroutinecreatedwiththeRlanguageisherepresented,toscreen,harmonizeandaggregate international trade data,representingthe tradeflows betweencountries forspecificproducts, inatimeframe thatcoversmonthlyflowsforatleast15yearsformostcountries.TheRscriptimplementingtheseroutinesis provided,beingeasilyadaptedtootherdatasetswithsimilarissues.

A step-by-stepprocedurefor cleansing and harmonizing international trade data,using Rprogramming language,ispresented

Automatedroutinesareveryeffectiveinobtainingrobustandfiltereddatainputstointegrateinscientific models

Spatialandtemporalpatternsofworldwidetraderelationscanbeexploredtoenhanceourunderstanding ofvariousassociatedphenomena

© 2021TheAuthors.PublishedbyElsevierB.V.

ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)

article info

Method name: Cleansing and harmonization of international trade data

Keywords: Automated screening, Data harmonization, Time-series analysis, R software

Article history: Received 9 September 2021; Accepted 30 October 2021; Available online 2 November 2021

Corresponding author.

E-mail address: sandra.oliveira1@campus.ul.pt (S. Oliveira).

https://doi.org/10.1016/j.mex.2021.101567

2215-0161/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )

(2)

2 S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567 SpecificationsTable

Subject Area: Environmental Science

More specific subject area: Spatial and time-series data analysis

Method name: Cleansing and harmonization of international trade data Name and reference of

original method:

Not applicable

Resource availability: R script available as supplementary material

Dataacquisition

Accesstostatisticaldataconcerningtheinternationaltradeofgoods(imports)wasobtainedfrom theInternationalTradeCentre (ITC)[1].The ITCenlarged thedigitalopen accessofitsdatabasefor several months in2020, providing monthlystatistics of imported goods by countryfree ofcharge, via their website[1].The availableinformation forspecificcommodities potentially associatedwith theaccidentalspreadofAedesspp.mosquitoes,specificallytyres(newandused)andliveplants,was collectedfor79countries.

Theaccesstothisdatabasefacedseveralconstraints,inparticularthelimitedamountofdatathat could bedownloaded ineachweb interactionandtheirsavingformat. Duetotheselimitations,the dataretrievedforeachcountrywascomposedofasetofseparatefiles, thetotalnumberdepending ontheyearsavailable,andthelastcolumnofafilewasautomaticallyassignedasthefirstcolumnof thefollowingfile,creatingduplicates.Thestructureofthefilesforonesinglecountrycouldalsodiffer, specificallyrelatedtotwoissues:i)thenumberofrowsineachfilewasdifferent,becausethenumber of exporter countries varied over time; ii) the number andposition of month/year columns could differ,becauseotherinformationwassometimesaddedbesidesthedate(e.g.units).Inaddition,the recordsforeachindividualcountrycouldbepresentedindifferentunits(forexamplekilograms,tons ornumberofindividual units),insomecaseswithmixedquantitiesforthesamedate,whichcould resultinamissing(zero)orpartialvalueforthetotalamountofimports,calculatedasa“World” row thatwasprovidedintheoriginaldataalongsideeachspecificexportercountry.

Methoddetails

A cleansing and harmonization procedure to transform and organize the trade data obtained from ITC wasdeveloped, using R scripting tools and datawrangling packages [2–8]. The proposed methodology (Fig. 1) was applied to each country and product individually, storing the results in specific folders createdalong thescreening process, tomaintainthe databaseorganizedforfurther use.

Fig. 1. Scheme of the overall methodology developed with R programming language for the cleansing and harmonization of the international trade data, including the major issues, data examples and processing steps.

(3)

S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567 3 Separatefilesforeachcountry

Thefirstissuetoovercomewasthedatabeingsavedinseparatefilesforeachproductandcountry, due to table size limitations in the downloading process (Fig. 1). In each web interaction, only a restrictednumberofdatecolumnscouldbedownloaded(n=20),whichrecordedtheyearandmonth ofthetradeflow. Formostcountriesdatawasavailable between2004and2019,butinsomecases theavailableperiodwasshorter,thereforethenumberofdownloadedfilescouldalsodifferbetween countries. All the corresponding files (for product and country) were imported to R software and storedasalist,basedonstringpatternsearchwithintheworkingfolder,andwereafterwardsmerged inaloopbythenameofthecountriesfromwheretheproductisimported(Exporters).Theresultis a singlefilewiththeamount ofimportsper year-monthincolumns,andthe nameofthemultiple Exportercountriesinrows.

Datarecordedindifferentunits

The second issuearose fromtheregistrationofthe importsin differentunits,dependingon the Exporter country, the date or both. As such, some of the columns did not correspond to import amounts (herebycalledvalues),butinsteadtothedescriptionoftheunitassignedtothevalue that wasrecorded inthe previous column(Fig. 1),such askilograms orindividual unitsof theproduct.

Moreover,differentunitscould berecordedinthesamecolumn.Thesecharacteristicsoftheoriginal data required a harmonization step that consistedin searching, within the merged dataframe, the location of the values recorded in different units and subsequently converting them to the same weight unit (in this case, to tons).The search wasfollowed by an indexing function, basedon the positionofthecorrespondingcolumnsandrows,sincethishelpsdefinepointerstowherethevalues arestoredwithinadataframe.Thishadtobeimplementedintwodifferentways:

1) When the entirecolumnrecorded the same weight unit butdifferent fromtons, the search was based on a full string pattern that was found in the last row of the merged dataframe, such as “Imported quantity, Units”. The index position of these columns was then used to convert simultaneouslyallthevaluesofthecolumnsthatmatchedthepatternsearchcriteria.

2) Inthe caseof the columnswhere thevalues were recorded inmultiple weight units,the search foreach specific value that requiredconversion wasbased on theindexing of thecorresponding columnandrowwhereeachvaluewaslocated.

Toautomatethisstep,thesearch,indexingandconversionofthevaluesweredonewithiterating functions. Toensure the applicability of the script fordifferent countriesand the executionof the whole procedure without obstacles, and dueto the possible differentformats andstructure of the original datapercountryandproduct,eachoftheloopingfunctionswasonlyruniftheappropriate conditionsweremet,otherwisetheexecutionofthescriptwouldproceedtothefollowingstep.

Duplicatedandnodatacolumns

The thirdmain issuewasrelatedto theduplicationofcolumns,resultingfromthemergeofthe original files, and to the existence of columns that presented the description of the type of unit assigned,insteadofimportamounts.Thiswashandledthroughthecleaningandsortingofcolumns, using iterationsforstring patternsearch andreplacementbased oncolumnnames,the selectionof the columns whichmatched specific criteriaand thesubsequent filtering ofthe merged dataframe by excluding theselectedcolumns. Theremaining columnswere thensorted byascending orderof year-month,creatingastructuredtime-seriesofproductimportsforeachspecificcountry.

Missingorpartialworldsumvalues

The final issue was associated with the mixed types of units that were recorded in the same column, which resulted in missingor partial values being provided forthe row that collected the

“World” value (sum of all imports)for the product and country. As such, the original row named

(4)

4 S. Oliveira, C. Capinha and J. Rocha / MethodsX 8 (2021) 101567

“World” inthecolumnofExporterswasreplacedbytheupdatedsumofallrowsforeachyear-month column,andthiswasplacedasthelastrowofthedataframe,overtakingthestrictalphabeticalorder ofExportersnamesthatthepreviousstepshadautomaticallyimplemented.

Conclusions

The data screening procedure herepresented wasable to overcomethe several issues found in the data acquired. The functionsapplied are easily accessible from R software and corresponding packages, and it can be implemented by different users, even those with limited experience in programming tools. For each product and country, the whole script runs in about 1 minute with computationcapabilitiesupto16GbRAM,anditispossibletoadjustittorunforseveralcountries simultaneously.Thescreeneddatacontainsafullyharmonizedtime-seriesdatabasewiththeamount (weightintons)ofimports regardingliveplantsandtyres,mostbeingavailable between2004and 2019. The monthly data that becameaccessible after the implementationof thisprocedure, allows fortheanalysisof seasonalandannualtrendsoftrade flowsbetweencountriesandregions, which can be explored inassociation withother phenomena, such asthe potential spread ofspeciesthat disseminate vector-borne diseases [9–10]. The base script that assembles the whole procedure is availableasasupplementaryfileandcanbeadaptedtootherdatawithsimilarissues.

DeclarationofCompetingInterest

The authors declare that they have no known competing financial interests or personal relationshipsthatcouldhaveappearedtoinfluencetheworkreportedinthispaper.

Acknowledgments

This work was financed by national funds through FCT—Portuguese Foundationfor Science and Technology, I.P., under the framework of the Project “TRIAD—health Risk and social vulnerability to ArboviralDiseases inmainland Portugal” [PTDC/GES-OUT/30210/2017] and by theResearch Unit UIDB/00295/2020andUIDP/00295/2020. CC was fundedthrough FCT, I.P.,under the programmeof

‘StimulusofScientificEmployment—IndividualSupport’withinthecontract’CEECIND/02037/2017’.

Supplementarymaterials

Supplementarymaterialassociatedwiththisarticlecanbefound,intheonlineversion,atdoi:10.

1016/j.mex.2021.101567.

References

[1] ITC, International Trade Centre. Trade statistics for international business development. Available from www.trademap.org [Retrieved in June-July 2020]

[2] R Core TeamR: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2020 URL https://www.R-project.org/ .

[3] Wickham, H.; François, R.; Henry, L.; Müller, K. (2020). dplyr: a Grammar of Data Manipulation. R package version 1.0.2.

https://CRAN.R-project.org/package=dplyr

[4] Wickham, H. (2020). tidyr: Tidy Messy Data. R package version 1.1.2. https://CRAN.R-project.org/package=tidyr

[5] H. Wickham, M. Averick, J. Bryan, W. Chang, L.D.A. McGowan, R. François, . . . H. Yutani, Welcome to the Tidyverse, J. Open Source Softw. 4 (43) (2019) 1686, doi: 10.21105/joss.01686 .

[6] Wickham, H.; Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl [7] Dragulescu, A.; Arendt, C. (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/20 0 0/XP/20 03 Files. R package version

0.6.5. https://CRAN.R-project.org/package=xlsx

[8] Microsoft and Steve Weston (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.1. https://CRAN.

R-project.org/package=foreach

[9] P. Reiter , Aedes albopictus and the world trade in used tires, 1988-1995: the shape of things to come? J. Am. Mosq. Control Assoc. 14 (1) (1998) 83–94 .

[10] P.E. Hulme , Unwelcome exchange: International trade as a direct and indirect driver of biological invasions worldwide, One Earth 4 (5) (2021) 666–679 .

Referências

Documentos relacionados

Assim a avaliação do risco de queda é intervenção essencial para a sua prevenção (Chang et. al, 2004), sendo para isso importante a corre- ta utilização da Escala de Quedas de

A primeira apresenta a classificação de curvas de Descartes, contendo uma análise completa das curvas necessárias para resolver o problema de Papus para quatro retas

As so, open access and social media, two the trending phenomena when dealing with information management place a number of challenges but also opportunities for those who are

Given the amount of data contained even in a small database like this, and in order to be possible to compare each of these values within all the tests, we decided to only present

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

están semánticamente vinculados, cada uno de estos forma parte de una unidad sémica bien diferenciada por el plano sintáctico: la estructura he de xynágousa

Para a OCDE, a concorrência fiscal é uma ameaça ao welfare state (estado de bem-estar) que impõem tributos altos, redistribuindo a riqueza. A OCDE recentemente

These results suggest that variations in the content and in the molecular mass of pectins, in addition to changes in their composition and structure could be related to