Integrating scanner data in Consumer Price Index calculation: Consumer Price Index calculation using scanner data

(1)

Master Degree Program in Data Science and Advanced Analy/cs

Integra/ng scanner data in Consumer Price Index calcula/on Consumer Price Index calcula6on using scanner data

Kostas Griska Internship Report

presented as par6al requirement for obtaining the Master Degree Program in Data Science and Advanced Analy6cs

NOVA Informa/on Management School

Ins/tuto Superior de EstaBs/ca e Gestão de Informação

Universidade Nova de Lisboa

MDSAA

(2)

NOVA Informa/on Management School

Ins/tuto Superior de EstaBs/ca e Gestão de Informação Universidade Nova de Lisboa

INTEGRATING SCANNER DATA IN CONSUMER PRICE INDEX CALCULATION

by

Kostas Griska

Internship report presented as par6al requirement for obtaining the Master’s degree in Advanced Analy6cs, with a specialisa6on in Data Science.

Supervisor / Co Supervisor: Roberto Henriques Co Supervisor: -

November 2022

(3)

STATEMENT OF INTEGRITY

I hereby declare having conducted this academic work with integrity. I conﬁrm that I have not used plagiarism or any form of undue use of informa6on or falsiﬁca6on of results along the process leading

to its elabora6on. I further declare that I have fully acknowledge the Rules of Conduct and Code of Honor from the NOVA Informa6on Management School.

Kostas Griska Lisbon, 2022

(4)

ABSTRACT

The conducted research is a work project of Nova IMS student employed by Sta6s6cs Lithuania (Na6onal Sta6s6cs Oﬃce). The primary purpose of this research is to propose an alterna6ve methodology for Consumer Price Index calcula6on using new data sources, which will contribute to na6onal and interna6onal level consumer price sta6s6cs.

The current methodology for consumer price index calcula6on is based on Laspeyers method for index calcula6on, which was last updated in 2004. Two decades ago, the supply of scanner data was minimal. Therefore the methodology was based on physical price collectors and survey sta6s6cs.

Such methodology is known to be extremely costly and ineﬃcient. This research will inves6gate the possibili6es of incorpora6ng new data sources in the consumer price sta6s6cs and inves6gate the alterna6ve index calcula6on methods that could poten6ally eliminate the old model, which is known to struggle with bias sampling and chain driY problems. The literature review will cover these issues thoroughly and present possible mul6lateral or bilateral index alterna6ves to counter them.

The research will also cover the classiﬁca6on and sampling procedures necessary to construct 6me- series data. Obtained mul6lateral and bilateral index values are then compared against the current methodology, followed by conclusions and discussion sec6ons.

The research does not only consider scanner data but also explicitly discusses the applica6ons of web-scraped data. The results sec6on reveals that Jevons and GEKS index values are not correlated, indica6ng that sales turnover may not be ra6onally correlated with price movements. That is valid evidence that web-scraped data can also be beneﬁcial in consumer price index calcula6on and oﬃcial sta6s6cs as a supplementary source of informa6on.

KEYWORDS

Consumer Price Index; Bilateral; Mul6lateral; Scanner data; Machine Learning Classiﬁca6on;

(5)

(6)

INDEX

1. Introduc6on ...1

2. Literature review ...3

2.1. CPI applica6ons in Lithuania ...3

2.2. Current method for Consumer price index calcula6on ...3

2.2.1. Seasonality ...4

2.2.2. The Chain DriY problem ...5

2.3. Solving chain driY according to web-scraped and scanner data ...7

2.3.1. Bilateral Jevon’s index ...7

2.3.2. Mul6lateral indexes ...7

2.3.2.1.GEKS index ...8

2.3.2.2.GEKS - Jevons index ...8

2.3.2.3.CCDI index ...9

2.3.2.4.Geary Khamis index ...10

2.3.2.5.Time Product Dummy index ...10

2.4. Key ﬁndings ...11

3. Methodology ...13

3.1. Data Classiﬁca6on ...13

3.2. Data Sampling ...14

3.2.1. Sta6c approach ...14

3.2.2. Dynamic approach ...15

3.3. Construc6ng Time series ...15

3.3.1. Replacement treatment ...16

3.3.2. Seasonality and imputa6on ...16

4. Results and discussion ...18

4.1. Bilateral index ...18

4.1.1. Jevons index ...18

4.2. Mul6lateral index ...19

4.2.1. GEKS, GEKS-Jevons and CCDI indices ...19

4.2.2. Geary Khamis and Time Product Dummy ...21

5. Conclusion ...26

6. Limita6ons and recommenda6ons for future works ...27

7. References ...28

(7)

LIST OF FIGURES

Figure 1 - Index computa6on process………. 13

Figure 2 - Bilateral Jevons index values.……….……… 18

Figure 3 - Mul6lateral GEKS methods results: Coﬀee.………. 19

Figure 4 - Mul6lateral GEKS methods results: Milk.……….. 20

Figure 5 - Mul6lateral GEKS methods results: Sugar.……… 20

Figure 6 - Mul6lateral matched-model method results: Coﬀee.……….. 21

Figure 7 - Mul6lateral matched-model method results: Milk…..………. 21

Figure 8 - Mul6lateral matched-model method results: Sugar.………. 22

Figure 9 - Full results: Coﬀee.……… 22

Figure 10 - Full correla6on results: Coﬀee ……….. 23

Figure 11 - Full results: Milk……… 24

Figure 12 - Full correla6on results: Milk……… 24

Figure 13 - Full results: Sugar.……… 25

Figure 14 - Full correla6on results: Sugar.……… 25

(8)

LIST OF TABLES

Table 1.1 – Price and quan6ty data for products “a” and “b”……… 5

Table 1.2 – ﬁxed and chained Fisher, Paasche, Laspeyers and Tornquist indexes……… 6

(9)

LIST OF ABBREVIATIONS AND ACRONYMS

CPI Consumer price index NSO Na6onal Sta6s6cs Oﬃce

GEKS Gini, Eltetö and Köves, and Szulc method

CCDI Na6onal Caves, Christensen, Diewert and Inklaar method GK Geary Khamis method

TPD Time Product Dummy method ML Machine Learning

NLP Natural Language Processing

COICOP Classiﬁca6on of Individual Consump6on according to Purpose TRT Time Reversal Test

PBT Price Bounce Test

(10)

(11)

1. INTRODUCTION

As machine learning and big data applica6ons are the current hot topics in oﬃcial sta6s6cs, one par6cular topic discussed in this paper is the consumer price index (CPI) calcula6on using alterna6ve data sources. CPI is used to measure product’s rela6ve price change in a given 6meframe and is typically determined by con6nuously tracking a product’s price changes in a representa6ve basket of goods, annually set by Na6onal Sta6s6cs Oﬃce (NSO) Sta6s6cs Lithuania (2007).

In Lithuania, the current methodology for CPI calcula6on s6ll relies on physical price collectors.

Collec6ng product prices by hand is known to be costly and inefficient, delivering a par6cular bias in the computa6onal process (Harchaoui & Janssen (2018)). However, this process can be automated and applied to a broader scope of products elimina6ng fixed-basked bias by integra6ng alterna6ve data sources. The concept is not new and was first introduced by ILO (2004) in the Consumer Price Index calcula6on manual using Scanner Data, where scanner data refers to product turnover informa6on collected as digital receipts as soon as they are available in the database. Such informa6on can be collected using two sources: a direct connec6on to the database, or alterna6vely, web-scraping.

In recent years, applica6ons for scanner data have gained more anen6on in official sta6s6cs as it allows faster, cheaper, and more accurate CPI monitoring as data can be collected much more efficiently. The prac6cal benefits of implemen6ng calcula6on methods are also evident from the mul6ple kinds of research conducted by Australian Bureau of Sta6s6cs (2017), Sta6s6cs Belgium (2018), Sta6s6cs Poland (2020), and various other sta6s6cal ins6tu6ons.

Sta6s6s6cs Lithuania, a Lithuanian na6onal sta6s6cs oﬃce (NSO), has yet to develop programs that could handle high volumes of data according to the prac6cal guide towards scanner data processing Eurostat (2017), which could then be applied with the models discussed in the updated CPI calcula6on manual ILO (2017). Therefore, this paper will focus on possible modeling alterna6ves and steps needed to be taken in order to implement an alterna6ve calcula6on method using diﬀerent data sources proposed.

The main objec6ve is to review the possible applica6ons of diﬀerent data sources and to propose a CPI calcula6on method that could be applied in modern-day sta6s6cs, tracking short-term price movements of an individual product. According to the latest remarks Diewert (2022) states that CPI should be calculated using bilateral or mul6lateral index calcula6on methods to integrate a high frequency and broad range of product informa6on. Furthermore, by introducing a new methodology, it is expected to:

! Reduce price collec6on 6me,

! Reduce price collec6on costs,

! Reduced respondent burden,

! Increase product coverage,

! Increase CPI tracking frequency.

Sta6s6cs Lithuania is currently in the process of collec6ng data that will be used to deepen the understanding of index trend panerns and how the alterna6ve index calcula6on methods correlate to the old model. Therefore, it is necessary to inves6gate the similari6es and diﬀerences among diﬀerent bilateral and mul6lateral price indices, which can be calculated using respec6vely: web

(12)

scraped or scanner data acquired directly from the vendor. The ﬁrst-men6oned bilateral approach does not require the quan6ty of sales informa6on for the calcula6on of CPI, therefore can be acquired through web scraping. Whereas all mul6lateral index alterna6ves require the volume of sales to be included in the index calcula6on, referring to scanner (transac6on) data, which can be acquired only by a direct connec6on to the retailer’s database.

The current methodology for consumer price index calcula6on applied by Lithuanian NSO is based on the modified Laspeyers formula, which was introduced in 1989 and last revised in 2004 (Sta6s6cs Lithuania (2017)). The calcula6on refers to a fixed product basket covering products and services divided into 12 sec6ons, 42 groups, and 95 classes under Classifica6on of Individual Consump6on according to Purpose (COICOP), assigning them certain compara6ve weights (Sta6s6cs Lithuania (2017)). However, it can only be applied to a fixed sample of goods, sugges6ng that the calcula6on method is biased toward the product basket (Haan, 2006). Moreover, the Laspeyers index calcula6on method is known to fail the 6me reversal test, leading to the chain driY problem ILO (2017). De Haan (2008) discovered the evidence limi6ng Laspeyers index applica6on doubled by Van der Grient (2011), where they found that some products do not bounce back to normal consump6on levels aYer a clearance sale in a supermarket.

The literature review will present the ﬁndings of de Haan and Van Der Grient, discuss the chain driY issue in greater detail, and will also inves6gate the alterna6ve solu6ons to the problem as proposed in the revised consumer price index manual ILO (2017) and Diewert’s (2020) remarks on it. Both sugges6ng that bilateral Jevons index or mul6lateral methods (GEKS, TPD, CCDI,…) should be applied to avoid the chain driY and replacement problems that current index calcula6on methodology is facing.

The research will inves6gate similari6es and diﬀerences between the currently employed method and suggested mul6lateral and bilateral index calcula6on alterna6ves. Par6cular anen6on will be paid to the bilateral Jevons method, which does not require quan66ve sales informa6on. Therefore it can be applied by using web-scraped data.

One of the main objec6ves of the methodology sec6on is to build an automated process for CPI calcula6on. A process that is easily interpretable, capable of tracking daily price changes, and would lead to a solu6on how to counter the chain driY problem.

This research is organized as follows. AYer the introduc6on, in the literature review chapter, the paper will present the importance of consumer price index calcula6on inside the na6onal context (2.1.). Later, sec6on (2.2.) will present the methodology currently employed and the ﬂaws associated with it. Next, sec6on (2.3.) will discuss Diewert’s (2020) sugges6ons on how bilateral (2.3.1.) or mul6lateral price (2.3.2.) indexes could solve the chain driY issue that current methodology is facing, emphasizing that CPI calcula6on methodology should be revised. AYerward, the methodology sec6on (3) will present the process of compu6ng an alterna6ve consumer price index. Finally, the remaining three chapters conclude the paper by providing the results of applied calcula6on methods (4), later discussing the research’s ﬁndings and limita6ons in chapters (5) and (6).

(13)

2. LITERATURE REVIEW

2.1. CPI

APPLICATIONSIN

L

ITHUANIA

In Lithuania, the price collec6on concept was ﬁrst introduced in the year 1989. Then, the methodology covered the retail produc6on prices and their turnover in urbanized areas. Such informa6on would be collected by survey sta6s6cs. A year later, CPI was introduced to suburban and rural areas, expanding its geographical coverage. Calcula6on became necessary aYer Maastricht Treaty that was signed in 1992. By signing the treaty, all countries commined to ensuring economic convergence. One of the main criteria for progress monitoring is price stability, expressed through the rate of inﬂa6on. With the increased anen6on, post-1992, when Interna6onal Monetary Fund revised the applica6ons of CPI, a new methodology was introduced (Sta6s6cs Lithuania (2007)).

The primary purpose of the Consumer Price Index in Lithuania is to measure the country’s inﬂa6on level, expressed as monetary costs for purchasing goods and services. CPI is the central unit of measurement necessary for calcula6ng beneﬁts established by the Personal Income Guarantee Act.

The act men6oned covers pensions, wages, minimum and state-supported income, and social allowances. Besides the men6oned, CPI, directly and indirectly, inﬂuences local tolls, corporate taxes, heritage and donated property laws, and court ﬁnes (Sta6s6cs Lithuania (2017)).

The Consumer Price Index is calculated using the modiﬁed Laspeyers formula, where prices of consumer goods and services would be aggregated and weighted by na6onal household budget survey results. Classes and sub-classes of goods and services would be, according to the COICOP categories are divided into 12 sec6ons, 42 groups, and 95 classes, assigning them certain compara6ve weights. Where services, durable, and non-durable goods would obtain diﬀerent yearly weights based on yearly consump6on. F.e.: food and beverages cover approx 30% of the basket, while transporta6on ±10%, and educa6on ±2% (Sta6s6cs Lithuania (2007)).

2.2. C

URRENTMETHODFOR

C

ONSUMERPRICEINDEXCALCULATION

Since the incep6on of CPI calcula6on, the methodology has relied on two special cases, which figure very prominently in the literature. Two German economists, E6enne Laspeyers and Herman Paashe (1865), were working side by side on different methods how to measure real price changes. The outcomes of their individual research were two price index calcula6on methods s6ll used today. The prac6cal way of calcula6ng Laspeyers can be defined as the weighted average of the price rela6ves, where weights are old values shares::

(1)

i=1

( p

_i^t

/p

_i⁰

)

⁻¹

s

_i^t

}

⁻¹

(14)

(3)

AYer assigning territorial weights, individual price indices are aggregated to upper-level price indices using the rela6ve product weight in the basket. Aggrega6on of the products is based on COICOP classiﬁca6on levels as well as total CPI.

, where (4)

Laspeyers and Paasche are proven to be biased towards the basket, which may suﬀer from seasonal changes (Haan, 2006).

2.2.1. Seasonality

Seasonal products oYen have a par6cular panern that must be iden6ﬁed to bener understand the CPI. Seasonal price ﬂuctua6ons are usually temporary and have a known dura6on. For example, in summer person can buy fresh, locally grown vegetables which will not be available in the winter. With increased supply during the summer, prices are expected to drop and vice versa.

According to CPI Manual, there is no uniform list of seasonal products for all the countries, and the list should be computed according to the loca6on. In addi6on, a dis6nc6on should be made between

“strongly” seasonal products that are available only part of the year (in season) and “weakly”

seasonal ones that are available throughout the year; however, their prices and availability for purchase fluctuate significantly within the year. Seasonal products can be iden6fied using Time Reversal or Price Bounce Tests.

Time Reversal Test (TRT) - General rule of the test - if all the data for two periods are interchanged, the resul6ng price index should equal the reciprocal of the original price index.

(5)

Price Bounce Test (PBT) - if the ordering of price quotes for both periods is changed in possibly diﬀerent ways, the elementary index has to remain unchanged.

AYer iden6fying the seasonal products, there are two approaches towards how to deal with the seasonal prices:

1. A ﬁxed-weight method that uses the annual weight for the seasonal product in all months using an imputed price in the out-of-season months

2. A seasonal weight method, where the weight is zero for out-of-season months and the annual weight is used for in-season months.

Sta6s6cs Lithuania would employ both techniques to iden6fy seasonal products. In both methods, the ﬁrst step is to es6mate seasonal products’ typical prices of the previous month. Calculated as the arithme6c average of the prices for every given territorial unit. Monthly prices would then be calculated as follows:

(6)

I

_Lⁱ

= ∑

¹⁸_j=1

P

_L^i,j

S

^j

∑

¹⁸_j=1

S

^j

I

_L

= ∑

ⁿ

₀ⁱ

₁₀

= 1

P

_m^−i,j

= 1 k

_m^i,j

k_m^i,^j

∑

a=1

p

_m,a^i,j

(15)

Here:

- average price of product or service (i) in the given month (m), calculated for territorial unit (j).

- price (i) of the product or service per measurement unit in the given month (m), collected in the territorial unit (j) represen6ng number of trading points (a)

- number of prices (i) collected for the product or service in the territorial unit (j) during the given month (m)

- trading points unit, where

For seasonal products but available throughout the year, inter-season price weights would then be applied for every product iden6ﬁed (seasonal weight method).

The ﬁxed-weight method is applied for strongly seasonal products (f.e.: snowshoes). Every missing price for a product would be imputed as a weighted price average per territorial unit of similar products in respec6ve COICOP classes or categories. The prices of products available in previous and current months would replace the missing prices of seasonal products.

Using chained indexes is a viable solu6on if data is collected on a monthly basis and the supply remains constant. In prac6ce, the CPI manual has failed to take into account the severity of chain driY which is typically caused by product sales (discounted products). The example provided by Haan ¹

& Grient (2011) proves that the chained Laspeyers index would not sa6sfy the price bounce test and ² would lead to chain driY.

2.2.2. The Chain DriV problem

Suppose that prices and purchased quan66es for commodi6es “a” and “b” are available for four periods. Commodity a is subject to periodic sales where it’s price drops by 50% and commodity b has no price or quan6ty devia6ons.

AYer announcing the sale of product a, the quan6ty sold surged to 5000 units, but the following month, as the consumers stocked up product during the discount, the quan6ty of units sold dropped to 1 and returned to the typical sales level only in period 4. This can be seen in the table below:

Table 1.1 – Price and quan6ty data for products “a” and “b”

P

_m^−i,j

p

_m,a^i,j

k

_m^i,j

a a = 1,…, k

_m^i,^j

Period t

1 1.0 1.0 10 100

2 0.5 1.0 5000 100

3 1.0 1.0 1 100

4 1.0 1.0 10 100

q

_a^t

q

_b^t

p

_b^t

p

_a^t

“Chain driY occurs if a chained index does not return to unity when prices in the current period return

1

to their levels in the base period” (ILO, 2004, p. 445)

Example based on an actual example that used Dutch scanner data. When the price of a detergent

2

product went on sale in the Netherlands at approximately one half of the regular price, the volume sold shot up approximately one thousand fold; see de Haan (2006; 15) and Haan & Grient (2011).

(16)

AYer calcula6ng the chained index values for Fisher, Laspeyers, Paasche, and Tornquist indexes(table 1.2), it is evident that all of the ﬁxed base methods would sa6sfy the Walsh test, but the chained indices would suﬀer from the chain driY (Haan (2008)).

Table 1.2 – ﬁxed and chained Fisher, Paasche, Laspeyers, and Tornquist indexes

This proves that if corresponding quan6ty does not return to its’ typical level in the following period , the chained indices will have a chain link bias.

To address the chain driY problem, Diewert (2022) iden6ﬁes ﬁve possible solu6ons:

! Use a ﬁxed base index

! Use a mul6lateral index

! Use annual weights for a past year

! Use Jevons index

! Use an adjusted superla6ve index

Two problems can be iden6fied with the first op6on. First, using a fixed base index, the results will depend on the base period and product choice. Second, pegging an index to a specific product that might disappear in 6me will need more price and quan6ty representa6veness.

According to Diewert & Fox (2017), the mul6lateral index issue is that each 6me an extra period of data becomes available, the index should be recomputed. This op6on is a good solu6on under the condi6on that the data for previous periods would be available at all 6mes.

The third op6on provided will inevitably result in subs6tu6on bias. Disappearing seasonal products would be hard to match. Therefore, the index might driY and lose representa6veness (Diewert, Huwiler, and Kohli (2009)).

Considering the fourth op6on, using the Jevons index would require less informa6on about product consump6on, however, it would require an expanded list of products available. The index purely relies on product prices and is the only bilateral (direct) index free from chain driY. The main issue is that products matched by COICOP categories might lead to a biased CPI as categories would match unimportant products with priori6zed ones.

Period t

1 1.000 1.000 1.000 1.000 1.000 1.000 1.000

2 0.698 0.955 0.510 0.698 0.694 0.955 0.510

3 1.000 1.000 1.000 0.979 0.972 1.872 0.512

4 1.000 1.000 1.000 0.979 0.972 1.872 0.512

Further, the project will narrow down Diewert (2022) proposi6ons and evaluate the poten6al complexity of implemen6ng diﬀerent index calcula6on methods using scraped and scanner data.

Finally, in order to avoid subs6tu6on bias, Jevon’s and mul6lateral indexes will be considered.

To begin with, it is necessary to emphasize that web-scraped data has limited informa6on. Web- scraped product data does not have turnover informa6on. The informa6on that is retrieved from HTML pages carries product name, category, and price metadata. This is suﬃcient to apply several proposed methods as they do not require informa6on on sales but instead focus on regression and sampling techniques.

On the other hand, scanner data can provide precise quan66es and prices of the products, which allows the data to be applied with any mul6lateral index, which solves the chain driY and subs6tu6on bias problems.

2.3.1. Bilateral Jevon’s index

The ﬁrst op6on that is par6cularly important in web-scraped data context is Jevon’s index ( )—

constructed as a bilateral price index under the assump6on that all products that fall into the same category would have the same probability of being selected. Therefore, elimina6ng expenditure data from the analysis is calculated by directly comparing the product’s price to its reference period.

(7)

Jevons index is preferred in situa6ons where expenditure data is limited. As men6oned in the chapter (2.3.2.) it is the only elementary aggregate index that is free from chain driY. In this case, where NSO conducts household expenditure surveys only every 4-5 years, web-scraping prices would provide meaningful insight about the price ﬂuctua6ons on the elementary level.

The main downside of having an elementary index is that it will inevitably have a bias towards product basket . As a result, irrelevant products might be matched in the same categories, and the index would not reﬂect the full scale of impact when an item in the category gets discounted.

¹ⁿ

p

_i⁰

(18)

2.3.2.1.GEKS index

The ﬁrst mul6lateral method to be discussed is GEKS, where the index value of two periods (0,t).

GEKS index is calculated as the geometric average of the ra6os of matched model bilateral price indices and , where each is taken as the base. Under the assump6on that bilateral price indices sa6sfy the 6me reversal test , GEKS index can be wrinen as:

(8) In the standard form provided by Haan & Grient (2011), the index uses a bilateral Fisher index which sa6sﬁes the 6me several tests and is oYen applied as an elementary aggregator at the base level.

(9)

As men6oned, Fisher index can also be replaced by alterna6ve bilateral indices. Two base index alterna6ves are suggested in the CPI manual (ILO, 2017) being Jevons and Tornquist indices.

2.3.2.2.GEKS - Jevons index

Considering GEKS - Jevons alterna6ve, it is a viable op6on when expenditure data is unavailable, which is the case of web-scraped data. Given that and are sets of products priced in countries and , and is the number of products available in both countries, then the binary elementary index can be wrinen as:

(10)

However, Sergeev (2003) suggested a modiﬁed GEKS-Jevons index formula, which includes three extra components that weight the sets of indices available. These components can be wrinen as where represents items in country and not in country ; represents items in country and not in country and the items that are available in both countries. In the given case, the binary elementary index is computed as a weighted geometric mean with weights propor6onal to the number of items in diﬀerent components.

(11)

Here, and represent the considered representa6ve set of products, where products are available in one country but not in the other. Alterna6vely represents set of products available in both countries.

As the binary price indices are not transi6ve in its nature because the index is based on prices of a diﬀerent set of products, the transi6vity is anained by applying GEKS procedure to the binary es6mates:

(12)

I

^0,^j

I

^0,k

l

P

^a,b

*P

^b,a

= 1 P

_GEKS^0:t

= Π

_l=0^T

( I

^0,l

I

^t,l

)

T+ 11

P

_F

= [

]

^M¹

(19)

Issues associated with this method are that the values might be distorted as they rely on the exact product availability, where the product might not be available in both countries due to seasonality and disappearing products. However, Eurostat (2017) indicates that the issue can be addressed by equalizing the number of products selected.

2.3.2.3.CCDI index

Another alterna6ve bilateral index that is proposed to be used with GEKS method is the Tornquist index:

(13)

Here, and are respec6ve expenditure shares of period 0 and .

(14)

(15)

This applica6on was ﬁrst men6oned by Caves, Christensen and Diewert (1982), where the method suggested that Tornquist formula would be used in the GINI methodology, crea6ng so known CCD ³ method. The main idea was to have an index that would provide quan6ty comparisons across produc6on units. Haan & Grient (2011), proposed that a similar methodology could be applied in CPI calcula6on as well. AYer the proposal, Inklaar and Diewert (2015) have extended the applica6on of Tornquist index in CCD method. In their version, the index provides price comparisons across all produc6on units. The CCDI method:

(16)

In the updated CPI manual ILO (2017) suggests that the same outcome can be achieved by taking the logarithm of CCDI price level of period , where would be deﬁned as comparison of prices in period with the sample average prices .

(17)

Sample average sales shares for product , and the sample average log price for product , for = 1,…,N are deﬁned as follows:

(18)

P

_T^0,t

= Π

_i∈S

( p

_i^t

P

_i⁰

)

^sⁱ⁰⁺²^sti

s

_i⁰

s

_i^t

_CCDI

t s

shares data into a single sta6s6c, which summarises the dispersion of income across the en6re income distribu6on.

(20)

(19)

The CCDI calcula6on method is expected to approximate GEKS method as Tornquist bilateral index approximates bilateral Fisher index Diewert (2022).

2.3.2.4.Geary Khamis index

Moving away from varia6ons of GEKS indices, towards matched-model mul6lateral index calcula6on methods, several op6ons are available. One of which is used for interna6onal comparison of real consumer prices, known as the Geary-Khamis (GK) method (Geary (1958), Khamis (1972)). Chessa (2016) adapted the formula to be used in 6me series context where it foresees two stages: 1) Clustering of similar items and calcula6ng the reference weights; 2) Compila6ons of GK indices.

GK index can also be seen as normalised Paasche index with imputed period 0 prices based on the reference prices :

(20)

The numerator is viewed as the measure of turnover, where denominator is a weighted quan6ty index. Reference prices are given as the weighted arithme6c average of deﬂated observed prices, with each period’s share in the total number of sales of the item across sample serving as weights:

(21)

Here is the set of 6me periods in which item is actually sold and the prices are available. Since acts as the deﬂator, a simultaneously solved system of equa6ons must be established. According to Diewert (2022) this can be solved itera6vely and is rela6vely simple to implement. He also indicates that Geary-Khamis method would be preferred if the risk of subs6tu6on bias is low and the products are highly subs6tutable with diverging price and quan6ty trends.

2.3.2.5.Time Product Dummy index

Last model to be considered in this research is the geometric counterpart of Geary-Khamis method, called the Time Product Dummy (TPD) model (Haan & Krsinich (2017)). TPD is a regression-based model given that most of the products are not sold throughout the year (disappearing). TPD regression model for the pooled data : ⁴

(22)

lnp

_1n

≡ ∑

^T

t=1

(1/T )lnp

_tn

̂ p

_i

P

_GK^0:t

=

∑_i∈Stp_i^tq_i^t

∑_i∈Stp̂_iq_i^t

∑_i∈S₀p_i⁰q_i⁰

∑_i∈S₀p̂_iq_i⁰

≡

[ ∑

_i∈St

s

_i^t

(

^pⁱ^t_̂

p_i

)

⁻¹

]

⁻¹

[ ∑

_i∈S0

s

_i⁰

(

^pⁱ⁰_̂

p_i

)

⁻¹

]

_i

S

_i

i

P

_GK

lnp

_i^t

= α + ∑

^T

t=1

δ

^t

D

_i^t

+

^N−1

∑

i=1

γ

_i

D

_i

+ ϵ

_i^t

Dummies for item n and period 0 are excluded to iden6fy the model

4

(21)

Here the dummy variable represents the similarity factor of the product. Zero if observa6on is not relevant to product and one otherwise. Similarly is a dummy variable that ranges from zero to one, wether the observa6on relates to period or not.

The weighted least squares regression es6mates the model with the items’ expenditure shares in each period serving as weights. Similar to GK method TPD can be wrinen as a system of equa6ons and solved itera6vely. Exponen6a6ng the es6mated 6me dummy parameter results in TPD index between periods 0 and ; , therefore the system is wrinen as:

(23)

(24)

Here, reference prices ( ) are equal to the expenditure-share weighted geometric averages of deﬂated prices.

Chessa (2015) argues that the method needs to be revised concerning their use of turnover in weight construc6on. The index value can not simplify to a unit value index when all products are homogeneous. Diewert (2022) also notes that the index computa6on might be problema6c as it relies on predeﬁned set of products. While solving a chain driY problem, in his work, he proves that if there will be no missing products and products are strong subs6tutes, the index tends to have a downward bias. Also, no6ng that, if missing products re-appears in the ﬁrst period, the index might have an upwards bias.

2.4. K

EYFINDINGS

The inves6ga6on of diﬀerent methods applied to consumer price index calcula6on, it is evident that both scanner and web-scraped data can be applied. AYer research of the prac6ces applied by na6onal sta6s6cs ins6tu6ons around the globe and the recent ﬁndings of Haan & Krsinich (2017) Chessa, Verburg and Willenborg (2017), Diewert (2022), several conclusions can be drawn.

First and most important, tradi6onal consumer price index will inevitably suﬀer from chain driY. One op6on to avoid it is to apply the mul6lateral price index calcula6on methods. Most common prac6ces rely on GEKS, CCDI, and GK, TPD methods. These are known to produce similar results on base, and the diﬀerence comes from product basket selec6on, price imputa6on, and outlier removal

techniques. In his paper about CPI calcula6on using scanner data, Diewert (2022) proves that GEKS and CCDI methods approximate similar results, which are closest to the ground truth of price changes. He rules out GK and TPD methods from the picture due to the proven upward bias of these two indexes, taking these models’ lack of interpretability and transparency into account.

Nevertheless, GEKS or CCDI indices are preferred to be used with scanner data. The ﬁnding is insigniﬁcant as superla6ve Fisher approximates the superla6ve Tornquist index Diewert (1976). One thing to be noted is the sensi6vity of these indices. ABS (2016), Chessa, Verburg and Willenborg (2017) commonly emphasise that GEKS has to consider the products that are on clearance sales, as they impact the short and long term driY. Products about to be cleared will not go back to regular

D

_i

D

_i^t

t

̂δ^t

t I_TPD^0:t =exp( ̂ δ^t)

I

_TPD^0:t

=

Π

_i∈S^t

(

^pⁱ^t

exp(γ̂_i)

)

^sⁱ^t

Π

_i∈S⁰

(

^pⁱ⁰

exp(γ̂_i)

)

^sⁱ⁰

exp ( γ ̂

_i

) = Π

_τ∈S_i

( p

_i^τ

I

_TPD^0:t

)

siτ

∑τ∈Sisiτ

exp ( γ ̂ )

(22)

price or consump6on levels, therefore causing the index’s downward driY and making it biased towards the product basket. Findings of Polish (Bialek, 2020) and Belgian (Loon (2018)) NSO’s prove that downward driY can, in fact, be avoided by excluding clearance sales from CPI calcula6on. A similar procedure is currently applied in the current methodology.

In addi6on to the ﬁndings about the use of micro-data, one par6cular case that is men6oned in CPI methodology (2004) must be discussed. The Jevons index is proposed by ILO (2004) to be used in cases where full micro-data is unavailable, therefore proposing the use of web-scrapped data.

Nevertheless, it is known that Jevons index suffers from a downward driY problem due to difficul6es in product matching (Haan & Grient (2011), Chessa (2015)). Jevons method was re-evaluated by Kevin Van Loon (2018), who proves that downward driY is caused by product differences that fall under the same category, which can be avoided by introducing extra informa6on about the product. If products are representa6ve of their category, the downward driY will be avoided, and the Jevons index can be used with web-scraped data. In addi6on to that, the Jevons index could be used as a base for the mul6lateral GEKS index method by incorpora6ng the results of the na6onal household survey.

(23)

3. METHODOLOGY

As bilateral methods were not ruled out and most sta6s6cal agencies can make some use of Jevons index, it is evident that both scanner and web-scrapped data can be used in the context of Consumer Price Index calcula6on. This chapter will cover the necessary procedures to fulﬁll the literature review objec6ves and compile the CPI index for elementary price aggregates.

In general, there are three steps and two approaches specifying the sample of products for which the price index value is calculated.

Figure 1 – Index computa6on process

Scanner data (microdata) is typically acquired directly from the supplier’s database and comes in full detail. Micro-level informa6on regarding product descrip6ons, internal categories, prices, and quan66es is available here. Assuming that the data provided by the supermarket is correct, the classifica6on of products would be the first step toward index calcula6on. Here, the products would have to be assigned an ECOICOP category, which refers to a na6onal-level classifica6on below the subclass level of COICOP.

AYer the classiﬁca6on process, there are two op6ons for selec6ng the representa6ve sample of products—the sta6c and the dynamic approach.

The sta6c approach resembles the tradi6onal fixed item basket, where only specified products are selected. The sample is drawn from year t and used for 12 months following December of year t. The sample is kept constant for the year, and replacements are made if necessary. On the other hand, the dynamic approach includes the whole product popula6on by drawing a two-month matched sample, including all products with a reference price in the previous period (Eurostat (2017)). The second approach also requires addi6onal programs that would filter price and turnover outliers.

One detail to be noted forehand is that the data acquired from web scraping does not hold the quan6ty of sales informa6on that is needed to ﬁlter the data using the dynamic approach. However, web scraped data contains the rest of the necessary informa6on about the product that is used by sta6c approach.

3.1. D

ATA

C

LASSIFICATION

This research is expected to improve the quality of the CPI calcula6on without an increase in manual labor. S6ll, a part of the process involves the manual classiﬁca6on of products, according to ECOICOP.

The sample data was acquired from three leading supermarket chains in the region. Sample data contains informa6on about all products sold during 13 months, star6ng December 2018 and covering full year of 2019. Classifying the item codes manually each 6me a new period of data arrives is not advisable, especially when a single chain can have over 80 000 products to classify monthly. Here, modern machine learning techniques are preferred.

(24)

In order to iden6fy and classify the products, data must be cleaned and processed. Here, natural language processing (NLP) techniques are applied. By using regular expressions to search for panerns, it is possible to extract product names, brands, and packaging informa6on straight from the product 6tle. AYer retrieving the product list and internal categories from the supplier or webpage, a one- 6me manual classiﬁca6on must is done to create a training dataset that a machine learning algorithm would use to classify the products in the future itera6vely.

AYer evalua6ng mul6ple op6ons (Logis6c Regression, Mul6-Layer Perceptron, Random Forest, and others), support vector machine was selected for good efficiency and accuracy metrics. Ini6al tests were based on manually tagged corpora consis6ng of more than 80 000 products belonging to 320 categories. The classifier’s performance is evaluated by cross-validated F1 micro and receiver opera6ng characteris6c (ROC) scores. It is bener to monitor the automated systems separately from the produc6on cycle, using a sample of newly mapped item codes, accept that some mistakes are likely, and use the results to improve the automated classifica6on procedures in the future.

(25) (26) Here,

(27)

(28)

(29) In the given case, a cross-validated F1 score for SVC algorithm reached 92%, which is sufficient to state that the process can be automated in the future. Moreover, the classifier score is bener for the first two sec6ons of ECOCICOP (Food and Beverages), F1 score reaches 95% as the manually classified product samples were balanced and big enough to iterate through the monthly data. According to Eurostat (2017), it is bener to monitor the automated systems separately from the produc6on cycle, accept that some mistakes are likely to occur, and use the results to improve the automated classifica6on procedures in the future.

3.2. D

ATA

S

AMPLING

As men6oned in the introduc6on of this chapter, sampling can be done in two ways, sta6c and dynamic. The ﬁrst one represents the ﬁxed basket of products for which the prices are collected, whereas the second op6on iterates through all the data available. These methods will be discussed in the following sec6ons.

3.2.1. Sta/c approach

This approach closely mimics tradi6onal price collec6on for the ﬁxed-base index, where product informa6on is gathered for the selected products only, covering all of the ECOICOP categories.

Considering this op6on, retrieving product informa6on would be easy to develop and maintain.

Nonetheless, it would require much manual labor to be implemented and constantly monitored for replacements if an item becomes unavailable.

F 1 = 2 PrecisionRecall Precision + Recall

ROC = Sensitivit y (TPR ) − (1 − Specificit y)

Precision = TP TP + FP

Sensitivit y = Recall = TP TP + FN Specificit y = TN

FP + TN

(25)

Here, only the actual price collec6on is automated, and the program would then imitate the ac6ons of a human price collector (go to a shop, ﬁnd an item and register the price for the amount sold). As the product basket is compiled according to ECOICOP categories, where each category contains mul6ple products, manually collec6ng links for thousands of items would be highly 6me-consuming.

Such methodology’s advantages are easy maintenance and lack of need for classifica6on. Here, ECOICOP categories can be assigned straight for the product as the product in the basket remains constant. However, as men6oned in the literature review, the main weakness of every fixed basket CPI calcula6on method is product replacement. Once a product disappears, the price of a replacement product is taken instead, which heavily influences the outcome of the elementary aggregate, as the turnover of a product may not be on the same level. It is up to NSO to ensure that fixed product basket is representa6ve to the popula6on. Therefore, sampling data using a sta6c approach involves human labor and can not be fully automated.

3.2.2. Dynamic approach

An alterna6ve way of sampling that is applied for mul6lateral index calcula6on, is the dynamic approach. As described in Prac6cal Guide for Processing Supermarket Scanner Data (2017), this approach automa6cally selects representa6ve product samples for each consecu6ve month by selec6ng all matched items relevant to elementary aggregate. However, the monthly produc6on process of the dynamic approach requires an extensive amount of ﬁlters to be implemented to construct the 6me series.

First, the program excludes all non-deﬁni6ve products whose composi6on is uncertain or constantly changing (e.g., “lunch box”).

The second ﬁlter is applied to eliminate products with a sharp decrease in price and turnover.

Products that are sharply discounted compared to the previous period are most likely to leave the market at an untypically low price, yielding a downward driY of the index.

Third, the outlier ﬁlter. Filter that is capable of elimina6ng products acquired at untypically low prices. In some instances, the products are acquired using a loyalty bonus, making prices drop nearly to zero.

Finally, product turnover is evaluated using the low sales algorithm. If product turnover signiﬁcantly decreases compared to rela6ve products in the category, the product should be excluded as it is not representa6ve of the sample. The low sales algorithm evaluates the turnover of an individual product in two adjacent months and excludes the product if a threshold is breached see Equa6on 30. The threshold is determined by the number of items in the product group Eurostat (2017).

(30) Here, represents the markets share of each matching product in month , is the number of products and is equal to 1.25, which is empirically determined to represent ~80% of turnover.

Here, in order to construct an unbiased sample, instead of focusing on a ﬁxed product basket, the focus is ﬁnding as many representa6ve products as possible. Compared to the sta6c approach, this technique is more challenging from the development side but is preferred when high volumes of data are available as the process can be automated Diewert (2022).

3.3. C

ONSTRUCTING

T

IMESERIES

The last step of the process is to adjust the 6me series for seasonality and replacements. The replacement problem is par6cularly relevant to the sta6c approach, where the index is suscep6ble to a speciﬁc product. It appears when a product is withdrawn from the trade and is replaced by a new

s

_m

+ s

_m−1

2 > 1 nλ*

s

_m

m n

λ

(26)

product or relaunched in a new package, acquiring a new EAN/GTIN code. In case of a missing item, the price for the product has to be imputed. The dynamic method par6ally solves this issue by automa6cally matching the reoccurring items; however, if product churn is high, there might be nothing inherent in the previous sample. Therefore, replacements have to be dealt with separately to ensure data quality.

The second mandatory adjustment implemented is an accommoda6on of the seasonal panerns.

Although mul6lateral indices following the dynamic approach can adjust for temporary items, they are evaluated separately, and the procedure is essen6al in the sta6c approach.

3.3.1. Replacement treatment

The phenomenon of disappearing items is known as anri6on and is a result of high product churn.

The rate of disappearing items can reach up to 60%, depending on seasonal panerns (ILO (2017)). The old product is replaced by a new version called “relaunch” and can be of two types:

! Newly introduced products,

! Upgraded products.

First, newly introduced products are evaluated based on seasonal panerns and included in the calcula6on if the replacement is apparent. To implement and evaluate the consump6on panerns and iden6fy replacements that would be included in the CPI calcula6on (e.g., warm slippers replace beach slippers in the winter and vice versa).

Second, upgraded products refer to the same product whose package has changed. These products are launched under a new EAN/GTIN code, therefore, can not be iden6ﬁed using the old code.

However, as the name of the product remains constant, the process can be automated by employing natural language techniques and calcula6ng the Levshtein distance between two strings. ⁵

3.3.2. Seasonality and imputa/on

As data is collected con6nuously, it will inevitably be faulty or missing at a speciﬁc 6me step. Here the products should be iden6ﬁed if they are missing due to seasonal panerns or if they are temporarily missing and should be imputed, as it is impossible to go back in 6me and recollect the product’s prices in the past. Due to this reason, con6nuous monitoring of data validity is a must ILO (2004).

In the consumer price index context, strongly seasonal products are iden6ﬁed annually following the methodology discussed in sec6on (2.3.1.) and excluded from the CPI calcula6on using a blacklist. Of course, the list of products is not uniform across the countries. However, considering the case of Lithuania, such products as locally grown strawberries or potatoes are iden6ﬁed as strongly seasonal, therefore, excluded.

Weakly seasonal products are iden6ﬁed as items that follow some cyclical panern. Typically, sales appear in all periods but might not be available in certain months (e.g., ‘’oranges’’). Such products must either ﬁnd a direct subs6tu6on or can be imputed following the guidelines for scanner data processing published by Eurostat (2017). According to the manual, temporarily missing data can be handled in several ways.

Overall mean imputa6on (OMI), this method uses the price changes of other similar varie6es as es6mates. It is a geometrical approach to missing values. The price that is missing in the current period would be imputed by taking the mean of all price rela6ves of the matched variety groups of

also known as the Edit distance-based algorithm as it computes the number of edits required to

5

transform one string to another.