Interactive Application for Exploration

4.2 Data Exploration

4.2.3 Interactive Application for Exploration

This section will introduce the application that was developed for the pur- pose of exploring the sales data. It is an R application, that uses the Shiny platform for web-development. The application features a number of different plot diagrams and analyses. To begin with, the time serie for any (store,product) combination can be selected. It is also possible to analyze cumulative sales over stores and/or products with the selection "all". The related menu selectors can be seen on the left side of figure4.3.

Furthermore, the user may also select (a) the time granularity (day,week,month) of the time serie (b) the amount of differencing for the time serie and (c) whether or not to apply a logarithmic transformation on the data. The related menu selectors can be seen on the right side of figure4.3. Via the aforementioned menu selectors, the user can specify the time serie that will be the

FIGURE4.3: Menu selectors that allow the user to specify the store and product, as well as the feature of analysis (left). Menu selectors that allow the user to specify time granularity, degree

of differencing and log transformation (right).

subject of the following analyses and plot depictions.

FIGURE4.4: A plot of the user selected time serie. The menu selectors are set to Stores = All, Product = All and Time Serie =

sumQty

The application renders a plot for the selected time serie, as in figure4.4.

Furthermore, the lag correlograms (mentioned in section2.4.5) are given for both ACF and PACF, as in the respective figures4.5 and4.6. As it was mentioned previously, the ACF and PACF plots regard the currently selected time serie. In the aforementioned illustrations, the lag 7 can be seen as highly significant. This is a statistical proof for the significance of the day of the week in the RDF problem.

FIGURE4.5: The importance of lag=7 can be seen in the above correlogram.

FIGURE4.6: The pacf diagram shows a low significance for lags 14 and 21

In the ACF plot, the lags 14 and 21 are wrongly regarded as significant, the reason being that they are multiples of 7 (days in week). In contrast, the PACF correctly shows a decreased significance for lags 14 and 21, thus making the appropriate correction. The application also features the Ljung- Boxportmanteau test that was mentioned in section2.4.6.

FIGURE4.7: The seasonplot depicts a weekly pattern.

Furthermore, the application renders aseasonplotdiagram, on which it is possible to detect weekly patterns. For instance, the seasonplot depicted in the diagram4.7shows a definite pattern in the total sales (all stores, all products). This pattern is explained as (a) The lowest numbers of sales appear on Sundays since the majority of stores are closed (b) the highest numbers of sales appear on Saturday (c) The number of sales on days weekdays are in the middle range. By changing the values of the menu selectors, different patterns will appear in the seasonplot.

FIGURE4.8: Menu selectors for clustering

The next feature in this interactive application is cluster analysis. Since the clustering regard multiple time series, there are specific menu selectors for this analysis, which can be seen in figure 4.8. The algorithm Dynamic Time Warping(DTW) is used for the computation of the distance matrix from the time series. Specifically, the user may select a DTW variation that will only approximate the real distance matrix, in exchange for a faster execution.

After a distance matrix has been computed, the application can run a standard clustering algorithm to group the time series. With the aforementioned selectors, the user can control:

• The DTW algorithm variation for computing the distance matrix: dtw, dtw-basic,dtw-lb,dtw2,lbk,lbi,sbdandgak

• The clustering algorithm (k-means, hierarchical)

• The number of clusters

• The sample size on which to run the cluster analysis

A graphical illustration of the clustering results on a small sample can be seen on figure4.9. The time serie members of each of the 10 clusters are overlapped on the clusters’ subplots. For the evaluation of the clusters, the application outputs a set of different cluster evaluation metrics. The different evaluation metrics are recorded in table 4.5. A cluster analysis run on monthly sales for all products and stores with DTW + k-means was evalu- ated as recorded in table4.6.

TABLE4.5: Clustering Evaluation Metrics

Sil Silhouette index (Arbelaitz et al. (2013) to be maximized).

D Dunn index (Arbelaitz et al. (2013); to be maximized).

COP COP index (Arbelaitz et al. (2013); to be minimized).

DB Davies-Bouldin index (Arbelaitz et al. (2013); to be minimized).

DBstar Modified Davies-Bouldin (Kim and Ramakrishna (2005); to be minimized).

CH Calinski-Harabasz index (Arbelaitz et al. (2013); to be maximized).

SF Score Function (Saitta et al. (2007); to be maximized).

TABLE4.6: Clustering Evaluation Results Example

Sil SF CH DB DBstar D COP 0.28 0 34.21 1.51 16.64 0 0.02

The last feature of this application is the production of shadowplots. Shad- owplots show the average daily sales as in figure4.10(dark line) over a set of products. The standard deviation of sales can also be seen in the same figure (green area).

FIGURE4.9: Sample of time serie clustering with the DTW algorithm

FIGURE4.10: The shadowplot that shows (a) average weekly sales (b) the variance of sales

Bibliography

[1] Singh, Namita, S. Jason Olasky, Kellie S. Cluff, and William F. Welch Jr.

Supply chain demand forecasting and planning.U.S. Patent 7,080,026, issued July 18, 2006.

[2] Loucopoulos, Pericles, and Vassilios Karakostas.System requirements engineering. McGraw-Hill, Inc., 1995.

[3] Pohl, Klaus. Requirements engineering: fundamentals, principles, and tech- niques.Springer Publishing Company, Incorporated, 2010.

[4] Oja, Hannu.Descriptive statistics for multivariate distributions. Statistics &

Probability Letters 1, no. 6 (1983): 327-332.

[5] Beeley, ChrisWeb Application Development with R Using Shiny.Packt Pub- lishing , 2013.

Chapter 5

Problem Solution and Forecasting Results

For Retail Demand Forecasting

5.1 Introduction

In this section we present (a) a system architecture for the solution of the RDF problem, (b) the technological approach that we followed for the construc- tion of a software prototype and (c) an analysis of the different modelling decisions that had to be made. The software prototype was based on the aforementioned system architecture. Finally, the modelling decisions regard the different ways to decompose a RDF problem into sub-problems.

No documento Applying Machine Learning For Retail Demand Forecasting (páginas 60-69)