Applying Machine Learning For Retail Demand Forecasting

Chapter 2 covers the topic of Time Series Forecasting, which is among the most popular mathematical theories for solving the RDF problem. The chapter then follows to a presentation of the technological tools that were used for the construction of the software (R, RStudio Server, SQL Server, GIT). Then a section is dedicated to the important issue of modeling decisions and challenges, which is at the heart of the RDF problem.

Introduction

For example, one can predict a city's electricity demand E(t) at time tx+h based on the city's population, temperature, and day of the week at the given time, such that. Models that combine time series modeling with the use of predictive variables are known as dynamic regression models. Are there patterns in the data that will continue into the future?

Forecasting In Business

For a well-established company, future estimates are mainly based on past data that has been accumulated over the years. In contrast, newer firms have not yet generated enough past data, so they have to base their forecasts on other factors (see predictor variables in the previous section), such as data regarding their future customers, performance of similar firms, competition, and the state of the the economy. As was mentioned in the previous section, a longer horizon increases the difficulty of forecasting so that the forecast deviates from the actual price.

Retail Demand Forecasting

To begin with, product and store sets can be dynamic, as (a) a new product may arrive in the market (b) a product may be removed from the shelves (c) the retail industry may open a new store ri and (d) the retail industry may decide to close a store. For example, one can forecast total sales of a group of products instead of sales of a single product. Some marketing campaigns are run internally by the retail industry and include special offers, product magazines, bonus points, etc.

Time Series Analysis and Forecasting

Moving Average Smoothing
Time Series Decomposition To Trend-Seasonal-Remainder 9
Fitted Values and Residuals
Autocorrelation
Portmanteau tests
Exponential Smoothing
ARIMA
Evalutation of forecast models

In the context of forecasting, one of the most commonly used mathematical forms of decomposition is the trend-seasonal-residual decomposition (TSRD). As mentioned in the previous subsection, it is desirable for the fitted residuals to be uncorrelated. According to the MA(q) model, the current value of a time series is a linear combination of the previous forecast errors q.

Supervised Learning Methods

Linear Regression
Logistic Regression
k Nearest Neighbors
Support Vector Machines
Decision Tree Learning
Random Forest and Ensemble Learning
Perceptron
Multilayer Perceptron Network

In the traditional practice of computer programming, one has to specify the exact rules that govern the behavior of the program under development. The linear regression model assumes that the output is a linear function of the input vector. The purpose of the training procedure on a given training set is to adjust the parameters inw to minimize the cost.

The optimization of the cost function is achieved by applying Gradient Decent to each example and then taking the average. As can be seen in the aforementioned consequences, the construction of the svm model (i.e. estimation of w andb), as well as the decision rule to be applied afterwards, depends only on dot products. The instance is placed at the root of the tree, and will then follow a path to a leaf.

The prediction depends on the portion of the training data belonging to this leaf T0. Therefore, a first adjustable parameter of the random forest model is simply the number of trees. Update all the weights in the network according to the activations and the δ To complete the definition of the above scheme, the details of δ calculation and weight update will be given.

For the neurons belonging to the output layer, the delta is calculated as δj(l) = f0(u(l)j )(yj −a(l)j ), where f0 is the derivative of the activation function.

Unsupervised Learning

K-Means Clustering

Repeat steps 3 and 4 until the clusters converge, meaning they have not changed since the previous iteration. Therefore, the k-Means algorithm is usually executed several times to find an acceptable center initialization. In order for the experimenter to choose the right number of clusters for his case, he will run the k-Means algorithm on a range of different parameters.

There are some problematic cases where the algorithm may reach an infinite loop while repeating steps 3 and 4. Moreover, the existence of outliers in the dataset can drastically reduce the quality of the clusters. A variation of the algorithm that is more robust to outliers is known as the thek−M edian algorithm.

Hierarchical Clustering

TheclusterDis is also a distance function, a function that uses the dist function as a subroutine to measure the distance between clusters. Minimum intercluster distance is the minimum distance between all pairs formed by taking one sample from CA and one from CB. Maximum distance between clusters is the maximum distance between all pairs formed by taking one example from CA and one from CB.

The first clustering result contains clusters, the second m−1 clusters, and so on until the clustering of only one cluster is achieved. There are empirical methods for choosing the cut point, so hierarchical clustering is considered better than k-Means for choosing the number of groups. In this case, a common approach is to create an initial set of k-means clusters and then apply hierarchical clustering to each cluster.

5] Rosenblatt, Frank. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65, no. 11] Xu, Rui, and Donald Wunsch. An overview of clustering algorithms. IEEE Transactions on Neural Networks 16, no.

Introduction

Problem Setting

System Requirements

The system is scalable, so that it continues to work according to its specifications after the introduction of new data.

Data Exploration

Data Features
Descriptive Statistics
Interactive Application for Exploration
System Architecture
Technological Approach
Modelling Decisions

The functions described in this as well as in the previous section depend only on the Sale Date part of the key. Specifically, if a promotional activity occurred for the record's (store, date, product), the related boolean function is set to TRUE, otherwise it is set to FALSE. While GroupDescription only depends on the ItemCode part of the key, GroupShare depends on the whole key (StoreID,ItemCode,SaleDate).

A few facts that can be gleaned from the descriptive statistics are (a) the sample spans the time period b) 24% of the sales records. This is a statistical proof of the importance of the day of the week in the RDF problem. By changing the values for the menu selectors, different patterns will appear in the season plot.

The Dynamic Time Warping (DTW) algorithm is used to calculate the distance matrix from the time series. The time series members of each of the 10 clusters overlap in subclusters of the clusters. For cluster evaluation, the application outputs a set of different cluster evaluation metrics.

A produced model is stored in the appropriate model repository location by. In this scenario, the module will need to extract the StoreID part of the key in order to select the correct model. This subsection will elaborate on the technologies and tools that have been adopted to implement the system.

Results

Feature Ranking

To measure the importance of each feature, deepIncMSEimportance metric (Louppe, Gilles, et al., 2013) was used. For the evaluation procedure, a variation of cross-validation, known as time-partition cross-validation, was used. This variation ensures that the records of the training set chronologically precede the records of the test set.

These were calculated on a data sample containing the 10 largest product groups2 for three stores. For each feature, the 10 points plotted correspond to the pIncMSE that appeared in each product group. Initially, featureCount_ItemBaskets is considered the best, where the pIncMSE metric is higher than 5% for all groups.

It also showed the highest overall pIncMSE, which went over 40% pIncMSE for a given cluster (meaning that adding this feature reduced the MSE by 40% . for a cluster). The variance of the metric is relatively small for these features, indicating that they will be positive for most product groups. Therefore, we can assume that, with a high probability, these features will be important for most groups.

They can be used for a specific subset of the product groups where the importance is positive.

Model Evaluation

However, it is true that the MAE values are summarized only for products of the same group, which also always have similar scales. Specifically, these characteristics explain some of the variance in sales (for example, due to the characteristics of marketing campaigns) that were considered noise by previous methods. Using the average statistic, the methods are ranked from best to worst as 1) XGBoost, 2) Random Forest, 3) ARIMA, 4) naive method, 5) drift method, and 6) average method.

Solving the real RDF (Retail Demand Forecasting) problem requires an understanding of the general forecasting problem as well as the challenges that arise in practice. The last part of the thesis dealt with the evaluation of a) characteristics of engineering data and b) a set of forecasting methods. For the purpose of this evaluation, sales records for ten major product groups were sampled from the RI database.

Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng et al. Apache Spark: a unified engine for big data processing. Communications of the ACM 59, no. TSM1: SVD + stlf/ets - this model applied SVD to the training data for preprocessing, and then forecasted each series with stlf(), using an exponential smoothing model (ets) for non-seasonal forecasting. TSM2: SVD + stlf/arima - same but with an arima model for non-seasonal forecasting.

Then predictions were made and several of the closely correlated series were averaged together, before the original scale was restored.

Walmart Recruiting II - Sales in Stormy Weather

Example of forecasting on artificial data
Example of forecasting on financial data
The application of MAS on noisy stock market data
Decomposition of sales data into trend-seasonal-remainder. In
Artificial time serie
The ACF plot for the artificial time serie
Linear regression in two dimensions
The logistic function
Decision boundary: For the points on the line, the logistic re-
Decision boundary for 1-Nearest-Neighbors
Solution with Logistic Regression (black line) and SVM (green
Decision tree learning in two dimensions
k-Means with k=5 on two-dimensional data
Cluster Dendrogram generated on the Iris Dataset
Trigonometric representation of the day in week
Descriptive statistics for a subset of the numeric features
Menu selectors that allow the user to specify the store and
A plot of the user selected time serie. The menu selectors are
The importance of lag=7 can be seen in the above correlogram. 48
Menu selectors for clustering
Sample of time serie clustering with the DTW algorithm
The shadowplot that shows (a) average weekly sales (b) the
System Architecture Design
Feature importance pIncMSE for 10 groups of products
Acquisition and Preprocessing of Data
Training Machine Learning Models
User Query Processing and Response
Evaluation of Results and Data
Clustering Evaluation Metrics
Clustering Evaluation Results Example

TSM5:non-seasonal arima with Fourier series terms as regressors – This also used auto.arima(), but as a non-seasonal arima model, with the seasonality captured in the regressors.