Retail Demand Forecasting with Artificial Neural Networks

At the beginning, some experiments on feature representation are presented and then all results are presented. Some of the approaches are (a) Training only on previous sales, (b) Training with multiple functions, (c) Training by store and group of products, (d) Training by store and on a subset of a group of products, (e) Training on the total sales of a group of products, (f) Training on the total sales (of all stores) of a single product.

Intoduction

In the example of a new product such as a new brand of milk, we would use the following predictor variables. Some models combine time series modeling with predictor variables and are known in the literature as dynamic regression models.

Business and Forecasting

We now have historical data on a similar product and can give a well-founded assessment for a new brand of milk. An expert could make better and more accurate decisions with an algorithm in most cases.

Retail Demand Forecasting

It is very important to have some knowledge about the reputation of the distributor and the demand for their products. The need to define the quantity arises from the limited storage capacity of the retail sector.

Time Series Analysis and Forecasting

Moving Average Smoothing
Time Series Decomposition
Baseline Methods of Forecasting
Fitted Values and Residuals
Autocorrelation
Statistical Significance
Exponential Smoothing
ARIMA
Evaluation of Forecast Models

By stationary we mean that the properties of the time series are independent of time. In such cases, humanity seems unable to create any good enough algorithm. The main reason is the indescribability presented.

Supervised Learning Methods

Linear Regression
Logistic Regression
Decision Tree
K-Nearest Neighbors
Support Vector Machines
Artificial Neural Networks

Decision tree learning is the method of solving classification or regression problems using tree structures. The tree is created from the features of the dataset. A tree has a head (where all the data is) and through its branches (which are already created from features) it ends to the leaves (which have a label). So when we want to make a prediction for the output of a new instance, we simply follow the branches to a leaf. The output of this new instance is going to be calculated from this certain leaf and its data as we will see later. With two separation classes, the hyperplane that can separate them will be more than one as shown in Figure 3.5. The hyperplane can be written aswTx+ b.Margin is the distance from the hyperplane to the nearest train data and according to svm the best hyperplane is the one with the largest margin. The training data that are closer to the hyperplane are called support vectors. So we need to maximize the margin under the above constraint. The margin is equal to |w2| .For mathematical convenience instead of |w2| to maximize we minimize.

All the above calculations are made with the assumption that the data are linearly separable. In reality, it is very rare to have such a data set. So, two different algorithms have been proposed for svm. The first is Soft-svm.

Unsupervised Learning Methods

K-Means Clustering

It has been proven that weights close to zero or extremely large will lead to slow and even undesirable results. Simply repeat steps 2 and 3 until the maximum number of iterations is reached or the algorithm is converged. The algorithm considers as covered when all data samples do not change their cluster.

On the other hand, the number of clusters is subjective and even different numbers of k may be correct.

Hierarchical Clustering

So one would have to do a tuning of these two parameters for a better clustering. The great advantage of this algorithm is the calculation of all possible number of clusters.

Principal Component Analysis

Maximizing Var(z) requires computing the eigenvectors of the matrix S. The first component will be the eigenvector with the largest eigenvalue, the second component will be the eigenvector with the second largest eigenvalue, and so on.

Reinforcement Learning Methods

Q-learning

The next action is chosen to maximize the Q value of the future state instead of following the current policy and this is the reason why Q learning is an out-of-policy algorithm.

SARSA

In this chapter, we will examine the data set of a particular retail demand forecasting problem. The functions will be explained in depth and some specific data categorization will be used to examine different examples.

Data Exploration

Feature Analysis

CosSaleMonth: The same technique can be used to represent the month of the data. GroupDesc4: This is a string that reveals which of the three groups the product belongs to. Quantity: It is the most important feature of the data set as it reveals the quantity of product that was sold by a store on the given date.

LAG7Qty: This is another useful feature as it provides the quantity of the same product that was sold a week earlier in the same store.

Statistical Analysis

FIGURE 4.3: Time series breakdown of sales of certain products in one store over a period of 2 years. FIGURE 4.4: Time series correlogram and partial correlogram of sales of certain products, in one store over a period of 2 years. FIGURE 4.5: Time series correlogram and partial correlogram of sales of certain products, in one store over a period of 2 years.

Another important component of a time series is stationarity, as it reveals how the time series changes over time.

Problem Description

In this approach, since the prediction is for all the stores, it should be adjusted for each store. Create one model for each store and all products: This approach should be the most efficient since each store has approximately standard sales per product. Create one model for each store and each product: It is a similar approach to the above with the only difference that there is no information about the other products.

Create a model for each store and each product group: This approach is designed to store some useful information and also solve the problem of sparse data.

Data Management

The data data was this time and as shown in 4.9 the splash problem is mostly solved. In this chapter we will present the prediction results of the implemented algorithms. The reason is that there are not many sudden changes in the objective value in most regression problems. ARIMA is one of the most basic algorithms for time series and is discussed in 2.4.8.

As mentioned and before, there are different approaches to the main problem, so in each section there will be a small introduction.

Experiments per Single Store and Itemcode

Training with only one feature

FIGURE 5.5: ARIMA forecasts trained with no missing days and no scaling. FIGURE 5.6: SARIMA predictions trained with no missing days and no scaling. FIGURE 5.8: LSTM predictions trained with missing days and no scaling.

FIGURE 5.10: The predictions of SARIMA trained with missing days and without scaling.

Training with multiple features

As before, scaling the target value is not good for LSTM. FIGURE 5.12: LSTM predictions trained with missing days and no scaling. FIGURE 5.13: Predictions of an LSTM trained with missing days and with feature-only scaling.

FIGURE 5.14: Predictions of an LSTM trained with missing days and with feature and target scaling.

Main problem and solution

Instead of trying to predict the next day's sales, which will mostly be zero, we can predict the total sales for the next fortnight. This way, the target value will mostly be different from zero and the model could easily learn to predict something useful. Naive seems to perform better on this dataset approach compared to the previous one.

Overall, we cannot say that this approach is the worst due to the extreme sparsity of the dataset (only 173 historical records out of 614 we should have).

Experiments per Single Store and Group

Training with Subsets of Group

FIGURE 5.22: MSE of training and test data across training periods with all item codes. This can be easily explained by the fact that most item codes have sparse data. In the second experiment, we kept a subset of the group element codes and added some timing functions, which we can see from Table 5.14 reduced the model error even further.

FIGURE 5.26: Predictions of LSTM trained on a useful subset dataset with additional features.

Forecasting Total Sales of Groups

FIGURE 5.29: MSE of training and testing data across training periods for total sales forecasts. FIGURE 5.30: LSTM forecasts trained with missing days predicting total group sales.

Experiments for Total Stores’ Sales per Itemcode

FIGURE5.32: The predictions of the SARIMA trained with missing days and total sales of an item code. FIGURE 5.34: The predictions of the LSTM trained with missing days and predicting the total sales of an item code. Stacking is in fact a simple model that takes the predictions of the other models as input.

The particular problem studied in this section is the predictions of the total sales of a given product across all stores.

Actor-Critic

First Approach

6.1,6.2, we can see that the first LSTM in the 20th epoch has basically learned all the useful information that the dataset provided. Another useful observation is that the second model only decreases its error after the first one does. On the other hand, the second model, which uses the predictions of the first combined with useful features, has achieved the best results, and this is evident even from 6.4.

FIGURE 6.3: The predictions of the first LSTM trained with only the past sales.

Second Approach

FIGURE 6.7: Predictions of the first LSTM trained only on past sales.

Stacking

Most of the features were analyzed and some statistical features of the dataset were presented. Several possible approaches to the original problem have been mentioned, combined with their corresponding solutions. It is a more useful forecast for store management and also easier to produce.

In Chapter 6, some more sophisticated machine learning models were used for predicting the total sales of a product in all the stores.