Time Series Analysis Project on Superstore Sales Data¶
Time Series Analysis (TSA) is an important part in the field of data science. TSA uses methods for analyzing time series data in order to identify useful patterns and extract meaningful statistics of the data. There are two major goals of TSA: 1) identifing patterns or features represented by the data; and 2) forecasting (using a model to predict future values based on previous data).
TSA is widely used for analysing non-stationary data, like weather data, stock price prediction, economic trend prediction and store sales prediction in this post. Different techniques in TSA will be demonstrated in this post. Let's get into it.
The data we will use is the superstore sales data, and it can be download here.
A Brief Exploratory Data Analysis (EDA)¶
# Import libraries
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
import warnings
warnings.filterwarnings('ignore')
Read in the data. We do a brief EDA to see what's in the data.
# Import data
df = pd.read_excel('data\Superstore.xls')
df.head()
df.info()
We have 21 columns in the dataset. Let's take a look what are they.
df.columns.values
Check if there is any missing values in the data. It turns out there's no missing values in the data.
df.isnull().any()
Let's take a look about the statistics of numeric values in the data.
df.describe()
We found that there is an interesting column Category which can divided the data into several subsets according to the product categories. Let's see how many categories of products we have in the data.
df.Category.unique()
df.Category.value_counts()
Now we can analyse different categories of products accordingly. In this post, let's take the Office Supplies data as an example. We will start our time series analysis on this type of products. You can also explore other categories in a similar way.
# Extract the data of Office Supplies
office_supplies = df.loc[df['Category'] == 'Office Supplies']
We have a good 4-year office supplies sales data. We will try to find some meaningful patterns in this data.
office_supplies['Order Date'].min(), office_supplies['Order Date'].max()
Data Processing¶
In this post, we only care about the sales data of office supplies. We'll start by droping other columns. First, let us set the index of our data using the 'Order Data' column.
Note that, the sales data on the same date should be integrated together.
office_supplies = office_supplies[['Order Date','Sales']]
office_supplies.head()
office_supplies = office_supplies.sort_values('Order Date')
office_supplies = office_supplies.groupby('Order Date')['Sales'].sum().reset_index()
office_supplies.head()
Indexing with the time series data¶
office_supplies = office_supplies.set_index('Order Date')
office_supplies.head()
office_supplies.index
As can be seen from the index, our current datetime data is not continuous and can be tricky to work with. Therefore, we will use the average daily sales value for that month instead, and we are using the start of each month as the timestamp. This requires us to resample the data.
monthly = office_supplies['Sales'].resample('MS').mean()
monthly
Visualizing the Sales Time Series Data¶
monthly.plot(figsize = (16, 7))
plt.show()
Some distinguishable patterns appear when we plot the data. The time-series has seasonality pattern, such as sales are always low at the beginning of the year and high at the end of the year. There is always an upward trend within any single year with a couple of low months in the mid of the year.
We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components: trend, seasonality, and noise.
from pylab import rcParams
rcParams['figure.figsize'] = 16, 7
decomposition = sm.tsa.seasonal_decompose(monthly, model = 'additive')
fig = decomposition.plot()
plt.show()
The figure above clearly shows the seasonality in our data, and the trend is gradually increasing through the years.
Time Series Forecasting with ARIMA¶
Autoregressive Integrated Moving Average (ARIMA) is a commonly used method for time-series prediction. ARIMA models are denoted with the notation ARIMA(p, d, q). These three parameters account for seasonality, trend, and noise in data:
p = d = q = range(0,2)
pdq = list(itertools.product(p,d,q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
This step is parameter selection for our Office Supplies sales ARIMA Time Series Model. Our goal here is to use a "grid search" to find the optimal set of parameters that yields the best performance for our model.
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(monthly, order = param,
seasonal_order = param_seasonal,
enforce_stationarity = False,
enforce_invertibility = False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
Fitting the ARIMA Model using the Optimal parameters¶
mod = sm.tsa.statespace.SARIMAX(monthly, order = (1,1,1),
seasonal_order = (1,1,0,12),
enforce_stationarity = False,
enforce_invertibility = False)
results = mod.fit()
print(results.summary().tables[1])
Let's see the model diagnostics to investigate any unusual behavior.
results.plot_diagnostics(figsize = (16,7))
plt.show()
It is not a perfect model. However, our model diagnostics suggests that the model residuals are nearly normally distributed.
Validating Forecasts¶
To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.
pred = results.get_prediction(start = pd.to_datetime('2017-01-01'), dynamic = False)
pred_ci = pred.conf_int()
ax = monthly['2014':].plot(label = 'observed')
pred.predicted_mean.plot(ax = ax, label = 'One-step ahead Forecast', alpha = .7, figsize = (14, 6))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:,0],
pred_ci.iloc[:,1],
color = 'k',
alpha = .2)
ax.set_xlabel('Date')
ax.set_ylabel('Office Supplies Sales')
plt.legend()
plt.show()
The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing an upward trend starts from the beginning of the year and captured the seasonality toward the end of the year.
The grey area shows the confidence interval.
monthly_forecasted = pred.predicted_mean
monthly_truth = monthly['2017-01-01':]
mse = ((monthly_forecasted -monthly_truth)**2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error (RMSE) shows that our model was able to predict the average daily Office Supplies Sales in the test set within 256.6 of the real sales. Our Office Supplies daily sales range from around 50 to 1350. So, our model works pretty well so far.
Producting and Visualizing Forecasts¶
pred_uc = results.get_forecast(steps = 100)
pred_ci = pred_uc.conf_int()
ax = monthly.plot(label = 'observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax = ax, label ='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:,0],
pred_ci.iloc[:,1],
color = 'k',
alpha = .2)
ax.set_xlabel('Date')
ax.set_ylabel('Office Supplies Sales')
plt.legend()
plt.show()
Our ARIMA prediction model clearly captured the office supplies sales seasonality. As we forecast further out into the future, it is natural for us to become less confident in our predicted values (the expansion of the grey area).
The above time series analysis for Office Supplies makes me curious about other categories, and how do they compare with each other over time. Therefore, we are going to compare time series of office supplies and furniture.
Time Series of Office Supplies vs. Furniture¶
According to our data, the size of Office Supplies is much larger than the size of Furniture.
furniture = df.loc[df['Category'] == 'Furniture']
furniture = furniture[['Order Date', 'Sales']]
furniture = furniture.sort_values('Order Date')
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
furniture = furniture.set_index('Order Date')
furniture.head()
monthly_office = monthly
monthly_furniture = furniture['Sales'].resample('MS').mean()
furniture = pd.DataFrame({'Order Date':monthly_furniture.index, 'Sales':monthly_furniture.values})
office = pd.DataFrame({'Order Date': monthly_office.index, 'Sales': monthly_office.values})
store = furniture.merge(office, how = 'inner', on = 'Order Date')
store.rename(columns = {'Sales_x': 'furniture_sales', 'Sales_y': 'office_sales'}, inplace=True)
store.head()
Visualizing the Sales Data¶
plt.figure(figsize = (16, 7))
plt.plot(store['Order Date'], store['furniture_sales'], 'b-', label = 'Furniture')
plt.plot(store['Order Date'], store['office_sales'], 'r-', label = 'Office Supplies')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Furniture vs. Office Supplies')
plt.legend()
plt.show()
We observe that sales of furniture and office supplies shared a similar seasonal pattern. Early of the year is the off season for both of the two categories. It seems summer time is quiet for office supplies too. in addition, average daily sales for furniture are higher than those of office supplies in most of the months. It is understandable, as the value of furniture should be much higher than those of office supplies. Occasionally, office supplies passed furniture on average daily sales. Let’s find out when was the first time office supplies’ sales surpassed those of furniture’s.
first_date = store.ix[np.min(list(np.where(store['office_sales'] > store['furniture_sales'])[0])), 'Order Date']
print("Office supplies first time produced higher sales than furniture is {}.".format(first_date.date()))
Time Series Modeling with Prophet¶
Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom changepoints. Therefore, we are using Prophet to get a model up and running.
Let's take look at the Prophet forcasting models for Furniture and Office Supplies Sales.
from fbprophet import Prophet
furniture = furniture.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
furniture_model = Prophet(interval_width = 0.95)
furniture_model.fit(furniture)
office = office.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
office_model = Prophet(interval_width=0.95)
office_model.fit(office)
furniture_forecast = furniture_model.make_future_dataframe(periods = 36, freq = 'MS')
furniture_forecast = furniture_model.predict(furniture_forecast)
office_forecast = office_model.make_future_dataframe(periods = 36, freq = 'MS')
office_forecast = office_model.predict(office_forecast)
plt.figure(figsize = (16,7))
furniture_model.plot(furniture_forecast, xlabel = 'Date', ylabel = 'Sales')
plt.title('Furniture Sales')
plt.show()
plt.figure(figsize = (16,7))
office_model.plot(office_forecast, xlabel = 'Date', ylabel = 'Sales')
plt.title('Office Supplies Sales')
plt.show()
Compare the Two Forecasts¶
We already have the forecasts for three years for these two categories into the future. We will now join them together to compare their future forecasts.
furniture_names = ['furniture_%s' % column for column in furniture_forecast.columns]
office_names = ['office_%s' % column for column in office_forecast.columns]
merge_furniture_forecast = furniture_forecast.copy()
merge_office_forecast = office_forecast.copy()
merge_furniture_forecast.columns = furniture_names
merge_office_forecast.columns = office_names
forecast = pd.merge(merge_furniture_forecast, merge_office_forecast,
how = 'inner', left_on = 'furniture_ds',
right_on = 'office_ds')
forecast = forecast.rename(columns = {'furniture_ds': 'Date'}).drop('office_ds', axis = 1)
forecast.head()
Trend and Forecast Visualization¶
plt.figure(figsize = (12, 6))
plt.plot(forecast['Date'], forecast['furniture_trend'],'b-')
plt.plot(forecast['Date'], forecast['office_trend'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Sales Trend');
plt.show()
plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_yhat'], 'b-')
plt.plot(forecast['Date'], forecast['office_yhat'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Estimate');
Trends and Patterns¶
Now, we can use the Prophet Models to inspect different trends of these two categories in the data.
furniture_model.plot_components(furniture_forecast)
plt.show()
office_model.plot_components(office_forecast)
plt.show()
Good to see that the sales for both furniture and office supplies have been linearly increasing over time and will be keep growing, although office supplies’ growth seems slightly stronger.
Comments
comments powered by Disqus