Time Series Analysis Project on Superstore Sales Data¶

Time Series Analysis (TSA) is an important part in the field of data science. TSA uses methods for analyzing time series data in order to identify useful patterns and extract meaningful statistics of the data. There are two major goals of TSA: 1) identifing patterns or features represented by the data; and 2) forecasting (using a model to predict future values based on previous data).

TSA is widely used for analysing non-stationary data, like weather data, stock price prediction, economic trend prediction and store sales prediction in this post. Different techniques in TSA will be demonstrated in this post. Let's get into it.

The data we will use is the superstore sales data, and it can be download here.

A Brief Exploratory Data Analysis (EDA)¶

In [1]:

# Import libraries
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import statsmodels.api as sm
import matplotlib

matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

import warnings
warnings.filterwarnings('ignore')

Read in the data. We do a brief EDA to see what's in the data.

In [2]:

# Import data
df = pd.read_excel('data\Superstore.xls')
df.head()

Out[2]:

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	Country	City	...	Postal Code	Region	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit
0	1	CA-2016-152156	2016-11-08	2016-11-11	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	...	42420	South	FUR-BO-10001798	Furniture	Bookcases	Bush Somerset Collection Bookcase	261.9600	2	0.00	41.9136
1	2	CA-2016-152156	2016-11-08	2016-11-11	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	...	42420	South	FUR-CH-10000454	Furniture	Chairs	Hon Deluxe Fabric Upholstered Stacking Chairs,...	731.9400	3	0.00	219.5820
2	3	CA-2016-138688	2016-06-12	2016-06-16	Second Class	DV-13045	Darrin Van Huff	Corporate	United States	Los Angeles	...	90036	West	OFF-LA-10000240	Office Supplies	Labels	Self-Adhesive Address Labels for Typewriters b...	14.6200	2	0.00	6.8714
3	4	US-2015-108966	2015-10-11	2015-10-18	Standard Class	SO-20335	Sean O'Donnell	Consumer	United States	Fort Lauderdale	...	33311	South	FUR-TA-10000577	Furniture	Tables	Bretford CR4500 Series Slim Rectangular Table	957.5775	5	0.45	-383.0310
4	5	US-2015-108966	2015-10-11	2015-10-18	Standard Class	SO-20335	Sean O'Donnell	Consumer	United States	Fort Lauderdale	...	33311	South	OFF-ST-10000760	Office Supplies	Storage	Eldon Fold 'N Roll Cart System	22.3680	2	0.20	2.5164

5 rows × 21 columns

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
Row ID           9994 non-null int64
Order ID         9994 non-null object
Order Date       9994 non-null datetime64[ns]
Ship Date        9994 non-null datetime64[ns]
Ship Mode        9994 non-null object
Customer ID      9994 non-null object
Customer Name    9994 non-null object
Segment          9994 non-null object
Country          9994 non-null object
City             9994 non-null object
State            9994 non-null object
Postal Code      9994 non-null int64
Region           9994 non-null object
Product ID       9994 non-null object
Category         9994 non-null object
Sub-Category     9994 non-null object
Product Name     9994 non-null object
Sales            9994 non-null float64
Quantity         9994 non-null int64
Discount         9994 non-null float64
Profit           9994 non-null float64
dtypes: datetime64[ns](2), float64(3), int64(3), object(13)
memory usage: 1.6+ MB

We have 21 columns in the dataset. Let's take a look what are they.

In [4]:

df.columns.values

Out[4]:

array(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country', 'City',
       'State', 'Postal Code', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit'], dtype=object)

Check if there is any missing values in the data. It turns out there's no missing values in the data.

In [5]:

df.isnull().any()

Out[5]:

Row ID           False
Order ID         False
Order Date       False
Ship Date        False
Ship Mode        False
Customer ID      False
Customer Name    False
Segment          False
Country          False
City             False
State            False
Postal Code      False
Region           False
Product ID       False
Category         False
Sub-Category     False
Product Name     False
Sales            False
Quantity         False
Discount         False
Profit           False
dtype: bool

Let's take a look about the statistics of numeric values in the data.

In [6]:

df.describe()

Out[6]:

	Row ID	Postal Code	Sales	Quantity	Discount	Profit
count	9994.000000	9994.000000	9994.000000	9994.000000	9994.000000	9994.000000
mean	4997.500000	55190.379428	229.858001	3.789574	0.156203	28.656896
std	2885.163629	32063.693350	623.245101	2.225110	0.206452	234.260108
min	1.000000	1040.000000	0.444000	1.000000	0.000000	-6599.978000
25%	2499.250000	23223.000000	17.280000	2.000000	0.000000	1.728750
50%	4997.500000	56430.500000	54.490000	3.000000	0.200000	8.666500
75%	7495.750000	90008.000000	209.940000	5.000000	0.200000	29.364000
max	9994.000000	99301.000000	22638.480000	14.000000	0.800000	8399.976000

We found that there is an interesting column Category which can divided the data into several subsets according to the product categories. Let's see how many categories of products we have in the data.

In [7]:

df.Category.unique()

Out[7]:

array(['Furniture', 'Office Supplies', 'Technology'], dtype=object)

In [8]:

df.Category.value_counts()

Out[8]:

Office Supplies    6026
Furniture          2121
Technology         1847
Name: Category, dtype: int64

Now we can analyse different categories of products accordingly. In this post, let's take the Office Supplies data as an example. We will start our time series analysis on this type of products. You can also explore other categories in a similar way.

In [9]:

# Extract the data of Office Supplies
office_supplies = df.loc[df['Category'] == 'Office Supplies']

We have a good 4-year office supplies sales data. We will try to find some meaningful patterns in this data.

In [10]:

office_supplies['Order Date'].min(), office_supplies['Order Date'].max()

Out[10]:

(Timestamp('2014-01-03 00:00:00'), Timestamp('2017-12-30 00:00:00'))

Data Processing¶

In this post, we only care about the sales data of office supplies. We'll start by droping other columns. First, let us set the index of our data using the 'Order Data' column.

Note that, the sales data on the same date should be integrated together.

In [11]:

office_supplies = office_supplies[['Order Date','Sales']]
office_supplies.head()

Out[11]:

	Order Date	Sales
2	2016-06-12	14.620
4	2015-10-11	22.368
6	2014-06-09	7.280
8	2014-06-09	18.504
9	2014-06-09	114.900

In [12]:

office_supplies = office_supplies.sort_values('Order Date')
office_supplies = office_supplies.groupby('Order Date')['Sales'].sum().reset_index()
office_supplies.head()

Out[12]:

	Order Date	Sales
0	2014-01-03	16.448
1	2014-01-04	288.060
2	2014-01-05	19.536
3	2014-01-06	685.340
4	2014-01-07	10.430

Indexing with the time series data¶

In [13]:

office_supplies = office_supplies.set_index('Order Date')
office_supplies.head()

Out[13]:

	Sales
Order Date
2014-01-03	16.448
2014-01-04	288.060
2014-01-05	19.536
2014-01-06	685.340
2014-01-07	10.430

In [14]:

office_supplies.index

Out[14]:

DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06',
               '2014-01-07', '2014-01-09', '2014-01-10', '2014-01-13',
               '2014-01-16', '2014-01-18',
               ...
               '2017-12-21', '2017-12-22', '2017-12-23', '2017-12-24',
               '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
               '2017-12-29', '2017-12-30'],
              dtype='datetime64[ns]', name='Order Date', length=1148, freq=None)

As can be seen from the index, our current datetime data is not continuous and can be tricky to work with. Therefore, we will use the average daily sales value for that month instead, and we are using the start of each month as the timestamp. This requires us to resample the data.

In [15]:

monthly = office_supplies['Sales'].resample('MS').mean()

In [16]:

monthly

Out[16]:

Order Date
2014-01-01     285.357647
2014-02-01      63.042588
2014-03-01     391.176318
2014-04-01     464.794750
2014-05-01     324.346545
2014-06-01     588.774409
2014-07-01     756.060400
2014-08-01     541.879143
2014-09-01    1015.677704
2014-10-01     267.078815
2014-11-01     959.372714
2014-12-01     692.556231
2015-01-01     129.198571
2015-02-01     335.504187
2015-03-01     690.545522
2015-04-01     502.342320
2015-05-01     364.549440
2015-06-01     560.407737
2015-07-01     205.214739
2015-08-01     558.814667
2015-09-01     772.230680
2015-10-01     361.392083
2015-11-01     757.790357
2015-12-01     540.055800
2016-01-01     331.230125
2016-02-01     357.597368
2016-03-01     693.877240
2016-04-01     462.932478
2016-05-01     449.489724
2016-06-01     436.072400
2016-07-01     587.474727
2016-08-01     344.605385
2016-09-01     830.847786
2016-10-01     678.408083
2016-11-01     787.972231
2016-12-01    1357.055929
2017-01-01     967.013136
2017-02-01     389.882737
2017-03-01     538.899481
2017-04-01     558.229296
2017-05-01     508.776444
2017-06-01     650.463038
2017-07-01     393.902615
2017-08-01    1156.148154
2017-09-01    1139.137250
2017-10-01     886.045846
2017-11-01    1124.012036
2017-12-01    1049.549724
Freq: MS, Name: Sales, dtype: float64

Visualizing the Sales Time Series Data¶

In [17]:

monthly.plot(figsize = (16, 7))
plt.show()

Some distinguishable patterns appear when we plot the data. The time-series has seasonality pattern, such as sales are always low at the beginning of the year and high at the end of the year. There is always an upward trend within any single year with a couple of low months in the mid of the year.

We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components: trend, seasonality, and noise.

In [18]:

from pylab import rcParams
rcParams['figure.figsize'] = 16, 7

decomposition = sm.tsa.seasonal_decompose(monthly, model = 'additive')
fig = decomposition.plot()
plt.show()

The figure above clearly shows the seasonality in our data, and the trend is gradually increasing through the years.

Time Series Forecasting with ARIMA¶

Autoregressive Integrated Moving Average (ARIMA) is a commonly used method for time-series prediction. ARIMA models are denoted with the notation ARIMA(p, d, q). These three parameters account for seasonality, trend, and noise in data:

In [19]:

p = d = q = range(0,2)
pdq = list(itertools.product(p,d,q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

This step is parameter selection for our Office Supplies sales ARIMA Time Series Model. Our goal here is to use a "grid search" to find the optimal set of parameters that yields the best performance for our model.

In [20]:

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(monthly, order = param,
                                           seasonal_order = param_seasonal,
                                           enforce_stationarity = False,
                                           enforce_invertibility = False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:747.1990404227043
ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1464.4967660059162
ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:500.7335398750499
ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:516.0876543936834
ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:1459.9493195280877
ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:347.44488563489716
ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:702.6891395292475
ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:nan
ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:487.76014158147206
ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:517.4952646585557
ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:3022.312250877466
ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:345.69220944601045
ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:666.4572045007284
ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1360.057691075371
ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:503.0840747609876
ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:500.0109385290892
ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1483.149298292236
ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:335.7148959418817
ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:637.3530008828178
ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:3075.7023343964047
ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:473.71539674554265
ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:489.9419970027669
ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2578.319436918695
ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:330.7061513093243
ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:680.4032716562347
ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1588.3750282900137
ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:501.7031226672368
ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:500.2190534421442
ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:1455.6307367581146
ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:331.0719973254865
ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:651.1768264308408
ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:28112.535708773736
ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:488.4314196132838
ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:489.254640043546
ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:2661.0163451942744
ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:331.7136802273554
ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:658.0713305703066
ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1417.437544166757
ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:494.907679737394
ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:482.80837530249664
ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1329.3989860683482
ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:321.2325469884074
ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:638.7962401173565
ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:41587.6558850662
ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:475.71388519565295
ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:478.6987301743289
ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:2580.1698313906977
ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:318.20664790842784

Fitting the ARIMA Model using the Optimal parameters¶

In [21]:

mod = sm.tsa.statespace.SARIMAX(monthly, order = (1,1,1),
                               seasonal_order = (1,1,0,12),
                               enforce_stationarity = False,
                               enforce_invertibility = False)

results = mod.fit()
print(results.summary().tables[1])

==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2479      0.328      0.755      0.450      -0.395       0.891
ma.L1         -0.9389      0.462     -2.030      0.042      -1.845      -0.032
ar.S.L12      -0.6135      0.348     -1.762      0.078      -1.296       0.069
sigma2      7.421e+04   2.68e+04      2.768      0.006    2.17e+04    1.27e+05
==============================================================================

Let's see the model diagnostics to investigate any unusual behavior.

In [22]:

results.plot_diagnostics(figsize = (16,7))
plt.show()

It is not a perfect model. However, our model diagnostics suggests that the model residuals are nearly normally distributed.

Validating Forecasts¶

To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.

In [23]:

pred = results.get_prediction(start = pd.to_datetime('2017-01-01'), dynamic = False)
pred_ci = pred.conf_int()

ax = monthly['2014':].plot(label = 'observed')
pred.predicted_mean.plot(ax = ax, label = 'One-step ahead Forecast', alpha = .7, figsize = (14, 6))

ax.fill_between(pred_ci.index, 
                pred_ci.iloc[:,0],
                pred_ci.iloc[:,1],
               color = 'k',
               alpha = .2)

ax.set_xlabel('Date')
ax.set_ylabel('Office Supplies Sales')
plt.legend()
plt.show()

The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing an upward trend starts from the beginning of the year and captured the seasonality toward the end of the year.

The grey area shows the confidence interval.

In [24]:

monthly_forecasted = pred.predicted_mean
monthly_truth = monthly['2017-01-01':]

mse = ((monthly_forecasted -monthly_truth)**2).mean()

print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

The Mean Squared Error of our forecasts is 65844.6

In [25]:

print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))

The Root Mean Squared Error of our forecasts is 256.6

The Root Mean Squared Error (RMSE) shows that our model was able to predict the average daily Office Supplies Sales in the test set within 256.6 of the real sales. Our Office Supplies daily sales range from around 50 to 1350. So, our model works pretty well so far.

Producting and Visualizing Forecasts¶

In [26]:

pred_uc = results.get_forecast(steps = 100)
pred_ci = pred_uc.conf_int()

ax = monthly.plot(label = 'observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax = ax, label ='Forecast')
ax.fill_between(pred_ci.index,
               pred_ci.iloc[:,0],
               pred_ci.iloc[:,1],
               color = 'k',
               alpha = .2)

ax.set_xlabel('Date')
ax.set_ylabel('Office Supplies Sales')
plt.legend()
plt.show()

Our ARIMA prediction model clearly captured the office supplies sales seasonality. As we forecast further out into the future, it is natural for us to become less confident in our predicted values (the expansion of the grey area).

The above time series analysis for Office Supplies makes me curious about other categories, and how do they compare with each other over time. Therefore, we are going to compare time series of office supplies and furniture.

Time Series of Office Supplies vs. Furniture¶

According to our data, the size of Office Supplies is much larger than the size of Furniture.

In [27]:

furniture = df.loc[df['Category'] == 'Furniture']
furniture = furniture[['Order Date', 'Sales']]
furniture = furniture.sort_values('Order Date')
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
furniture = furniture.set_index('Order Date')
furniture.head()

Out[27]:

	Sales
Order Date
2014-01-06	2573.820
2014-01-07	76.728
2014-01-10	51.940
2014-01-11	9.940
2014-01-13	879.939

In [28]:

monthly_office = monthly
monthly_furniture = furniture['Sales'].resample('MS').mean()

In [29]:

furniture = pd.DataFrame({'Order Date':monthly_furniture.index, 'Sales':monthly_furniture.values})
office = pd.DataFrame({'Order Date': monthly_office.index, 'Sales': monthly_office.values})

store = furniture.merge(office, how = 'inner', on = 'Order Date')
store.rename(columns = {'Sales_x': 'furniture_sales', 'Sales_y': 'office_sales'}, inplace=True)
store.head()

Out[29]:

	Order Date	furniture_sales	office_sales
0	2014-01-01	480.194231	285.357647
1	2014-02-01	367.931600	63.042588
2	2014-03-01	857.291529	391.176318
3	2014-04-01	567.488357	464.794750
4	2014-05-01	432.049188	324.346545

Visualizing the Sales Data¶

In [30]:

plt.figure(figsize = (16, 7))
plt.plot(store['Order Date'], store['furniture_sales'], 'b-', label = 'Furniture')
plt.plot(store['Order Date'], store['office_sales'], 'r-', label = 'Office Supplies')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Furniture vs. Office Supplies')
plt.legend()
plt.show()

We observe that sales of furniture and office supplies shared a similar seasonal pattern. Early of the year is the off season for both of the two categories. It seems summer time is quiet for office supplies too. in addition, average daily sales for furniture are higher than those of office supplies in most of the months. It is understandable, as the value of furniture should be much higher than those of office supplies. Occasionally, office supplies passed furniture on average daily sales. Let’s find out when was the first time office supplies’ sales surpassed those of furniture’s.

In [31]:

first_date = store.ix[np.min(list(np.where(store['office_sales'] > store['furniture_sales'])[0])), 'Order Date']
print("Office supplies first time produced higher sales than furniture is {}.".format(first_date.date()))

Office supplies first time produced higher sales than furniture is 2014-07-01.

Time Series Modeling with Prophet¶

Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom changepoints. Therefore, we are using Prophet to get a model up and running.

Let's take look at the Prophet forcasting models for Furniture and Office Supplies Sales.

In [32]:

from fbprophet import Prophet
furniture = furniture.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
furniture_model = Prophet(interval_width = 0.95)
furniture_model.fit(furniture)

office = office.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
office_model = Prophet(interval_width=0.95)
office_model.fit(office)

furniture_forecast = furniture_model.make_future_dataframe(periods = 36, freq = 'MS')
furniture_forecast = furniture_model.predict(furniture_forecast)

office_forecast = office_model.make_future_dataframe(periods = 36, freq = 'MS')
office_forecast = office_model.predict(office_forecast)

plt.figure(figsize = (16,7))
furniture_model.plot(furniture_forecast, xlabel = 'Date', ylabel = 'Sales')
plt.title('Furniture Sales')
plt.show()

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

<Figure size 1152x504 with 0 Axes>

In [33]:

plt.figure(figsize = (16,7))
office_model.plot(office_forecast, xlabel = 'Date', ylabel = 'Sales')

plt.title('Office Supplies Sales')
plt.show()

<Figure size 1152x504 with 0 Axes>

Compare the Two Forecasts¶

We already have the forecasts for three years for these two categories into the future. We will now join them together to compare their future forecasts.

In [34]:

furniture_names = ['furniture_%s' % column for column in furniture_forecast.columns]
office_names = ['office_%s' % column for column in office_forecast.columns]

merge_furniture_forecast = furniture_forecast.copy()
merge_office_forecast = office_forecast.copy()

merge_furniture_forecast.columns = furniture_names
merge_office_forecast.columns = office_names

forecast = pd.merge(merge_furniture_forecast, merge_office_forecast,
                   how = 'inner', left_on = 'furniture_ds',
                   right_on = 'office_ds')

forecast = forecast.rename(columns = {'furniture_ds': 'Date'}).drop('office_ds', axis = 1)

forecast.head()

Out[34]:

	Date	furniture_trend	furniture_yhat_lower	furniture_yhat_upper	furniture_trend_lower	furniture_trend_upper	furniture_additive_terms	furniture_additive_terms_lower	furniture_additive_terms_upper	furniture_yearly	...	office_additive_terms	office_additive_terms_lower	office_additive_terms_upper	office_yearly	office_yearly_lower	office_yearly_upper	office_yhat
0	2014-01-01	731.079361	266.970810	823.529872	731.079361	731.079361	-178.836100	-178.836100	-178.836100	-178.836100	...	-132.483942	-132.483942	-132.483942	-132.483942	-132.483942	-132.483942	297.865749
1	2014-02-01	733.206972	133.362486	678.590785	733.206972	733.206972	-324.048145	-324.048145	-324.048145	-324.048145	...	-288.226070	-288.226070	-288.226070	-288.226070	-288.226070	-288.226070	149.595672
2	2014-03-01	735.128684	395.356700	921.716107	735.128684	735.128684	-69.406915	-69.406915	-69.406915	-69.406915	...	0.829065	0.829065	0.829065	0.829065	0.829065	0.829065	445.399757
3	2014-04-01	737.256294	330.392503	863.616976	737.256294	737.256294	-140.477169	-140.477169	-140.477169	-140.477169	...	-89.156127	-89.156127	-89.156127	-89.156127	-89.156127	-89.156127	362.886617
4	2014-05-01	739.315271	273.750063	839.110706	739.315271	739.315271	-172.355011	-172.355011	-172.355011	-172.355011	...	-183.195734	-183.195734	-183.195734	-183.195734	-183.195734	-183.195734	276.078026

5 rows × 31 columns

Trend and Forecast Visualization¶

In [35]:

plt.figure(figsize = (12, 6))
plt.plot(forecast['Date'], forecast['furniture_trend'],'b-')
plt.plot(forecast['Date'], forecast['office_trend'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Sales Trend');
plt.show()

In [36]:

plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_yhat'], 'b-')
plt.plot(forecast['Date'], forecast['office_yhat'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Estimate');

Trends and Patterns¶

Now, we can use the Prophet Models to inspect different trends of these two categories in the data.

In [37]:

furniture_model.plot_components(furniture_forecast)
plt.show()

In [38]:

office_model.plot_components(office_forecast)
plt.show()

Good to see that the sales for both furniture and office supplies have been linearly increasing over time and will be keep growing, although office supplies’ growth seems slightly stronger.

Time Series Analysis on Superstore Sales Data

Time Series Analysis Project on Superstore Sales Data¶

A Brief Exploratory Data Analysis (EDA)¶

Data Processing¶

Indexing with the time series data¶

Visualizing the Sales Time Series Data¶

Time Series Forecasting with ARIMA¶

Fitting the ARIMA Model using the Optimal parameters¶

Validating Forecasts¶

Producting and Visualizing Forecasts¶

Time Series of Office Supplies vs. Furniture¶

Visualizing the Sales Data¶

Time Series Modeling with Prophet¶

Compare the Two Forecasts¶

Trend and Forecast Visualization¶

Trends and Patterns¶

Comments

Time Series Analysis Project on Superstore Sales Data¶

A Brief Exploratory Data Analysis (EDA)¶

Data Processing¶

Indexing with the time series data¶

Visualizing the Sales Time Series Data¶

Time Series Forecasting with ARIMA¶

Fitting the ARIMA Model using the Optimal parameters¶

Validating Forecasts¶

Producting and Visualizing Forecasts¶

Time Series of Office Supplies vs. Furniture¶

Visualizing the Sales Data¶

Time Series Modeling with Prophet¶

Compare the Two Forecasts¶

Trend and Forecast Visualization¶

Trends and Patterns¶

Related Posts:

Comments