Exploring Time Series

Time Series Manipulation using Pandas

In [2]:
# Creating a date range with hourly frequency

import pandas as pd
from datetime import datetime
import numpy as np

date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
In [3]:
date_rng
Out[3]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00', '2018-01-01 03:00:00',
               '2018-01-01 04:00:00', '2018-01-01 05:00:00',
               '2018-01-01 06:00:00', '2018-01-01 07:00:00',
               '2018-01-01 08:00:00', '2018-01-01 09:00:00',
               ...
               '2018-01-07 15:00:00', '2018-01-07 16:00:00',
               '2018-01-07 17:00:00', '2018-01-07 18:00:00',
               '2018-01-07 19:00:00', '2018-01-07 20:00:00',
               '2018-01-07 21:00:00', '2018-01-07 22:00:00',
               '2018-01-07 23:00:00', '2018-01-08 00:00:00'],
              dtype='datetime64[ns]', length=169, freq='H')
In [4]:
type(date_rng[0])
Out[4]:
pandas._libs.tslibs.timestamps.Timestamp

Now let´s create an example dataframe with the timestamp data we just created

In [5]:
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))

df.head()
Out[5]:
date data
0 2018-01-01 00:00:00 23
1 2018-01-01 01:00:00 73
2 2018-01-01 02:00:00 3
3 2018-01-01 03:00:00 10
4 2018-01-01 04:00:00 44

If we want to do time series manipulation, we’ll need to have a date time index so that our data frame is indexed on the timestamp.

In [6]:
#Convert the dataframe index to a datetime index 

df['datetime'] = pd.to_datetime(df['date'])
df = df.set_index('datetime')
df.drop(['date'], axis=1, inplace=True)
df.head()
Out[6]:
data
datetime
2018-01-01 00:00:00 23
2018-01-01 01:00:00 73
2018-01-01 02:00:00 3
2018-01-01 03:00:00 10
2018-01-01 04:00:00 44
In [7]:
# Example on how to filter data with only day 2.

df[df.index.day == 2]
Out[7]:
data
datetime
2018-01-02 00:00:00 27
2018-01-02 01:00:00 80
2018-01-02 02:00:00 9
2018-01-02 03:00:00 67
2018-01-02 04:00:00 26
2018-01-02 05:00:00 13
2018-01-02 06:00:00 22
2018-01-02 07:00:00 68
2018-01-02 08:00:00 48
2018-01-02 09:00:00 75
2018-01-02 10:00:00 93
2018-01-02 11:00:00 80
2018-01-02 12:00:00 36
2018-01-02 13:00:00 67
2018-01-02 14:00:00 54
2018-01-02 15:00:00 66
2018-01-02 16:00:00 54
2018-01-02 17:00:00 9
2018-01-02 18:00:00 38
2018-01-02 19:00:00 36
2018-01-02 20:00:00 19
2018-01-02 21:00:00 19
2018-01-02 22:00:00 30
2018-01-02 23:00:00 88
In [8]:
# Filtering data between two dates

df['2018-01-04':'2018-01-06']
Out[8]:
data
datetime
2018-01-04 00:00:00 61
2018-01-04 01:00:00 62
2018-01-04 02:00:00 24
2018-01-04 03:00:00 76
2018-01-04 04:00:00 41
... ...
2018-01-06 19:00:00 44
2018-01-06 20:00:00 14
2018-01-06 21:00:00 45
2018-01-06 22:00:00 45
2018-01-06 23:00:00 63

72 rows × 1 columns

We could take the min, max, average, sum, etc., of the data at a daily frequency instead of an hourly frequency as per the example below where we compute the daily average of the data:

In [9]:
df.resample('D').mean()
Out[9]:
data
datetime
2018-01-01 45.083333
2018-01-02 46.833333
2018-01-03 48.333333
2018-01-04 52.625000
2018-01-05 47.791667
2018-01-06 37.333333
2018-01-07 45.666667
2018-01-08 68.000000
In [10]:
# Example on how to get the sum of the last three values.

df['rolling_sum'] = df.rolling(3).sum()
df.head(10)
Out[10]:
data rolling_sum
datetime
2018-01-01 00:00:00 23 NaN
2018-01-01 01:00:00 73 NaN
2018-01-01 02:00:00 3 99.0
2018-01-01 03:00:00 10 86.0
2018-01-01 04:00:00 44 57.0
2018-01-01 05:00:00 9 63.0
2018-01-01 06:00:00 73 126.0
2018-01-01 07:00:00 83 165.0
2018-01-01 08:00:00 4 160.0
2018-01-01 09:00:00 95 182.0

It only starts having valid values when there are three periods over which to look back.

The following is a good chance to see how we can do forward or backfilling of data when working with missing data values.

In [11]:
df['rolling_sum_backfilled'] = df['rolling_sum'].fillna(method='backfill')
df.head()
Out[11]:
data rolling_sum rolling_sum_backfilled
datetime
2018-01-01 00:00:00 23 NaN 99.0
2018-01-01 01:00:00 73 NaN 99.0
2018-01-01 02:00:00 3 99.0 99.0
2018-01-01 03:00:00 10 86.0 86.0
2018-01-01 04:00:00 44 57.0 57.0

It’s often useful to be able to fill your missing data with realistic values such as the average of a time period, but always remember that if you are working with a time series problem and want your data to be realistic, you should not do a backfill of your data.

When working with time series data, you may come across time values that are in Unix time. Unix time, also called Epoch time is the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.

How to convert epoch time to real time?

In [12]:
epoch_t = 1529272655
real_t = pd.to_datetime(epoch_t, unit='s')
real_t
Out[12]:
Timestamp('2018-06-17 21:57:35')
In [13]:
# Now, let's convert it to Pacific time

real_t.tz_localize('UTC').tz_convert('US/Pacific')
Out[13]:
Timestamp('2018-06-17 14:57:35-0700', tz='US/Pacific')

Use case:

In the following example we will only take in data from a uni-variate time series. That means we are only considering the relationship between the y-axis value the x-axis time points. We’re not considering outside factors that may be effecting the time series.

A common mistake beginners make is they immediately start to apply ARIMA forecasting models to data that has many outside factors.

In [1]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/electric_production.csv', index_col=0)
data.head()
Out[1]:
IPG2211A2N
DATE
1939-01-01 3.3335
1939-02-01 3.3590
1939-03-01 3.4353
1939-04-01 3.4607
1939-05-01 3.4607
In [48]:
data.tail()
Out[48]:
Energy Production
DATE
2022-02-01 114.3064
2022-03-01 102.7846
2022-04-01 91.4573
2022-05-01 95.5598
2022-06-01 104.3661

Our index is actually just a list of strings that look like a date so we need to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values.

In [2]:
data.index = pd.to_datetime(data.index)

Let's also rename our IPG2211A2N column with a more friendly name.

In [3]:
data.columns = ['Energy Production']
In [5]:
pip install chart_studio cufflinks statsmodels
Collecting plotly
  Downloading plotly-5.9.0-py2.py3-none-any.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 79.4 MB/s eta 0:00:00m eta 0:00:010:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.9.0 tenacity-8.0.1
WARNING: There was an error checking the latest version of pip.
Note: you may need to restart the kernel to use updated packages.
In [21]:
import cufflinks as cf
import plotly.offline as py
import matplotlib.pyplot as plt
In [24]:
data.plot(title="Energy Production Jan 1985--Jan 2018", figsize=(15,6))
Out[24]:
<AxesSubplot:title={'center':'Energy Production Jan 1985--Jan 2018'}, xlabel='DATE'>

It looks like the trend in these earlier days is slightly increasing at a higher rate than just linear. Experimenting with additive versus multiplicative methods are made easy in just a few lines of code with statsmodels:

In [66]:
from pylab import rcParams
rcParams['figure.figsize'] = 11, 9

import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(data, model='multiplicative')
fig = decomposition.plot()
plt.show()

We can clearly see the seasonal component of the data, and we can also see the separated upward trend of the data. It makes sense to use a Seasonal ARIMA model. In order to do this we will need to choose p,d,q values for the ARIMA, and P,D,Q values for the Seasonal component.

In [43]:
pip install pmdarima
Collecting pmdarima
  Downloading pmdarima-1.8.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 19.0 MB/s eta 0:00:00m eta 0:00:010:01
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (62.3.2)
Requirement already satisfied: numpy>=1.19.3 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.23.1)
Requirement already satisfied: statsmodels!=0.12.0,>=0.11 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (0.13.2)
Requirement already satisfied: scikit-learn>=0.22 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.1.1)
Requirement already satisfied: pandas>=0.19 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.4.3)
Requirement already satisfied: joblib>=0.11 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.1.0)
Requirement already satisfied: scipy>=1.3.2 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.8.1)
Requirement already satisfied: urllib3 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (1.26.9)
Requirement already satisfied: Cython!=0.29.18,>=0.29 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pmdarima) (0.29.30)
Requirement already satisfied: pytz>=2020.1 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pandas>=0.19->pmdarima) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from scikit-learn>=0.22->pmdarima) (3.1.0)
Requirement already satisfied: patsy>=0.5.2 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from statsmodels!=0.12.0,>=0.11->pmdarima) (0.5.2)
Requirement already satisfied: packaging>=21.3 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from statsmodels!=0.12.0,>=0.11->pmdarima) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from packaging>=21.3->statsmodels!=0.12.0,>=0.11->pmdarima) (3.0.9)
Requirement already satisfied: six in /home/gitpod/.pyenv/versions/3.8.13/lib/python3.8/site-packages (from patsy>=0.5.2->statsmodels!=0.12.0,>=0.11->pmdarima) (1.16.0)
Installing collected packages: pmdarima
Successfully installed pmdarima-1.8.5
WARNING: There was an error checking the latest version of pip.
Note: you may need to restart the kernel to use updated packages.

The pyramid-arima library for Python allows us to quickly perform a grid search and even creates a model object that you can fit to the training data.

This library contains an auto_arima function that allows us to set a range of p,d,q,P,D,and Q values and then fit models for all the possible combinations. Then the model will keep the combination that reported back the best AIC value.

In [45]:
from pmdarima.arima import auto_arima
stepwise_model = auto_arima(data, start_p=1, start_q=1,
                           max_p=3, max_q=3, m=12,
                           start_P=0, seasonal=True,
                           d=1, D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True)
print(stepwise_model.aic())
Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=4023.136, Time=1.80 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=4583.420, Time=0.11 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=4382.760, Time=0.52 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=4129.116, Time=1.20 sec
 ARIMA(1,1,1)(0,1,0)[12]             : AIC=4340.994, Time=0.34 sec
 ARIMA(1,1,1)(1,1,1)[12]             : AIC=4020.582, Time=2.72 sec
 ARIMA(1,1,1)(1,1,0)[12]             : AIC=4228.529, Time=1.62 sec
 ARIMA(1,1,1)(2,1,1)[12]             : AIC=3982.381, Time=5.49 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=4086.153, Time=4.49 sec
 ARIMA(1,1,1)(2,1,2)[12]             : AIC=3969.042, Time=17.31 sec
 ARIMA(1,1,1)(1,1,2)[12]             : AIC=4009.967, Time=9.90 sec
 ARIMA(0,1,1)(2,1,2)[12]             : AIC=4056.170, Time=13.98 sec
 ARIMA(1,1,0)(2,1,2)[12]             : AIC=4113.543, Time=11.19 sec
 ARIMA(2,1,1)(2,1,2)[12]             : AIC=3970.985, Time=22.43 sec
 ARIMA(1,1,2)(2,1,2)[12]             : AIC=3970.979, Time=15.87 sec
 ARIMA(0,1,0)(2,1,2)[12]             : AIC=4183.409, Time=10.02 sec
 ARIMA(0,1,2)(2,1,2)[12]             : AIC=3992.791, Time=19.49 sec
 ARIMA(2,1,0)(2,1,2)[12]             : AIC=4070.929, Time=11.41 sec
 ARIMA(2,1,2)(2,1,2)[12]             : AIC=3970.372, Time=29.53 sec
 ARIMA(1,1,1)(2,1,2)[12] intercept   : AIC=3971.031, Time=41.57 sec

Best model:  ARIMA(1,1,1)(2,1,2)[12]          
Total fit time: 221.072 seconds
3969.042177805799
In [49]:
# Train test ssplit

train = data.loc['1985-01-01':'2017-12-01']
test = data.loc['2018-01-01':]
In [50]:
# Train the model

stepwise_model.fit(train)
Out[50]:
 ARIMA(1,1,1)(2,1,2)[12]          
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.

In [60]:
stepwise_model.fit(train).plot_diagnostics(figsize=(15, 12))
plt.show()

This is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.

In the top right plot, we see that the orange KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.

The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.

Now that the model has been fitted to the training data, we can forecast into the future. Recall that our test data set is from 2018–01–01 all the way until 2022-06-01, so we have 54 periods. That is the value we will use for our .predict() method call:

In [51]:
future_forecast = stepwise_model.predict(n_periods=54)

Let's create a dataframe that contains our future forecast and then concatenating that with the original data.

In [57]:
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])
pd.concat([test,future_forecast],axis=1).plot()
Out[57]:
<AxesSubplot:xlabel='DATE'>

Now let's get a larger picture of the context of our prediction in the test set.

In [58]:
pd.concat([data,future_forecast],axis=1).plot()
Out[58]:
<AxesSubplot:xlabel='DATE'>

Now it's your turn to refit our model to our entire data set and then forecast into the real future!