# Creating a date range with hourly frequency
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
date_rng
type(date_rng[0])
Now let´s create an example dataframe with the timestamp data we just created
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.head()
If we want to do time series manipulation, we’ll need to have a date time index so that our data frame is indexed on the timestamp.
#Convert the dataframe index to a datetime index
df['datetime'] = pd.to_datetime(df['date'])
df = df.set_index('datetime')
df.drop(['date'], axis=1, inplace=True)
df.head()
# Example on how to filter data with only day 2.
df[df.index.day == 2]
# Filtering data between two dates
df['2018-01-04':'2018-01-06']
We could take the min, max, average, sum, etc., of the data at a daily frequency instead of an hourly frequency as per the example below where we compute the daily average of the data:
df.resample('D').mean()
# Example on how to get the sum of the last three values.
df['rolling_sum'] = df.rolling(3).sum()
df.head(10)
It only starts having valid values when there are three periods over which to look back.
The following is a good chance to see how we can do forward or backfilling of data when working with missing data values.
df['rolling_sum_backfilled'] = df['rolling_sum'].fillna(method='backfill')
df.head()
It’s often useful to be able to fill your missing data with realistic values such as the average of a time period, but always remember that if you are working with a time series problem and want your data to be realistic, you should not do a backfill of your data.
When working with time series data, you may come across time values that are in Unix time. Unix time, also called Epoch time is the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.
How to convert epoch time to real time?
epoch_t = 1529272655
real_t = pd.to_datetime(epoch_t, unit='s')
real_t
# Now, let's convert it to Pacific time
real_t.tz_localize('UTC').tz_convert('US/Pacific')
In the following example we will only take in data from a uni-variate time series. That means we are only considering the relationship between the y-axis value the x-axis time points. We’re not considering outside factors that may be effecting the time series.
A common mistake beginners make is they immediately start to apply ARIMA forecasting models to data that has many outside factors.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/electric_production.csv', index_col=0)
data.head()
data.tail()
Our index is actually just a list of strings that look like a date so we need to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values.
data.index = pd.to_datetime(data.index)
Let's also rename our IPG2211A2N column with a more friendly name.
data.columns = ['Energy Production']
pip install chart_studio cufflinks statsmodels
import cufflinks as cf
import plotly.offline as py
import matplotlib.pyplot as plt
data.plot(title="Energy Production Jan 1985--Jan 2018", figsize=(15,6))
It looks like the trend in these earlier days is slightly increasing at a higher rate than just linear. Experimenting with additive versus multiplicative methods are made easy in just a few lines of code with statsmodels:
from pylab import rcParams
rcParams['figure.figsize'] = 11, 9
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(data, model='multiplicative')
fig = decomposition.plot()
plt.show()
We can clearly see the seasonal component of the data, and we can also see the separated upward trend of the data. It makes sense to use a Seasonal ARIMA model. In order to do this we will need to choose p,d,q values for the ARIMA, and P,D,Q values for the Seasonal component.
pip install pmdarima
The pyramid-arima library for Python allows us to quickly perform a grid search and even creates a model object that you can fit to the training data.
This library contains an auto_arima function that allows us to set a range of p,d,q,P,D,and Q values and then fit models for all the possible combinations. Then the model will keep the combination that reported back the best AIC value.
from pmdarima.arima import auto_arima
stepwise_model = auto_arima(data, start_p=1, start_q=1,
max_p=3, max_q=3, m=12,
start_P=0, seasonal=True,
d=1, D=1, trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(stepwise_model.aic())
# Train test ssplit
train = data.loc['1985-01-01':'2017-12-01']
test = data.loc['2018-01-01':]
# Train the model
stepwise_model.fit(train)
When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.
stepwise_model.fit(train).plot_diagnostics(figsize=(15, 12))
plt.show()
This is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.
In the top right plot, we see that the orange KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
Now that the model has been fitted to the training data, we can forecast into the future. Recall that our test data set is from 2018–01–01 all the way until 2022-06-01, so we have 54 periods. That is the value we will use for our .predict() method call:
future_forecast = stepwise_model.predict(n_periods=54)
Let's create a dataframe that contains our future forecast and then concatenating that with the original data.
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])
pd.concat([test,future_forecast],axis=1).plot()
Now let's get a larger picture of the context of our prediction in the test set.
pd.concat([data,future_forecast],axis=1).plot()
Now it's your turn to refit our model to our entire data set and then forecast into the real future!
Source:
https://towardsdatascience.com/how-to-forecast-time-series-with-multiple-seasonalities-23c77152347e
https://towardsdatascience.com/time-series-analysis-in-python-an-introduction-70d5a5b1d52a
https://github.com/WillKoehrsen/Data-Analysis/tree/master/additive_models
https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea