TIME SERIES FORECASTING AND ANALYSIS : ARIMA AND SEASONAL-ARIMA

Subham Sarkar
Analytics Vidhya
Published in
10 min readApr 4, 2020

--

Time series is different from more traditional classification and regression predictive modeling problems.

The temporal nature adds an order to the observations. This imposed order means that important assumptions about the consistency of those observations needs to be handled specifically.

The ability to make predictions based upon historical observations creates a competitive advantage. For example, if an organization has the capacity to better forecast the sales quantities of a product, it will be in a more favourable position to optimize inventory levels. This can result in an increased liquidity of the organizations cash reserves, decrease of working capital and improved customer satisfaction by decreasing the backlog of orders.

In the domain of machine learning, there’s a specific collection of methods and techniques particularly well suited for predicting the value of a dependent variable according to time. In the proceeding article, we’ll cover AutoRegressive Integrated Moving Average (ARIMA).

We refer to a series of data points indexed (or graphed) in time order as a time series. A time series can be broken down into 3 components.

  • Trend: Upward & downward movement of the data with time over a large period of time (i.e. house appreciation)
  • Seasonality: Seasonal variance (i.e. an increase in demand for ice cream during summer)
  • Noise: Spikes & troughs at random intervals.

Types of Time Series:

Stationary Time Series :

  • The observations in a stationary time series are not dependent on time.
  • Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations.
  • When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.
Stationary Time Series

Non-Stationary Time Series:

  • Observations from a non-stationary time series show seasonal effects, trends, and other structures that depend on the time index.
  • Summary statistics like the mean and variance do change over time, providing a drift in the concepts a model may try to capture.
  • Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects.
Non-Stationary Time Series

Before applying any statistical model on a time series, we want to ensure it’s stationary.

For a time series to be stationary it has to satisfy the below conditions:

  • The mean of the series should not be a function of time. The red graph below is not stationary because the mean increases over time.
  • The variance of the series should not be a function of time. This property is also known as Homoscedasticity. The red graph below is not stationary because of the varying spread of data over time.
  • Finally, the Covariance of the i’th term and the (i+m)’th term should not be a function of time, i.e. it is only a function of Gap. In the red graph you can notice that the spread becomes closer as the time increases. Hence the Covariance is not constant with time .

Let’s now discuss about ARIMA models :

AutoRegressive Model (AR)

  • Autoregressive models operate under the premise that past values have an effect on current values. AR models are commonly used in analyzing nature, economics, and other time-varying processes. As long as the assumption holds, we can build a linear regression model that attempts to predict value of a dependent variable today, given the values it had on previous days. p is a parameter of how many lagged observations to be taken in.

The order of the AR model corresponds to the number of days incorporated in the formula.

Integrated(I):

  • A model that uses the differencing of raw observations (e.g. subtracting an observation from the previous time step). Differencing in statistics is a transformation applied to time-series data in order to make it stationary. This allows the properties do not depend on the time of observation, eliminating trend and seasonality and stabilizing the mean of the time series.
  • For example, first-order differencing addresses linear trends, and employs the transformation zi = yi — yi-1. Second-order differencing addresses quadratic trends and employs a first-order difference of a first-order difference, namely zi = (yi — yi-1) — (yi-1 — yi-2), and so on.

Moving Average Model (MA)

  • Assumes the value of the dependent variable on the current day depends on the previous days error terms. The formula can be expressed as:

where μ is the mean of the series, the θ1, …, θq are the parameters of the model and the εt, εt−1,…, εt−q are white noise error terms. The value of q is called the order of the MA model.

AutoRegressive Integrated Moving Average (ARIMA):

The ARIMA (aka Box-Jenkins) model adds differencing to an ARMA model. Differencing subtracts the current value from the previous and can be used to transform a time series into one that’s stationary. For example, first-order differencing addresses linear trends, and employs the transformation zi = yi — yi-1. Second-order differencing addresses quadratic trends and employs a first-order difference of a first-order difference, namely zi = (yi — yi-1) — (yi-1 — yi-2), and so on.

Three integers (p, d, q) are typically used to parametrize ARIMA models.

  • p: number of autoregressive terms (AR order)
  • d: number of nonseasonal differences (differencing order)
  • q: number of moving-average terms (MA order)

Let’ walk through some code and understand time series and ARIMA models better:

The general process for ARIMA models is the following:

  • Visualise the Time Series Data
  • Make the time series data stationary
  • Plot the Correlation and AutoCorrelation Charts
  • Construct the ARIMA Model or Seasonal ARIMA based on the data
  • Use the model to make predictions

Let’s go through these steps!

  • Loading the data into DataFrame.
  • Cleaning up the Data with NaN entries for Sales column
  • As, we can see that the Month column is not in proper python date-time format, so we need to convert it first.
  • Let’s visualize the Data :

As mentioned previously, before we can build a model, we must ensure that the time series is stationary. There are two primary way to determine whether a given time series is stationary :

  • Rolling Statistics: Plot the rolling mean and rolling standard deviation. The time series is stationary if they remain constant with time (with the naked eye look to see if the lines are straight and parallel to the x-axis).
  • Augmented Dickey-Fuller Test: The time series is considered stationary if the p-value is low (according to the null hypothesis) and the critical values at 1%, 5%, 10% confidence intervals are as close as possible to the ADF Statistics.
Here, Null Hypothesis is accepted since p-value is more than Significance level
  • The ADF Statistic is far from the critical values and the p-value is greater than the threshold (0.05). Thus, we can conclude that the time series is not stationary.

To make it stationary we will use Differencing :

  • Differencing in statistics is a transformation applied to time-series data in order to make it stationary. This allows the properties do not depend on the time of observation, eliminating trend and seasonality and stabilizing the mean of the time series.
Notation
After applying differencing
  • Let’s perform Augmented Dickey-Fuller Test again to test whether the series is stationary or not post Differencing.
  • Now , we can see that the p-value is less than the Significance level od 0.05. Thus, Null Hypothesis is Rejected and the time series is Stationary.

Autocorrelation function plot (ACF):

  • Autocorrelation refers to how correlated a time series is with its past values whereas the ACF is the plot used to see the correlation between the points, up to and including the lag unit. In ACF, the correlation coefficient is in the x-axis whereas the number of lags is shown in the y-axis.
  • The Autocorrelation function plot will let you know how the given time series is correlated with itself.
  • Identification of an MA model is often best done with the ACF rather than the PACF.
  • For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0 in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have non-zero autocorrelations only at lags involved in the model.

Partial Auto Correlation Function (PACF)

  • As the name implies, PACF is a subset of ACF. PACF expresses the correlation between observations made at two points in time while accounting for any influence from other data points. We can use PACF to determine the optimal number of terms to use in the AR model. The number of terms determines the order of the model.
  • Identification of an AR model is often best done with the PACF.
  • For an AR model, the theoretical PACF “shuts off” past the order of the model. The phrase “shuts off” means that in theory the partial autocorrelations are equal to 0 beyond that point. Put another way, the number of non-zero partial autocorrelations gives the order of the AR model. By the “order of the model” we mean the most extreme lag of x that is used as a predictor.

Let’s plot the ACF and PACF on our dataset :

Let’s apply ARIMA for predictions :

Let’s view the Forecasting using ARIMA :

Observation :

  • Here we can see that the forecasting is not good using ARIMA, since the time series exhibits seasonality.

So, now we will implement Seasonal-ARIMA

Seasonal-ARIMA(SARIMA):

  • As the name suggests, this model is used when the time series exhibits seasonality. This model is similar to ARIMA models, we just have to add in a few parameters to account for the seasons.

We write SARIMA as

  • ARIMA(p,d,q)(P, D, Q)m,
  • p — the number of autoregressive
  • d — degree of differencing
  • q — the number of moving average terms
  • m — refers to the number of periods in each season
  • (P, D, Q ) — represents the (p,d,q) for the seasonal part of the time series

Seasonal differencing takes into account the seasons and differences the current value and it’s value in the previous season eg: Difference for the month may would be value in May 2018 — value in may 2017.

Observation :

  • Here, we can see that forecasting using SARIMA gave awesome results since the data exhibited seasonality.

Now, let’s add some data points to our original dataset synthtically , so that future forecasting can be observed .

Let’s plot the future predictions :

Observation:

  • This gave beautiful results .

Thanks for reading this blog. If you liked it please clap, follow and share.

Where can you find my code ?

Github : https://github.com/SubhamIO/TimeSeriesForecasting-ARIMA-SARIMA

Conclusion :

What we learnt ?

  • The importance of time series data being stationary for use with statistical modeling methods and even some modern machine learning methods.
  • How to use line plots and basic summary statistics to check if a time series is stationary.
  • How to calculate and interpret statistical significance tests to check if a time series is stationary.
  • How to use ARIMA and SARIMA models for forecasting.
  • In the domain of machine learning, there is a collection techniques for manipulating and interpreting variables that depend on time. Among these include ARIMA which can remove the trend component in order to accurately predict future values.

References :

--

--