Machine learning is a study of data. With data taking many different forms such as images, texts, or crosssectional, tabular like, there is a special type of data that is associated with time, namely timeseries data. Time series data is a sequence of data points indexed with time chronologically. This characteristic often makes time series data violate general statistical properties and thus requires special treatment when modelling. Time series data has numerous applications just like any other machine learning task. For example, with time series clustering we can find items that follow similar sale seasonalities so that we can bundle them together in our recommendation systems. With timeseries anomaly detection we can locate potential deviating points from our data stream to prepare ourselves from extreme events. I was involved in a time series forecasting project this summer. Our task is to estimate the total amount of traffic for different multimedia systems at different times, so network service providers can turn on/off certain devices to both save energy and handle busy hours. In this post, we will focus on time series forecasting.
Some special terminologies are usually associated with time series data. First, stationarity. A time series is stationary when all statistical properties of that series are maintained when shifting in time. Technically speaking, the joint distribution of targets,
,where t represents different point in time, and h is the shift period, only depends on the shift h but not t. Sounds too good to be true huh? There is also weak stationarity, which only requires that the previously mentioned joint distribution has the same mean at all time points, and that the covariance between the values at any pair, t and t−k , depend only on the shift k but not on the actual time point t. With that being said, nonstationary time series data are those that fail to satisfy the conditions of weakly stationarity. Stationarity detection is an important starting step for a time series analysis since stationary data are easier to model. In many cases when the conditions of stationarity can not be satisfied, we transform the data in the preprocessing step to make it stationary, or at least weakly stationary.
Seasonality is another characteristic commonly observed from timeseries data. As its name suggests, seasonality describes the regular pattern that recurs every time interval, such as the day of the week and the holidays of the year. One example I encountered was in telecommunication systems. The network traffic has a clear pattern repeating daily, where activities start around 5am, peak around lunch hour, and calm down after 9pm. There are two major types of seasonality, additive seasonality and multiplicative seasonality, where the amplitude of seasonality either remains the same, or amplifying/diminishing. It is important to recognize the presence of seasonality in time series analysis. For example, it can help us with anomaly detection globally when certain data points do not follow the estimated seasonality.
Other terminologies worth covering here include autocorrelation and trend. Autocorrelation is the correlation between observations at different points in time. It measures the (linear) similarity of the time series with a lagged version of itself within a specified time interval. It can be used to measure how much influence the past has on its future. The trend is a pattern that shows whether the time series data moves to relatively higher or lower values over the long term. With these definitions clear in mind, we will be ready to move on to the modelling part.
One of the most dominating statistical model of time series data is the ARIMA model. In a nutshell, the ARIMA model, standing for the autoregressive integrated moving average model, consists of three key components:

 The autoregressive component considers the autocorrelation of the time series data.
 The integrated component transforms the original data and makes it stationary (i.e., representing data values by their difference with the previous values).
 The moving average component considers the dependency between the output and past observations stochastically.
Each component in the ARIMA model has a parameter with standard notation.

 Parameter p is used to represent the number of lags in observations in the autoregressive component.
 Parameter d represents the degree of differencing in the integrated component, or the number of times that the original observations are differenced.
 Parameter q represents the order of moving average in the moving average component, similar to parameter p.
ARIMA forecasting is achieved by fitting the time series data with the variables of interest. Many offtheshelf statistical tools can identify the optimal setting of the (p,d,q) pair. We can then build an ARIMA model with the identified pair using, for example, Python’s statsmodels package. The model output will give us the coefficients of each feature, which can then be interpreted similarly as in generalized linear regression problems.
Yes, you guessed it, there are also attempts to model time series data with neural networks. The flexibility provided by neural networks enables us to tackle the time series problem in a multiin, multiout manner, by controlling the windowing settings in the preprocessing step. The windowing setting commonly contains three parameters, the input length, the offset step, as well as the output (label) length. It can be tricky when handling the indexes when batching windows from the training, evaluation, and test sets. This notebook provides a very efficient implementation of window generator and integration with tf.data.Datasets. After proper designing of input and output, we can set up different architectures and compare their performances. The performance can be evaluated in terms of both accuracy and efficiency, especially in a production environment. Common accuracy metrics include mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Weighted Absolute Percentage Error (WAPE), and their variants. The efficiency can be measured by the time to training the model, and the time it takes for the model to inferences.
Common deep learning architectures to model time series data are highly overlapped with those to model generic sequential data. For example, RNNs, LSTMs, GRUs, ESNs, Temporal CNNs, attentionbased architectures and their variants. This recent review gave a very detailed comparison between different architectures across commonly used timeseries datasets. Depending on what your use case is, you may gain some insights on how you should design your architecture in the first place.
Another interesting resource I found when digging into this topic is the Prophet project by Facebook. Similar to the ARIMA model, it is also composed of three main components: trend, seasonality, and holidays. However, it adds on nonlinear relationships to model these components. It can be expressed as
,where

 y(t) is again our target;
 g(t) is a piecewise linear or logistic function for modeling nonperiodic changes;
 s(t) models the seasonality;
 h(t) models the effects of irregularities such as holiday.
The Prophet model gives interpretable parameters that can be intuitively adjusted by analysts with domain knowledge. The main novelty of Prophet is that it treats traditional time series task with a curvefitting strategy with more flexibility. This makes it easier for model fitting, and handles missing data or outliers more gracefully. It is available for both Python and R. For more detailed instructions, you can refer to their official document and sample use cases on their GitHub repository.
Hope this blog at least helps you understand what time series data and analysis are, and what are the common approaches in performing time series analysis. As always, I am a learner just like you are and my understandings are never perfect. If you have any comments or suggestions, feel free to reach out to our email, or simply start a discussion on our Facebook group. Looking forward to hearing from you!