# Load required libraries
library(tseries)
library(tidyverse)
library(timetk) #for lag_vec()
library(stats) #for acf()
library(forecast) # Ljung-box test
rm(list=ls())

1 Definition

정의 1 A time series is said to be stationary if its statistical properties such as mean, variance, and autocorrelation remain constant over time. In other words, a stationary time series does not exhibit any trend, seasonality, or change in statistical properties over time.

time series modeling is the process of converting from non-stationary data into stationary data.

1.1 Stationary Data

시계열적인 특성이 없는 데이터

1.1.1 Constant Mean

A stationary time series has a mean that remains constant over time, which means that the average value of the time series does not change over time. A smoothing line of moving average with a certain window should show a constant trend.

1.1.2 Constant Variance

A stationary time series has a variance that remains constant over time, which means that the variability or spread of the time series data around its mean does not change over time. A moving variance with a certain window should show a constant trend.

1.1.3 Constant Autocorrelation

A stationary time series has autocorrelation that remains constant over time. Autocorrelation refers to the relationship between the values of a time series at different time lags. In a stationary time series, the strength and direction of autocorrelation do not change over time.

1.1.4 Absence of Trend

A stationary time series does not exhibit any trend, which means that there is no systematic upward or downward movement in the mean of the time series over time.

1.1.5 Absence of Seasonality

A stationary time series does not exhibit any seasonality, which means that there are no regular, repeating patterns or cycles in the data over time.

1.1.6 Statistical Properties are Time-Invariant

The statistical properties of a stationary time series, such as mean, variance, and autocorrelation, do not change with time. This property allows for the use of statistical techniques and models that assume constant statistical properties over time.

1.1.7 Example

white noise : no pattern about the time independent variable


# Generate a random time series data
set.seed(123)
ts_data <- rnorm(100)
plot(ts_data)
# Perform ADF test to check for stationarity
adf_test <- adf.test(ts_data)

# Print the results
cat("ADF Test Results:\n")
cat("Test statistic:", adf_test$statistic, "\n")
cat("P-value:", adf_test$p.value, "\n")

# Check if the time series is stationary
if (adf_test$p.value <= 0.05) {
  cat("Conclusion: The time series is stationary.\n")
} else {
  cat("Conclusion: The time series is not stationary.\n")
}

노트

stationary data could have a trend and seasonality, but its period is not constant and easy to be predicted.

# Set random seed for reproducibility
set.seed(123)

# Generate time series data with irregular trend and seasonality
n <- 100
t <- 1:n
trend <- 0.1 * t + 2 * sin(t * 0.05) * rnorm(n)
seasonality <- 2 * sin(t * 0.2 + 1) * rnorm(n)
irregular <- rnorm(n)
ts_data <- trend + seasonality + irregular

# Plot the time series data
ggplot(data = data.frame(t = t, ts_data = ts_data), aes(x = t, y = ts_data)) +
  geom_line() +
  labs(x = "Time", y = "Value", title = "Time Series Data with Irregular Trend and Seasonality")

stationary data with an irregular trend and seasonality using a combination of a linear trend (small coefficient), sinusoidal pattern(for varying amplitude), and random noise (irregular trend and seasonality).

1.2 Non-stationary Data

분석대상으로 시간 축에 대하여 분산(=정보)이 있음

1.2.1 Example

심장 박동 수 : 일정한 주기를 반복해야 건강한 상태

1.3 Conversion Process from Non-tationary to Stationary

1.3.1 Lag

A lag refers to the time interval between observations in a time series. It represents the number of time units (e.g., time periods, days, months) by which a variable is shifted or delayed in time.

정의 2 In time series analysis, a lag refers to the time interval between observations in a time series. Let $Y_t$ denote the value of a variable at time $t$, and $Y_{t-k}$ denote the value of the same variable at time $t$ lagged by $k$ time units. The lagged value $Y_{t-k}$ is defined as:

\[ Y_{t-k} = Y_{t-k} \] where $k$ is the lag.

The lag can be positive, indicating a forward shift in time, or negative, indicating a backward shift in time.

For example, suppose we have a time series of daily temperature data, and we want to examine the relationship between the temperature at a given day and the temperature on the same day one week ago (i.e., $k = 7$). In this case, $Y_t$ represents the temperature at time $t$, and $Y_{t-7}$ represents the temperature on the same day one week ago. By examining the lagged relationship between $Y_t$ and $Y_{t-7}$, we can analyze any patterns or trends in the temperature data over a one-week period.

1.3.1.1 Properties

1.3.1.1.1 Time Shifting

Lags allow for time shifting of a time series variable, where the value of the variable at a given time step is compared to its value at a previous time step. This allows for analyzing the temporal relationship and dependencies between values of a time series over different time intervals.

1.3.1.1.2 Autocorrelation

Lags are used to calculate autocorrelation, which is the correlation between a time series variable and its lagged values. Autocorrelation helps in understanding the persistence or pattern of the variable over time, and can be used to detect seasonality, trends, or other patterns in the data.

1.3.1.1.3 Trend Analysis

Lags can be used to analyze trends in time series data. By comparing a time series variable with its lagged values, trends can be identified and analyzed to understand the direction and magnitude of changes in the variable over time.

1.3.1.1.4 Seasonality Detection

Lags can be used to detect seasonality in time series data. By analyzing the relationship between a time series variable and its lagged values, patterns that repeat at regular intervals (e.g., daily, monthly, yearly) can be identified, indicating seasonality in the data.

1.3.1.1.5 Forecasting

Lags are used in time series forecasting models to make predictions about future values of a time series variable. By using lagged values of the variable as predictors, forecasting models can capture the historical patterns and trends in the data to make future predictions.

1.3.1.1.6 Data Transformation

Lags can be used to transform time series data into a different format, such as creating lagged variables or lagged differences, which can be used in various statistical techniques for analysis, modeling, and forecasting of time series data.

# Create an example time series data
set.seed(123)

n <- 100 # Number of observations
t <- 1:n
trend <- 0.5 * t # Linear trend component
seasonality <- 10 * sin(2 * pi * t/12) # Seasonal component
error <- rnorm(n, mean = 0, sd = 5) # Error component
ts_data <- trend + seasonality + error # Combine components to create time series data

# Create a data frame with lagged variables
lagged <- data.frame(
  Value = ts_data)%>%
  mutate(
  Lag1 = lag_vec(ts_data, lag = 1),
  Lag2 = lag_vec(ts_data, lag = 6),
  Lag3 = lag_vec(ts_data, lag = 12),
  n=1:n()
)
lagged_data<-lagged%>%
gather(key=lag,value=value,Value:Lag3)

knitr::kable(lagged%>%head(20))

# Plot the time series data and its lagged variables
ggplot(lagged_data, aes(x = n,y = value,color=lag)) +
  geom_line( size = 1.5, linetype = "solid")+
  labs(title = "Time Series Data and Lagged Variables",
       x = "Time Step", y = "Value") +
  scale_color_manual(values=c('darkred','darkgreen','darkblue','black'))+
  theme(legend.position = "right")

1.3.2 Difference

Differencing is a common technique used in time series analysis to transform a non-stationary time series into a stationary time series. It involves computing the difference between consecutive observations in the time series to remove trends or seasonality, and create a stationary time series that can be easier to analyze and model.

정의 3 The differenced time series $Y_t$ of an original time series $X_t$ of order $d$ can be defined as:

$Y_t = X_t - X_{t-d}$

where $X_t$ is the original time series value at time $t$, $ X_{t-d}$ is the original time series value at time $t-d$, and $d$ is the order of differencing.

The difference in data at a specific time interval could be used to represent time interval data such as year-on-year growth and month-on-month growth


# Create a data frame with lagged variables
lagged <- data.frame(
  Value = ts_data)%>%
  mutate(
  Lag1 = lag_vec(ts_data, lag = 1),
  Lag2 = lag_vec(ts_data, lag = 6),
  Lag3 = lag_vec(ts_data, lag = 12),
  n=1:n(),
  diff1= c(diff_vec(Value,lag=1)),
  diff2= c(diff_vec(Value,lag=6)),
  diff3= c(diff_vec(Value,lag=12)))%>%
  dplyr::select(n,everything())

knitr::kable(lagged%>%round(3)%>%head(20))

lagged_data<-lagged%>%
gather(key=process,value=value,Value:diff3)

# Plot the time series data and its lagged variables

ggplot(lagged_data%>%filter(!grepl('Lag',process)), aes(x = n,y = value,color=process)) +
  geom_line(size =1, linetype = "solid")+
  labs(title = "Time Series Data and Lagged Variables",
       x = "Time Step", y = "Value") +
  scale_color_manual(values=c('red','green','blue','black'))+
  theme(legend.position = "right")

1.4 ACF

ACF stands for autocorrelation function. ACF is a statistical tool used in time series analysis to measure the correlation between a time series and its lagged values. It helps to identify the presence of autocorrelation, which is the tendency of a time series to exhibit similar patterns or trends at different time points.

정의 4 The autocorrelation function $ (k) $ of a time series $ X_t $ at lag $ k $ can be defined as:

\[ \rho(k) = \frac{\text{Cov}(X_t, X_{t-k})}{\sqrt{\text{Var}(X_t) \cdot \text{Var}(X_{t-k})}} \]

where $X_t$ is the value of the time series at time $t$, $X_{t-k}$ is the value of the time series at time $t-k$, and $\text{Cov}(X_t, X_{t-k})$ and $\text{Var}(X_t)$ are the covariance and variance of the time series, respectively.

We can calculate the ACF of this time series to check for autocorrelation using R. Here’s an example code:

# Calculate autocorrelation function
acf_sales<-acf(ts_data)
acf_sales

1.4.1 ACF Interpretation

The lag on the x-axis represents the time lag between the current observation and the lagged observation for which the autocorrelation coefficient is calculated. Lags closer to 0 represent autocorrelation between neighboring observations, while larger lags represent autocorrelation between more distant observations.
The height of the autocorrelation coefficients on the y-axis indicates the strength of autocorrelation at different lags. Larger coefficients indicate stronger autocorrelation, while smaller coefficients indicate weaker autocorrelation.
- The ACF of the lag $1$ on the $x$ axis means the autocorrelation between the original data, Value and the Value lagged by $k = 1$, and the ACF of the lag $6$ on the $x$ axis means the autocorrelation between the original data, Value and the Value lagged by $k = 6$
The sign of the autocorrelation coefficient indicates the direction of autocorrelation. Positive coefficients indicate values tend to be similar at neighboring lags, while negative coefficients indicate values tend to be dissimilar at neighboring lags.
The horizontal dashed lines on the ACF plot represent the confidence intervals. Autocorrelation coefficients that fall outside these confidence intervals are considered statistically significant, indicating a high likelihood that the observed autocorrelation is not due to random chance.
- the line means the confidence interval = $\left[ \text{ACF}(k) \pm \frac{z_{\alpha/2}}{\sqrt{n}}\right]$ where $\operatorname{ACF}(k)$ is the autocorrelation coefficient at lag $k$, $z_{\alpha/2}$ is the critical value from the standard normal distribution for the desired confidence level (e.g., 1.96 for a 95% confidence level), $n$ is the sample size.
The pattern of autocorrelation coefficients can provide insights into the presence of trend, seasonality, or other underlying patterns in the time series data. For instance, a repeating pattern of positive and negative autocorrelation coefficients may indicate the presence of seasonality, while a gradual decline in autocorrelation coefficients may indicate the presence of a trend. For our example, we can see the decline and increasing pattern both in ACF and the original plot.

1.4.2 ACF’s Weakness

ACF has a weakness in that it can sometimes show spurious correlations due to the effect of earlier lags. This is known as the chain reaction or spillover effect. For example, if a time series has a strong autocorrelation at lag 1, it can cause subsequent lags to also exhibit autocorrelation, even if there is no true underlying relationship. To address this issue, the Partial Autocorrelation Function (PACF) was developed.

1.5 PACF

Partial Autocorrelation Function stands for PACF.

정의 5 The PACF at lag $k$, denoted as $\operatorname{PACF}(k)$, is defined as the autocorrelation between the original time series and its lagged values, with the effects of all shorter lags removed.

\[ \begin{align*} \text{PACF}(k) &= \phi_{kk} \\ &= \frac{\text{cov}(Y_t, Y_{t-k} | Y_{t-1}, Y_{t-2}, \ldots, Y_{1})}{\sqrt{\text{var}(Y_t | Y_{t-1}, Y_{t-2}, \ldots, Y_{1}) \cdot \text{var}(Y_{t-k} | Y_{t-1}, Y_{t-2}, \ldots, Y_{1})}} \end{align*} \]

where $\phi_{kk}$ represents the partial autocorrelation coefficient at lag $k$.

PACF measures the autocorrelation between the residuals of a time series after removing the effects of shorter lags. It provides a more direct measure of the linear relationship between the time series at a specific lag, while accounting for the effects of earlier lags. PACF helps to isolate the direct impact of a particular lag on the time series, without the spillover effect from earlier lags.

PACF in time series analysis measures the correlation between a time series value at a specific lag (denoted as $k$) and its lagged value at a previous time step (denoted as $t−k$), after removing the linear dependence on the intermediate lags ($1,2,…,k−1$).

PACF is useful in time series analysis for identifying the order of an autoregressive (AR) model, which is a common type of time series model. AR models use past values of the time series to predict future values. The PACF plot can help identify the significant lags that contribute to the prediction of the time series, and thus aid in model selection and forecasting accuracy.

노트

ACF(1) is equaivalent to PACF(1) because there is no in-between lags and chain reaction at lag $k = 1$

pacf_result <- stats::pacf(ts_data)

1.5.1 PACF Interpretation

A significant positive value at lag k in the PACF plot indicates a strong positive linear relationship between the value at lag k and the current value of the time series. This suggests that the value at lag k is an important predictor for the current value.
A significant negative value at lag k in the PACF plot indicates a strong negative linear relationship between the value at lag k and the current value of the time series. This suggests that the value at lag k is an important predictor for the current value, but with an inverse relationship compared to positive values.
Non-significant values close to zero in the PACF plot indicate that there is little or no autocorrelation at those lags. This suggests that the value at those lags does not significantly impact the current value of the time series.

1.6 Time Series Model

The fitted values, denoted as $\hat{Y}_t$, for a time series model are obtained by applying the estimated model parameters to the observed data points $Y_t$ up to time $t$, using the estimated model equations.

\[ \hat{Y}_t = \hat{\alpha} + \hat{\beta}_1 X_{1,t} + \hat{\beta}_2 X_{2,t} + \ldots + \hat{\beta}_p X_{p,t} \]

where $\hat{\alpha}$ is the estimated intercept term, $\hat{\beta}_1, \ldots, \hat{\beta}_p$ are the estimated coefficients for the explanatory variables $X_1, X_2, \ldots, X_p$ respectively, and $X_{1,t}, X_{2,t}, \ldots, X_{p,t}$ are the observed values of the explanatory variables at time $t$.

1.6.1 Fitted Value

model<-lm(Value~n,data=lagged)
fitted_data<-lagged%>%
  mutate(fit=fitted(model),
  residual=residuals(model))%>%
  gather(key=fit_output,value=output,c(Value,fit,residual))

ggplot(data=fitted_data,aes(x=n,y=output,color=fit_output))+
geom_line()+
geom_point()

As you can see the residual pattern is the same as the original data, Value when you fit the data with a linear regression. If you use the time index as the explanatory variable in a linear regression to fit a time series model, and the residuals will capture the deviations of the observed values from this linear trend. As a result, the residuals will exhibit the same pattern, since they represent the discrepancies between the observed values and the linear trend estimated from the row numbers. It’s important to note that using row numbers as the explanatory variable in a linear regression for time series analysis may not always be meaningful, as it does not take into account any underlying patterns, trends, or seasonality present in the data. It’s generally recommended to use appropriate time-related variables or other relevant explanatory variables in time series modeling to capture the inherent dynamics of the data.

1.7 White Noise

정의 6 White noise is a type of time series data that is characterized by random, uncorrelated, and identically distributed (i.i.d.) values. \[ Y_t \sim \operatorname{WN}(0,\sigma^2) \]

where

\[ \begin{align*} X_t & : \text{The value of the white noise at time t} \\ \text{WN} & : \text{Indicates that the data follows a white noise process} \\ 0 & : \text{The mean of the white noise process, which is typically assumed to be 0} \\ \sigma^2 & : \text{The variance of the white noise process, which determines the spread of the random values} \end{align*} \]

It is often used as a reference or benchmark series to compare against other time series data for identifying patterns or structures.

1.7.1 Properties

Randomness: White noise is a series of random values that are not predictable or follow any pattern.
Independence: The values in a white noise series are uncorrelated, meaning that the value at any time point does not depend on the values at other time points.
Identically Distributed: The values in a white noise series are drawn from the same distribution, typically assumed to have a constant mean and variance.
Constant Mean: The mean of a white noise series is typically assumed to be constant and equal to zero, although this can be adjusted to a different value if necessary.
Constant Variance: The variance of a white noise series is typically assumed to be constant, meaning that the spread of the values remains the same over time.
No Autocorrelation: White noise series have no autocorrelation, meaning that the correlation between the values at different time points is close to zero.
No Trend: White noise series do not exhibit any trend or pattern over time, as the values are purely random and do not follow any systematic behavior.
Useful as a Benchmark: White noise series are often used as a benchmark or reference to compare against other time series data for identifying patterns or structures.

1.7.2 Ljung-box Test

The Ljung-Box test is a statistical test used to assess whether a time series data exhibits significant autocorrelation at different lags. The null hypothesis ($H_0$) of the Ljung-Box test is that there is no autocorrelation in the time series data up to a certain lag, while the alternative hypothesis ($H_a$) is that there is significant autocorrelation present. It is used to assess the goodness-of-fit of a model by testing whether the autocorrelation coefficients of the residuals (or errors) of the model are significantly different from zero.

\[ Q(m) = n(n+2) \sum_{k=1}^{m} \frac{\hat{\rho}_k^2}{n-k} \sim \chi^2_{1-\alpha,h} \]

where: - $Q(m)$ is the Ljung-Box test statistic for a given maximum lag $m$ - $n$ is the sample size of the time series data - $\hat{\rho}_k$ is the sample autocorrelation at lag $k$ - Under $H_0$, $Q(m)$ assymptotically follows a $\chi^2_{1-\alpha,h}$ - $h$ is the number of lags being tested

Suppose we have a time series data vector $x$ of length $n$, and we want to perform a Ljung-Box test up to a maximum lag of $m$.

The Q statistic follows a chi-squared distribution with degrees of freedom equal to the number of autocorrelation coefficients being tested. The p-value associated with the Q statistic can be compared to a chosen significance level (e.g., 0.05) to determine if the residuals exhibit significant autocorrelation. If the p-value is below the chosen significance level, it suggests that the model may have inadequate fit and that there may be remaining autocorrelation in the residuals.

힌트

The Box-Pierce test is a modified version of the Ljung-Box test, \[ Q(m) = n \sum_{k=1}^{m} \hat{\rho}_k^2 \]

Ljung-Box test incorporates the sample size $n$ in the denominator. This makes the Ljung-Box test more appropriate for small sample sizes, while the Box-Pierce test is suitable for larger sample sizes.

1.7.3 Example

A classic example of white noise is a series of random numbers generated from a standard normal distribution, where each value in the series is independent and identically distributed with mean 0 and variance 1.

# Generate white noise series
set.seed(123)
n <- 100 # Number of observations
wn <- rnorm(n, mean = 0, sd = 1) # Generate random values from standard normal distribution

# Plot white noise series
wn_data <- data.frame(Time = 1:n, Value = wn)
ggplot(wn_data, aes(x = Time, y = Value)) +
  geom_line() +
  labs(title = "White Noise Series", x = "Time", y = "Value")

result<-acf(wn)
plot(result)
checkresiduals(wn)

1.8 Time Series Decomposition

The characteristics of time series data has trend, seasonality, and autocorrelation. To check if autocorrelation exists in data, we used ACF and PACF. Then, how to check the trend and seasonality characteristics? Time series decomposition can be used to check them.

The observed time series $y_t$ can be decomposed into four components: the trend component $T_t$, the seasonal component $S_t$, the cyclical component $C_t$, and the remainder or error component $E_t$. This can be expressed as:

\[ \begin{equation} y_t = T_t + S_t + C_t + E_t, \quad \text{where} \quad E_t \sim \text{WN}(0,\sigma^2), \end{equation} \]

where $T_t$ is the trend component, $S_t$ is the seasonal component, $C_t$ is the cyclical component, and $E_t$ is the error term, which is a random variable with a white noise distribution with mean 0 and variance $\sigma^2$.

The trend component $T_t$ represents the long-term behavior of the time series, and can be either deterministic or stochastic. The seasonal component $S_t$ represents the regular and repeated variations that occur within a single year or other fixed time period. The cyclical component $C_t$ represents the irregular variations in the time series that do not follow a fixed pattern.

힌트

Deterministic refers to a situation where the outcome is completely determined by the initial conditions, and there is no randomness involved. For example, if you drop a ball from a certain height, the time it takes to hit the ground can be determined exactly based on the initial height and the acceleration due to gravity. This is a deterministic process because there is no randomness involved.

Stochastic, on the other hand, refers to a situation where the outcome is uncertain and subject to randomness. For example, if you roll a fair six-sided die, you cannot predict with certainty what number will come up. The outcome is determined by chance, and is therefore stochastic.

There are two ways of decomposing time series data: additive decomposition and multiplicative decomposition This additive deomoposition method is suitable for the time series data with a constant variation of seasonality according the trend, while multiplicative decomposition is suitable for the time series data with a inconsistent variation of seasonality according the trend. The variation magnitude could increase or decrease.

1.8.1 Additive Decomposition

A common method for decomposing time series is the additive decomposition, which can be expressed as: \[ \begin{equation} y_t = T_t + S_t + E_t. \end{equation} \]

The trend component $T_t$ is estimated by smoothing the data using techniques such as moving averages or exponential smoothing. The seasonal component $S_t$ is estimated by computing the seasonal indices, which are ratios of the observed values to the estimated trend component. The cyclical component $C_t$ is usually not explicitly estimated, but can be identified as the fluctuations that remain after removing the trend and seasonal components.

For example, let $X_t$ be a quarterly time series of sales data for a company, with $t$ ranging from 1 to 20. We can decompose the time series using an additive model:

\[ \begin{equation} X_t = T_t + S_t + E_t, \end{equation} \]

where $T_t$ is the trend component, $S_t$ is the seasonal component, and $E_t$ is the error term. The trend component can be estimated using a 3-period moving average: \[ \begin{equation} T_t = \frac{1}{3} (X_{t-1} + X_t + X_{t+1}), \end{equation} \]

where $t=2,3,\ldots,19$. The seasonal component can be estimated by computing the average value of $X_t$ for each quarter:

\[ \begin{equation} S_t = \frac{1}{5} \sum_{i=0}^3 X_{t+4i}. \end{equation} \]

Finally, the error term $E_t$ can be computed as:

\[ \begin{equation} E_t = X_t - T_t - S_t. \end{equation} \]

This decomposition allows us to separate the long-term trend, seasonal patterns, and irregular fluctuations in the sales data, and can help us identify any underlying patterns or trends that may be present in the data.

1.9 Go to Blog Content List

Blog Content List