Forecasting ETF Prices in R: Regression and Time Series Analysis
In this post, we’ll use R to explore and forecast the adjusted close prices of a financial ETF using historical data. We’ll visualize the trend, apply a linear regression, and forecast future values using ARIMA models — all powered by ggplot2
, forecast
, and tseries
.
1. Load and Prepare Data
# Required packages
install.packages(c("ggplot2", "forecast", "tseries", "lubridate", "dplyr"))
library(ggplot2)
library(forecast)
library(tseries)
library(lubridate)
library(dplyr)
# Load data
etf_data <- read.csv("ETF prices.csv")
aaa <- etf_data %>%
filter(fund_symbol == "AAA") %>%
mutate(price_date = as.Date(price_date),
adj_close = as.numeric(adj_close)) %>%
arrange(price_date)
2. Visualize Historical Prices
ggplot(aaa, aes(x = price_date, y = adj_close)) +
geom_line(color = "steelblue") +
labs(title = "Adjusted Close Price for AAA ETF",
x = "Date", y = "Adjusted Close Price") +
theme_minimal()
3. Linear Regression on Time Trend
aaa$time_index <- as.numeric(aaa$price_date)
lm_model <- lm(adj_close ~ time_index, data = aaa)
summary(lm_model)
# Plot regression
ggplot(aaa, aes(x = price_date, y = adj_close)) +
geom_line(color = "grey60") +
geom_smooth(method = "lm", color = "darkred") +
labs(title = "Linear Trend in Adjusted Close (AAA)",
y = "Adjusted Close", x = "Date") +
theme_minimal()
4. Time Series Forecast with ARIMA
# Convert to time series
aaa_ts <- ts(aaa$adj_close, frequency = 252) # Approx. trading days/year
# Stationarity test
adf.test(aaa_ts)
# Fit ARIMA model
arima_fit <- auto.arima(aaa_ts)
summary(arima_fit)
# Forecast 30 trading days
forecasted <- forecast(arima_fit, h = 30)
# Plot forecast
autoplot(forecasted) +
labs(title = "30-Day Forecast for AAA ETF",
y = "Adjusted Close Price") +
theme_minimal()
5. Conclusion
Through simple regression and ARIMA forecasting, we captured both the linear trend and the cyclical behavior of the AAA ETF. This approach can be scaled to multiple funds for deeper financial analysis.
Next step: Add technical indicators like moving averages, RSI, and MACD for richer financial modeling.
Mathematical Justification for Using ARIMA in Forecasting
The Autoregressive Integrated Moving Average (ARIMA) model is ideal for forecasting financial time series that exhibit non-stationarity with autocorrelated structure. We justify its suitability for the AAA ETF adjusted close prices using mathematical reasoning based on three core steps:
1. Stationarity and Differencing
Let the observed price series be denoted as:
\[y_t \in \mathbb{R}, \quad t = 1, 2, \dots, n\]We assume the raw series is non-stationary, which we test using the Augmented Dickey-Fuller (ADF) test. The null hypothesis is:
\[H_0: \text{The series has a unit root (non-stationary)}\]Upon rejection of \(H_0\) after differencing, the transformed series:
\[w_t = \nabla^d y_t = (1 - B)^d y_t\]is approximately stationary, satisfying the assumptions for ARIMA modeling.
2. Model Structure and Flexibility
The general ARIMA(p, d, q) model is defined as:
\[\phi(B)(1 - B)^d y_t = \theta(B)\varepsilon_t\]Where:
\[\begin{aligned} \phi(B) &= 1 - \phi_1 B - \dots - \phi_p B^p \quad \text{(AR polynomial)} \\ \theta(B) &= 1 + \theta_1 B + \dots + \theta_q B^q \quad \text{(MA polynomial)} \\ B &= \text{Backshift operator, where } B y_t = y_{t-1} \\ \varepsilon_t &\sim \mathcal{N}(0, \sigma^2) \quad \text{(White noise)} \end{aligned}\]This structure captures both short-term autocorrelation (via MA) and momentum or mean-reversion effects (via AR), while accounting for non-stationarity through differencing \(d\).
3. Model Selection and Validation
Let us consider a class of models $begin:math:text$ \mathcal{M} $end:math:text$ (e.g., AR, MA, ARMA, ARIMA, Exponential Smoothing). The Akaike Information Criterion (AIC) is given by:
\[\text{AIC}(M) = 2k - 2\log(\hat{L}_M)\]Where:
\[\begin{aligned} k &= \text{Number of estimated parameters in model } M \\ \hat{L}_M &= \text{Maximized log-likelihood of model } M \end{aligned}\]The model selected by auto.arima()
is the one that minimizes AIC, balancing fit and parsimony.
4. Theoretical Result: Best Linear Forecast
For a stationary process \(\\{X_t\\}\), the Best Linear Unbiased Predictor (BLUP) of \(X_{t+h}\) given the past is:
\[\hat{X}_{t+h|t} = \mathbb{E}[X_{t+h} \mid X_t, X_{t-1}, \dots]\]If \(\\{X_t\\}\) is Gaussian and follows an ARIMA process, then the ARIMA forecast is equal to the minimum mean squared error (MMSE) forecast:
\[\hat{y}_{t+h} = \arg\min_{\hat{y}} \mathbb{E}[(y_{t+h} - \hat{y})^2]\]Thus, under reasonable assumptions, ARIMA provides the optimal linear forecast in terms of minimizing prediction error.
Conclusion
ARIMA is the most appropriate model for the AAA ETF dataset because:
- It handles non-stationary financial time series
- It systematically optimizes model fit via AIC
- It aligns with theoretical foundations of MMSE forecasting
- It allows for diagnostics and robustness checks
For practical forecasting in finance, ARIMA offers a mathematically grounded, data-driven, and interpretable solution.
Tags: #RStats
#Finance
#TimeSeries
#Forecasting
#ETF