Netflix Stock Price Prediction & Forecasting using Machine Learning in R

Recently, many people have been paying attention to the stock market as it offers high risks and high returns. In simple words, “Stock” is the ownership of a small part of a company. The more stock you have the bigger the ownership is. Using machine learning algorithms to predict a company’s stock price aims to forecast the future value of the company stock. Due to some factors or elements stock price is dynamic and volatile and predicting it is more challenging.

Table of Content

  • DataSet Used for Netflix Stock Price Prediction
  • Model Used for Netflix Stock Price Prediction
  • How to Predict Netflix Stock Price using Machine Learning in R
    • Step 1: Importing the required libraries
    • Step 2: Loading the Netfix Stock Price Dataset
    • Step 3: Checking the dimension and missing values of our data
    • Step 4: Taking the summary of the data
    • Step 5: Plotting the data
    • Step 6: Model building
    • Step 7: Model Fitting
  • Executing and Checking the Model Summary
    • Checking Accuracy of Netflix Stock Price Prediction Model
    • Performance Comparison on Netflix Stock Price Prediction Model on Training vs Test Data Set
  • Predict Netflix Stock Price
    • Calculate Test accuracy score

DataSet Used for Netflix Stock Price Prediction

For this R Machine Learning Project, we have used the “2002-01-01” to “2022-12-31” Netflix stock price data. This data can be fetched from either of the below sources:

  1. Finance Websites (such as Yahoo, etc)
    • To import this dataset, we can use the external package “quantmod” and get the required data with the help of the getSymbols() method.
  2. CSV file containing Netflix stock price data (NFLX.csv)

Model Used for Netflix Stock Price Prediction

Here we will use only the Close price of the Netflix stock for prediction and we will use the ARIMA (p, d, q) model for the prediction.

How to Predict Netflix Stock Price using Machine Learning in R

Step 1: Importing the required libraries

Below is the list of external and internal libraries and packages, that we will be requiring for this R Machine Learning Project:

Package

Uses

smooth

Smoothing techniques and forecasting models for time series analysis.

forecast

Used for forecasting time series data.

xts

Used for handling and manipulating time series data.

imputeTS

Used functions to handle missing values in time series data

fpp2

Provides datasets and additional forecasting tools

tseries

Used for functions for time series analysis, including tests for stationarity.

ggfortify

Used for easy visualization of time series objects

ggplot2

A popular package for creating complex and customizable plots in R.

quantmod

This package provides tools to fetch financial market data, analyze, and visualize it.

R
#Install and load libraries
#Smoothing techniques for time series analysis.
install.packages("smooth")
library(smooth)

# Used for forecasting time series data.
install.packages("forecast")
library(forecast)

#Used for handling and manipulating time series data
install.packages("xts")
library(xts)

#handle missing values in time series data
install.packages("imputeTS")
library(imputeTS)

#provides datasets
install.packages("fpp2")
library(fpp2)

#functions for time series analysis
install.packages("tseries")
library(tseries)

#visualization of time series objects 
install.packages("ggfortify")
library(ggfortify)

#customizable plots in R
install.packages("ggplot2")
library(ggplot2)

# fetch financial market data
install.packages("quantmod")
library(quantmod)

Step 2: Loading the Netfix Stock Price Dataset

Here we install and load the required libraries, based on the choice of mode of dataset (as discussed above).

  • Loading dataset from Finance websites
R
# Loading the required data
df = read.csv("/content/NFLX.csv") #if you use external data set
  • Loading dataset from CSV file
R
# Here we use getSymboles() function for collect the data from Yahoo finance
getSymbols('NFLX', from = '2002-01-01', to = '2024-01-01')
df = NFLX

# View dataset
head(df)

Output:

           NFLX.Open NFLX.High NFLX.Low NFLX.Close NFLX.Volume NFLX.Adjusted
2002-05-23 1.156429 1.242857 1.145714 1.196429 104790000 1.196429
2002-05-24 1.214286 1.225000 1.197143 1.210000 11104800 1.210000
2002-05-28 1.213571 1.232143 1.157143 1.157143 6609400 1.157143
2002-05-29 1.164286 1.164286 1.085714 1.103571 6757800 1.103571
2002-05-30 1.107857 1.107857 1.071429 1.071429 10154200 1.071429
2002-05-31 1.078571 1.078571 1.071429 1.076429 8464400 1.076429

Step 3: Checking the dimension and missing values of our data

Here we measure the dimension of the dataset and check the missing values.

R
# Check the dimension of the dataset
dim(df)

# Check the missing values of all the columns of the dataset
colSums(is.na(df))

Output:

[1] 5439    6

NFLX.Open NFLX.High NFLX.Low NFLX.Close NFLX.Volume NFLX.Adjusted
0 0 0 0 0 0

Step 4: Taking the summary of the data

We check the summary of the data and get the basic idea of the dataset.

R
# Checking the summary of the data
summary(df)

Output:

     Index              NFLX.Open          NFLX.High           NFLX.Low       
Min. :2002-05-23 Min. : 0.3779 Min. : 0.4107 Min. : 0.3464
1st Qu.:2007-10-16 1st Qu.: 4.1143 1st Qu.: 4.1936 1st Qu.: 4.0400
Median :2013-03-13 Median : 33.9957 Median : 34.5543 Median : 33.5100
Mean :2013-03-11 Mean :132.3833 Mean :134.4291 Mean :130.2730
3rd Qu.:2018-08-04 3rd Qu.:255.3800 3rd Qu.:261.5600 3rd Qu.:249.5550
Max. :2023-12-29 Max. :692.3500 Max. :700.9900 Max. :686.0900
NFLX.Close NFLX.Volume NFLX.Adjusted
Min. : 0.3729 Min. : 285600 Min. : 0.3729
1st Qu.: 4.1214 1st Qu.: 5922600 1st Qu.: 4.1214
Median : 33.9600 Median : 10018000 Median : 33.9600
Mean :132.4029 Mean : 15907149 Mean :132.4029
3rd Qu.:255.1150 3rd Qu.: 18833300 3rd Qu.:255.1150
Max. :691.6900 Max. :323414000 Max. :691.6900

Step 5: Plotting the data

We will use chartSeries() function from the quantmod package in R, typically used for financial and stock market data visualization. type = ‘auto’, it automatically selects an appropriate chart type based on the data provided.

R
chartSeries(df, type = 'auto')

Output:

Predicting Stock Prices in R

Now we will Check that the data is stationary or not by visualize the data.

R
ggplot(df, aes(x = NFLX.Close))+
  geom_density(alpha = 0.5, fill = "blue") +
  geom_histogram(aes(y = ..density..), 
                 color = "black", 
                 fill = "lightgray", 
                 bins = 30, alpha = 0.4) +
  labs(title = "Density and Histogram of Close Price",
       x = "Close Price",
       y = "Density") +
  theme_minimal()

Output:

Predicting Stock Prices in R

Clearly the data is not normally distributed which implies it is a non-stationary data.

Step 6: Model building

We take out the data frame consist of closing price and then split our data df.close consist of closing price of stock in a 80:20 ratio where 80% is the training purpose and remaining for test or validation purpose.

We will split the data in train and test and now we will use arima model to Predicting Stock Prices.

R
# df.close is just name of the data frame consist of closing price you can take 
df.close = df[,4] # just taking the 4th column i.e. Close price

# Train test split
df.close.train = df.close[1:(0.8*length(df.close))]

df.close.test = df.close[(0.8*length(df.close)):length(df.close)]

Step 7: Model Fitting

R
# df.close.arima is just a name convention 
df.close.arima = auto.arima(df.close.train,
                            seasonal = T,
                            stepwise = T,
                            nmodels = 100,
                            trace = T,
                            biasadj = T)

Output:

 Fitting models using approximations to speed things up...

ARIMA(2,1,2) with drift : 21853.71
ARIMA(0,1,0) with drift : 21847.69
ARIMA(1,1,0) with drift : 21848.52
ARIMA(0,1,1) with drift : 21847.56
ARIMA(0,1,0) : 21847.87
ARIMA(1,1,1) with drift : 21848.77
ARIMA(0,1,2) with drift : 21849.32
ARIMA(1,1,2) with drift : 21850.01
ARIMA(0,1,1) : 21847.64

Now re-fitting the best model(s) without approximations...

ARIMA(0,1,1) with drift : 21849.74

Best model: ARIMA(0,1,1) with drift

Executing and Checking the Model Summary

Now we will check the summary of the model.

R
# Summary of the model
summary(df.close.arima)

Output:

Series: df.close.train 
ARIMA(0,1,1) with drift

Coefficients:
ma1 drift
0.0220 0.0667
s.e. 0.0151 0.0462

sigma^2 = 8.883: log likelihood = -10921.87
AIC=21849.74 AICc=21849.74 BIC=21868.87

Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1.070102e-05 2.979396 1.175391 -1.252125 2.838547 1.008832 0.0001708062

This ARIMA model appears to:

  • Fit the training data with a low overall error (as indicated by ME, RMSE, MAE, etc.).
  • The coefficients for ma1 and drift show a slight moving average component and a small linear drift.
  • The log-likelihood, AIC, and BIC values are reported for understanding the quality of model fit and for comparing with other models.

Checking Accuracy of Netflix Stock Price Prediction Model

Comparing Training and Testing Accuracy of of the Netflix Stock Price Prediction Model:

R
accuracy(df.close.forecast, df.close.test)

Output:

                       ME       RMSE        MAE       MPE      MAPE       MASE         ACF1
Training set 1.070102e-05 2.979396 1.175391 -1.252125 2.838547 1.008832 0.0001708062
Test set 8.253246e+01 150.853167 122.694874 11.009547 29.253149 105.308391 NA

where:

  • ME (Mean Error) indicates the average error.
  • RMSE (Root Mean Square Error) shows the square root of the average squared errors.
  • MAE (Mean Absolute Error) is the average of the absolute errors.
  • MPE (Mean Percentage Error) represents the average percentage error.
  • MAPE (Mean Absolute Percentage Error) is the average of the absolute percentage errors.
  • MASE (Mean Absolute Scaled Error) measures the accuracy of a model compared to a naive forecasting method.
  • ACF1 (Autocorrelation at Lag 1) indicates how much current values are related to past values.

The training set has much lower error values across all metrics compared to the test set. This suggests the model performs well on the data it was trained on but does not generalize well to new data (the test set).

Performance Comparison on Netflix Stock Price Prediction Model on Training vs Test Data Set

  • The training set has minimal errors (almost perfect), which might indicate overfitting—the model learned the training data too well but is not adaptable to new or unseen data.
  • The test set shows much higher errors, suggesting that the model doesn’t predict well for data it hasn’t seen before.

Predict Netflix Stock Price

With the help of ARIMA() function for different value of (p, d, q) we are seeing the model accuracy and try to find best predicted values.

R
df.arima1 =Arima(df.close.train, order = c(0,2,1))
pred1 = predict(df.arima1, n.ahead = 1088)
summary(df.arima1)

Output:

Series: df.close.train 
ARIMA(0,2,1)

Coefficients:
ma1
-0.9994
s.e. 0.0014

sigma^2 = 8.89: log likelihood = -10924.97
AIC=21853.93 AICc=21853.94 BIC=21866.69

Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0380327 2.980599 1.165301 0.03106153 2.352286 1.000172 0.0219196

Calculate Test accuracy score

R
accuracy(pred1$pred, df.close.test)

Output:

               ME     RMSE      MAE     MPE     MAPE
Test set 60.76618 144.3566 118.3292 4.91788 29.72001

We can observe that the accuracy of the above models df.arima1 model has the minimum MAPE, but that is not the best. The possible reasons for this may be we using a very simple model to perform such a complex task, Stock price prediction. It can be improve by some parameter tuning or using some simulation technique to find the appropriate value for (p, d, q).