Ordinary Least Squares (OLS) Regression in R

Ordinary Least Squares (OLS) Regression allows researchers to understand the impact of independent variables on the dependent variable and make predictions based on the model.

Ordinary Least Squares (OLS) Regression in R

Ordinary Least Squares (OLS) regression is a powerful statistical method used to analyze the relationship between one or more independent variables and a dependent variable. It’s a cornerstone of regression analysis and is widely utilized across various disciplines, including economics, social sciences, finance, and more. OLS regression aims to find the best-fitting line (or hyperplane in multiple dimensions) through a set of data points, minimizing the sum of squared differences between observed and predicted values.

How is OLS Regression different from other regression algorithms?

Ordinary Least Squares (OLS) regression is a specific type of regression algorithm that differs from other regression algorithms in several ways:


OLS Regression

Other Regression

Linear Relationship

OLS regression assumes a linear relationship between the independent and dependent variables.

Other algorithms can capture non-linear relationships by including higher-order terms or using non-linear functions.

Minimization Objective

OLS regression minimizes the sum of squared differences between observed and predicted values of the dependent variable.

Other regression algorithms may use different optimization objectives, such as minimizing absolute errors (as in Lasso and Ridge regression) or maximizing likelihood (as in logistic regression).

Assumptions

OLS regression relies on several assumptions, including linearity, homoscedasticity, independence of errors, and normality of errors.

Other regression algorithms may have different sets of assumptions or may be more robust to violations of these assumptions.

Interpretability

OLS regression provides easily interpretable coefficients that represent the effect of each independent variable on the dependent variable.

Other regression algorithms, such as decision trees or neural networks, may provide less interpretable models with complex structures.

Complexity

OLS regression is relatively simple and computationally efficient, making it suitable for small to moderately sized datasets with a limited number of predictors.

Other regression algorithms may be more complex and computationally intensive, allowing for more flexibility and scalability but requiring larger datasets and more computational resources.

Mathematically, the OLS estimation formula can be represented as:

Given a dataset with ? observations and ? independent variables, denoted by ?1, ?2, . . ., ??, and a dependent variable ?, the OLS estimation formula for the coefficients (?) is:

[Tex]\hat{\beta} = (X^T X)^{-1} X^T Y [/Tex]

Where:

  • ?^ is the vector of estimated coefficients.
  • ? is the design matrix containing the independent variables (with dimensions ? × (?+1), including a column of ones for the intercept).
  • ? is the vector of observed values of the dependent variable (with dimensions ?×1.
  • ?? represents the transpose of matrix ?.
  • (???)−1 denotes the inverse of the matrix ???.

This formula calculates the estimates for the coefficients (?^) that minimize the sum of squared differences between the observed values of the dependent variable and the predicted values based on the regression model.

The following step-by-step example shows how to perform OLS regression in R.

Step 1. Install and load the required libraries

We will install and load the required libraries for Ordinary Least Squares (OLS) Regression in R.

R

# Load necessary libraries library(ggplot2) library(readr)

Step 2. Load the Dataset

Here we perform Ordinary Least Squares regression on a random sample of weather data to analyze the relationship between temperature and humidity.

Link: WeatherHistory

R

# Set the seed for reproducibility set.seed(123) # Load a random sample of the dataset weather_data <- read_csv_sample("E:/weatherHistory.csv", n = 1000) head(weather_data)

Output:

Formatted.Date Summary Precip.Type Temperature..C.
1 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222
2 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556
3 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778
4 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889
5 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556
6 2006-04-01 05:00:00.000 +0200 Partly Cloudy rain 9.222222
Apparent.Temperature..C. Humidity Wind.Speed..km.h. Wind.Bearing..degrees.
1 7.388889 0.89 14.1197 251
2 7.227778 0.86 14.2646 259
3 9.377778 0.89 3.9284 204
4 5.944444 0.83 14.1036 269
5 6.977778 0.83 11.0446 259
6 7.111111 0.85 13.9587 258
Visibility..km. Loud.Cover Pressure..millibars. Daily.Summary
1 15.8263 0 1015.13 Partly cloudy throughout the day.
2 15.8263 0 1015.63 Partly cloudy throughout the day.
3 14.9569 0 1015.94 Partly cloudy throughout the day.
4 15.8263 0 1016.41 Partly cloudy throughout the day.
5 15.8263 0 1016.51 Partly cloudy throughout the day.
6 14.9569 0 1016.66 Partly cloudy throughout the day.

Step 3: Perform OLS Regression

Now we create a model and perform Ordinary Least Squares (OLS) Regression in R Programming language.

R

# Perform OLS regression with 'Temperature (C)' as the dependent model <- lm(`Temperature (C)` ~ Humidity, data = weather_data) # Summary of the regression model summary(model)

Output:

Call:
lm(formula = Temperature..C. ~ Humidity, data = weather_data)

Residuals:
Min 1Q Median 3Q Max
-52.415 -5.091 0.378 5.741 18.804

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.6369 0.0927 373.7 <2e-16 ***
Humidity -30.8944 0.1219 -253.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.4 on 96451 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.3997
F-statistic: 6.423e+04 on 1 and 96451 DF, p-value: < 2.2e-16

In this linear model, Humidity has a strong negative relationship with Temperature (C). The coefficient for Humidity is -30.8944, meaning for each unit increase in Humidity, the Temperature (C) decreases by about 30.89 units. The model explains approximately 39.97% of the variance in Temperature (C), and the results are highly statistically significant.

Step 4. Visualize the OLS model

Now we will visualize the Ordinary Least Squares (OLS) Regression model.

R

# Create scatter plot of 'Temperature (C)' against 'Humidity' with regression line ggplot(data = weather_data, aes(x = Humidity, y = `Temperature (C)`)) + geom_point() + # Add points for observed data geom_smooth(method = "lm", se = FALSE) + # Add regression line labs(x = "Humidity", y = "Temperature (C)") + # Labels for axes ggtitle("OLS Regression of Temperature vs Humidity") # Title for the plot

Output:

Ordinary Least Squares (OLS) Regression in R

The plot visually explain the linear relationship between Humidity and Temperature (C). Given the regression line’s negative slope, it visually confirms that higher humidity levels are associated with lower temperatures, as indicated by the regression model coefficients. This visualization helps in understanding the strength and direction of the relationship between the two variables.

Importance of OLS Regression in Data Analysis

  1. Foundation of Statistical Modeling: OLS regression is a fundamental technique that forms the basis for many other statistical methods and machine learning algorithms.
  2. Simplicity and Interpretability: The method is straightforward to implement and the results are easy to interpret. Coefficients provide clear insights into the relationships between independent and dependent variables.
  3. Diagnostic Insights: OLS regression provides valuable diagnostic statistics, such as standard errors, t-values, p-values, and R-squared, which help assess model performance and predictor significance.
  4. Versatility: It can be applied to various fields, including economics, finance, healthcare, and social sciences, to understand and predict outcomes based on input data.
  5. Model Validation: The technique helps validate hypotheses and informs decision-making by quantifying the strength and direction of relationships between variables.

Conclusion

In the conclusion, Ordinary Least Squares (OLS) regression is a fundamental technique in machine learning for modeling relationships between variables. Despite its simplicity, users must be aware of assumptions and challenges like linearity and multicollinearity. While OLS is valuable for its interpretability, it’s important to supplement it with advanced methods when dealing with complex datasets or non-linear relationships.