Implementing Hierarchical Linear Regression in R
In this article we will use different real-world base examples to understand our topic in detail using different packages for analysis and visualization. We can perform this analysis in 5 different steps:
- Step 1: Creating/Loading a Dataset
- Step 2: Loading packages
- Step 3: Model Fitting
- Step 4: Running an Analysis of Variance (ANOVA) and Assessing Model Fit
- Step 5: Making a prediction
- Step 6: Interpretation and Visualization
We will try to understand the relationship between teaching method and student performance or marks obtained by the student. We will create a fictional dataset using a set.seed() function which helps in generating random data.
Hierarchical linear regression using mtcars dataset in R
In this example we will take an in-built dataset in R programming language called “mtcars”. This dataset have information such as weight, horsepower, and quarter mile time, miles per gallon (mpg) about different types of cars. We will try to understand how these attributes affect each other at different levels.
Step 1: Loading the Dataset
R
# Loading the Dataset data (mtcars) |
We can also view our data by using different functions such as head(), tail() or summary(). The head() function returns the first 6 rows of our dataset similarly, tail() returns the last 6 rows.
Step 2: Model Specification
In this step, we’ll specify the hierarchical linear regression model. We’ll consider two levels of hierarchy: car features and car make. We will try to understand the influence of car make or brand on mpg.
R
# Car features level model <- lm (mpg ~ wt + hp + qsec, data = mtcars) # Car make level model_by_make <- lm (mpg ~ disp + drat + wt, data = mtcars) # summary of the model summary (model) summary (model_by_make) |
Output:
Call:
lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.8591 -1.6418 -0.4636 1.1940 5.6092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.61053 8.41993 3.279 0.00278 **
wt -4.35880 0.75270 -5.791 3.22e-06 ***
hp -0.01782 0.01498 -1.190 0.24418
qsec 0.51083 0.43922 1.163 0.25463
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.578 on 28 degrees of freedom
Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11
summary(model_by_make)
Call:
lm(formula = mpg ~ disp + drat + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.2342 -2.3719 -0.3148 1.6315 6.2820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.043257 7.099792 4.372 0.000154 ***
disp -0.016389 0.009578 -1.711 0.098127 .
drat 0.843965 1.455051 0.580 0.566537
wt -3.172482 1.217157 -2.606 0.014495 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.951 on 28 degrees of freedom
Multiple R-squared: 0.7835, Adjusted R-squared: 0.7603
F-statistic: 33.78 on 3 and 28 DF, p-value: 1.92e-09
In above model specification : one model “model” examines how car features (such as weight, horsepower, and quarter mile time) influence the miles per gallon (mpg) performance of the cars. The other model, “model_by_make”, focuses on understanding how certain features (such as displacement, rear axle ratio, and weight) affect mpg within different car makes based on the dataset we have. This is a hierarchical structure of data.
Step 3: Running an Analysis of Variance (ANOVA) and Assessing Model Fit
As discussed in the previous example, this step is important for understanding the significance of the model fit.
R
# Conducting Analysis of Variance (ANOVA) for both models anova_model <- anova (model) anova_model_by_make <- anova (model_by_make) # Printing ANOVA results for both models cat ( "ANOVA for model:\n" ) print (anova_model) cat ( "\nANOVA for model by make:\n" ) print (anova_model_by_make) |
Output:
ANOVA for model:
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.73 847.73 127.5739 6.131e-12 ***
hp 1 83.27 83.27 12.5319 0.00142 **
qsec 1 8.99 8.99 1.3527 0.25463
Residuals 28 186.06 6.64
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA for model by make:
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
disp 1 808.89 808.89 92.9171 2.152e-10 ***
drat 1 14.26 14.26 1.6383 0.2111
wt 1 59.14 59.14 6.7937 0.0145 *
Residuals 28 243.75 8.71
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The incremental R-squared of 0.05123624 suggests that the additional predictors in the “model_by_make” account for an additional 5.12% of the variance.
Overall, the results indicate that the additional predictors in the “model_by_make” significantly contribute to explaining the variance in the response variable compared to the initial “model”.
Assessing Sums of Squares for both models
R
# Assessing Sums of Squares for both models SST <- anova_model$`Sum Sq`[1] SSR <- anova_model$`Sum Sq`[2] SSE <- anova_model$`Sum Sq`[3] SST_model_by_make <- anova_model_by_make$`Sum Sq`[1] SSR_model_by_make <- anova_model_by_make$`Sum Sq`[2] SSE_model_by_make <- anova_model_by_make$`Sum Sq`[3] cat ( "\nSums of Squares:\n" ) cat ( "SST:" , SST, "\n" ) cat ( "SSR:" , SSR, "\n" ) cat ( "SSE:" , SSE, "\n" ) |
Output:
Sums of Squares:
SST: 847.7252
SSR: 83.27418
SSE: 8.988458
Calculating Difference in Sum of Squares for both models
R
# Calculating Difference in Sum of Squares for both models SSR_diff <- SST - SSE SSR_diff_model_by_make <- SST_model_by_make - SSE_model_by_make # Deriving F-Statistics and P-Values for both models F_statistic <- anova_model_by_make$`F value`[1] p_value <- anova_model_by_make$` Pr (>F)`[1] F_statistic_model_by_make <- anova_model_by_make$`F value`[1] p_value_model_by_make <- anova_model_by_make$` Pr (>F)`[1] cat ( "\nDifference in Sum of Squares for model by make (SSR_diff_model_by_make):" , SSR_diff_model_by_make, "\n" ) cat ( "\nF-Statistic for model by make:" , F_statistic_model_by_make, "\n" ) cat ( "P-Value for model by make:" , p_value_model_by_make, "\n" ) |
Output:
Difference in Sum of Squares for model by make (SSR_diff_model_by_make): 749.7461
F-Statistic for model by make: 92.91705
P-Value for model by make: 2.151799e-10
Computing Incremental R-Squared
R
# Computing Incremental R-Squared incremental_R_squared <- summary (model)$r.squared - summary (model_by_make)$r.squared cat ( "\nIncremental R-Squared:" , incremental_R_squared, "\n" ) |
Output:
Incremental R-Squared: 0.05123624
4. Making Predictions
We’ll predict the mpg for a specific car make, in this case, “Ford Pinto.” Here we will assign certain values for displacement of car, rear axle ratio for a car and weight of the car in thousands of pounds.
R
new_data <- data.frame (disp = 120, drat = 3.9, wt = 2.8) predicted_mpg <- predict (model_by_make, newdata = new_data) cat ( "Predicted MPG:" , predicted_mpg, "\n" ) |
Output:
Predicted MPG: 23.48507
23.48507 represents the estimated miles per gallon (mpg) performance based on our predictions.
Interpretation and Visualization
R
library (ggplot2) # Adding a new variable 'make' derived from row names of mtcars dataset mtcars$make <- rownames (mtcars) #'make' on the x-axis and 'mpg' on the y-axis ggplot (mtcars, aes (x = make, y = mpg)) + geom_bar (stat = "identity" , fill = "Red" ) + labs (title = "MPG by Car Make" , x = "Car Make" , y = "MPG" ) + # Adjust the x-axis text angle for better readability theme (axis.text.x = element_text (angle = 45, hjust = 1)) |
Output:
The barplot shows mpg(miles per gallon) of various car make(brand). This graph provides a clear visualization of the mpg performance across various car makes, allowing for easy comparison.
In this example, we used in-built dataset in R called “mtcars” and used it to perform hierarchical regression. Here, we specified two model for different of our dataset. We predicted certain values for mpg using our dataset. Prediction understands the previous patterns and trends in our dataset and predicted values using predict() function. We also plotted a graph based on car make(brand) and mpg values for them for better understanding.
Conclusion:
This article was focused on different examples based on the real-life. With the help of these examples we understood how hierarchical linear regression works and how it uses multilevel dataset to understand the relationship between the dependent and independent variables and how they influence the prediction. We also understood the role of underlying patterns and trends of historical dataset in prediction.
Hierarchical linear regression using R
Linear Regression model is used to establish a connection between two or more variables. These variables are either dependent or independent. Linear Regression In R Programming Language is used to give predictions based on the given data about a particular topic, It helps us to have valuable insights and give conclusions that help us in many future decisions. Hierarchical linear regression is an extension of the standard linear regression that allows for the analysis of such hierarchical data(grouped data or data at different levels).