How to Calculate the Mean by Group in R DataFrame ?
Calculating the mean by group in an R DataFrame involves splitting the data into subsets based on a specific grouping variable and then computing the mean of a numeric variable within each subgroup.
In this article, we will see how to calculate the mean by the group in R DataFrame in R Programming Language.
It can be done with two approaches:
- Using aggregate function
- Using dplyr Package
Dataset creation: First, we create a dataset so that later we can apply the above two approaches and find the Mean by group.
R
# GFG dataset name and creation GFG <- data.frame ( Category = c ( "A" , "B" , "C" , "B" , "C" , "A" , "C" , "A" , "B" ), Frequency= c (9,5,0,2,7,8,1,3,7) ) # Prints the dataset print (GFG) |
Output:
Category Frequency
1 A 9
2 B 5
3 C 0
4 B 2
5 C 7
6 A 8
7 C 1
8 A 3
9 B 7
So, as you can see the above code is for creating a dataset named “GFG”.
It has 2 columns named Category and Frequency. So, when you run the above code in an R compiler.
Before we discuss those approaches let us first know how we got the output values:
- In Table 1, We have two columns named Category and Frequency.
- In Category, we have some repeating variables of A, B, and C.
- A group values (9,8,3), B group values (5,2,7), and C group values (0,7,1) are taken from the Frequency column.
- So, to find the Mean we have a formula
MEAN = Sum of terms / Number of terms
- Hence, the Mean by Group of each group (A, B, C) would be
Sum:
- A=9+8+3=20
- B=5+2+7=14
- C=0+7+1=8
A number of terms:
- A is repeated 3 times
- B is repeated 3 times
- C is repeated 3 times
Mean by group (A, B, C):
- A(mean) = Sum/Number of terms = 20/3 = 6.67
- B(mean) = Sum/Number of terms = 14/3 = 4.67
- C(mean) = Sum/Number of terms = 8/3 = 2.67
Code Implementations
Method 1: Using aggregate function
Aggregate function: Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
Syntax: aggregate(x = dataset_Name , by = group_list, FUN = any_function)
# Basic R syntax of aggregate function
Now, let’s sum our data using an aggregate function:
R
# Specify data column group_mean<- aggregate (x= GFG$Frequency, # Specify group indicator by = list (GFG$Category), # Specify function (i.e. mean) FUN = mean) print (group_mean) |
Output:
Group.1 x
1 A 6.666667
2 B 4.666667
3 C 2.666667
In the above aggregate function, it takes on three parameters
- First is the dataset name in our case it is “GFG”.
- Second is the column name which values we need to make different groups in our case it is a Category column, and it is separated into three groups (A, B, C).
- In the third parameter, we need to mention which function(i.e mean, sum, etc) we need to perform on a group formed (A, B, C)
Method 2: Using dplyr Package
dplyr is a package that provides a set of tools for efficiently manipulating datasets in R
Methods in dplyr package:
- mutate() adds new variables that are functions of existing variables
- select() picks variables based on their names.
- filter() picks cases based on their values.
- summarise() reduces multiple values to a single summary.
- arrange() changes the ordering of the rows.
Install this library:
install.packages("dplyr")
Load this library:
library("dplyr")
R
# load dplyr library library ( "dplyr" ) # Specify data frame group_mean <- GFG %>% # Specify group indicator, column, function group_by (Category) %>% # Calculate the mean of the "Frequency" column for each group summarise_at ( vars (Frequency), list (Mean_Frequency = mean)) # Print the resulting summary data frame print (group_mean) |
Output:
# A tibble: 3 × 2
Category Mean_Frequency
<chr> <dbl>
1 A 6.67
2 B 4.67
3 C 2.67
Code Steps:
- The %>% operator allows us to perform the operations one after another.
- group_by(Category) groups the data by the “Category” column. This means that subsequent operations will be performed separately for each unique value in the “Category” column.
- summarise_at() has two parameters first is a column on which it applies the operation given as the second parameter of it.
- The result is a new data frame called group_mean, which contains one row for each unique category and a column “Mean_Frequency” that holds the calculated means.
Finally, group_mean is printed to the console to display the summary statistics for each category.
Method 3: Use the data.table package
The data.table
package provides a concise and efficient way to calculate summary statistics by group. In this case, we calculate the mean of the “Frequency” column for each group defined by the “Category” column.
R
# Load the data.table library library (data.table) # Convert data.frame to data.table gfg <- data.table (GFG) # Calculate the mean by "Category" group mean_by_category <- gfg[, . (Mean_Frequency = mean (Frequency)), by = Category] # Print the result print (mean_by_category) |
Output:
Category Mean_Frequency
1: A 6.666667
2: B 4.666667
3: C 2.666667
Code Steps:
- The first line loads the data.table library in R. The data.table package is used for efficient data manipulation.
- Then we convert the existing data frame GFG into a data.table named gfg
- Mean by the “Category” group using the data.table is calculated as follows:
- Inside the gfg data table, we perform the mean of Frequency column group wise, The Mean_Frequency stores the group wise mean of Frequency column.
- The `by` argument specifies the grouping variable. It tells R to group the data by the “Category” column before applying the calculation.