How to Calculate the Mean by Group in R DataFrame ?️‍🔥

Calculating the mean by group in an R DataFrame involves splitting the data into subsets based on a specific grouping variable and then computing the mean of a numeric variable within each subgroup.

In this article, we will see how to calculate the mean by the group in R DataFrame in R Programming Language.

It can be done with two approaches:

Using aggregate function
Using dplyr Package

Dataset creation: First, we create a dataset so that later we can apply the above two approaches and find the Mean by group.

R

# GFG dataset name and creation
GFG <- data.frame(                                            
   Category  = c ("A","B","C","B","C","A","C","A","B"),       
   Frequency= c(9,5,0,2,7,8,1,3,7)                            
)
 
# Prints the dataset
print(GFG)                                                    

Output:

  Category Frequency
1        A         9
2        B         5
3        C         0
4        B         2
5        C         7
6        A         8
7        C         1
8        A         3
9        B         7

So, as you can see the above code is for creating a dataset named “GFG”.

It has 2 columns named Category and Frequency. So, when you run the above code in an R compiler.

Before we discuss those approaches let us first know how we got the output values:

In Table 1, We have two columns named Category and Frequency.
In Category, we have some repeating variables of A, B, and C.
A group values (9,8,3), B group values (5,2,7), and C group values (0,7,1) are taken from the Frequency column.
So, to find the Mean we have a formula

MEAN = Sum of terms / Number of terms

Hence, the Mean by Group of each group (A, B, C) would be

Sum:

A=9+8+3=20
B=5+2+7=14
C=0+7+1=8

A number of terms:

A is repeated 3 times
B is repeated 3 times
C is repeated 3 times

Mean by group (A, B, C):

A(mean) = Sum/Number of terms = 20/3 = 6.67
B(mean) = Sum/Number of terms = 14/3 = 4.67
C(mean) = Sum/Number of terms = 8/3 = 2.67

Code Implementations

Method 1: Using aggregate function

Aggregate function: Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

Syntax: aggregate(x = dataset_Name , by = group_list, FUN = any_function)

# Basic R syntax of aggregate function

Now, let’s sum our data using an aggregate function:

R

# Specify data column
group_mean<- aggregate(x= GFG$Frequency,
                      # Specify group indicator
                      by = list(GFG$Category),      
                      # Specify function (i.e. mean)
                      FUN = mean)
print(group_mean)

Output:

  Group.1        x
1       A 6.666667
2       B 4.666667
3       C 2.666667

In the above aggregate function, it takes on three parameters

First is the dataset name in our case it is “GFG”.
Second is the column name which values we need to make different groups in our case it is a Category column, and it is separated into three groups (A, B, C).
In the third parameter, we need to mention which function(i.e mean, sum, etc) we need to perform on a group formed (A, B, C)

Method 2: Using dplyr Package

dplyr is a package that provides a set of tools for efficiently manipulating datasets in R

Methods in dplyr package:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values to a single summary.
arrange() changes the ordering of the rows.

Install this library:

install.packages("dplyr")

Load this library:

library("dplyr")

R

# load dplyr library
library("dplyr")                             
 
# Specify data frame
group_mean <- GFG %>%
    # Specify group indicator, column, function
    group_by(Category) %>%
    # Calculate the mean of the "Frequency" column for each group
    summarise_at(vars(Frequency),
                 list(Mean_Frequency = mean))
 
 
# Print the resulting summary data frame
print(group_mean)

Output:

# A tibble: 3 × 2
  Category Mean_Frequency
  <chr>             <dbl>
1 A                  6.67
2 B                  4.67
3 C                  2.67

Code Steps:

The %>% operator allows us to perform the operations one after another.
group_by(Category) groups the data by the “Category” column. This means that subsequent operations will be performed separately for each unique value in the “Category” column.
summarise_at() has two parameters first is a column on which it applies the operation given as the second parameter of it.
The result is a new data frame called group_mean, which contains one row for each unique category and a column “Mean_Frequency” that holds the calculated means.

Finally, group_mean is printed to the console to display the summary statistics for each category.

Method 3: Use the data.table package

The data.table package provides a concise and efficient way to calculate summary statistics by group. In this case, we calculate the mean of the “Frequency” column for each group defined by the “Category” column.

R

# Load the data.table library
library(data.table)
 
# Convert data.frame to data.table
gfg <- data.table(GFG)
 
# Calculate the mean by "Category" group
mean_by_category <- gfg[, .(Mean_Frequency = mean(Frequency)), by = Category]
 
# Print the result
print(mean_by_category)

Output:

   Category Mean_Frequency
1:        A       6.666667
2:        B       4.666667
3:        C       2.666667

Code Steps:

The first line loads the data.table library in R. The data.table package is used for efficient data manipulation.
Then we convert the existing data frame GFG into a data.table named gfg
Mean by the “Category” group using the data.table is calculated as follows:
- Inside the gfg data table, we perform the mean of Frequency column group wise, The Mean_Frequency stores the group wise mean of Frequency column.
- The `by` argument specifies the grouping variable. It tells R to group the data by the “Category” column before applying the calculation.