What is Manipulate R Data Frames Using SQL?

In this article, we will learn Manipulate R Data Frames Using SQL,This free SQL tutorial for complete beginners will help you learn SQL from scratch.

Manipulate R Data Frames Using SQL - ❤️SQL Tutorials In 2024

Manipulating data frames in R Programming using SQL can be easily done using the sqldf package. This package in R provides a mechanism that allows data frame manipulation with SQL and also helps to connect with a limited number of databases. The sqldf package in R is basically used for executing the SQL commands or statements on data frames. One can simply specify the SQL statement using data frame names instead of table names in R, and then the following things happen:

A database with proper schema or table layout is created
The data frames are loaded into the created database automatically
The specific SQL statement or command is executed
The result is retrieved back into R, and
Automatically the database gets deleted.

This makes the existence of the database quite transparent. This method can lead to faster R calculation. The result is obtained using some heuristics in order to determine the class which is to be assigned to each column of the resultant data frame.

Working With sqldf in R

A handful of SQL operations can be performed in R using the sqldf package. Let’s use two csv files from the Highway data.

crashes.csv which contains Year, Road, N_Crashes, and Volume.
roads.csv which contains Road, District, and Length.

In order to work with the sqldf package, first install it as follows:

install.packages("sqldf")

After proper installation, include the package in R script as follows:

library(sqldf)

Now load the data in the script. In order to do so, change the present directory to the directory which contains the csv files crashes.csv and roads.csv using the setwd() function.
Example:

r

# Importing required library
library(sqldf)
 
# Changing the directory
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
 
# Reading the csv files
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
# Displaying the data in crashes.csv 
head(crashes)
tail(crashes)
 
# Displaying the data in roads.csv 
print(roads)

Output:

  Year     Road       N_Crashes Volume
1 1991 Interstate 65        25  40000
2 1992 Interstate 65        37  41000
3 1993 Interstate 65        45  45000
4 1994 Interstate 65        46  45600
5 1995 Interstate 65        46  49000
6 1996 Interstate 65        59  51000

    Year  Road           N_Crashes Volume
105 2007 Interstate 275        32  21900
106 2008 Interstate 275        21  21850
107 2009 Interstate 275        25  22100
108 2010 Interstate 275        24  21500
109 2011 Interstate 275        23  20300
110 2012 Interstate 275        22  21200

           Road       District   Length
1 Interstate 65     Greenfield    262
2 Interstate 70      Vincennes    156
3         US-36 Crawfordsville    139
4         US-40     Greenfield    150
5         US-52 Crawfordsville    172

Now perform any SQL operation on these data using the sqldf() function of the sqldf package.

Joining and Merging Data Frames

The most common SQL operation is the join operation. One can perform left join and inner join using sqldf(). Currently, sqldf() does not support the full outer join and right join operations. Along with the sqldf package we need to include the tcltk package.
Example 1: Performing left join operation

r

# Perform Left Join
 
# Importing required library
library(sqldf)
library(tcltk)
 
# Setting the directory
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
 
# Reading the csv files
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
# Performing left join
join_string <- "select crashes.*, 
                 roads.District, 
                 roads.Length 
                 from crashes
                 left join 
                 roads on 
                 crashes.Road = roads.Road"
 
# Resultant data frame
crashes_join_roads <- sqldf(join_string, 
                stringsAsFactors = FALSE)
head(crashes_join_roads)
tail(crashes_join_roads)

Output:

  Year    Road        N_Crashes Volume  District   Length
1 1991 Interstate 65        25  40000 Greenfield    262
2 1992 Interstate 65        37  41000 Greenfield    262
3 1993 Interstate 65        45  45000 Greenfield    262
4 1994 Interstate 65        46  45600 Greenfield    262
5 1995 Interstate 65        46  49000 Greenfield    262
6 1996 Interstate 65        59  51000 Greenfield    262

    Year  Road          N_Crashes Volume     District     Length
105 2007 Interstate 275        32  21900     <NA>     NA
106 2008 Interstate 275        21  21850     <NA>     NA
107 2009 Interstate 275        25  22100     <NA>     NA
108 2010 Interstate 275        24  21500     <NA>     NA
109 2011 Interstate 275        23  20300     <NA>     NA
110 2012 Interstate 275        22  21200     <NA>     NA

Explanation:
The crashes_join_roads is a new data frame created by the sqldf statement which stores the result of the join operation. The sqldf() function or operation requires at least a string character along with the SQL operation. The stringsAsFactors parameter is used to assign character class to the categorical data instead of factor class.
Example 2: Performing inner join

r

# Perform Inner Join
 
# Importing required package
library(sqldf)
library(tcltk)
 
# Selecting the proper directory
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
# Reading the csv files
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
# Performing the inner join
join_string2 <- "select crashes.*, 
                 roads.District,
                 roads.Length 
                 from crashes
                 inner join 
                 roads on 
                 crashes.Road = roads.Road"
 
# The new data frame
crashes_join_roads2 <- sqldf(join_string2, 
                 stringsAsFactors = FALSE)
head(crashes_join_roads2)
tail(crashes_join_roads2)

Output:

 
  Year     Road       N_Crashes Volume  District  Length
1 1991 Interstate 65        25  40000 Greenfield    262
2 1992 Interstate 65        37  41000 Greenfield    262
3 1993 Interstate 65        45  45000 Greenfield    262
4 1994 Interstate 65        46  45600 Greenfield    262
5 1995 Interstate 65        46  49000 Greenfield    262
6 1996 Interstate 65        59  51000 Greenfield    262

   Year  Road  N_Crashes Volume   District      Length
83 2007 US-36        49  24000 Crawfordsville    139
84 2008 US-36        52  24500 Crawfordsville    139
85 2009 US-36        55  24700 Crawfordsville    139
86 2010 US-36        35  23000 Crawfordsville    139
87 2011 US-36        33  21000 Crawfordsville    139
88 2012 US-36        31  20500 Crawfordsville    139

Here only the matching rows are kept in the resultant data frame.
Now let’s see how the merge() function works. In R, the merge operation is capable of performing left join, right join, inner join, and full outer join, unlike the sqldf() function. Also, one can easily perform the equivalent operation like sqldf() using the merge() operation.
Example 3:

r

# Perform Merge operation
 
# Import required library
library(sqldf)
library(tcltk)
 
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
 
# Reading the two csv files
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
# Merge the two data frames 
crashes_merge_roads2 <- merge(crashes, 
                              roads, 
                              by = c("Road"),
                              all.x = TRUE)
head(crashes_merge_roads2)
tail(crashes_merge_roads2)

Output:

 
     Road        Year  N_Crashes Volume    District    Length
1 Interstate 275 1994        21  21200     <NA>     NA
2 Interstate 275 1995        28  23200     <NA>     NA
3 Interstate 275 1996        22  20000     <NA>     NA
4 Interstate 275 1997        27  18000     <NA>     NA
5 Interstate 275 1998        21  19500     <NA>     NA
6 Interstate 275 1999        22  21000     <NA>     NA

     Road Year N_Crashes Volume   District  Length
105 US-40 2003        94  55200 Greenfield    150
106 US-40 2004        25  55300 Greenfield    150
107 US-40 2009        67  65000 Greenfield    150
108 US-40 2010       102  67000 Greenfield    150
109 US-40 2011        87  67500 Greenfield    150
110 US-40 2012        32  67500 Greenfield    150

We will see that the rows in the resultant data frames are rearranged when we are using the merge() function.

Using where Clause

R can perform the exact operations as SQL. Hence to use a SQL statement where to include any condition use the where clause.
Example:
Let’s see how to perform inner join using the combination of merge and subset operation by including the where clause in the query.

r

# Using where clause
 
# Importing required library
library(sqldf)
library(plyr)
library(tcltk)
 
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
# Using the where clause
join_string2 <- "select crashes.*, 
                 roads.District, 
                 roads.Length 
                 from crashes
                 inner join 
                 roads on 
                 crashes.Road = roads.Road 
                 where 
                 crashes.Road = 'US-40'"  
               
crashes_join_roads4 <- sqldf(join_string2, 
                 stringsAsFactors = FALSE)
head(crashes_join_roads4)
tail(crashes_join_roads4)

Output:

  Year  Road  N_Crashes Volume District   Length
1 1991 US-40        46  21000 Greenfield    150
2 1992 US-40       101  21500 Greenfield    150
3 1993 US-40        76  23000 Greenfield    150
4 1994 US-40        72  21000 Greenfield    150
5 1995 US-40        75  24000 Greenfield    150
6 1996 US-40       136  23500 Greenfield    150

   Year  Road N_Crashes  Volume  District  Length
17 2007 US-40        45  59500 Greenfield    150
18 2008 US-40        23  61000 Greenfield    150
19 2009 US-40        67  65000 Greenfield    150
20 2010 US-40       102  67000 Greenfield    150
21 2011 US-40        87  67500 Greenfield    150
22 2012 US-40        32  67500 Greenfield    150

Aggregate Functions

In the sqldf package, the aggregate operations can be performed using the group by clause.
Example:

r

# Perform aggregate operations
 
# Import required library
library(sqldf)
library(tcltk)
 
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
# Group by clause
group_string <- "select crashes.Road, 
                 avg(crashes.N_Crashes) 
                 as Mean_Crashes 
                 from crashes
                 left join 
                 roads on 
                 crashes.Road = roads.Road 
                 group by 1"
sqldf(group_string)

Output:

      Road         Mean_Crashes
1 Interstate 275     24.95455
2  Interstate 65    107.81818
3  Interstate 70     65.18182
4          US-36     48.00000
5          US-40     68.68182

The sqldf() function can be used for performing certain kinds of data manipulations. To overcome these limitations, use the plyr package in the R Script. Hadley Wickham’s plyr package can be used to perform advanced calculations and data manipulations. Let’s see how it works.
Example:

r

# Importing required library
library(sqldf)
library(plyr)
library(tcltk)
 
setwd("C:\\Users\\SHAONI\\Documents\\
                     R\\win-library")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
 
ddply(
crashes_merge_roads,
c("Road"),
function(X) 
data.frame(
  Mean_Crashes = mean(X$N_Crashes),
  Q1_Crashes = quantile(X$N_Crashes, 0.25),
  Q3_Crashes = quantile(X$N_Crashes, 0.75),
  Median_Crashes = quantile(X$N_Crashes, 0.50))
)

Output:

 
     Road        Mean_Crashes   Q1_Crashes Q3_Crashes   Median_Crashes
1 Interstate 65    107.81818      63.25     140.25          108.5
2 Interstate 70     65.18182      52.00      75.50           66.5
3         US-36     48.00000      42.00      57.25           47.0
4         US-40     68.68182      45.25      90.75           70.0