Manipulate R Data Frames Using SQL
Manipulating data frames in R Programming using SQL can be easily done using the sqldf package. This package in R provides a mechanism that allows data frame manipulation with SQL and also helps to connect with a limited number of databases. The sqldf package in R is basically used for executing the SQL commands or statements on data frames. One can simply specify the SQL statement using data frame names instead of table names in R, and then the following things happen:
- A database with proper schema or table layout is created
- The data frames are loaded into the created database automatically
- The specific SQL statement or command is executed
- The result is retrieved back into R, and
- Automatically the database gets deleted.
This makes the existence of the database quite transparent. This method can lead to faster R calculation. The result is obtained using some heuristics in order to determine the class which is to be assigned to each column of the resultant data frame.
A handful of SQL operations can be performed in R using the sqldf package. Let’s use two csv files from the Highway data.
- crashes.csv which contains Year, Road, N_Crashes, and Volume.
- roads.csv which contains Road, District, and Length.
In order to work with the sqldf package, first install it as follows:
install.packages("sqldf")
After proper installation, include the package in R script as follows:
library(sqldf)
Now load the data in the script. In order to do so, change the present directory to the directory which contains the csv files crashes.csv and roads.csv using the setwd() function.
Example:
r
# Importing required library library (sqldf) # Changing the directory setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") # Reading the csv files crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Displaying the data in crashes.csv head (crashes) tail (crashes) # Displaying the data in roads.csv print (roads) |
Output:
Year Road N_Crashes Volume 1 1991 Interstate 65 25 40000 2 1992 Interstate 65 37 41000 3 1993 Interstate 65 45 45000 4 1994 Interstate 65 46 45600 5 1995 Interstate 65 46 49000 6 1996 Interstate 65 59 51000 Year Road N_Crashes Volume 105 2007 Interstate 275 32 21900 106 2008 Interstate 275 21 21850 107 2009 Interstate 275 25 22100 108 2010 Interstate 275 24 21500 109 2011 Interstate 275 23 20300 110 2012 Interstate 275 22 21200 Road District Length 1 Interstate 65 Greenfield 262 2 Interstate 70 Vincennes 156 3 US-36 Crawfordsville 139 4 US-40 Greenfield 150 5 US-52 Crawfordsville 172
Now perform any SQL operation on these data using the sqldf() function of the sqldf package.
Joining and Merging Data Frames
The most common SQL operation is the join operation. One can perform left join and inner join using sqldf(). Currently, sqldf() does not support the full outer join and right join operations. Along with the sqldf package we need to include the tcltk package.
Example 1: Performing left join operation
r
# Perform Left Join # Importing required library library (sqldf) library (tcltk) # Setting the directory setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") # Reading the csv files crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Performing left join join_string <- "select crashes.*, roads.District, roads.Length from crashes left join roads on crashes.Road = roads.Road" # Resultant data frame crashes_join_roads <- sqldf (join_string, stringsAsFactors = FALSE ) head (crashes_join_roads) tail (crashes_join_roads) |
Output:
Year Road N_Crashes Volume District Length 1 1991 Interstate 65 25 40000 Greenfield 262 2 1992 Interstate 65 37 41000 Greenfield 262 3 1993 Interstate 65 45 45000 Greenfield 262 4 1994 Interstate 65 46 45600 Greenfield 262 5 1995 Interstate 65 46 49000 Greenfield 262 6 1996 Interstate 65 59 51000 Greenfield 262 Year Road N_Crashes Volume District Length 105 2007 Interstate 275 32 21900 <NA> NA 106 2008 Interstate 275 21 21850 <NA> NA 107 2009 Interstate 275 25 22100 <NA> NA 108 2010 Interstate 275 24 21500 <NA> NA 109 2011 Interstate 275 23 20300 <NA> NA 110 2012 Interstate 275 22 21200 <NA> NA
Explanation:
The crashes_join_roads is a new data frame created by the sqldf statement which stores the result of the join operation. The sqldf() function or operation requires at least a string character along with the SQL operation. The stringsAsFactors parameter is used to assign character class to the categorical data instead of factor class.
Example 2: Performing inner join
r
# Perform Inner Join # Importing required package library (sqldf) library (tcltk) # Selecting the proper directory setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") # Reading the csv files crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Performing the inner join join_string2 <- "select crashes.*, roads.District, roads.Length from crashes inner join roads on crashes.Road = roads.Road" # The new data frame crashes_join_roads2 <- sqldf (join_string2, stringsAsFactors = FALSE ) head (crashes_join_roads2) tail (crashes_join_roads2) |
Output:
Year Road N_Crashes Volume District Length 1 1991 Interstate 65 25 40000 Greenfield 262 2 1992 Interstate 65 37 41000 Greenfield 262 3 1993 Interstate 65 45 45000 Greenfield 262 4 1994 Interstate 65 46 45600 Greenfield 262 5 1995 Interstate 65 46 49000 Greenfield 262 6 1996 Interstate 65 59 51000 Greenfield 262 Year Road N_Crashes Volume District Length 83 2007 US-36 49 24000 Crawfordsville 139 84 2008 US-36 52 24500 Crawfordsville 139 85 2009 US-36 55 24700 Crawfordsville 139 86 2010 US-36 35 23000 Crawfordsville 139 87 2011 US-36 33 21000 Crawfordsville 139 88 2012 US-36 31 20500 Crawfordsville 139
Here only the matching rows are kept in the resultant data frame.
Now let’s see how the merge() function works. In R, the merge operation is capable of performing left join, right join, inner join, and full outer join, unlike the sqldf() function. Also, one can easily perform the equivalent operation like sqldf() using the merge() operation.
Example 3:
r
# Perform Merge operation # Import required library library (sqldf) library (tcltk) setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") # Reading the two csv files crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Merge the two data frames crashes_merge_roads2 <- merge (crashes, roads, by = c ( "Road" ), all.x = TRUE ) head (crashes_merge_roads2) tail (crashes_merge_roads2) |
Output:
Road Year N_Crashes Volume District Length 1 Interstate 275 1994 21 21200 <NA> NA 2 Interstate 275 1995 28 23200 <NA> NA 3 Interstate 275 1996 22 20000 <NA> NA 4 Interstate 275 1997 27 18000 <NA> NA 5 Interstate 275 1998 21 19500 <NA> NA 6 Interstate 275 1999 22 21000 <NA> NA Road Year N_Crashes Volume District Length 105 US-40 2003 94 55200 Greenfield 150 106 US-40 2004 25 55300 Greenfield 150 107 US-40 2009 67 65000 Greenfield 150 108 US-40 2010 102 67000 Greenfield 150 109 US-40 2011 87 67500 Greenfield 150 110 US-40 2012 32 67500 Greenfield 150
We will see that the rows in the resultant data frames are rearranged when we are using the merge() function.
Using where Clause
R can perform the exact operations as SQL. Hence to use a SQL statement where to include any condition use the where clause.
Example:
Let’s see how to perform inner join using the combination of merge and subset operation by including the where clause in the query.
r
# Using where clause # Importing required library library (sqldf) library (plyr) library (tcltk) setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Using the where clause join_string2 <- "select crashes.*, roads.District, roads.Length from crashes inner join roads on crashes.Road = roads.Road where crashes.Road = 'US-40' " crashes_join_roads4 <- sqldf (join_string2, stringsAsFactors = FALSE ) head (crashes_join_roads4) tail (crashes_join_roads4) |
Output:
Year Road N_Crashes Volume District Length 1 1991 US-40 46 21000 Greenfield 150 2 1992 US-40 101 21500 Greenfield 150 3 1993 US-40 76 23000 Greenfield 150 4 1994 US-40 72 21000 Greenfield 150 5 1995 US-40 75 24000 Greenfield 150 6 1996 US-40 136 23500 Greenfield 150 Year Road N_Crashes Volume District Length 17 2007 US-40 45 59500 Greenfield 150 18 2008 US-40 23 61000 Greenfield 150 19 2009 US-40 67 65000 Greenfield 150 20 2010 US-40 102 67000 Greenfield 150 21 2011 US-40 87 67500 Greenfield 150 22 2012 US-40 32 67500 Greenfield 150
Aggregate Functions
In the sqldf package, the aggregate operations can be performed using the group by clause.
Example:
r
# Perform aggregate operations # Import required library library (sqldf) library (tcltk) setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) # Group by clause group_string <- "select crashes.Road, avg (crashes.N_Crashes) as Mean_Crashes from crashes left join roads on crashes.Road = roads.Road group by 1" sqldf (group_string) |
Output:
Road Mean_Crashes 1 Interstate 275 24.95455 2 Interstate 65 107.81818 3 Interstate 70 65.18182 4 US-36 48.00000 5 US-40 68.68182
The sqldf() function can be used for performing certain kinds of data manipulations. To overcome these limitations, use the plyr package in the R Script. Hadley Wickham’s plyr package can be used to perform advanced calculations and data manipulations. Let’s see how it works.
Example:
r
# Importing required library library (sqldf) library (plyr) library (tcltk) setwd ("C:\\Users\\SHAONI\\Documents\\ R\\win-library") crashes <- read.csv ( "crashes.csv" ) roads <- read.csv ( "roads.csv" ) ddply ( crashes_merge_roads, c ( "Road" ), function (X) data.frame ( Mean_Crashes = mean (X$N_Crashes), Q1_Crashes = quantile (X$N_Crashes, 0.25), Q3_Crashes = quantile (X$N_Crashes, 0.75), Median_Crashes = quantile (X$N_Crashes, 0.50)) ) |
Output:
Road Mean_Crashes Q1_Crashes Q3_Crashes Median_Crashes 1 Interstate 65 107.81818 63.25 140.25 108.5 2 Interstate 70 65.18182 52.00 75.50 66.5 3 US-36 48.00000 42.00 57.25 47.0 4 US-40 68.68182 45.25 90.75 70.0