Visualizing missing data for all columns
Let’s create a function to transform the dataframe to a binary TRUE/FALSE matrix and then visualize it using a barplot in R.
Example: Visualizing missing data for all columns
R
age = c (12,34, NA ,7,15, NA ) name = c ( 'rob' , NA , "arya" , "jon" , NA , NA ) grade = c ( "A" , "A" , "D" , "B" , "C" , "B" ) df <- data.frame (age,name,grade) # function convert dataframe to binary TRUE/FALSE matrix toBinaryMatrix <- function (df){ m<- c () for (i in colnames (df)){ x<- sum ( is.na (df[,i])) # missing value count m<- append (m,x) # non-missing value count m<- append (m, nrow (df)-x) } # adding column and row names to matrix a<- matrix (m,nrow=2) rownames (a)<- c ( "TRUE" , "FALSE" ) colnames (a)<- colnames (df) return (a) } # function call binMat = toBinaryMatrix (df) binMat |
Output:
age name grade TRUE 2 3 0 FALSE 4 3 6
Visualizing Missing Data with Barplot in R
In this article, we will discuss how to visualize missing data with barplot using R programming language.
Missing Data are those data points that are not recorded i.e not entered in the dataset. Usually, missing data are represented as NA or NaN or even an empty cell.
Dataset in use:
In the case of larger datasets, few missing data might not affect the overall information whereas it can be a huge loss in information in the case of smaller datasets. These missing data are removed or imputed depending on the dataset. To decide how to deal with missing data we’ll first see how to visualize the missing data points.
Let us first count the total number of missing values.
Example: Counting missing values
R
# Creating a sample dataframe using 3 vectors age = c (12,34, NA ,7,15, NA ) name = c ( 'rob' , NA , "arya" , "jon" , NA , NA ) grade = c ( "A" , "A" , "D" , "B" , "C" , "B" ) df <- data.frame (age,name,grade) # count the total number of missing values sum ( is.na (df)) |
Output:
5
We can also find out how many missing values are there in each attribute/column.
Example: Count missing values in each attribute/column
R
# Creating a sample dataframe using 3 vectors age = c (12,34, NA ,7,15, NA ) name = c ( 'rob' , NA , "arya" , "jon" , NA , NA ) grade = c ( "A" , "A" , "D" , "B" , "C" , "B" ) df <- data.frame (age,name,grade) # count number of missing values in each # attribute/column sapply (df, function (x) sum ( is.na (x))) |
Output:
age name grade 2 3 0