How to Calculate Correlation in R with Missing Values
When we calculate correlation in R Programming Language with missing values then its default behavior is to exclude observations with missing values pairwise, meaning that if a pair of variables has missing values for any observation, that pair will not contribute to the correlation calculation for those observations. In this article, we will learn about different approaches by which we can calculate correlation in R with missing values
How to Calculate Correlation in R with Missing Values
Below are some of the ways by which we can calculate correlation in R with missing values
- Using the cor() with complete.obs
- Using cor() with pairwise.complete.obs
- Handling Missing Values Manually
- Using the
cov()
andcor()
Functions with Imputation
Calculate Correlation with Missing Values Using cor() with complete.obs
In this example, we use the cor() function to calculate the correlation coefficient between x and y. By specifying use = ‘complete.obs’,it calculate the correlation coefficient using only complete observations. The resulting correlation coefficient is then printed to the console.
R
# Sample dataset with missing values data <- data.frame ( A = c (1, 2, 3, NA , 5), B = c (5, NA , 7, 8, 9), C = c (10, 11, 12, 13, NA ) ) # Calculate correlation with missing values using cor() with complete.obs correlation_matrix <- cor (data, use = "complete.obs" ) # Print the correlation matrix print (correlation_matrix) |
Output:
A B C
A 1 1 1
B 1 1 1
C 1 1 1
Calculate Correlation with Missing Values Using cor() with pairwise.complete.obs
In this example, we use the cor()
function again, by specifying use = 'pairwise.complete.obs'
, it calculates correlation matrix based on pairwise complete observations. The resulting correlation matrix is then printed to the console.
R
# Create sample data frame with missing values df <- data.frame ( x = c (1, 2, 3, NA , 5), y = c (4, NA , 6, 7, 8) ) # Calculate correlation matrix correlation_matrix <- cor (df, use = 'pairwise.complete.obs' ) print (correlation_matrix) |
Output:
x y
x 1 1
y 1 1
Calculate Correlation with Missing Values by Handling Missing Values Manually
In this approach ,missing values are manually handled by removing rows with missing values before calculating the correlation matrix. It ensures that only complete data is used in the correlation calculation.
R
# Example data with missing values data <- data.frame ( x = c (1, 2, 3, NA , 5), y = c (3, NA , 4, 5, 6) ) # Remove rows with missing values complete_data <- na.omit (data) # Calculate correlation matrix with complete data correlation_matrix <- cor (complete_data) # View the correlation matrix correlation_matrix |
Output:
x y
x 1.0000000 0.9819805
y 0.9819805 1.0000000
Calculate Correlation with Missing Values Using the cov()
and cor()
Functions with Imputation
In this method, we impute missing values with the mean of each column before calculating the correlation coefficients using all available data.
R
# Example data with missing values data <- data.frame ( x = c (1, 2, 3, NA , 5), y = c (3, NA , 4, 5, 6) ) # Impute missing values with mean imputed_data <- apply (data, 2, function (x) ifelse ( is.na (x), mean (x, na.rm = TRUE ), x)) # Calculate covariance matrix covariance_matrix <- cov (imputed_data) # Calculate correlation matrix correlation_matrix <- cor (imputed_data) # View the correlation matrix correlation_matrix |
Output:
x y
x 1.0000000 0.8882165
y 0.8882165 1.0000000
Conclusion
In this article we understood how to calculate correlation coefficients with missing values.We can effectively handle missing values and derive insights from incomplete datasets. These methods allow us to assess the relationship between variables while accounting for missing data, ensuring a more accurate and comprehensive analysis.