Create a correlation Matrix using Python

In the field of data science and machine learning, a correlation matrix aids in understanding relationships between variables. Correlation matrix represents how different variables interact with each other.

For someone who is navigating the complex landscape of data, understanding and harnessing the potential of correlation matrices is a skill that can significantly enhance their ability to drive meaningful insights. In this article, we will explore the step-by-step process of creating a correlation matrix in Python.

What is correlation?

Correlation is a statistical indicator that quantifies the degree to which two variables change in relation to each other. It indicates the strength and direction of the linear relationship between two variables. The correlation coefficient is denoted by “r”, and it ranges from -1 to 1.

  • If r = -1, it means that there is a perfect negative correlation.
  • If r = 0, it means that there is no correlation between the two variables.
  • If r = 1, it means that there is a perfect positive correlation.

There are two popular methods used to find the correlation coefficients:

Pearson’s product-moment correlation coefficient

The Pearson correlation coefficient (r) is a measure of linear relationship between two variables.

r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²]

Here,

  • n is the number of data points
  • ∑xy is the sum of the product of corresponding values of x and y
  • ∑x is the sum of all the values of x
  • ∑y is the sum of all the values of y
  • ∑x^2 is the sum of the squares of all values of x
  • ∑y^2 is the sum of the squares of all the of y

Spearman’s rank correlation coefficient

The Spearman’s rank correlation coefficient is a measure of statistical dependence between two variables. It is based on the ranks of the data rather than the actual data values.

\rho = 1 – \frac{6 \sum d^2}{n(n^2 -1)}

Here,

  • n is the number of paired observations
  • d is the difference between the rank of corresponding values of the two variables.

What is a Correlation Matrix?

A correlation is a tabular representation that displays correlation coefficients, indicating the strength and direction of relationships between variables in a dataset. Within this matrix, each cell signifies the correlation between two specific variables. This tool serves multiple purposes, serving as a summary of data relationships, input for more sophisticated analyses, and a diagnostic aid for advanced analytical procedures. By presenting a comprehensive overview of inter-variable correlations, the matrix becomes invaluable in discerning patterns, guiding further analyses, and identifying potential areas of interest or concern in the dataset. Its applications extend beyond mere summary statistics, positioning it as a fundamental component in the preliminary stages of diverse and intricate data analyses.

Interpreting the correlation matrix

  • Strong correlations, indicated by values close to 1 or -1, suggest a robust connection, while weak correlations, near 0, imply a less pronounced association. They are identifying these degrees of correlation aids in understanding the intensity of interactions within the dataset, facilitating targeted analysis and decision-making.
  • Positive correlations (values > 0) signify that as one variable increases, the other tends to increase as well. Conversely, negative correlations (values < 0) imply an inverse relationship—when one variable increases, the other tends to decrease. Investigating these directional associations provides insights into how variables influence each other, crucial for formulating informed hypotheses and predictions.

How to create correlation matrix in Python?

A correlation matrix has been created using the following two libraries:

  1. NumPy Library
  2. Pandas Library

Creating a correlation matrix using NumPy Library

NumPy is a library for mathematical computations. It can be used for creating correlation matrices that helps to analyze the relationships between the variables through matric representation.

Example 1

Suppose an ice cream shop keeps track of total sales of ice creams versus the temperature on that day. To learn the correlation, we will use NumPy library.

In the following code snippet, x and y represent total sales in dollars and corresponding temperatures for each day of sale and np.corrcoef() function is sed to compute the correlation matrix.

Python3

import numpy as np

# x represents the total sale in dollars
x = [215, 325, 185, 332, 406, 522, 412,
     614, 544, 421, 445, 408],

# y represents the temperature on each day of sale
y = [14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 
     19.4, 25.1, 23.4, 18.1, 22.6, 17.2]

# create correlation matrix
matrix = np.corrcoef(x, y)
print(matrix)