Share

How to create Pearson correlation coefficient matrix

correlation matrix
Pearson Correlation Matrix

In Statistics, Pearson correlation coefficient is widely used to find out relationship among random variables. In this tutorial, we will learn to create a correlation matrix and represents using heat matrix.

Pearson correlation coefficient

Lets assume X_1, X_2, X_3 \dots X_n are n random variables and Pearson correlation coefficient matrix is a n\times n matrix where entry (i,j) represents correlation between variable i and j.




Pearson correlation coefficient between two variables x and y is given as follows

(1)   \begin{equation*} corr(x,y)=\frac{cov(x,y)}{sd(x)sd(y)} \end{equation*}

here, Cov(x,y) is co-variance between x and y defined as

    \[cov(x,y)=\frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\]

    \[sd(x)=\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}\]

here, \bar{x} is the mean of variable x and n is the number of points in sample

Python Code for Pearson correlation coefficient

import numpy as np
x=np.array([12,55,23,67,86,34])
y=np.array([34,90,56,134,162,78])

# Calculate mean for x and y
x_bar=np.mean(x)
y_bar=np.mean(y)

# number of data points in x
n=len(x)

# nominator and denominator for covariance
cov_nominator=np.sum(np.array([(x[i]-x_bar)*(y[i]-y_bar) for i in range(n)]))
cov_denominator=(n-1)

# covariance computation
cov=cov_nominator/cov_denominator

# standard deviation computation
x_diff=(x-x_bar)**2
y_diff=(y-y_bar)**2

std_x=np.sqrt(np.sum(x_diff)/(n-1))
std_y=np.sqrt(np.sum(y_diff)/(n-1))

cor=cov/(std_x*std_y)
print 'Correlation coefficient:',cor

Pearson correlation coefficient matrix in Heat matrix format

We will use iris dataset to generate correlation coefficient matrix and show it in heat matrix form.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Read iris dataset
dataset=pd.read_csv('iris.csv')

# Print attributes name
print dataset.columns.values

# Drop last column from dataset
df=dataset.drop('species',1)

# Generate pearson correlation matrix
cor=df.corr(method='pearson')
print cor

# Printing correlation in heat matrix
cm=plt.cm.viridis
sns.heatmap(cor,cmap=cm,linewidths=0.1,linecolor='white',annot=True)
plt.show()

Output

['sepal_length' 'sepal_width' 'petal_length' 'petal_width' 'species']

               sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.109369      0.871754     0.817954
sepal_width      -0.109369     1.000000     -0.420516    -0.356544
petal_length      0.871754    -0.420516      1.000000     0.962757
petal_width       0.817954    -0.356544      0.962757     1.000000