当前位置:网站首页>Data feature analysis skills - correlation test

Data feature analysis skills - correlation test

2022-06-25 15:10:00 A window full of stars and milky way

Data feature analysis skills —— Correlation test

Correlation analysis refers to the analysis of two or more variable elements with correlation , thus Measure the correlation between the two variables
There are four common methods :
- Drawing judgment
- pearson( Pearson ) The correlation coefficient
- sperman( Spearman ) The correlation coefficient
- Cosine similarity ( Cosine correlation coefficient )

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inline

Draw a graph to judge

Generally, for two variables with strong correlation , Drawing a picture can qualitatively judge whether it is relevant

data1 = pd.Series(np.random.rand(50)*100).sort_values()
data2 = pd.Series(np.random.rand(50)*50).sort_values()
data3 = pd.Series(np.random.rand(50)*500).sort_values(ascending = False)
#  Create three data :data1 by 0-100 Random numbers and arrange them from small to large ,data2 by 0-50 Random numbers and arrange them from small to large ,data3 by 0-500 Random numbers and arrange them from large to small ,

fig = plt.figure(figsize = (10,4))
ax1 = fig.add_subplot(1,2,1)
ax1.scatter(data1, data2)
plt.grid()
#  Positive linear correlation 

ax2 = fig.add_subplot(1,2,2)
ax2.scatter(data1, data3)
plt.grid()
#  Negative linear correlation 

 Picture description here

# (2) The relationship between multivariable is judged by scatter graph matrix 

data = pd.DataFrame(np.random.randn(200,4)*100, columns = ['A','B','C','D'])
pd.plotting.scatter_matrix(data,figsize=(8,8),
                         c = 'k',
                         marker = '+',
                         diagonal='hist',
                         alpha = 0.8,
                         range_padding=0.1)
data.head()
ABCD
083.463300108.208281-16.441879-69.039664
1-114.341786-176.341932-64.28250654.378911
2-108.781464116.22351111.9965544.445215
3-124.358401-74.357458-46.089528-73.539092
487.330398205.76792359.964420137.955811

 Picture description here

pearson( Pearson ) The correlation coefficient

requirement The sample satisfies the normal distribution
- The Pearson correlation coefficient between two variables is defined as the Quotient of covariance and standard deviation , Its value is between -1 And 1 Between

  • The formula :

     covariance :
    

    sxy=1n1nk=1(xkx¯)(yky¯) s x y = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ )

     Standard deviation :
    

    sx=1n1nk=1(xkx¯)2 s x = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) 2

     Pearson correlation coefficient : 
    

    sxysxsy=nk=1(xkx¯)(yky¯)nk=1(xkx¯)2nk=1(yky¯)2 s x y s x s y = ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ ) ∑ k = 1 n ( x k − x ¯ ) 2 ∑ k = 1 n ( y k − y ¯ ) 2

data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
   'value1':data1.values,
                     'value2':data2.values})
print(data.head())
print('------')
#  Create sample data 

u1,u2 = data['value1'].mean(),data['value2'].mean()  #  Calculate the mean 
std1,std2 = data['value1'].std(),data['value2'].std()  #  Calculate the standard deviation 
print('value1 Normality test :\n',stats.kstest(data['value1'], 'norm', (u1, std1)))
print('value2 Normality test :\n',stats.kstest(data['value2'], 'norm', (u2, std2)))
print('------')
#  Normality test  → pvalue >0.05


data['(x-u1)*(y-u2)'] = (data['value1'] - u1) * (data['value2'] - u2)
data['(x-u1)**2'] = (data['value1'] - u1)**2
data['(y-u2)**2'] = (data['value2'] - u2)**2
print(data.head())
print('------')
#  Make Pearson Correlation coefficient evaluation table 

r = data['(x-u1)*(y-u2)'].sum() / (np.sqrt(data['(x-u1)**2'].sum() * data['(y-u2)**2'].sum()))
print('Pearson The correlation coefficient is 0 :%.4f' % r)
#  Find out r
# |r| > 0.8 →  Highly linear correlation 
     value1    value2
0  0.438432  0.486913
1  2.974424  0.663775
2  4.497743  1.417196
3  5.490366  2.047252
4  6.216346  3.455314

------
value1 Normality test :
 KstestResult(statistic=0.07534983222255448, pvalue=0.6116837468934935)
value2 Normality test :
 KstestResult(statistic=0.11048646902786918, pvalue=0.1614817955196972)
------

     value1    value2  (x-u1)*(y-u2)    (x-u1)**2   (y-u2)**2
0  0.438432  0.486913    1201.352006  2597.621877  555.603052
1  2.974424  0.663775    1133.009967  2345.549928  547.296636
2  4.497743  1.417196    1062.031735  2200.319086  512.612654
3  5.490366  2.047252    1010.628854  2108.181383  484.479509
4  6.216346  3.455314     931.020494  2042.041746  424.476709
------
Pearson The correlation coefficient is 0 :0.9937
# Pearson The correlation coefficient  -  Algorithm 

data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
   'value1':data1.values,
                     'value2':data2.values})
print(data.head())
print('------')
#  Create sample data 

data.corr()
# pandas Correlation method :data.corr(method='pearson', min_periods=1) →  The correlation coefficient matrix of the data field is given directly 
# method Default pearson
value1 value2 0 0.983096 0.368653 1 1.107613 0.509117 2 1.130588 0.755587 3 2.996367 0.909899 4 3.283088 1.233879 ——
value1value2
value11.0000000.996077
value20.9960771.000000

Sperman Rank correlation coefficient

Pearson correlation coefficient is mainly used for continuous variables following normal distribution , For variables that do not obey a normal distribution , Categorical relevance can be used Sperman Rank correlation coefficient , Also known as Rank correlation coefficient

computing method :
- Rank the two variables from small to large according to their values ,Rx representative Xi Rank of ,Ry representative Yi Rank of
- If two variables have the same rank , Then the rank is (index1+index2)/ 2
- di = Rx -Ry
The formula :
ρs=16d2in(n21) ρ s = 1 − 6 ∑ d i 2 n ( n 2 − 1 )

data = pd.DataFrame({
   ' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
                    ' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
#  Create sample data 

data.sort_values(' Intelligence quotient (IQ) ', inplace=True)
data['range1'] = np.arange(1,len(data)+1)
data.sort_values(' TV hours per week ', inplace=True)
data['range2'] = np.arange(1,len(data)+1)
print(data)
print('------')
# “ Intelligence quotient (IQ) ”、“ TV hours per week ” Reorder from small to large , And set the rank index

data['d'] = data['range1'] - data['range2']
data['d2'] = data['d']**2
print(data)
print('------')
#  Find out di,di2

n = len(data)
rs = 1 - 6 * (data['d2'].sum()) / (n * (n**2 - 1))
print('Sperman The rank correlation coefficient is :%.4f' % rs)
#  Find out rs
     Intelligence quotient (IQ)    TV hours per week 
0  106         7
1   86         0
2  100        27
3  101        50
4   99        28
5  103        29
6   97        20
7  113        12
8  112         6
9  110        17
------
     Intelligence quotient (IQ)    TV hours per week   range1  range2
1   86         0       1       1
8  112         6       9       2
0  106         7       7       3
7  113        12      10       4
9  110        17       8       5
6   97        20       2       6
2  100        27       4       7
4   99        28       3       8
5  103        29       6       9
3  101        50       5      10
------
     Intelligence quotient (IQ)    TV hours per week   range1  range2  d  d2
1   86         0       1       1  0   0
8  112         6       9       2  7  49
0  106         7       7       3  4  16
7  113        12      10       4  6  36
9  110        17       8       5  3   9
6   97        20       2       6 -4  16
2  100        27       4       7 -3   9
4   99        28       3       8 -5  25
5  103        29       6       9 -3   9
3  101        50       5      10 -5  25
------
Sperman The rank correlation coefficient is :-0.1758
# spearman The correlation coefficient  -  Algorithm 

data = pd.DataFrame({
   ' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
                    ' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
#  Create sample data 

data.corr(method='spearman')
# pandas Correlation method :data.corr(method='pearson', min_periods=1) →  The correlation coefficient matrix of the data field is given directly 
# method Default pearson
Intelligence quotient (IQ) TV hours per week 0 106 7 1 86 0 2 100 27 3 101 50 4 99 28 5 103 29 6 97 20 7 113 12 8 112 6 9 110 17 ——
Intelligence quotient (IQ) TV hours per week
Intelligence quotient (IQ) 1.000000-0.175758
TV hours per week -0.1757581.000000
原网站

版权声明
本文为[A window full of stars and milky way]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202200508198573.html