当前位置:网站首页>Data feature analysis skills - correlation test
Data feature analysis skills - correlation test
2022-06-25 15:10:00 【A window full of stars and milky way】
Data feature analysis skills —— Correlation test
Correlation analysis refers to the analysis of two or more variable elements with correlation , thus Measure the correlation between the two variables
There are four common methods :
- Drawing judgment
- pearson( Pearson ) The correlation coefficient
- sperman( Spearman ) The correlation coefficient
- Cosine similarity ( Cosine correlation coefficient )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inlineDraw a graph to judge
Generally, for two variables with strong correlation , Drawing a picture can qualitatively judge whether it is relevant
data1 = pd.Series(np.random.rand(50)*100).sort_values()
data2 = pd.Series(np.random.rand(50)*50).sort_values()
data3 = pd.Series(np.random.rand(50)*500).sort_values(ascending = False)
# Create three data :data1 by 0-100 Random numbers and arrange them from small to large ,data2 by 0-50 Random numbers and arrange them from small to large ,data3 by 0-500 Random numbers and arrange them from large to small ,
fig = plt.figure(figsize = (10,4))
ax1 = fig.add_subplot(1,2,1)
ax1.scatter(data1, data2)
plt.grid()
# Positive linear correlation
ax2 = fig.add_subplot(1,2,2)
ax2.scatter(data1, data3)
plt.grid()
# Negative linear correlation 
# (2) The relationship between multivariable is judged by scatter graph matrix
data = pd.DataFrame(np.random.randn(200,4)*100, columns = ['A','B','C','D'])
pd.plotting.scatter_matrix(data,figsize=(8,8),
c = 'k',
marker = '+',
diagonal='hist',
alpha = 0.8,
range_padding=0.1)
data.head()| A | B | C | D | |
|---|---|---|---|---|
| 0 | 83.463300 | 108.208281 | -16.441879 | -69.039664 |
| 1 | -114.341786 | -176.341932 | -64.282506 | 54.378911 |
| 2 | -108.781464 | 116.223511 | 11.996554 | 4.445215 |
| 3 | -124.358401 | -74.357458 | -46.089528 | -73.539092 |
| 4 | 87.330398 | 205.767923 | 59.964420 | 137.955811 |

pearson( Pearson ) The correlation coefficient
requirement The sample satisfies the normal distribution
- The Pearson correlation coefficient between two variables is defined as the Quotient of covariance and standard deviation , Its value is between -1 And 1 Between
The formula :
covariance :sxy=1n−1∑nk=1(xk−x¯)(yk−y¯) s x y = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ )
Standard deviation :sx=1n−1∑nk=1(xk−x¯)2−−−−−−−−−−−−−−−√ s x = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) 2
Pearson correlation coefficient :sxysxsy=∑nk=1(xk−x¯)(yk−y¯)∑nk=1(xk−x¯)2√∑nk=1(yk−y¯)2√ s x y s x s y = ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ ) ∑ k = 1 n ( x k − x ¯ ) 2 ∑ k = 1 n ( y k − y ¯ ) 2
data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
'value1':data1.values,
'value2':data2.values})
print(data.head())
print('------')
# Create sample data
u1,u2 = data['value1'].mean(),data['value2'].mean() # Calculate the mean
std1,std2 = data['value1'].std(),data['value2'].std() # Calculate the standard deviation
print('value1 Normality test :\n',stats.kstest(data['value1'], 'norm', (u1, std1)))
print('value2 Normality test :\n',stats.kstest(data['value2'], 'norm', (u2, std2)))
print('------')
# Normality test → pvalue >0.05
data['(x-u1)*(y-u2)'] = (data['value1'] - u1) * (data['value2'] - u2)
data['(x-u1)**2'] = (data['value1'] - u1)**2
data['(y-u2)**2'] = (data['value2'] - u2)**2
print(data.head())
print('------')
# Make Pearson Correlation coefficient evaluation table
r = data['(x-u1)*(y-u2)'].sum() / (np.sqrt(data['(x-u1)**2'].sum() * data['(y-u2)**2'].sum()))
print('Pearson The correlation coefficient is 0 :%.4f' % r)
# Find out r
# |r| > 0.8 → Highly linear correlation value1 value2
0 0.438432 0.486913
1 2.974424 0.663775
2 4.497743 1.417196
3 5.490366 2.047252
4 6.216346 3.455314
------
value1 Normality test :
KstestResult(statistic=0.07534983222255448, pvalue=0.6116837468934935)
value2 Normality test :
KstestResult(statistic=0.11048646902786918, pvalue=0.1614817955196972)
------
value1 value2 (x-u1)*(y-u2) (x-u1)**2 (y-u2)**2
0 0.438432 0.486913 1201.352006 2597.621877 555.603052
1 2.974424 0.663775 1133.009967 2345.549928 547.296636
2 4.497743 1.417196 1062.031735 2200.319086 512.612654
3 5.490366 2.047252 1010.628854 2108.181383 484.479509
4 6.216346 3.455314 931.020494 2042.041746 424.476709
------
Pearson The correlation coefficient is 0 :0.9937
# Pearson The correlation coefficient - Algorithm
data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
'value1':data1.values,
'value2':data2.values})
print(data.head())
print('------')
# Create sample data
data.corr()
# pandas Correlation method :data.corr(method='pearson', min_periods=1) → The correlation coefficient matrix of the data field is given directly
# method Default pearson value1 value2 0 0.983096 0.368653 1 1.107613 0.509117 2 1.130588 0.755587 3 2.996367 0.909899 4 3.283088 1.233879 —— | value1 | value2 | |
|---|---|---|
| value1 | 1.000000 | 0.996077 |
| value2 | 0.996077 | 1.000000 |
Sperman Rank correlation coefficient
Pearson correlation coefficient is mainly used for continuous variables following normal distribution , For variables that do not obey a normal distribution , Categorical relevance can be used Sperman Rank correlation coefficient , Also known as Rank correlation coefficient
computing method :
- Rank the two variables from small to large according to their values ,Rx representative Xi Rank of ,Ry representative Yi Rank of
- If two variables have the same rank , Then the rank is (index1+index2)/ 2
- di = Rx -Ry
The formula :
ρs=1−6∑d2in(n2−1) ρ s = 1 − 6 ∑ d i 2 n ( n 2 − 1 )
data = pd.DataFrame({
' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
# Create sample data
data.sort_values(' Intelligence quotient (IQ) ', inplace=True)
data['range1'] = np.arange(1,len(data)+1)
data.sort_values(' TV hours per week ', inplace=True)
data['range2'] = np.arange(1,len(data)+1)
print(data)
print('------')
# “ Intelligence quotient (IQ) ”、“ TV hours per week ” Reorder from small to large , And set the rank index
data['d'] = data['range1'] - data['range2']
data['d2'] = data['d']**2
print(data)
print('------')
# Find out di,di2
n = len(data)
rs = 1 - 6 * (data['d2'].sum()) / (n * (n**2 - 1))
print('Sperman The rank correlation coefficient is :%.4f' % rs)
# Find out rs Intelligence quotient (IQ) TV hours per week
0 106 7
1 86 0
2 100 27
3 101 50
4 99 28
5 103 29
6 97 20
7 113 12
8 112 6
9 110 17
------
Intelligence quotient (IQ) TV hours per week range1 range2
1 86 0 1 1
8 112 6 9 2
0 106 7 7 3
7 113 12 10 4
9 110 17 8 5
6 97 20 2 6
2 100 27 4 7
4 99 28 3 8
5 103 29 6 9
3 101 50 5 10
------
Intelligence quotient (IQ) TV hours per week range1 range2 d d2
1 86 0 1 1 0 0
8 112 6 9 2 7 49
0 106 7 7 3 4 16
7 113 12 10 4 6 36
9 110 17 8 5 3 9
6 97 20 2 6 -4 16
2 100 27 4 7 -3 9
4 99 28 3 8 -5 25
5 103 29 6 9 -3 9
3 101 50 5 10 -5 25
------
Sperman The rank correlation coefficient is :-0.1758
# spearman The correlation coefficient - Algorithm
data = pd.DataFrame({
' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
# Create sample data
data.corr(method='spearman')
# pandas Correlation method :data.corr(method='pearson', min_periods=1) → The correlation coefficient matrix of the data field is given directly
# method Default pearson Intelligence quotient (IQ) TV hours per week 0 106 7 1 86 0 2 100 27 3 101 50 4 99 28 5 103 29 6 97 20 7 113 12 8 112 6 9 110 17 —— | Intelligence quotient (IQ) | TV hours per week | |
|---|---|---|
| Intelligence quotient (IQ) | 1.000000 | -0.175758 |
| TV hours per week | -0.175758 | 1.000000 |
边栏推荐
- How to package rpm
- QQ love talk candy love talk content acquisition and storage
- QT pop up open file dialog box QFileDialog
- 【Try to Hack】vulnhub DC1
- QT source code online view
- Review of arrays and pointers triggered by a topic
- HMS Core机器学习服务实现同声传译,支持中英文互译和多种音色语音播报
- Boost listening port server
- Yolov3 spp Darknet version to caffemodel and then to OM model
- Go语言Zap库Logger的定制化和封装使用详解
猜你喜欢

有哪个瞬间让你觉得这个世界出bug了?

semget No space left on device

【Try to Hack】vulnhub DC1

如何裁剪动图大小?试试这个在线照片裁剪工具

Source code analysis of zeromq lockless queue

Sequential programming 1

Learning C language today is the first time to learn C language. In college, C linguistics is not good, but I want to make progress, so I found a beep video on the Internet to learn C language

Learning notes on February 18, 2022 (C language)

basic_ String mind map

电源自动测试系统NSAT-8000,精准高速可靠的电源测试设备
随机推荐
Bessie's weight problem [01 backpack]
Dynamic memory allocation
定位position(5种方式)
Master XSS completely from 0 to 1
QT pop up open file dialog box QFileDialog
1090.Phone List
The robot is playing an old DOS based game
QT inline dialog
Std:: vector minutes
Build a minimalist gb28181 gatekeeper and gateway server, establish AI reasoning and 3D service scenarios, and then open source code (I)
Qmake uses toplevel or topbuilddir
QQ情话糖果情话内容获取并保存
Golang channel reading data
What moment makes you think there is a bug in the world?
分饼干问题
QQ love talk candy love talk content acquisition and storage
High precision addition
p1408
搭建极简GB28181 网守和网关服务器,建立AI推理和3d服务场景,然后开源代码(一)
Qcodeeditor - QT based code editor