【相关分析】理论与实现

工具

X\ Y	分类	连续
分类	交叉表（列联表）	ttest ANOVA
连续	ttest ANOVA	相关分析

参看统计推断

相关系数（更新）

相关系数	定义	描述	H0	统计量	代码
Pearson	$r=\dfrac{cov(x,y)}{\sqrt{DxDy}}$	成对的连续数据接近正态的单峰分布	$r=0$	$t=\dfrac{r\sqrt{n-2}}{1-r^2}\sim t(n-2)$	r, p_value = stats.pearsonr
Spearman	计算秩的pearson，等价于： $r=1-\dfrac{6\sum d_i^2}{n(n^2-1)}$ $d_i=R_i-Q_i$	成对的等级数据无论分布	r=0	小样本：参数为n-2的 Spearman 分布大样本：$t=\dfrac{r\sqrt{n-2}}{1-r^2}\sim t(n-2)$	stats.spearmanr
Kendall	$\tau_a=2(C-D)/(n(n-1))$ $\tau_b = (P - Q) / \sqrt{(P + Q + T) * (P + Q + U)}$			小样本：Kendall分布大样本$U=3\tau\sqrt{\dfrac{n(n-1)}{2(2n-5)}}$	stats.kendalltau stats.weightedtau

范围(-1,1)，-1:完全负相关，1：完全正相关，0：不相关
pearson的358原则:
- $\mid r\mid \geq 0.8$表示两个变量高度相关
- $\mid r\mid \in [0.5,0.8]$表示两个变量中度相关
- $\mid r\mid \in [0.3,0.5]$表示两个变量低度相关
- $\mid r\mid \in [0,0.3]$表示两个变量几乎不相关
Kendall
This is the 1945 “tau-b” version of Kendall’s tau $\tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))$ where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U.
（1938 “tau-a” version）

代码示例

from scipy import stats
import numpy as np
n = 10
x = np.random.rand(n)
y = np.random.rand(n)

# Pearson
r, p_value = stats.pearsonr([1,2,3,4,5], [5,6,7,8,7])

# Spearman
tau, pvalue = stats.spearmanr(x,y)

# 如果输入 n×m 的数据，返回的是相关系数矩阵
x = np.random.rand(n,3)
tau, p_value = stats.spearmanr(x)

# Kendall
tau, p_value = stats.kendalltau(x,y)
# 用的是 tau_b 算法

其它：

stats.theilslopes
stats.weightedtau

列联表分析

以两离散变量分别都是两类举例

H0：X，Y独立
H1：X，Y不独立

step1：取得源数据

	0	1
0	n11	n12
1	n21	n22

step2：求边缘密度

	0	1
0	n11	n12	a1=(n11+n12)/n
1	n21	n22	a2=(n21+n22)/n
	b1=(n11+n21)/n	b2=(n12+n22)/n

step3：求期望概率（假设独立）

	0	1
0	a1×b1	a1×b1	a1
1	a1×b1	a1×b1	a2
	b1	b2

step4:求期望频数

	0	1
0	a1×b1×n	a1×b1×n
1	a1×b1×n	a1×b1×n

step5：期望频数与原频数的差，得到的数字平方和后服从卡方分布

step6：卡方检验

【相关分析】理论与实现

工具

相关分析

相关系数（更新）

代码示例

列联表分析

参考资料