随笔档案 (761)

阅读排行榜

Pearson or Spearman correlation

Pearson vs. Spearman Correlation: When to Use Each?

Both Pearson and Spearman correlations measure the relationship between two variables, but they are used in different situations based on the type of data and assumptions.

1. Pearson Correlation (皮尔逊相关系数)

Measures the linear relationship between two continuous variables.
Assumes that both variables are normally distributed and have a linear relationship.
Returns a value between -1 and 1:
- +1 → Perfect positive linear correlation
- 0 → No correlation
- -1 → Perfect negative linear correlation

Formula:

\frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2} \cdot \sqrt{\sum (Y_i - \bar{Y})^2}}r=∑(Xi​−Xˉ)2​⋅∑(Yi​−Yˉ)2​∑(Xi​−Xˉ)(Yi​−Yˉ)​

where:

$Xi,YiX_i, Y_iXi,Yi are data points,$
$Xˉ,Yˉ\bar{X}, \bar{Y}Xˉ,Yˉ are means of XXX and YYY.$

When to Use Pearson?

When data is continuous (e.g., height, weight, temperature).
When the relationship between variables is linear.
When data is normally distributed.

Example:

Relationship between height and weight.
Relationship between study hours and exam scores (assuming a linear trend).

2. Spearman Correlation (斯皮尔曼秩相关系数)

Measures the monotonic relationship between two variables (not necessarily linear).
Works for ordinal, interval, or ratio data.
Does not assume normality.
Instead of using raw values, it ranks the data before computing correlation.

Formula:

rs=1−6∑di2n(n2−1)r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}rs​=1−n(n2−1)6∑di2​​

where:

$did_idi is the difference between the ranks of XiX_iXi and YiY_iYi,$

When to Use Spearman?

When data is not normally distributed.
When data has outliers (Spearman is robust to outliers).
When the relationship is monotonic but not necessarily linear.
When working with ranked (ordinal) data.

Example:

Relationship between customer satisfaction and service rating (ranked 1-5).
Relationship between income and happiness (where higher income generally means higher happiness, but not in a strict linear way).

Key Differences:

Feature	Pearson Correlation	Spearman Correlation
Measures	Linear relationship	Monotonic relationship
Data Type	Continuous	Continuous or Ordinal
Normality Assumption	Yes (data should be normally distributed)	No assumption about normality
Sensitivity to Outliers	High (outliers can distort results)	Low (ranks reduce the effect of outliers)
Best Use Case	When the relationship is strictly linear	When the relationship is monotonic but not linear

Which One to Use?

If your data is normally distributed and you expect a linear relationship → Use Pearson.
If your data is not normally distributed, contains outliers, or has a monotonic but not linear relationship → Use Spearman.