SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

https://dev.to/sayemmh/descriptive-stats-percentiles-in-numpy-and-scipystats-59a7

Abbreviations of Statistics:

CDF vs. PDF: What’s the Difference?, BY ZACH BOBBITTPOSTED ON JUNE 13, 2019

PDF: P.D.F.(Probability Density Function)
CDF: C.D.F.(Cumulative Distribution Function)

Quantile 是P.D.F.特形

Definition: Quantile(分位数)指的就是连续分布函数的一个点，这个点对应概率p。
若概率0<p<1，随机变量X或它的概率分布的分位数Pa，是指满足条件p(X≤Pa)=α的实数 [1]。

Quantile(分位数, 亦称分位点)

分位数（Quantile），亦称分位点，是指将一个随机变量的概率分布范围分为几个等份的数值点，常用的有Median(中位数, 即二分位数）、Quartile(四分位数)、Percentile(百分位数)等。

Median(中位数或中值)

是按顺序排列的一组数据序列上处于中间位置的数. 统计学中的专有名词.
代表一个样本、种群或概率分布的一个数值，可将数值集合划分为数量等同的上下两部分。

对于有限的数集，可将所有观察值升序排序后，找出正中序号的一个作为Median(中位数)。
如果观察值有偶数个，通常取最中间的两个数值的平均数作为Median(中位数)。

Quartile(四分位数（Quartile）也称四分位点

统计学上, 把所有数值升序(由小到大)排列并分成四等份，处于三个分割点位置的数值。
多应用于统计学的箱线图绘制。它是一组数据排序后处于25%, 50% 和 75%位置上的值**。

四分位数是通过3个点将全部数据分为4等份，其中每部分包含25%的数据。
中间的四分位数就是中位数，因此通常说的四分位数是指:
- 下四分位数: 处在25%位置上的数值,
- 上四分位数: 处在75%位置上的数值。
根据未分组数据计算四分位数时:
- 首先对数据进行排序，
- 然后确定四分位数所在的位置，
- 该位置上的数值就是四分位数。
- 大体上与"中位数的计算方法"类似，但是
  与中位数不同的是，四分位数位置的确定方法有几种，
  每种方法得到的结果会有一定差异，但差异不会很大。 [1]

函数$\large percentile(P)\text{ where } P \in [0, 1]$

找出数据集上的一个目标数据值$\large V = percentile(P)$，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 $\large V$

至少有 (1 - P)*100% 的数据** 大于或等于 $\large V$

percentile()是P.D.F.特形

将"数据集"排序成一"序列";
并用"百分数"确定目标数据值的"序位号";
最终用此"序位号"索取"序列(数据集)"的"数据值"并计算得目标值percentile.

假设 $\large 数据集D(样本空间)$ 总计有 $\large N个数据$ , 求其 percent 为$\large P$ 的 percentile :

排序数据集: 升序(由小到大)排成 $\large pandas.Series(序列)$ 并用 $\large S$ 代指其;
确定目标序位号: 用公式 $ Index = N * P$
用 $\large Index$ 作 "索取数(脚标)" 索取序列 $\large S$的值并计算$percentile$值:
- 若 $\large Index$ 为 $\large Fraction$, 则上收取整后用其索取 $\large S$ 的一个值作目标值:
  即 $\large percentile(P) = S[ round(Index) ]$.
- 若 $\large Index$ 为 $\large Integer$ , 就用其索取 $\large S$ 的两个后邻值并取平均值作目标值:
  即 $\large percentile(P) = \frac{S[Index] + S[Index + 1]}{2}$

quartile()是P.D.F.(Probability Density Frequency)特例

quartile()即四分位函数，求得25%, 50%, 75% 的 percentile 值将数据集“四分”:
\[\large ( percentile(0.25), percentile(0.50), percentile(0.75) ) \]
quartile() 非常有用.

求 P percentile的例题:

Q.: “求数据集 3, 2, 2, 1, 1 的第50 百分位数 ”。
Answer:
we got N = 5, and P = 0.5 (因为 0.5 = 50/100)
1. make a corresponding sorted sequence S: "1, 1, 2, 2, 3"
2. calculate the Index number: $\large Index = N * P = 5 * 0.5 = 2.5 $
3. $\large percentile(0.5) = S[ round(2.5) ] = S[3] = 2 $
if P = 0.8, then:
$\large Index = N * P = 5 * 0.8 = 4 $
$\large percentile(0.8) = \frac{S[4] +S[4+1]}{2}= \frac{2 + 3}{2} = 2.5 $
Q.: “求数据集 "1, 2, 3, 4, 5, 6, 7, 8, 9, 10" 的 percentile(43%) 和 percentile(80%)”
Answer:
For "percentile(43%)":
we got N = 10, and P = 0.43
1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
2. calculate the Index number: $\large Index = N * P = 10 * 0.43 = 4.3 $
3. $\large percentile(45\%) = S[ round(4.3) ] = S[5] = 5 $
For "percentile(80%)":
we got N = 10, and P = 0.80
1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
2. calculate the Index number: $\large Index = N * P = 10 * 0.80 = 8 $
3. $\large percentile(80\%) = \frac{S[8] +S[8+1]}{2} =\frac{8 + 9}{2} = 8.5 $

DEV Community
Sayem Hoque, Posted on Oct 13, 2022 • Updated on Nov 16, 2022

Descriptive stats + percentiles in numpy and scipy.stats
To get the measures of central tendency in a pandas df, we can use the built in functions to calculate mean, median, mode:

import pandas as pd
import numpy as np


# Load the data
df = pd.read_csv("data.csv")

df.mean()
df.median()
df.mode()

To measure dispersion, we can use built-in functions to calculate std. deviation, variance, interquartile range, and skewness.

A low std. deviation means the data tends to be closer bunched around the mean, and vice versa if the std. deviation is high. The iqr is the difference between the 75th and 25th percentile. To calculate this, scipy.stats is used. Skew refers to how symmetric a distribution is about its' mean. A perfectly symmetric distribution would have equivalent mean, median, and mode.

from scipy.stats import iqr

df.std()
iqr(df['column1'])
df.skew()
from scipy import stats

stats.percentileofscore([1, 2, 3, 4], 3)
>> 75.0

The result of the percentileofscore function is the percentage of values within a distribution that are equal to or below the target. In this case, [1, 2, 3] are <= to 3, so 3/4 are below.

numpy.percentile is actually not the inverse of stats.percentileofscore. numpy.percentile takes in a parameter q to return the q-th percentile in an array of elements. The function sorts the original array of elements, and computes the difference between the max and minimum element. Once that range is calculated, the percentile is computed by finding the nearest two neighbors q/100 away from the minimum. A list of input functions can be used to control the numerical method applied to interpolate the two nearest neighbors. The default method is linear interpolation, taking the average of the nearest two neighbors.

Example:

arr = [0,1,2,3,4,5,6,7,8,9,10]
print("50th percentile of arr : ",
       np.percentile(arr, 50))
print("25th percentile of arr : ",
       np.percentile(arr, 25))
print("75th percentile of arr : ",
       np.percentile(arr, 75))

>>> 50th percentile of arr :  5
>>> 25th percentile of arr :  2.5
>>> 75th percentile of arr :  7.5

Now, using scipy.stats, we can compute the percentile at which a particular value is within a distribution of values. In this example, we are trying to see the percentile score for cur within the non-null values in the column ep_30.
non_nan = features[~features['ep_30'].isnull()]['ep_30']
cur = features['ep_30'][-1]

print(f'''Cur is at the {round(stats.percentileofscore(non_nan, cur, kind='mean'), 2)}th percentile of the distribution.''')

This is at the 7.27th percentile of the distribution.
👋 Before you go

posted @ 2024-07-19 21:04 abaelhe 阅读(192) 评论(0) 收藏举报

刷新页面返回顶部

abaelhe

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

Abbreviations of Statistics:

Quantile 是P.D.F.特形

Quantile(分位数, 亦称分位点)

Median(中位数或中值)

Quartile(四分位数（Quartile）也称四分位点

函数\(\large percentile(P)\text{ where } P \in [0, 1]\)

找出数据集上的一个目标数据值\(\large V = percentile(P)\)，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 \(\large V\)

至少有 (1 - P)*100% 的数据** 大于或等于 \(\large V\)

percentile()是P.D.F.特形

quartile()是P.D.F.(Probability Density Frequency)特例

求 P percentile的例题:

公告

abaelhe

SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

Abbreviations of Statistics:

Quantile 是P.D.F.特形

Quantile(分位数, 亦称分位点)

Median(中位数 或 中值)

Quartile(四分位数（Quartile）也称四分位点

函数\(\large percentile(P)\text{ where } P \in [0, 1]\)

找出数据集上的一个目标数据值\(\large V = percentile(P)\)，

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 \(\large V\)

至少有 (1 - P)*100% 的数据** 大于或等于 \(\large V\)

percentile()是P.D.F.特形

quartile()是P.D.F.(Probability Density Frequency)特例

求 P percentile的例题:

公告

Median(中位数或中值)