SciTech-Mathematics-Probability+Statistics- Descriptive stats +完全搞懂percentile(百分位数) + quartile(四分位数) + median(中位数) + percentiles() in NumPy+Pandas+SciPy.stats

Descriptive Stats + percentiles in numpy and scipy.stats

https://dev.to/sayemmh/descriptive-stats-percentiles-in-numpy-and-scipystats-59a7

Abbreviations of Statistics:

CDF vs. PDF: What’s the Difference?, BY ZACH BOBBITTPOSTED ON JUNE 13, 2019

  • PDF: P.D.F.(Probability Density Function)
  • CDF: C.D.F.(Cumulative Distribution Function)

Quantile 是P.D.F.特形

Definition: Quantile(分位数)指的就是连续分布函数一个点,这个点对应概率p
若概率0<p<1,随机变量X或它的概率分布的分位数Pa,是指满足条件p(X≤Pa)=α的实数 [1]。

Quantile(分位数, 亦称分位点)

分位数(Quantile),亦称分位点,是指将一个随机变量概率分布范围分为几个等份的数值点,常用的有Median(中位数, 即二分位数)、Quartile(四分位数)、Percentile(百分位数)等。

Median(中位数 或 中值)

按顺序排列一组数据序列上处于中间位置的数. 统计学中的专有名词.
代表一个样本、种群或概率分布的一个数值,可将数值集合划分为数量等同的上下两部分

  • 对于有限的数集,可将所有观察值升序排序后,找出正中序号的一个作为Median(中位数)
  • 如果观察值有偶数个,通常取最中间的两个数值的平均数作为Median(中位数)。

Quartile(四分位数(Quartile)也称四分位点

统计学上, 把所有数值升序(由小到大)排列分成四等份,处于三个分割点位置数值
多应用于统计学的
箱线图绘制。它是一组数据排序后处于25%, 50% 和 75%位置上的值**。

  • 四分位数是通过3个点全部数据分为4等份,其中每部分包含25%的数据
  • 中间的四分位数就是中位数,因此通常说的四分位数是指:
    • 下四分位数: 处在25%位置上的数值,
    • 上四分位数: 处在75%位置上的数值。
  • 根据未分组数据计算四分位数时:
    • 首先对数据进行排序,
    • 然后确定四分位数所在的位置,
    • 该位置上的数值就是四分位数。
    • 大体上与"中位数的计算方法"类似,但是
      与中位数不同的是,四分位数位置确定方法几种
      每种方法得到的结果会有一定差异,但差异不会很大。 [1]

函数percentile(P) where P[0,1]

找出数据集上的一个目标数据值V=percentile(P)

保证整个数据集上确定,

至少有 (P)*100% 的数据** 小于或等于 V

至少有 (1 - P)*100% 的数据** 大于或等于 V

percentile()是P.D.F.特形

  • "数据集"排序成一"序列";
  • 并用"百分数"确定目标数据值的"序位号";
  • 最终用此"序位号"索取"序列(数据集)"的"数据值"并计算得目标值percentile.

假设 D() 总计有 N , 求其 percent 为P 的 percentile :

  1. 排序数据集: 升序(由小到大)排成 pandas.Series() 并用 S 代指其;
  2. 确定目标序位号: 用公式 Index=NP
  3. Index"索取数(脚标)" 索取序列 S的值并计算percentile值:
    • IndexFraction, 则上收取整后用其索取 S一个值作目标值:
      percentile(P)=S[round(Index)].
    • IndexInteger , 就用其索取 S两个后邻值并取平均值作目标值:
      percentile(P)=S[Index]+S[Index+1]2

quartile()P.D.F.(Probability Density Frequency)特例

  • quartile()即四分位函数,求得25%, 50%, 75% 的 percentile 值将数据集“四分”:

    (percentile(0.25),percentile(0.50),percentile(0.75))

  • quartile() 非常有用.

求 P percentile的例题:

  1. Q.: “求数据集 3, 2, 2, 1, 1 的 第50 百分位数 ”。
    Answer:
    we got N = 5, and P = 0.5 (因为 0.5 = 50/100)

    1. make a corresponding sorted sequence S: "1, 1, 2, 2, 3"
    2. calculate the Index number: Index=NP=50.5=2.5
    3. percentile(0.5)=S[round(2.5)]=S[3]=2

    if P = 0.8, then:
    Index=NP=50.8=4
    percentile(0.8)=S[4]+S[4+1]2=2+32=2.5

  2. Q.: “求数据集 "1, 2, 3, 4, 5, 6, 7, 8, 9, 10" 的 percentile(43%) 和 percentile(80%)”
    Answer:
    For "percentile(43%)":
    we got N = 10, and P = 0.43

    1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
    2. calculate the Index number: Index=NP=100.43=4.3
    3. percentile(45%)=S[round(4.3)]=S[5]=5

    For "percentile(80%)":
    we got N = 10, and P = 0.80

    1. make a corresponding sorted sequence S: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
    2. calculate the Index number: Index=NP=100.80=8
    3. percentile(80%)=S[8]+S[8+1]2=8+92=8.5

DEV Community
Sayem Hoque, Posted on Oct 13, 2022 • Updated on Nov 16, 2022

Descriptive stats + percentiles in numpy and scipy.stats
To get the measures of central tendency in a pandas df, we can use the built in functions to calculate mean, median, mode:

import pandas as pd
import numpy as np


# Load the data
df = pd.read_csv("data.csv")

df.mean()
df.median()
df.mode()

To measure dispersion, we can use built-in functions to calculate std. deviation, variance, interquartile range, and skewness.

A low std. deviation means the data tends to be closer bunched around the mean, and vice versa if the std. deviation is high. The iqr is the difference between the 75th and 25th percentile. To calculate this, scipy.stats is used. Skew refers to how symmetric a distribution is about its' mean. A perfectly symmetric distribution would have equivalent mean, median, and mode.

from scipy.stats import iqr

df.std()
iqr(df['column1'])
df.skew()
from scipy import stats

stats.percentileofscore([1, 2, 3, 4], 3)
>> 75.0

The result of the percentileofscore function is the percentage of values within a distribution that are equal to or below the target. In this case, [1, 2, 3] are <= to 3, so 3/4 are below.

numpy.percentile is actually not the inverse of stats.percentileofscore. numpy.percentile takes in a parameter q to return the q-th percentile in an array of elements. The function sorts the original array of elements, and computes the difference between the max and minimum element. Once that range is calculated, the percentile is computed by finding the nearest two neighbors q/100 away from the minimum. A list of input functions can be used to control the numerical method applied to interpolate the two nearest neighbors. The default method is linear interpolation, taking the average of the nearest two neighbors.

Example:

arr = [0,1,2,3,4,5,6,7,8,9,10]
print("50th percentile of arr : ",
       np.percentile(arr, 50))
print("25th percentile of arr : ",
       np.percentile(arr, 25))
print("75th percentile of arr : ",
       np.percentile(arr, 75))

>>> 50th percentile of arr :  5
>>> 25th percentile of arr :  2.5
>>> 75th percentile of arr :  7.5

Now, using scipy.stats, we can compute the percentile at which a particular value is within a distribution of values. In this example, we are trying to see the percentile score for cur within the non-null values in the column ep_30.
non_nan = features[~features['ep_30'].isnull()]['ep_30']
cur = features['ep_30'][-1]

print(f'''Cur is at the {round(stats.percentileofscore(non_nan, cur, kind='mean'), 2)}th percentile of the distribution.''')

This is at the 7.27th percentile of the distribution.
👋 Before you go

posted @   abaelhe  阅读(25)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
点击右上角即可分享
微信分享提示