Pandas系列(十三)-其他常用功能

一、统计数据频率

 

 1. values_counts

pd.value_counts(df.column_name)
df.column_name.value_counts()

Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)[source]
Return a Series containing counts of unique values.

  参数详解

normalize : boolean, default False
If True then the object returned will contain the relative frequencies of the unique values.

sort : boolean, default True
Sort by values.

ascending : boolean, default False
Sort in ascending order.

bins : integer, optional
Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.

dropna : boolean, default True
Don’t include counts of NaN.

  参数示例讲解

index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
Out[144]: 
3.0    2
4.0    1
2.0    1
1.0    1
dtype: int64
index.value_counts(normalize=True)
Out[145]: 
3.0    0.4
4.0    0.2
2.0    0.2
1.0    0.2
dtype: float64
index.value_counts(bins=3)
Out[146]: 
(2.0, 3.0]      2
(0.996, 2.0]    2
(3.0, 4.0]      1
dtype: int64
index.value_counts(dropna=False)
Out[148]: 
 3.0    2
NaN     1
 4.0    1
 2.0    1
 1.0    1
dtype: int64 
In [21]:  data=pd.DataFrame(pd.Series([1,2,3,4,5,6,11,1,1,1,1,2,2,2,2,3]).values.reshape(4,4),columns=['a','b','c','d'])

In [22]: data
Out[22]: 
   a  b   c  d
0  1  2   3  4
1  5  6  11  1
2  1  1   1  2
3  2  2   2  3

In [23]: pd.value_counts(data.a)
Out[23]: 
1    2
2    1
5    1
Name: a, dtype: int64

In [26]: pd.value_counts(data.a).sort_index()
Out[26]: 
1    2
2    1
5    1
Name: a, dtype: int64

In [27]: pd.value_counts(data.a).sort_index().index
Out[27]: Int64Index([1, 2, 5], dtype='int64')

In [28]: pd.value_counts(data.a).sort_index().values
Out[28]: array([2, 1, 1], dtype=int64)
values_count实例

 2.类别中属性的个数

# 方式一
cat_uniques = []
for cat in cat_features:
    cat_uniques.append(len(train[cat].unique()))
uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])

# 方式二
list(map(lambda x: len(train[x]),cat_featrues))

 3.唯一值

 

二、数据去重

 

    def drop_duplicates(self, subset=None, keep='first', inplace=False):
        """
        Return DataFrame with duplicate rows removed, optionally only
        considering certain columns

        Parameters
        ----------
        subset : column label or sequence of labels, optional
            Only consider certain columns for identifying duplicates, by
            default use all of the columns
        keep : {'first', 'last', False}, default 'first'
            - ``first`` : Drop duplicates except for the first occurrence.
            - ``last`` : Drop duplicates except for the last occurrence.
            - False : Drop all duplicates.
        inplace : boolean, default False
            Whether to drop duplicates in place or to return a copy

        Returns
        -------
        deduplicated : DataFrame
        """

  1. duplicated()方法
  • 查看列是否重复
dataframe.colname.duplicated()
  • 查看整行是否重复
dataframe.duplicated()
  • 查看subset是否重复
dataframe.duplicated(subset = [])
  1. drop_duplicats()方法
    用于丢弃重复项
dataframe.drop_duplicats()
  1. 参数keep
    keep可以为first和last,表示是选择最前一项还是最后一项保留。
dataframe.duplicated(keep = "first")
dataframe.duplicated(keep = "last")

也可以设置布尔类型,当设为False时候,重复项将都被显示。

dataframe.duplicated(keep = "False")

 三、数据类型转换

四、聚合方法

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
pd.set_option('display.max_columns',None)
df = pd.read_csv('911.csv')
print(df.head(1))
         lat        lng                                               desc  \
0  40.297876 -75.581294  REINDEER CT & DEAD END;  NEW HANOVER; Station ...   
       zip                   title            timeStamp          twp  \
0  19525.0  EMS: BACK PAINS/INJURY  2015-12-10 17:10:52  NEW HANOVER   
                     addr  e  
0  REINDEER CT & DEAD END  1  
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249737 entries, 0 to 249736
Data columns (total 9 columns):
lat          249737 non-null float64
lng          249737 non-null float64
desc         249737 non-null object
zip          219391 non-null float64
title        249737 non-null object
timeStamp    249737 non-null object
twp          249644 non-null object
addr         249737 non-null object
e            249737 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 17.1+ MB
#获取分类
temp_list = df.title.str.split(':').tolist()
cate_list = list(set([i[0] for i in temp_list]))
cate_list
Out[152]: ['Fire', 'Traffic', 'EMS']
#构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(cate_list))),columns=cate_list)
#赋值
for cate in cate_list:
    zeros_df[cate][df.title.str.contains(cate)] = 1
print(zeros_df)
        Fire  Traffic  EMS
0        0.0      0.0  1.0
1        0.0      0.0  1.0
2        1.0      0.0  0.0
3        0.0      0.0  1.0
4        0.0      0.0  1.0
5        0.0      0.0  1.0
6        0.0      0.0  1.0
7        0.0      0.0  1.0
8        0.0      0.0  1.0
9        0.0      1.0  0.0
10       0.0      1.0  0.0
11       0.0      1.0  0.0
12       0.0      1.0  0.0
13       0.0      1.0  0.0
14       0.0      1.0  0.0
15       0.0      1.0  0.0
16       0.0      0.0  1.0
17       0.0      0.0  1.0
18       0.0      0.0  1.0
19       0.0      1.0  0.0
20       0.0      1.0  0.0
21       0.0      1.0  0.0
22       1.0      0.0  0.0
23       0.0      1.0  0.0
24       0.0      1.0  0.0
25       0.0      0.0  1.0
26       0.0      0.0  1.0
27       1.0      0.0  0.0
28       0.0      1.0  0.0
29       0.0      1.0  0.0
      ...      ...  ...
249707   0.0      1.0  0.0
249708   1.0      0.0  0.0
249709   0.0      0.0  1.0
249710   0.0      1.0  0.0
249711   0.0      1.0  0.0
249712   0.0      0.0  1.0
249713   1.0      0.0  0.0
249714   1.0      0.0  0.0
249715   0.0      1.0  0.0
249716   0.0      0.0  1.0
249717   0.0      0.0  1.0
249718   1.0      0.0  0.0
249719   0.0      0.0  1.0
249720   0.0      0.0  1.0
249721   0.0      0.0  1.0
249722   0.0      1.0  0.0
249723   0.0      0.0  1.0
249724   0.0      0.0  1.0
249725   0.0      0.0  1.0
249726   1.0      0.0  0.0
249727   1.0      0.0  0.0
249728   0.0      1.0  0.0
249729   0.0      0.0  1.0
249730   0.0      0.0  1.0
249731   0.0      1.0  0.0
249732   0.0      0.0  1.0
249733   0.0      0.0  1.0
249734   0.0      0.0  1.0
249735   1.0      0.0  0.0
249736   0.0      1.0  0.0
[249737 rows x 3 columns]
sum_ret = zeros_df.sum(axis=0)
print(sum_ret)
Fire        37432.0
Traffic     87465.0
EMS        124844.0
dtype: float64
不同类型紧急情况统计
df = pd.read_csv('911.csv')
# print(df.head(1))
# print(df.info())
#获取分类
temp_list = df.title.str.split(':').tolist()
cate_list = [i[0] for i in temp_list]
df['cate'] = pd.DataFrame(np.array(cate_list).reshape(df.shape[0],1))
print(df.groupby(by='cate').count()['title'])
cate
EMS        124840
Fire        37432
Traffic     87465
Name: title, dtype: int64
实例2

五、分布分析

pd.cut(data['col_names'], bins, labels=None)

 实例

import numpy
import pandas
from pandas import read_csv
data = read_csv('E:/python/data_analysis/data/dis-cut.csv')
bins = [data['年龄'].min()-1,20,30,40,max(data.年龄)+1]
labels = ['20岁及以下','21岁到30岁','31岁到40岁','41岁以上']
age_cut = pandas.cut(data.年龄,bins,labels=labels)
data['年龄分层'] = age_cut
result = data.groupby(by=['年龄分层'])['年龄'].agg(['size','mean'])
result.rename(columns= {'size': '人数','mean': '平均年龄'})
Out[171]: 
            人数       平均年龄
年龄分层                     
20岁及以下    2061  19.302280
21岁到30岁  46858  25.759081
31岁到40岁   8729  33.095773
41岁以上     1453  50.625602

 六、交叉分析

import pandas
from pandas import read_csv
data = read_csv('E:/python/data_analysis/data/pivot_table.csv')
bins = [data['年龄'].min() - 1, 20, 30, 40, max(data.年龄) + 1]
labels = ['20岁及以下', '21岁到30岁', '31岁到40岁', '41岁以上']
age_cut = pandas.cut(data.年龄, bins, labels=labels)
data['年龄分层'] = age_cut
r1 = data.pivot_table(
    values=['年龄'],
    index=['年龄分层'],
    columns=['性别'],
    aggfunc=[numpy.size, numpy.mean]
)
r2 = data.pivot_table(
    values=['年龄'],
    index=['年龄分层'],
    columns=['性别'],
    aggfunc=[numpy.std],
)
print(r1.index)
print(r1['size']['年龄'][''])
print(r1.join(r2))    
CategoricalIndex(['41岁以上', '21岁到30岁', '31岁到40岁', '20岁及以下'], categories=['20岁及以下', '21岁到30岁', '31岁到40岁', '41岁以上'], ordered=True, name='年龄分层', dtype='category')
年龄分层
41岁以上       111
21岁到30岁    2903
31岁到40岁     735
20岁及以下      567
Name: 女, dtype: int64
         size              mean                  std          
           年龄                年龄                   年龄          
性别          女      男          女          男         女         男
年龄分层                                                          
41岁以上     111   1950  18.972973  19.321026  1.708053  1.044185
21岁到30岁  2903  43955  25.954874  25.746149  2.453642  2.361298
31岁到40岁   735   7994  33.213605  33.084939  2.316704  2.200319
20岁及以下    567    886  51.691358  49.943567  6.761848  7.914171
交叉分析pivot_table案例
import pandas as pd
import numpy as np
data = pd.DataFrame({'Sample': range(1, 11), 'Gender': ['Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'], 
                    'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']})
#假设我们想要根据性别和用手习惯对这段数据进行#统计汇总。虽然可以用pivot_table()实现该功#能,但是pandas.crosstab()函数会更方便:
# 方法一:用pivot_table
# 其实我觉的一点都不麻烦ε=(´ο`*)))唉
data.pivot_table(index=['Gender'], columns='Handedness', aggfunc=len, margins=True)
Out[173]: 
                Sample                 
Handedness Left-handed Right-handed All
Gender                                 
Female               1            4   5
Male                 2            3   5
All                  3            7  10
# 方法二:用crosstab
pd.crosstab(data.Gender, data.Handedness, margins=True)
Out[174]: 
Handedness  Left-handed  Right-handed  All
Gender                                    
Female                1             4    5
Male                  2             3    5
All                   3             7   10
交叉分析crosstab案例
  • 透视表pivot_table()是一种进行分组统计的函数,参数aggfunc决定统计类型;
  • 交叉表crosstab()是一种特殊的pivot_table(),专用于计算分组频率。

具体使用参照 https://www.cnblogs.com/onemorepoint/p/8425300.html

 七、格式化输出

pandas dataframe数据全部输出,数据太多也不用省略号表示。

pd.set_option('display.max_columns',None)

或者

with option_context('display.max_rows', 10, 'display.max_columns', 5):
    代码逻辑

 八、索引

  

 

posted @ 2019-03-12 23:27  DreamBoy_张亚飞  阅读(472)  评论(0编辑  收藏  举报