Data aggregation and group operations in pandas

Data aggregation and group operations in pandas

After loading,merging and preparing a dataset,you may need to compute group statistics or possibly pivot tables for reporting or visualization purpose.pandas provides a flexible groupby interface,enabling you to slice,dice and summarize datasets in a natural way.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Groupby mechanics

split-apply-combine:
In the first stage of the process,data contained in a pandas object,whether a Series,DataFrame,or otherwise,is split into groups based on one or more keys that you provide.The splitting is performed on a particular axis off an object.For example,a DataFrame can be grouped on its rows(axis=0) or its columns(axis=1).
Once the first step is done, a function is applied to each group,producing a new value.
Finally,the results of all those function applications are combined into a result object.

help(pd.Series.groupby)
Help on function groupby in module pandas.core.generic:

groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
    Group series using mapper (dict or key function, apply given function
    to group, return result as series) or by a series of columns.
    
    Parameters
    ----------
    by : mapping function / list of functions, dict, Series, or tuple /
        list of column names.
        Called on each element of the object index to determine the groups.
        If a dict or Series is passed, the Series or dict VALUES will be
        used to determine the groups
    axis : int, default 0
    level : int, level name, or sequence of such, default None
        If the axis is a MultiIndex (hierarchical), group by a particular
        level or levels
    as_index : boolean, default True
        For aggregated output, return object with group labels as the
        index. Only relevant for DataFrame input. as_index=False is
        effectively "SQL-style" grouped output
    sort : boolean, default True
        Sort group keys. Get better performance by turning this off.
        Note this does not influence the order of observations within each
        group.  groupby preserves the order of rows within each group.
    group_keys : boolean, default True
        When calling apply, add group keys to index to identify pieces
    squeeze : boolean, default False
        reduce the dimensionality of the return type if possible,
        otherwise return a consistent type
    
    Examples
    --------
    DataFrame results
    
    >>> data.groupby(func, axis=0).mean()
    >>> data.groupby(['col1', 'col2'])['col3'].mean()
    
    DataFrame with hierarchical index
    
    >>> data.groupby(['col1', 'col2']).mean()
    
    Returns
    -------
    GroupBy object
df=pd.DataFrame({'key1':['a','a','b','b','a'],
                'key2':['one','two','one','two','one'],
                'data1':np.random.randn(5),
                'data2':np.random.randn(5)});df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one
  • The first method:
    Suppose you wanted to compute the mean of the data column using the labels from key1.There are a number of ways to do this.One is to access data1 and call groupby with the column(a Series) at key1:
grouped=df['data1'].groupby(df['key1'])
help(grouped)
Help on SeriesGroupBy in module pandas.core.groupby object:

class SeriesGroupBy(GroupBy)
 |  Class for grouping and aggregating relational data. See aggregate,
 |  transform, and apply functions on this object.
 |  
 |  It's easiest to use obj.groupby(...) to use GroupBy, but you can also do:
 |  
 |  ::
 |  
 |      grouped = groupby(obj, ...)
 |  
 |  Parameters
 |  ----------
 |  obj : pandas object
 |  axis : int, default 0
 |  level : int, default None
 |      Level of MultiIndex
 |  groupings : list of Grouping objects
 |      Most users should ignore this
 |  exclusions : array-like, optional
 |      List of columns to exclude
 |  name : string
 |      Most users should ignore this
 |  
 |  Notes
 |  -----
 |  After grouping, see aggregate, apply, and transform functions. Here are
 |  some other brief notes about usage. When grouping by multiple groups, the
 |  result index will be a MultiIndex (hierarchical) by default.
 |  
 |  Iteration produces (key, group) tuples, i.e. chunking the data by group. So
 |  you can write code like:
 |  
 |  ::
 |  
 |      grouped = obj.groupby(keys, axis=axis)
 |      for key, group in grouped:
 |          # do something with the data
 |  
 |  Function calls on GroupBy, if not specially implemented, "dispatch" to the
 |  grouped data. So if you group a DataFrame and wish to invoke the std()
 |  method on each group, you can simply do:
 |  
 |  ::
 |  
 |      df.groupby(mapper).std()
 |  
 |  rather than
 |  
 |  ::
 |  
 |      df.groupby(mapper).aggregate(np.std)
 |  
 |  You can pass arguments to these "wrapped" functions, too.
 |  
 |  See the online documentation for full exposition on these topics and much
 |  more
 |  
 |  Returns
 |  -------
 |  **Attributes**
 |  groups : dict
 |      {group name -> group labels}
 |  len(grouped) : int
 |      Number of groups
 |  
 |  Method resolution order:
 |      SeriesGroupBy
 |      GroupBy
 |      _GroupBy
 |      pandas.core.base.PandasObject
 |      pandas.core.base.StringMixin
 |      pandas.core.base.SelectionMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  agg = aggregate(self, func_or_funcs, *args, **kwargs)
 |  
 |  aggregate(self, func_or_funcs, *args, **kwargs)
 |      Apply aggregation function or functions to groups, yielding most likely
 |      Series but in some cases DataFrame depending on the output of the
 |      aggregation function
 |      
 |      Parameters
 |      ----------
 |      func_or_funcs : function or list / dict of functions
 |          List/dict of functions will produce DataFrame with column names
 |          determined by the function names themselves (list) or the keys in
 |          the dict
 |      
 |      Notes
 |      -----
 |      agg is an alias for aggregate. Use it.
 |      
 |      Examples
 |      --------
 |      >>> series
 |      bar    1.0
 |      baz    2.0
 |      qot    3.0
 |      qux    4.0
 |      
 |      >>> mapper = lambda x: x[0] # first letter
 |      >>> grouped = series.groupby(mapper)
 |      
 |      >>> grouped.aggregate(np.sum)
 |      b    3.0
 |      q    7.0
 |      
 |      >>> grouped.aggregate([np.sum, np.mean, np.std])
 |         mean  std  sum
 |      b  1.5   0.5  3
 |      q  3.5   0.5  7
 |      
 |      >>> grouped.agg({'result' : lambda x: x.mean() / x.std(),
 |      ...              'total' : np.sum})
 |         result  total
 |      b  2.121   3
 |      q  4.95    7
 |      
 |      See also
 |      --------
 |      apply, transform
 |      
 |      Returns
 |      -------
 |      Series or DataFrame
 |  
 |  count(self)
 |      Compute count of group, excluding missing values
 |  
 |  filter(self, func, dropna=True, *args, **kwargs)
 |      Return a copy of a Series excluding elements from groups that
 |      do not satisfy the boolean criterion specified by func.
 |      
 |      Parameters
 |      ----------
 |      func : function
 |          To apply to each group. Should return True or False.
 |      dropna : Drop groups that do not pass the filter. True by default;
 |          if False, groups that evaluate False are filled with NaNs.
 |      
 |      Examples
 |      --------
 |      >>> grouped.filter(lambda x: x.mean() > 0)
 |      
 |      Returns
 |      -------
 |      filtered : Series
 |  
 |  nlargest(self, n=5, keep='first')
 |      Return the largest `n` elements.
 |      
 |      Parameters
 |      ----------
 |      n : int
 |          Return this many descending sorted values
 |      keep : {'first', 'last', False}, default 'first'
 |          Where there are duplicate values:
 |          - ``first`` : take the first occurrence.
 |          - ``last`` : take the last occurrence.
 |      take_last : deprecated
 |      
 |      Returns
 |      -------
 |      top_n : Series
 |          The n largest values in the Series, in sorted order
 |      
 |      Notes
 |      -----
 |      Faster than ``.sort_values(ascending=False).head(n)`` for small `n`
 |      relative to the size of the ``Series`` object.
 |      
 |      See Also
 |      --------
 |      Series.nsmallest
 |      
 |      Examples
 |      --------
 |      >>> import pandas as pd
 |      >>> import numpy as np
 |      >>> s = pd.Series(np.random.randn(1e6))
 |      >>> s.nlargest(10)  # only sorts up to the N requested
 |  
 |  nsmallest(self, n=5, keep='first')
 |      Return the smallest `n` elements.
 |      
 |      Parameters
 |      ----------
 |      n : int
 |          Return this many ascending sorted values
 |      keep : {'first', 'last', False}, default 'first'
 |          Where there are duplicate values:
 |          - ``first`` : take the first occurrence.
 |          - ``last`` : take the last occurrence.
 |      take_last : deprecated
 |      
 |      Returns
 |      -------
 |      bottom_n : Series
 |          The n smallest values in the Series, in sorted order
 |      
 |      Notes
 |      -----
 |      Faster than ``.sort_values().head(n)`` for small `n` relative to
 |      the size of the ``Series`` object.
 |      
 |      See Also
 |      --------
 |      Series.nlargest
 |      
 |      Examples
 |      --------
 |      >>> import pandas as pd
 |      >>> import numpy as np
 |      >>> s = pd.Series(np.random.randn(1e6))
 |      >>> s.nsmallest(10)  # only sorts up to the N requested
 |  
 |  nunique(self, dropna=True)
 |      Returns number of unique elements in the group
 |  
 |  transform(self, func, *args, **kwargs)
 |      Call function producing a like-indexed Series on each group and return
 |      a Series with the transformed values
 |      
 |      Parameters
 |      ----------
 |      func : function
 |          To apply to each group. Should return a Series with the same index
 |      
 |      Examples
 |      --------
 |      >>> grouped.transform(lambda x: (x - x.mean()) / x.std())
 |      
 |      Returns
 |      -------
 |      transformed : Series
 |  
 |  value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  all
 |      
 |      Return whether all elements are True over requested axis
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      level : int or level name, default None
 |          If the axis is a MultiIndex (hierarchical), count along a
 |          particular level, collapsing into a scalar
 |      bool_only : boolean, default None
 |          Include only boolean columns. If None, will attempt to use everything,
 |          then use only boolean data. Not implemented for Series.
 |      
 |      Returns
 |      -------
 |      all : scalar or Series (if level specified)
 |  
 |  any
 |      
 |      Return whether any element is True over requested axis
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      level : int or level name, default None
 |          If the axis is a MultiIndex (hierarchical), count along a
 |          particular level, collapsing into a scalar
 |      bool_only : boolean, default None
 |          Include only boolean columns. If None, will attempt to use everything,
 |          then use only boolean data. Not implemented for Series.
 |      
 |      Returns
 |      -------
 |      any : scalar or Series (if level specified)
 |  
 |  corr
 |      Compute correlation with `other` Series, excluding missing values
 |      
 |      Parameters
 |      ----------
 |      other : Series
 |      method : {'pearson', 'kendall', 'spearman'}
 |          * pearson : standard correlation coefficient
 |          * kendall : Kendall Tau correlation coefficient
 |          * spearman : Spearman rank correlation
 |      min_periods : int, optional
 |          Minimum number of observations needed to have a valid result
 |      
 |      
 |      Returns
 |      -------
 |      correlation : float
 |  
 |  cov
 |      Compute covariance with Series, excluding missing values
 |      
 |      Parameters
 |      ----------
 |      other : Series
 |      min_periods : int, optional
 |          Minimum number of observations needed to have a valid result
 |      
 |      Returns
 |      -------
 |      covariance : float
 |      
 |      Normalized by N-1 (unbiased estimator).
 |  
 |  cummax
 |          Return cumulative max over requested axis.
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      
 |      Returns
 |      -------
 |      cummax : scalar
 |  
 |  cummin
 |          Return cumulative minimum over requested axis.
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      
 |      Returns
 |      -------
 |      cummin : scalar
 |  
 |  describe
 |      Generate various summary statistics, excluding NaN values.
 |      
 |      Parameters
 |      ----------
 |      percentiles : array-like, optional
 |          The percentiles to include in the output. Should all
 |          be in the interval [0, 1]. By default `percentiles` is
 |          [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
 |      include, exclude : list-like, 'all', or None (default)
 |          Specify the form of the returned result. Either:
 |      
 |          - None to both (default). The result will include only
 |            numeric-typed columns or, if none are, only categorical columns.
 |          - A list of dtypes or strings to be included/excluded.
 |            To select all numeric types use numpy numpy.number. To select
 |            categorical objects use type object. See also the select_dtypes
 |            documentation. eg. df.describe(include=['O'])
 |          - If include is the string 'all', the output column-set will
 |            match the input one.
 |      
 |      Returns
 |      -------
 |      summary: NDFrame of summary statistics
 |      
 |      Notes
 |      -----
 |      The output DataFrame index depends on the requested dtypes:
 |      
 |      For numeric dtypes, it will include: count, mean, std, min,
 |      max, and lower, 50, and upper percentiles.
 |      
 |      For object dtypes (e.g. timestamps or strings), the index
 |      will include the count, unique, most common, and frequency of the
 |      most common. Timestamps also include the first and last items.
 |      
 |      For mixed dtypes, the index will be the union of the corresponding
 |      output types. Non-applicable entries will be filled with NaN.
 |      Note that mixed-dtype outputs can only be returned from mixed-dtype
 |      inputs and appropriate use of the include/exclude arguments.
 |      
 |      If multiple values have the highest count, then the
 |      `count` and `most common` pair will be arbitrarily chosen from
 |      among those with the highest count.
 |      
 |      The include, exclude arguments are ignored for Series.
 |      
 |      See Also
 |      --------
 |      DataFrame.select_dtypes
 |  
 |  diff
 |      1st discrete difference of object
 |      
 |      Parameters
 |      ----------
 |      periods : int, default 1
 |          Periods to shift for forming difference
 |      
 |      Returns
 |      -------
 |      diffed : Series
 |  
 |  dtype
 |      return the dtype object of the underlying data
 |  
 |  fillna
 |      Fill NA/NaN values using the specified method
 |      
 |      Parameters
 |      ----------
 |      value : scalar, dict, Series, or DataFrame
 |          Value to use to fill holes (e.g. 0), alternately a
 |          dict/Series/DataFrame of values specifying which value to use for
 |          each index (for a Series) or column (for a DataFrame). (values not
 |          in the dict/Series/DataFrame will not be filled). This value cannot
 |          be a list.
 |      method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
 |          Method to use for filling holes in reindexed Series
 |          pad / ffill: propagate last valid observation forward to next valid
 |          backfill / bfill: use NEXT valid observation to fill gap
 |      axis : {0, 'index'}
 |      inplace : boolean, default False
 |          If True, fill in place. Note: this will modify any
 |          other views on this object, (e.g. a no-copy slice for a column in a
 |          DataFrame).
 |      limit : int, default None
 |          If method is specified, this is the maximum number of consecutive
 |          NaN values to forward/backward fill. In other words, if there is
 |          a gap with more than this number of consecutive NaNs, it will only
 |          be partially filled. If method is not specified, this is the
 |          maximum number of entries along the entire axis where NaNs will be
 |          filled.
 |      downcast : dict, default is None
 |          a dict of item->dtype of what to downcast if possible,
 |          or the string 'infer' which will try to downcast to an appropriate
 |          equal type (e.g. float64 to int64 if possible)
 |      
 |      See Also
 |      --------
 |      reindex, asfreq
 |      
 |      Returns
 |      -------
 |      filled : Series
 |  
 |  hist
 |      Draw histogram of the input series using matplotlib
 |      
 |      Parameters
 |      ----------
 |      by : object, optional
 |          If passed, then used to form histograms for separate groups
 |      ax : matplotlib axis object
 |          If not passed, uses gca()
 |      grid : boolean, default True
 |          Whether to show axis grid lines
 |      xlabelsize : int, default None
 |          If specified changes the x-axis label size
 |      xrot : float, default None
 |          rotation of x axis labels
 |      ylabelsize : int, default None
 |          If specified changes the y-axis label size
 |      yrot : float, default None
 |          rotation of y axis labels
 |      figsize : tuple, default None
 |          figure size in inches by default
 |      bins: integer, default 10
 |          Number of histogram bins to be used
 |      kwds : keywords
 |          To be passed to the actual plotting function
 |      
 |      Notes
 |      -----
 |      See matplotlib documentation online for more on this
 |  
 |  idxmax
 |      Index of first occurrence of maximum of values.
 |      
 |      Parameters
 |      ----------
 |      skipna : boolean, default True
 |          Exclude NA/null values
 |      
 |      Returns
 |      -------
 |      idxmax : Index of maximum of values
 |      
 |      Notes
 |      -----
 |      This method is the Series version of ``ndarray.argmax``.
 |      
 |      See Also
 |      --------
 |      DataFrame.idxmax
 |      numpy.ndarray.argmax
 |  
 |  idxmin
 |      Index of first occurrence of minimum of values.
 |      
 |      Parameters
 |      ----------
 |      skipna : boolean, default True
 |          Exclude NA/null values
 |      
 |      Returns
 |      -------
 |      idxmin : Index of minimum of values
 |      
 |      Notes
 |      -----
 |      This method is the Series version of ``ndarray.argmin``.
 |      
 |      See Also
 |      --------
 |      DataFrame.idxmin
 |      numpy.ndarray.argmin
 |  
 |  mad
 |      
 |      Return the mean absolute deviation of the values for the requested axis
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      level : int or level name, default None
 |          If the axis is a MultiIndex (hierarchical), count along a
 |          particular level, collapsing into a scalar
 |      numeric_only : boolean, default None
 |          Include only float, int, boolean columns. If None, will attempt to use
 |          everything, then use only numeric data. Not implemented for Series.
 |      
 |      Returns
 |      -------
 |      mad : scalar or Series (if level specified)
 |  
 |  name
 |      since we are a series, we by definition only have
 |      a single name, but may be the result of a selection or
 |      the name of our object
 |  
 |  pct_change
 |      Percent change over given number of periods.
 |      
 |      Parameters
 |      ----------
 |      periods : int, default 1
 |          Periods to shift for forming percent change
 |      fill_method : str, default 'pad'
 |          How to handle NAs before computing percent changes
 |      limit : int, default None
 |          The number of consecutive NAs to fill before stopping
 |      freq : DateOffset, timedelta, or offset alias string, optional
 |          Increment to use from time series API (e.g. 'M' or BDay())
 |      
 |      Returns
 |      -------
 |      chg : NDFrame
 |      
 |      Notes
 |      -----
 |      
 |      By default, the percentage change is calculated along the stat
 |      axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
 |      ``Panel``. You can change this with the ``axis`` keyword argument.
 |  
 |  quantile
 |      Return value at the given quantile, a la numpy.percentile.
 |      
 |      Parameters
 |      ----------
 |      q : float or array-like, default 0.5 (50% quantile)
 |          0 <= q <= 1, the quantile(s) to compute
 |      interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
 |          .. versionadded:: 0.18.0
 |      
 |          This optional parameter specifies the interpolation method to use,
 |          when the desired quantile lies between two data points `i` and `j`:
 |      
 |              * linear: `i + (j - i) * fraction`, where `fraction` is the
 |                fractional part of the index surrounded by `i` and `j`.
 |              * lower: `i`.
 |              * higher: `j`.
 |              * nearest: `i` or `j` whichever is nearest.
 |              * midpoint: (`i` + `j`) / 2.
 |      
 |      Returns
 |      -------
 |      quantile : float or Series
 |          if ``q`` is an array, a Series will be returned where the
 |          index is ``q`` and the values are the quantiles.
 |      
 |      Examples
 |      --------
 |      >>> s = Series([1, 2, 3, 4])
 |      >>> s.quantile(.5)
 |      2.5
 |      >>> s.quantile([.25, .5, .75])
 |      0.25    1.75
 |      0.50    2.50
 |      0.75    3.25
 |      dtype: float64
 |  
 |  rank
 |      Compute numerical data ranks (1 through n) along axis. Equal values are
 |      assigned a rank that is the average of the ranks of those values
 |      
 |      Parameters
 |      ----------
 |      axis: {0 or 'index', 1 or 'columns'}, default 0
 |          index to direct ranking
 |      method : {'average', 'min', 'max', 'first', 'dense'}
 |          * average: average rank of group
 |          * min: lowest rank in group
 |          * max: highest rank in group
 |          * first: ranks assigned in order they appear in the array
 |          * dense: like 'min', but rank always increases by 1 between groups
 |      numeric_only : boolean, default None
 |          Include only float, int, boolean data. Valid only for DataFrame or
 |          Panel objects
 |      na_option : {'keep', 'top', 'bottom'}
 |          * keep: leave NA values where they are
 |          * top: smallest rank if ascending
 |          * bottom: smallest rank if descending
 |      ascending : boolean, default True
 |          False for ranks by high (1) to low (N)
 |      pct : boolean, default False
 |          Computes percentage rank of data
 |      
 |      Returns
 |      -------
 |      ranks : same type as caller
 |  
 |  skew
 |      
 |      Return unbiased skew over requested axis
 |      Normalized by N-1
 |      
 |      Parameters
 |      ----------
 |      axis : {index (0)}
 |      skipna : boolean, default True
 |          Exclude NA/null values. If an entire row/column is NA, the result
 |          will be NA
 |      level : int or level name, default None
 |          If the axis is a MultiIndex (hierarchical), count along a
 |          particular level, collapsing into a scalar
 |      numeric_only : boolean, default None
 |          Include only float, int, boolean columns. If None, will attempt to use
 |          everything, then use only numeric data. Not implemented for Series.
 |      
 |      Returns
 |      -------
 |      skew : scalar or Series (if level specified)
 |  
 |  take
 |      return Series corresponding to requested indices
 |      
 |      Parameters
 |      ----------
 |      indices : list / array of ints
 |      convert : translate negative to positive indices (default)
 |      
 |      Returns
 |      -------
 |      taken : Series
 |      
 |      See also
 |      --------
 |      numpy.ndarray.take
 |  
 |  tshift
 |      Shift the time index, using the index's frequency if available.
 |      
 |      Parameters
 |      ----------
 |      periods : int
 |          Number of periods to move, can be positive or negative
 |      freq : DateOffset, timedelta, or time rule string, default None
 |          Increment to use from the tseries module or time rule (e.g. 'EOM')
 |      axis : int or basestring
 |          Corresponds to the axis that contains the Index
 |      
 |      Notes
 |      -----
 |      If freq is not specified then tries to use the freq or inferred_freq
 |      attributes of the index. If neither of those attributes exist, a
 |      ValueError is thrown
 |      
 |      Returns
 |      -------
 |      shifted : NDFrame
 |  
 |  unique
 |      Return np.ndarray of unique values in the object.
 |      Significantly faster than numpy.unique. Includes NA values.
 |      The order of the original is preserved.
 |      
 |      Returns
 |      -------
 |      uniques : np.ndarray
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from GroupBy:
 |  
 |  backfill(self, limit=None)
 |      Backward fill the values
 |      
 |      Parameters
 |      ----------
 |      limit : integer, optional
 |          limit of how many values to fill
 |      
 |      See Also
 |      --------
 |      Series.fillna
 |      DataFrame.fillna
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  bfill = backfill(self, limit=None)
 |  
 |  cumcount(self, ascending=True)
 |      Number each item in each group from 0 to the length of that group - 1.
 |      
 |      Essentially this is equivalent to
 |      
 |      >>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
 |      
 |      Parameters
 |      ----------
 |      ascending : bool, default True
 |          If False, number in reverse, from length of group - 1 to 0.
 |      
 |      Examples
 |      --------
 |      
 |      >>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
 |      ...                   columns=['A'])
 |      >>> df
 |         A
 |      0  a
 |      1  a
 |      2  a
 |      3  b
 |      4  b
 |      5  a
 |      >>> df.groupby('A').cumcount()
 |      0    0
 |      1    1
 |      2    2
 |      3    0
 |      4    1
 |      5    3
 |      dtype: int64
 |      >>> df.groupby('A').cumcount(ascending=False)
 |      0    3
 |      1    2
 |      2    1
 |      3    1
 |      4    0
 |      5    0
 |      dtype: int64
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  cumprod(self, axis=0, *args, **kwargs)
 |      Cumulative product for each group
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  cumsum(self, axis=0, *args, **kwargs)
 |      Cumulative sum for each group
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  expanding(self, *args, **kwargs)
 |      Return an expanding grouper, providing expanding
 |      functionaility per group
 |      
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  ffill = pad(self, limit=None)
 |  
 |  first = f(self)
 |      Compute first of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  head(self, n=5)
 |      Returns first n rows of each group.
 |      
 |      Essentially equivalent to ``.apply(lambda x: x.head(n))``,
 |      except ignores as_index flag.
 |      
 |      Examples
 |      --------
 |      
 |      >>> df = DataFrame([[1, 2], [1, 4], [5, 6]],
 |                         columns=['A', 'B'])
 |      >>> df.groupby('A', as_index=False).head(1)
 |         A  B
 |      0  1  2
 |      2  5  6
 |      >>> df.groupby('A').head(1)
 |         A  B
 |      0  1  2
 |      2  5  6
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  irow(self, i)
 |      DEPRECATED. Use ``.nth(i)`` instead
 |  
 |  last = f(self)
 |      Compute last of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  max = f(self)
 |      Compute max of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  mean(self, *args, **kwargs)
 |      Compute mean of groups, excluding missing values
 |      
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  median(self)
 |      Compute median of groups, excluding missing values
 |      
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  min = f(self)
 |      Compute min of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  nth(self, n, dropna=None)
 |      Take the nth row from each group if n is an int, or a subset of rows
 |      if n is a list of ints.
 |      
 |      If dropna, will take the nth non-null row, dropna is either
 |      Truthy (if a Series) or 'all', 'any' (if a DataFrame);
 |      this is equivalent to calling dropna(how=dropna) before the
 |      groupby.
 |      
 |      Parameters
 |      ----------
 |      n : int or list of ints
 |          a single nth value for the row or a list of nth values
 |      dropna : None or str, optional
 |          apply the specified dropna operation before counting which row is
 |          the nth row. Needs to be None, 'any' or 'all'
 |      
 |      Examples
 |      --------
 |      
 |      >>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
 |      ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
 |      >>> g = df.groupby('A')
 |      >>> g.nth(0)
 |           B
 |      A
 |      1  NaN
 |      2  3.0
 |      >>> g.nth(1)
 |           B
 |      A
 |      1  2.0
 |      2  5.0
 |      >>> g.nth(-1)
 |           B
 |      A
 |      1  4.0
 |      2  5.0
 |      >>> g.nth([0, 1])
 |           B
 |      A
 |      1  NaN
 |      1  2.0
 |      2  3.0
 |      2  5.0
 |      
 |      Specifying ``dropna`` allows count ignoring NaN
 |      
 |      >>> g.nth(0, dropna='any')
 |           B
 |      A
 |      1  2.0
 |      2  3.0
 |      
 |      NaNs denote group exhausted when using dropna
 |      
 |      >>> g.nth(3, dropna='any')
 |          B
 |      A
 |      1 NaN
 |      2 NaN
 |      
 |      Specifying ``as_index=False`` in ``groupby`` keeps the original index.
 |      
 |      >>> df.groupby('A', as_index=False).nth(1)
 |         A    B
 |      1  1  2.0
 |      4  2  5.0
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  ohlc(self)
 |      Compute sum of values, excluding missing values
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  pad(self, limit=None)
 |      Forward fill the values
 |      
 |      Parameters
 |      ----------
 |      limit : integer, optional
 |          limit of how many values to fill
 |      
 |      See Also
 |      --------
 |      Series.fillna
 |      DataFrame.fillna
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  prod = f(self)
 |      Compute prod of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  resample(self, rule, *args, **kwargs)
 |      Provide resampling when using a TimeGrouper
 |      Return a new grouper with our resampler appended
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  rolling(self, *args, **kwargs)
 |      Return a rolling grouper, providing rolling
 |      functionaility per group
 |      
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  sem(self, ddof=1)
 |      Compute standard error of the mean of groups, excluding missing values
 |      
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      Parameters
 |      ----------
 |      ddof : integer, default 1
 |          degrees of freedom
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  shift(self, periods=1, freq=None, axis=0)
 |      Shift each group by periods observations
 |      
 |      Parameters
 |      ----------
 |      periods : integer, default 1
 |          number of periods to shift
 |      freq : frequency string
 |      axis : axis to shift, default 0
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  size(self)
 |      Compute group sizes
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  std(self, ddof=1, *args, **kwargs)
 |      Compute standard deviation of groups, excluding missing values
 |      
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      Parameters
 |      ----------
 |      ddof : integer, default 1
 |          degrees of freedom
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  sum = f(self)
 |      Compute sum of group values
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  tail(self, n=5)
 |      Returns last n rows of each group
 |      
 |      Essentially equivalent to ``.apply(lambda x: x.tail(n))``,
 |      except ignores as_index flag.
 |      
 |      Examples
 |      --------
 |      
 |      >>> df = DataFrame([['a', 1], ['a', 2], ['b', 1], ['b', 2]],
 |                         columns=['A', 'B'])
 |      >>> df.groupby('A').tail(1)
 |         A  B
 |      1  a  2
 |      3  b  2
 |      >>> df.groupby('A').head(1)
 |         A  B
 |      0  a  1
 |      2  b  1
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  var(self, ddof=1, *args, **kwargs)
 |      Compute variance of groups, excluding missing values
 |      
 |      For multiple groupings, the result index will be a MultiIndex
 |      
 |      Parameters
 |      ----------
 |      ddof : integer, default 1
 |          degrees of freedom
 |      
 |      
 |      See also
 |      --------
 |      pandas.Series.groupby
 |      pandas.DataFrame.groupby
 |      pandas.Panel.groupby
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from _GroupBy:
 |  
 |  __getattr__(self, attr)
 |  
 |  __init__(self, obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Groupby iterator
 |      
 |      Returns
 |      -------
 |      Generator yielding sequence of (name, subsetted object)
 |      for each group
 |  
 |  __len__(self)
 |  
 |  __unicode__(self)
 |      Return a string representation for a particular object.
 |      
 |      Invoked by unicode(obj) in py2 only. Yields a Unicode String in both
 |      py2/py3.
 |  
 |  apply(self, func, *args, **kwargs)
 |      Apply function and combine results together in an intelligent way. The
 |      split-apply-combine combination rules attempt to be as common sense
 |      based as possible. For example:
 |      
 |      case 1:
 |      group DataFrame
 |      apply aggregation function (f(chunk) -> Series)
 |      yield DataFrame, with group axis having group labels
 |      
 |      case 2:
 |      group DataFrame
 |      apply transform function ((f(chunk) -> DataFrame with same indexes)
 |      yield DataFrame with resulting chunks glued together
 |      
 |      case 3:
 |      group Series
 |      apply function with f(chunk) -> DataFrame
 |      yield DataFrame with result of chunks glued together
 |      
 |      Parameters
 |      ----------
 |      func : function
 |      
 |      Notes
 |      -----
 |      See online documentation for full exposition on how to use apply.
 |      
 |      In the current implementation apply calls func twice on the
 |      first group to decide whether it can take a fast or slow code
 |      path. This can lead to unexpected behavior if func has
 |      side-effects, as they will take effect twice for the first
 |      group.
 |      
 |      
 |      See also
 |      --------
 |      aggregate, transform
 |  
 |  get_group(self, name, obj=None)
 |      Constructs NDFrame from group with provided name
 |      
 |      Parameters
 |      ----------
 |      name : object
 |          the name of the group to get as a DataFrame
 |      obj : NDFrame, default None
 |          the NDFrame to take the DataFrame out of.  If
 |          it is None, the object groupby was called on will
 |          be used
 |      
 |      Returns
 |      -------
 |      group : type of obj
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from _GroupBy:
 |  
 |  groups
 |      dict {group name -> group labels}
 |  
 |  indices
 |      dict {group name -> group indices}
 |  
 |  ngroups
 |  
 |  plot
 |      Class implementing the .plot attribute for groupby objects
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.base.PandasObject:
 |  
 |  __dir__(self)
 |      Provide method name lookup and completion
 |      Only provide 'public' methods
 |  
 |  __sizeof__(self)
 |      Generates the total memory usage for a object that returns
 |      either a value or Series of values
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.base.StringMixin:
 |  
 |  __bytes__(self)
 |      Return a string representation for a particular object.
 |      
 |      Invoked by bytes(obj) in py3 only.
 |      Yields a bytestring in both py2/py3.
 |  
 |  __repr__(self)
 |      Return a string representation for a particular object.
 |      
 |      Yields Bytestring in Py2, Unicode String in py3.
 |  
 |  __str__(self)
 |      Return a string representation for a particular Object
 |      
 |      Invoked by str(df) in both py2/py3.
 |      Yields Bytestring in Py2, Unicode String in py3.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from pandas.core.base.StringMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pandas.core.base.SelectionMixin:
 |  
 |  __getitem__(self, key)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from pandas.core.base.SelectionMixin:
 |  
 |  ndim
grouped.mean()
key1
a   -0.198323
b   -0.613947
Name: data1, dtype: float64
df['data1'].groupby(['a','a','b','b','a']).mean()
a   -0.198323
b   -0.613947
Name: data1, dtype: float64
p=df['data1'];p
0   -0.288087
1    0.804061
2   -1.672009
3    0.444115
4   -1.110942
Name: data1, dtype: float64
p.groupby(df['key1']).mean()
key1
a   -0.198323
b   -0.613947
Name: data1, dtype: float64

The important thing above is that the data(Series) has been aggregated according to the group key,producing a new Series that is now indexed by the unique values in the key1 column.

The result index has the name 'key1' because the DataFrame column df['key1'] did!

If instead we had passed multiple arrays as a list,we'd get something different:

means=df['data1'].groupby([df['key1'],df['key2']]).mean();means
key1  key2
a     one    -0.699514
      two     0.804061
b     one    -1.672009
      two     0.444115
Name: data1, dtype: float64
df['data1'].groupby([df['key2'],df['key1']]).mean()
key2  key1
one   a      -0.699514
      b      -1.672009
two   a       0.804061
      b       0.444115
Name: data1, dtype: float64

Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

means.unstack()
key2 one two
key1
a -0.699514 0.804061
b -1.672009 0.444115

In this example,the group keys are all Series,though they could be any arrays of the right length:

states=np.array(['Ohio','Califonia','Ohio','Ohio','Califonia'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
Califonia  2005    0.804061
           2006   -1.110942
Ohio       2005    0.078014
           2006   -1.672009
Name: data1, dtype: float64
  • The second method:
    Frequently the grouping information is found in the same DataFrame as the data you want to work on.In this case,you can pass column names(whether those are strings,numbers or other python objects) as the group keys:
df.groupby('key1').mean()
data1 data2
key1
a -0.198323 -0.039899
b -0.613947 1.384352
df.groupby(['key1','key2']).mean()
data1 data2
key1 key2
a one -0.699514 -0.386868
two 0.804061 0.654039
b one -1.672009 1.780028
two 0.444115 0.988676

You may have noticed in the first case of df.groupby('key').mean() that there is no key2 column in the result.**Because df['key2'] is not numeric data,it is said to be a nuisance column,which is therefor excluded from the result.By default,all of teh numeric columns are aggregated,though it is possible to filter down to a subset.

help(df.groupby(['key1','key2']).size)
Help on method size in module pandas.core.groupby:

size() method of pandas.core.groupby.DataFrameGroupBy instance
    Compute group sizes
    
    See also
    --------
    pandas.Series.groupby
    pandas.DataFrame.groupby
    pandas.Panel.groupby

A generally useful Groupby method is size,which returns a Series containing group sizes:

df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64
df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one

We can see from above,the number of ('a','one') is 2 indeed...

Iterating over groups

The Groupby object supports iteration,generating a sequence of 2-tuples containning the group name along with the chunk of data.

for name,group in df.groupby('key1'):
    print(name)
    print(group)
a
      data1     data2 key1 key2
0 -0.288087 -1.355619    a  one
1  0.804061  0.654039    a  two
4 -1.110942  0.581883    a  one
b
      data1     data2 key1 key2
2 -1.672009  1.780028    b  one
3  0.444115  0.988676    b  two

In the case of multiple keys,the first element in the tuple will be a tuple of key values.

for (k1,k2),group in df.groupby(['key1','key2']):
    print((k1,k2))
    print(group)
('a', 'one')
      data1     data2 key1 key2
0 -0.288087 -1.355619    a  one
4 -1.110942  0.581883    a  one
('a', 'two')
      data1     data2 key1 key2
1  0.804061  0.654039    a  two
('b', 'one')
      data1     data2 key1 key2
2 -1.672009  1.780028    b  one
('b', 'two')
      data1     data2 key1 key2
3  0.444115  0.988676    b  two
df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one

A useful recipe to get the pieces of data through computing a dict of the data pieces as a one-liner:

pieces=dict(list(df.groupby('key1')))
pieces
{'a':       data1     data2 key1 key2
 0 -0.288087 -1.355619    a  one
 1  0.804061  0.654039    a  two
 4 -1.110942  0.581883    a  one, 'b':       data1     data2 key1 key2
 2 -1.672009  1.780028    b  one
 3  0.444115  0.988676    b  two}
pieces['b']
data1 data2 key1 key2
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
list(df.groupby('key1'))
[('a',       data1     data2 key1 key2
  0 -0.288087 -1.355619    a  one
  1  0.804061  0.654039    a  two
  4 -1.110942  0.581883    a  one), ('b',       data1     data2 key1 key2
  2 -1.672009  1.780028    b  one
  3  0.444115  0.988676    b  two)]
dict([('a',1),('b',2)])
{'a': 1, 'b': 2}
dict([('a',1)])
{'a': 1}
dict([(1,2)])
{1: 2}

Above: dict's built-in creation.

df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one
df.dtypes
data1    float64
data2    float64
key1      object
key2      object
dtype: object
grouped=df.groupby(df.dtypes,axis=1)  # notice here,axis=1,df is divided into 2 DataFrame accroding `dtype`.
for dtype,group in grouped:
    print(dtype)
    print(group)
float64
      data1     data2
0 -0.288087 -1.355619
1  0.804061  0.654039
2 -1.672009  1.780028
3  0.444115  0.988676
4 -1.110942  0.581883
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one
pp=dict(list(grouped))
pp
{dtype('float64'):       data1     data2
 0 -0.288087 -1.355619
 1  0.804061  0.654039
 2 -1.672009  1.780028
 3  0.444115  0.988676
 4 -1.110942  0.581883, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}
i,j=pp.keys()
pp[i]
data1 data2
0 -0.288087 -1.355619
1 0.804061 0.654039
2 -1.672009 1.780028
3 0.444115 0.988676
4 -1.110942 0.581883
pp[j]
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
type(i)
numpy.dtype
pp[np.dtype('object')]
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one

Selecting a column or subset of columns

a=list(df.groupby('key1')['data1']);a
[('a', 0   -0.288087
  1    0.804061
  4   -1.110942
  Name: data1, dtype: float64), ('b', 2   -1.672009
  3    0.444115
  Name: data1, dtype: float64)]
list(df['data1'].groupby(df['key1']))
[('a', 0   -0.288087
  1    0.804061
  4   -1.110942
  Name: data1, dtype: float64), ('b', 2   -1.672009
  3    0.444115
  Name: data1, dtype: float64)]
df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one
df.groupby('key1').size()
key1
a    3
b    2
dtype: int64
a=df.groupby('key1')['data1'];a
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
b=df.groupby('key1')[['data1']];a
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
aa=df['data1'].groupby(df['key1']);aa
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB56707B8>
bb=df[['data1']].groupby(df['key1']);bb
<pandas.core.groupby.DataFrameGroupBy object at 0x000001DDB56703C8>

Indexing a groupby object created from a DataFrame with a column name or array of column names has the efffect of column subsetting for aggregation.

df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one
list(df.groupby(['key1','key2'])['data2'])
[(('a', 'one'), 0   -1.355619
  4    0.581883
  Name: data2, dtype: float64), (('a', 'two'), 1    0.654039
  Name: data2, dtype: float64), (('b', 'one'), 2    1.780028
  Name: data2, dtype: float64), (('b', 'two'), 3    0.988676
  Name: data2, dtype: float64)]
df.groupby(['key1','key2'])[['data2']].mean()
data2
key1 key2
a one -0.386868
two 0.654039
b one 1.780028
two 0.988676
df[['data1','data2']].mean()
data1   -0.364572
data2    0.529801
dtype: float64

Grouping with Dicts and Series

Grouping information may exist in a form other than an array.

people=pd.DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['Joe','Steve','Wes','Jim','Travis']);people
a b c d e
Joe 0.696631 -0.491001 -0.065796 -1.392325 -0.651280
Steve 0.965853 -1.694954 2.142343 -0.969325 1.129057
Wes -0.145624 -0.548971 -0.893291 2.421140 0.305064
Jim 0.434581 1.189071 0.455694 -0.929888 1.193285
Travis -0.321174 -1.400983 -0.529033 0.037673 1.221623

Now suppose we have a group correspondence for the columns and want to sum together the columns by group.

mapping={'a':'red','b':'red',
        'c':'blue','d':'blue',
        'e':'red','f':'orange'}
by_column=people.groupby(mapping,axis=1)
by_column.sum()
blue red
Joe -1.458121 -0.445651
Steve 1.173019 0.399957
Wes 1.527849 -0.389532
Jim -0.474194 2.816937
Travis -0.491360 -0.500533

The same functionality holds for Series,which can be viewed as a fixed-size mapping.


map_series=pd.Series(mapping);map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
people.groupby(map_series,axis=1).count()
blue red
Joe 2 3
Steve 2 3
Wes 2 3
Jim 2 3
Travis 2 3
people.groupby(['red','red','blue','blue','red']).mean()
a b c d e
blue 0.144479 0.320050 -0.218799 0.745626 0.749174
red 0.447103 -1.195646 0.515838 -0.774659 0.566467
people.groupby(['red','red','blue','blue','red'],axis=1).mean()
blue red
Joe -0.729060 -0.148550
Steve 0.586509 0.133319
Wes 0.763924 -0.129844
Jim -0.237097 0.938979
Travis -0.245680 -0.166844

Grouping with function

Using python functions is a more generic way of defining a group mappingg compared with a dict or Series.Any function passed as a group key will be called once per index value,with the return values being used as the group names.

Suppose you want to group by the length of the names;while you compute an array of string lengths,it's simpler to just pass the len function.

people.groupby(len).sum()
a b c d e
3 0.985588 0.149098 -0.503394 0.098927 0.847068
5 0.965853 -1.694954 2.142343 -0.969325 1.129057
6 -0.321174 -1.400983 -0.529033 0.037673 1.221623

Mixing functions with arrays,dicts,or Series is not a problem as everything gets converted to arrays internaly:

key_list=['one','one','one','two','two']
key_test=[1,1,1,1,1]
people.groupby([len,key_list]).min()
a b c d e
3 one -0.145624 -0.548971 -0.893291 -1.392325 -0.651280
two 0.434581 1.189071 0.455694 -0.929888 1.193285
5 one 0.965853 -1.694954 2.142343 -0.969325 1.129057
6 two -0.321174 -1.400983 -0.529033 0.037673 1.221623
people.groupby([len,key_test]).min()
a b c d e
3 1 -0.145624 -0.548971 -0.893291 -1.392325 -0.651280
5 1 0.965853 -1.694954 2.142343 -0.969325 1.129057
6 1 -0.321174 -1.400983 -0.529033 0.037673 1.221623

Grouping by Index levels

A final vonvenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an index.

columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3,]],names=['City','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns);hier_df
City US JP
tenor 1 3 5 1 3
0 -1.878462 1.360321 -1.444771 0.883927 0.995222
1 -1.082905 1.061055 -0.938387 -0.391143 1.383284
2 -0.626484 -0.298127 1.144298 0.941223 1.163550
3 -0.272500 -3.441699 0.804467 -0.029275 -0.003555

To group by level,pass the level number or name using level keyword.

hier_df.groupby(level='City',axis=1).count()
City JP US
0 2 3
1 2 3
2 2 3
3 2 3

Data aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.

description=['Number of non-NAN values in the group',
            'Sum of non-NA values',
            'Mean of non-NA values',
            'Arithmetic median off non-NAN values',
            'Unbiased(n-1 denominator)standard deviation and variance',
            'Minimun and maximum of non-NA values',
            'Product of non-NA values',
            'First and last non-NA value']

Tips:Custom aggregation functions are generally much slower than the optimized functions found in above table.This is because there is some extra overhead(function calls,data rearrangement) in constructing the intermeediate chunks.

index=pd.Index(['count','sum','mean','median','std,var','min,max','prod','first,last'],names='Function name')
pd.DataFrame(description,index=index)
0
count Number of non-NAN values in the group
sum Sum of non-NA values
mean Mean of non-NA values
median Arithmetic median off non-NAN values
std,var Unbiased(n-1 denominator)standard deviation an...
min,max Minimun and maximum of non-NA values
prod Product of non-NA values
first,last First and last non-NA value

To use your own aggregation functions,pass any function that aggregates an array to the aggregate or agg method.

grouped=df.groupby('key1')
def peak_to_peak(arr):
    return arr.max()-arr.min()
grouped.agg(peak_to_peak)
data1 data2
key1
a 1.915003 2.009658
b 2.116124 0.791351
df
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
4 -1.110942 0.581883 a one
def length(arr):
    l=0
    for i in arr:
        l+=i**2
    return l**0.5
        
grouped.agg(length)
data1 data2
key1
a 1.401321 1.613710
b 1.729986 2.036168
grouped.describe()
data1 data2
key1
a count 3.000000 3.000000
mean -0.198323 -0.039899
std 0.960652 1.140018
min -1.110942 -1.355619
25% -0.699514 -0.386868
50% -0.288087 0.581883
75% 0.257987 0.617961
max 0.804061 0.654039
b count 2.000000 2.000000
mean -0.613947 1.384352
std 1.496326 0.559570
min -1.672009 0.988676
25% -1.142978 1.186514
50% -0.613947 1.384352
75% -0.084916 1.582190
max 0.444115 1.780028

You may notice some methods like describe also work,even though they are not aggregations.

Column-wise and multiple function application

tips=pd.read_csv(r'./pydata-book-2nd-edition/examples/tips.csv')
tips.head()
total_bill tip smoker day time size
0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
tips['tip_pct']=tips['tip']/tips['total_bill']

As we have seen,aggregating a Series or all of the columns of a DataFrame is a matter of using aggregae with the desired function or calling a method like mean or std.However,you may want to aggregate using a different function depending on the column,or multiple functions at once.

grouped=tips.groupby(['day','smoker'])
group_pct=grouped['tip_pct']
grouped.size()
day   smoker
Fri   No         4
      Yes       15
Sat   No        45
      Yes       42
Sun   No        57
      Yes       19
Thur  No        45
      Yes       17
dtype: int64
group_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

If you pass a list of functions or function names instead,you get back a DataFrame with column names taken from the functions:

group_pct.agg(['mean','std',peak_to_peak])
mean std peak_to_peak
day smoker
Fri No 0.151650 0.028123 0.067349
Yes 0.174783 0.051293 0.159925
Sat No 0.158048 0.039767 0.235193
Yes 0.147906 0.061375 0.290095
Sun No 0.160113 0.042347 0.193226
Yes 0.187250 0.154134 0.644685
Thur No 0.160298 0.038774 0.193350
Yes 0.163863 0.039389 0.151240
df=pd.DataFrame(np.arange(12).reshape(3,4),index=['one','two','three'],columns=['o','t','th','f']);df
o t th f
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
list(df.groupby([0,0,1])['f'])
[(0, one    3
  two    7
  Name: f, dtype: int32), (1, three    11
  Name: f, dtype: int32)]

If you pass a list of (name,function) tuples,the first element of each tuple will be used as the DataFrame column names.

group_pct.agg([('foo','mean'),('bar',np.std)])
foo bar
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389

With a DataFrame you have more options,as you can specify a list of functions to apply to all of the columns or different functions per column.To start ,suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

functions=['count','mean','max']
result=grouped['tip_pct','total_bill'].agg(functions);result
tip_pct total_bill
count mean max count mean max
day smoker
Fri No 4 0.151650 0.187735 4 18.420000 22.75
Yes 15 0.174783 0.263480 15 16.813333 40.17
Sat No 45 0.158048 0.291990 45 19.661778 48.33
Yes 42 0.147906 0.325733 42 21.276667 50.81
Sun No 57 0.160113 0.252672 57 20.506667 48.17
Yes 19 0.187250 0.710345 19 24.120000 45.35
Thur No 45 0.160298 0.266312 45 17.113111 41.19
Yes 17 0.163863 0.241255 17 19.190588 43.11

As you can see,the resulting DataFrame has a hierarchical columns,the same as you would get aggregating each column separately and using concat to glue the results together using the column names as the keys argument

As before,a list of tuples with custom names can be passed:

ftuples=[('Durchs','mean'),('adkfd',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
tip_pct total_bill
Durchs adkfd Durchs adkfd
day smoker
Fri No 0.151650 0.000791 18.420000 25.596333
Yes 0.174783 0.002631 16.813333 82.562438
Sat No 0.158048 0.001581 19.661778 79.908965
Yes 0.147906 0.003767 21.276667 101.387535
Sun No 0.160113 0.001793 20.506667 66.099980
Yes 0.187250 0.023757 24.120000 109.046044
Thur No 0.160298 0.001503 17.113111 59.625081
Yes 0.163863 0.001551 19.190588 69.808518

Now,suppose you want to apply potentially different functions to one or more of the columns.To do this,pass a dict to agg that contains a mappingg of column names to any of the function specifications listed so far:

grouped.agg({'tip':np.max,'size':'sum'})
tip size
day smoker
Fri No 3.50 9
Yes 4.73 31
Sat No 9.00 115
Yes 10.00 104
Sun No 6.00 167
Yes 6.50 49
Thur No 6.70 112
Yes 5.00 40
grouped.agg(ftuples)
total_bill tip size tip_pct
Durchs adkfd Durchs adkfd Durchs adkfd Durchs adkfd
day smoker
Fri No 18.420000 25.596333 2.812500 0.807292 2.250000 0.250000 0.151650 0.000791
Yes 16.813333 82.562438 2.714000 1.161369 2.066667 0.352381 0.174783 0.002631
Sat No 19.661778 79.908965 3.102889 2.696453 2.555556 0.616162 0.158048 0.001581
Yes 21.276667 101.387535 2.875476 2.658791 2.476190 0.743322 0.147906 0.003767
Sun No 20.506667 66.099980 3.167895 1.500099 2.929825 1.066416 0.160113 0.001793
Yes 24.120000 109.046044 3.516842 1.590501 2.578947 0.812865 0.187250 0.023757
Thur No 17.113111 59.625081 2.673778 1.645997 2.488889 1.391919 0.160298 0.001503
Yes 19.190588 69.808518 3.030000 1.239863 2.352941 0.492647 0.163863 0.001551
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
tip_pct size
min max mean std sum
day smoker
Fri No 0.120385 0.187735 0.151650 0.028123 9
Yes 0.103555 0.263480 0.174783 0.051293 31
Sat No 0.056797 0.291990 0.158048 0.039767 115
Yes 0.035638 0.325733 0.147906 0.061375 104
Sun No 0.059447 0.252672 0.160113 0.042347 167
Yes 0.065660 0.710345 0.187250 0.154134 49
Thur No 0.072961 0.266312 0.160298 0.038774 112
Yes 0.090014 0.241255 0.163863 0.039389 40

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

Returning Aggregated data without row index

In all of the examples up until now,the aggregated data comes back with an index,potentially hierarchical,composed from the unique group key combinations.Since this is not always desirable,you can disable this behaviour in most cases by passing as_index=False to groupby.

tips.groupby(['day','smoker'],as_index=False).mean()
day smoker total_bill tip size tip_pct
0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863
tips.groupby(['day','smoker']).mean()
total_bill tip size tip_pct
day smoker
Fri No 18.420000 2.812500 2.250000 0.151650
Yes 16.813333 2.714000 2.066667 0.174783
Sat No 19.661778 3.102889 2.555556 0.158048
Yes 21.276667 2.875476 2.476190 0.147906
Sun No 20.506667 3.167895 2.929825 0.160113
Yes 24.120000 3.516842 2.578947 0.187250
Thur No 17.113111 2.673778 2.488889 0.160298
Yes 19.190588 3.030000 2.352941 0.163863

Apply:General split-apply-combine

The most general-purpose GroupBy method is apply,which is the subject of the rest of this section.

tips.head()
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

Returning to the tipping dataset from before,suppose you want to select the top five tip_pct values by group.First,write a function that selects the rows with the largest values in a particular column:

def top(df,n=5,column='tip_pct'):
    return df.sort_values(by=column)[-n:]
tips.sort_values(by='tip_pct')[-5:]
total_bill tip smoker day time size tip_pct
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
top(tips,n=6)
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345

Now,if we group by smoker,and call apply with this function.we get:

tips.groupby('smoker').apply(top)
total_bill tip smoker day time size tip_pct
smoker
No 88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345

If you pass a function to apply that takes other arguments or keywords,you can pass these after the function:

tips.groupby(['smoker','day']).apply(top,n=2,column='total_bill')
total_bill tip smoker day time size tip_pct
smoker day
No Fri 91 22.49 3.50 No Fri Dinner 2 0.155625
94 22.75 3.25 No Fri Dinner 2 0.142857
Sat 59 48.27 6.73 No Sat Dinner 4 0.139424
212 48.33 9.00 No Sat Dinner 4 0.186220
Sun 112 38.07 4.00 No Sun Dinner 3 0.105070
156 48.17 5.00 No Sun Dinner 6 0.103799
Thur 85 34.83 5.17 No Thur Lunch 4 0.148435
142 41.19 5.00 No Thur Lunch 5 0.121389
Yes Fri 90 28.97 3.00 Yes Fri Dinner 2 0.103555
95 40.17 4.73 Yes Fri Dinner 4 0.117750
Sat 102 44.30 2.50 Yes Sat Dinner 3 0.056433
170 50.81 10.00 Yes Sat Dinner 3 0.196812
Sun 184 40.55 3.00 Yes Sun Dinner 2 0.073983
182 45.35 3.50 Yes Sun Dinner 3 0.077178
Thur 83 32.68 5.00 Yes Thur Lunch 2 0.152999
197 43.11 5.00 Yes Thur Lunch 4 0.115982

Suppressing the Group keys

In the preceding examples,you see that the resulting object has a hierarchical index formed from the group keys along the indexes of each piece of the original object.You can disable this by passing group_keys=False to groupby.

tips.groupby('smoker',group_keys=False).apply(top)
total_bill tip smoker day time size tip_pct
88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345

**Notice the difference between agg and apply:agg aggregates GroupBy object's every column into one value,while apply apply function to every column of GroupBy object and the result may not to be just one value like agg.

tips.groupby('smoker').agg('max')
total_bill tip day time size tip_pct
smoker
No 48.33 9.0 Thur Lunch 6 0.291990
Yes 50.81 10.0 Thur Lunch 5 0.710345
tips.groupby('smoker').apply(max)
total_bill tip smoker day time size tip_pct
smoker
No 48.33 9.0 No Thur Lunch 6 0.291990
Yes 50.81 10.0 Yes Thur Lunch 5 0.710345

The above two methods return just the same result.To some extent,aggg is a specific version of apply.

Quantile and bucket analysis

Combining qcut or cut with groupby makes it convenient to perform bucket or quantile analysis on a dataset.

frame=pd.DataFrame({'data1':np.random.randn(1000),
                   'data2':np.random.randn(1000)})
quantiles=pd.cut(frame.data1,4)
quantiles[:10]
0    (0.00174, 1.893]
1    (0.00174, 1.893]
2    (0.00174, 1.893]
3     (-3.788, -1.89]
4    (0.00174, 1.893]
5    (0.00174, 1.893]
6    (0.00174, 1.893]
7     (-3.788, -1.89]
8    (0.00174, 1.893]
9    (0.00174, 1.893]
Name: data1, dtype: category
Categories (4, object): [(-3.788, -1.89] < (-1.89, 0.00174] < (0.00174, 1.893] < (1.893, 3.784]]
help(pd.cut)
Help on function cut in module pandas.tools.tile:

cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
    Return indices of half-open bins to which each value of `x` belongs.
    
    Parameters
    ----------
    x : array-like
        Input array to be binned. It has to be 1-dimensional.
    bins : int or sequence of scalars
        If `bins` is an int, it defines the number of equal-width bins in the
        range of `x`. However, in this case, the range of `x` is extended
        by .1% on each side to include the min or max values of `x`. If
        `bins` is a sequence it defines the bin edges allowing for
        non-uniform bin width. No extension of the range of `x` is done in
        this case.
    right : bool, optional
        Indicates whether the bins include the rightmost edge or not. If
        right == True (the default), then the bins [1,2,3,4] indicate
        (1,2], (2,3], (3,4].
    labels : array or boolean, default None
        Used as labels for the resulting bins. Must be of the same length as
        the resulting bins. If False, return only integer indicators of the
        bins.
    retbins : bool, optional
        Whether to return the bins or not. Can be useful if bins is given
        as a scalar.
    precision : int
        The precision at which to store and display the bins labels
    include_lowest : bool
        Whether the first interval should be left-inclusive or not.
    
    Returns
    -------
    out : Categorical or Series or array of integers if labels is False
        The return type (Categorical or Series) depends on the input: a Series
        of type category if input is a Series else Categorical. Bins are
        represented as categories when categorical data is returned.
    bins : ndarray of floats
        Returned only if `retbins` is True.
    
    Notes
    -----
    The `cut` function can be useful for going from a continuous variable to
    a categorical variable. For example, `cut` could convert ages to groups
    of age ranges.
    
    Any NA values will be NA in the result.  Out of bounds values will be NA in
    the resulting Categorical object
    
    
    Examples
    --------
    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
    ([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
      (6.533, 9.7], (0.191, 3.367]]
    Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
    array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3,
               labels=["good","medium","bad"])
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]
    >>> pd.cut(np.ones(5), 4, labels=False)
    array([1, 1, 1, 1, 1], dtype=int64)
help(pd.qcut)
Help on function qcut in module pandas.tools.tile:

qcut(x, q, labels=None, retbins=False, precision=3)
    Quantile-based discretization function. Discretize variable into
    equal-sized buckets based on rank or based on sample quantiles. For example
    1000 values for 10 quantiles would produce a Categorical object indicating
    quantile membership for each data point.
    
    Parameters
    ----------
    x : ndarray or Series
    q : integer or array of quantiles
        Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
        array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
    labels : array or boolean, default None
        Used as labels for the resulting bins. Must be of the same length as
        the resulting bins. If False, return only integer indicators of the
        bins.
    retbins : bool, optional
        Whether to return the bins or not. Can be useful if bins is given
        as a scalar.
    precision : int
        The precision at which to store and display the bins labels
    
    Returns
    -------
    out : Categorical or Series or array of integers if labels is False
        The return type (Categorical or Series) depends on the input: a Series
        of type category if input is a Series else Categorical. Bins are
        represented as categories when categorical data is returned.
    bins : ndarray of floats
        Returned only if `retbins` is True.
    
    Notes
    -----
    Out of bounds values will be NA in the resulting Categorical object
    
    Examples
    --------
    >>> pd.qcut(range(5), 4)
    [[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
    Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]
    >>> pd.qcut(range(5), 3, labels=["good","medium","bad"])
    [good, good, medium, bad, bad]
    Categories (3, object): [good < medium < bad]
    >>> pd.qcut(range(5), 4, labels=False)
    array([0, 0, 1, 2, 3], dtype=int64)

The Categorical object returned by cut can be passed directly to groupby.So we could compute a set of statistics for the data2 like so:

def get_status(group):
    return {'min':group.min(),'max':group.max(),
           'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(quantiles)
all(frame.data2==frame['data2'])
True
df=pd.Series([1,2,3,5])
pd.cut(df,2)
0    (0.996, 3]
1    (0.996, 3]
2    (0.996, 3]
3        (3, 5]
dtype: category
Categories (2, object): [(0.996, 3] < (3, 5]]
pd.qcut(df,2)
0    [1, 2.5]
1    [1, 2.5]
2    (2.5, 5]
3    (2.5, 5]
dtype: category
Categories (2, object): [[1, 2.5] < (2.5, 5]]
fig=plt.figure()
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1ddb56f2908>
grouped=frame.data2.groupby(quantiles)
grouped.size()
data1
(-3.788, -1.89]      34
(-1.89, 0.00174]    443
(0.00174, 1.893]    492
(1.893, 3.784]       31
dtype: int64

Here,pd.cut(frame.data1,4) returns 1000 rows of Categorical object dividing data1 into 4 groups

grouped.apply(get_status)
data1                  
(-3.788, -1.89]   count     34.000000
                  max        1.935178
                  mean      -0.147920
                  min       -1.843208
(-1.89, 0.00174]  count    443.000000
                  max        2.547017
                  mean      -0.020845
                  min       -2.779875
(0.00174, 1.893]  count    492.000000
                  max        2.729491
                  mean      -0.066996
                  min       -3.854003
(1.893, 3.784]    count     31.000000
                  max        1.533001
                  mean      -0.070277
                  min       -1.639600
Name: data2, dtype: float64
grouped.apply(get_status).unstack()
count max mean min
data1
(-3.788, -1.89] 34.0 1.935178 -0.147920 -1.843208
(-1.89, 0.00174] 443.0 2.547017 -0.020845 -2.779875
(0.00174, 1.893] 492.0 2.729491 -0.066996 -3.854003
(1.893, 3.784] 31.0 1.533001 -0.070277 -1.639600

There were equal-length buckets;to compute equal-size buckets based on sample quantiles,use qcut.We will pass labels=False to jsut get quantile numbers:

grouping=pd.qcut(frame.data1,10,labels=False)
grouped=frame.data2.groupby(grouping)
grouped.size()
data1
0    100
1    100
2    100
3    100
4    100
5    100
6    100
7    100
8    100
9    100
dtype: int64
grouped.apply(get_status).unstack()
count max mean min
data1
0 100.0 2.547017 -0.013468 -2.357562
1 100.0 2.462236 -0.042549 -2.698867
2 100.0 2.546495 0.084219 -2.423998
3 100.0 2.225054 0.035150 -2.778011
4 100.0 1.669557 -0.134327 -2.779875
5 100.0 2.649044 -0.071834 -1.871068
6 100.0 2.729491 0.036419 -2.505988
7 100.0 2.014786 -0.009873 -2.671353
8 100.0 2.043135 -0.193331 -3.854003
9 100.0 2.491035 -0.184446 -2.033173

Example:Filling missing values with group-specific values

When cleaning up missing data,in some cases you will replace data observations using dropna,but in others you may want to impute(fill in) the null(NA) values using a fixed value or some value derived from the data.fillna is the right tool to use:

s=pd.Series(np.random.randn(6))
s[::2]=np.nan
s
0         NaN
1   -0.875438
2         NaN
3    1.106316
4         NaN
5    0.020065
dtype: float64
s.fillna(s.mean())
0    0.083648
1   -0.875438
2    0.083648
3    1.106316
4    0.083648
5    0.020065
dtype: float64

Suppose you need the fill value to vary by group.One way to do this is to group the data and use apply with a function that calls fillna on each data chunk.

states=['Ohio','New york','Vermont','Florida','Oregon','Nevada','California','Idaho']
group_key=['East']*4+['West']*4
data=pd.Series(np.random.randn(8),index=states)
data[['Vermont','Nevada','Idaho']]=np.nan
data.groupby(group_key).mean()
East    0.199033
West   -0.704578
dtype: float64

We can fill the NA values using the group means like so:

fill_mean=lambda g : g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio         -0.953675
New york      0.531946
Vermont       0.199033
Florida       1.018829
Oregon       -0.539930
Nevada       -0.704578
California   -0.869226
Idaho        -0.704578
dtype: float64

In another case,you might have predefined fill values in your code that vary by group.

fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Ohio         -0.953675
New york      0.531946
Vermont       0.500000
Florida       1.018829
Oregon       -0.539930
Nevada       -1.000000
California   -0.869226
Idaho        -1.000000
dtype: float64

Example: Random sampling and permutation

Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Carlo simulation purpose or some other application.There are a number of ways to perform the 'draws':Here we use sample method for Series.

suits=list('HSCD')
card_val=(list(range(1,11))+[10]*3)*4
base_names=['A']+list(range(2,11))+['J','K','Q'];base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
cards=[]
for suit in ['H','S','C','D']:
    cards.extend(str(num)+suit for num in base_names)
deck=pd.Series(card_val,index=cards)
deck
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
AS      1
2S      2
3S      3
4S      4
5S      5
6S      6
7S      7
8S      8
9S      9
10S    10
JS     10
KS     10
QS     10
AC      1
2C      2
3C      3
4C      4
5C      5
6C      6
7C      7
8C      8
9C      9
10C    10
JC     10
KC     10
QC     10
AD      1
2D      2
3D      3
4D      4
5D      5
6D      6
7D      7
8D      8
9D      9
10D    10
JD     10
KD     10
QD     10
dtype: int64
def draw(deck,n=5):
    return deck.sample(n)
draw(deck)
4H     4
AD     1
8S     8
3C     3
KH    10
dtype: int64

Suppose you want two random cards from each suit.Because the suit is the last character of each card name,we can group based on this and use apply.

get_suit=lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C  KC     10
   8C      8
D  5D      5
   8D      8
H  QH     10
   5H      5
S  4S      4
   10S    10
dtype: int64

Alternatively,we could write:

deck.groupby(get_suit,group_keys=False).apply(draw,2)
8C      8
KC     10
8D      8
9D      9
8H      8
QH     10
10S    10
4S      4
dtype: int64

Example:Group weighted average and correlation

Under the split-apply-combine paradigm of groupby operations between columns in a DataFrame or two Series,such as a group weighted average,are possible.

df=pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],
                'data':np.random.randn(8),'weights':np.random.randn(8)});df
category data weights
0 a 0.902349 1.730949
1 a 1.287855 0.768677
2 a -0.751881 -0.571836
3 a 1.208645 0.229177
4 b 0.986127 -0.262788
5 b 1.566069 -0.252579
6 b 0.285959 0.184662
7 b 1.059962 1.804769

The group weighted average by category would then be:

grouped=df.groupby('category')
get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
list(grouped)
[('a',   category      data   weights
  0        a  0.902349  1.730949
  1        a  1.287855  0.768677
  2        a -0.751881 -0.571836
  3        a  1.208645  0.229177), ('b',   category      data   weights
  4        b  0.986127 -0.262788
  5        b  1.566069 -0.252579
  6        b  0.285959  0.184662
  7        b  1.059962  1.804769)]
grouped.apply(get_wavg)
category
a    1.510830
b    0.889441
dtype: float64
close_px=pd.read_csv('./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px.head()
AAPL MSFT XOM SPX
2003-01-02 7.40 21.11 29.22 909.03
2003-01-03 7.45 21.14 29.24 908.59
2003-01-06 7.45 21.52 29.96 929.01
2003-01-07 7.43 21.93 28.95 922.93
2003-01-08 7.28 21.31 28.83 909.93
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
    Read CSV (comma-separated) file into DataFrame
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Additional help can be found in the `online docs for IO Tools
    <http://pandas.pydata.org/pandas-docs/stable/io.html>`_.
    
    Parameters
    ----------
    filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
        The string could be a URL. Valid URL schemes include http, ftp, s3, and
        file. For file URLs, a host is expected. For instance, a local file could
        be file ://localhost/path/to/table.csv
    sep : str, default ','
        Delimiter to use. If sep is None, will try to automatically determine
        this. Separators longer than 1 character and different from ``'\s+'`` will
        be interpreted as regular expressions, will force use of the python parsing
        engine and will ignore quotes in the data. Regex example: ``'\r\t'``
    delimiter : str, default ``None``
        Alternative argument name for sep.
    delim_whitespace : boolean, default False
        Specifies whether or not whitespace (e.g. ``' '`` or ``'    '``) will be
        used as the sep. Equivalent to setting ``sep='\s+'``. If this option
        is set to True, nothing should be passed in for the ``delimiter``
        parameter.
    
        .. versionadded:: 0.18.1 support for the Python parser.
    
    header : int or list of ints, default 'infer'
        Row number(s) to use as the column names, and the start of the data.
        Default behavior is as if set to 0 if no ``names`` passed, otherwise
        ``None``. Explicitly pass ``header=0`` to be able to replace existing
        names. The header can be a list of integers that specify row locations for
        a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
        specified will be skipped (e.g. 2 in this example is skipped). Note that
        this parameter ignores commented lines and empty lines if
        ``skip_blank_lines=True``, so header=0 denotes the first line of data
        rather than the first line of the file.
    names : array-like, default None
        List of column names to use. If file contains no header row, then you
        should explicitly pass header=None. Duplicates in this list are not
        allowed unless mangle_dupe_cols=True, which is the default.
    index_col : int or sequence or False, default None
        Column to use as the row labels of the DataFrame. If a sequence is given, a
        MultiIndex is used. If you have a malformed file with delimiters at the end
        of each line, you might consider index_col=False to force pandas to _not_
        use the first column as the index (row names)
    usecols : array-like, default None
        Return a subset of the columns. All elements in this array must either
        be positional (i.e. integer indices into the document columns) or strings
        that correspond to column names provided either by the user in `names` or
        inferred from the document header row(s). For example, a valid `usecols`
        parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
        results in much faster parsing time and lower memory usage.
    as_recarray : boolean, default False
        DEPRECATED: this argument will be removed in a future version. Please call
        `pd.read_csv(...).to_records()` instead.
    
        Return a NumPy recarray instead of a DataFrame after parsing the data.
        If set to True, this option takes precedence over the `squeeze` parameter.
        In addition, as row indices are not available in such a format, the
        `index_col` parameter will be ignored.
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    prefix : str, default None
        Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
    mangle_dupe_cols : boolean, default True
        Duplicate columns will be specified as 'X.0'...'X.N', rather than
        'X'...'X'. Passing in False will cause data to be overwritten if there
        are duplicate names in the columns.
    dtype : Type name or dict of column -> type, default None
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
        (Unsupported with engine='python'). Use `str` or `object` to preserve and
        not interpret dtype.
    engine : {'c', 'python'}, optional
        Parser engine to use. The C engine is faster while the python engine is
        currently more feature-complete.
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can either
        be integers or column labels
    true_values : list, default None
        Values to consider as True
    false_values : list, default None
        Values to consider as False
    skipinitialspace : boolean, default False
        Skip spaces after delimiter.
    skiprows : list-like or integer, default None
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file
    skipfooter : int, default 0
        Number of lines at bottom of file to skip (Unsupported with engine='c')
    skip_footer : int, default 0
        DEPRECATED: use the `skipfooter` parameter instead, as they are identical
    nrows : int, default None
        Number of rows of file to read. Useful for reading pieces of large files
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values.  By default the following values are interpreted as
        NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to.
    na_filter : boolean, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    skip_blank_lines : boolean, default True
        If True, skip over blank lines rather than interpreting as NaN values
    parse_dates : boolean or list of ints or names or list of lists or dict, default False
    
        * boolean. If True -> try parsing the index.
        * list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
          each as a separate date column.
        * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
            a single date column.
        * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
          'foo'
    
        Note: A fast-path exists for iso8601-formatted dates.
    infer_datetime_format : boolean, default False
        If True and parse_dates is enabled, pandas will attempt to infer the format
        of the datetime strings in the columns, and if it can be inferred, switch
        to a faster method of parsing them. In some cases this can increase the
        parsing speed by ~5-10x.
    keep_date_col : boolean, default False
        If True and parse_dates specifies combining multiple columns then
        keep the original columns.
    date_parser : function, default None
        Function to use for converting a sequence of string columns to an array of
        datetime instances. The default uses ``dateutil.parser.parser`` to do the
        conversion. Pandas will try to call date_parser in three different ways,
        advancing to the next if an exception occurs: 1) Pass one or more arrays
        (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
        string values from the columns defined by parse_dates into a single array
        and pass that; and 3) call date_parser once for each row using one or more
        strings (corresponding to the columns defined by parse_dates) as arguments.
    dayfirst : boolean, default False
        DD/MM format dates, international and European format
    iterator : boolean, default False
        Return TextFileReader object for iteration or getting chunks with
        ``get_chunk()``.
    chunksize : int, default None
        Return TextFileReader object for iteration. `See IO Tools docs for more
        information
        <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ on
        ``iterator`` and ``chunksize``.
    compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
        For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
        bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',
        '.zip', or 'xz', respectively, and no decompression otherwise. If using
        'zip', the ZIP file must contain only one data file to be read in.
        Set to None for no decompression.
    
        .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
    
    thousands : str, default None
        Thousands separator
    decimal : str, default '.'
        Character to recognize as decimal point (e.g. use ',' for European data).
    float_precision : string, default None
        Specifies which converter the C engine should use for floating-point
        values. The options are `None` for the ordinary converter,
        `high` for the high-precision converter, and `round_trip` for the
        round-trip converter.
    lineterminator : str (length 1), default None
        Character to break file into lines. Only valid with C parser.
    quotechar : str (length 1), optional
        The character used to denote the start and end of a quoted item. Quoted
        items can include the delimiter and it will be ignored.
    quoting : int or csv.QUOTE_* instance, default 0
        Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
        QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
    doublequote : boolean, default ``True``
       When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
       whether or not to interpret two consecutive quotechar elements INSIDE a
       field as a single ``quotechar`` element.
    escapechar : str (length 1), default None
        One-character string used to escape delimiter when quoting is QUOTE_NONE.
    comment : str, default None
        Indicates remainder of line should not be parsed. If found at the beginning
        of a line, the line will be ignored altogether. This parameter must be a
        single character. Like empty lines (as long as ``skip_blank_lines=True``),
        fully commented lines are ignored by the parameter `header` but not by
        `skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'
        with `header=0` will result in 'a,b,c' being
        treated as the header.
    encoding : str, default None
        Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
        standard encodings
        <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
    dialect : str or csv.Dialect instance, default None
        If None defaults to Excel dialect. Ignored if sep longer than 1 char
        See csv.Dialect documentation for more details
    tupleize_cols : boolean, default False
        Leave a list of tuples on columns as is (default is to convert to
        a Multi Index on the columns)
    error_bad_lines : boolean, default True
        Lines with too many fields (e.g. a csv line with too many commas) will by
        default cause an exception to be raised, and no DataFrame will be returned.
        If False, then these "bad lines" will dropped from the DataFrame that is
        returned. (Only valid with C parser)
    warn_bad_lines : boolean, default True
        If error_bad_lines is False, and warn_bad_lines is True, a warning for each
        "bad line" will be output. (Only valid with C parser).
    low_memory : boolean, default True
        Internally process the file in chunks, resulting in lower memory use
        while parsing, but possibly mixed type inference.  To ensure no mixed
        types either set False, or specify the type with the `dtype` parameter.
        Note that the entire file is read into a single DataFrame regardless,
        use the `chunksize` or `iterator` parameter to return the data in chunks.
        (Only valid with C parser)
    buffer_lines : int, default None
        DEPRECATED: this argument will be removed in a future version because its
        value is not respected by the parser
    compact_ints : boolean, default False
        DEPRECATED: this argument will be removed in a future version
    
        If compact_ints is True, then for any column that is of integer dtype,
        the parser will attempt to cast it as the smallest integer dtype possible,
        either signed or unsigned depending on the specification from the
        `use_unsigned` parameter.
    use_unsigned : boolean, default False
        DEPRECATED: this argument will be removed in a future version
    
        If integer columns are being compacted (i.e. `compact_ints=True`), specify
        whether the column should be compacted to the smallest signed or unsigned
        integer dtype.
    memory_map : boolean, default False
        If a filepath is provided for `filepath_or_buffer`, map the file object
        directly onto memory and access the data directly from there. Using this
        option can improve performance because there is no longer any I/O overhead.
    
    Returns
    -------
    result : DataFrame or TextParser
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
AAPL MSFT XOM SPX
2011-10-11 400.29 27.00 76.27 1195.54
2011-10-12 402.19 26.96 77.16 1207.25
2011-10-13 408.43 27.18 76.37 1203.66
2011-10-14 422.00 27.27 78.11 1224.58
help(pd.DataFrame.info)
Help on function info in module pandas.core.frame:

info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
    Concise summary of a DataFrame.
    
    Parameters
    ----------
    verbose : {None, True, False}, optional
        Whether to print the full summary.
        None follows the `display.max_info_columns` setting.
        True or False overrides the `display.max_info_columns` setting.
    buf : writable buffer, defaults to sys.stdout
    max_cols : int, default None
        Determines whether full summary or short summary is printed.
        None follows the `display.max_info_columns` setting.
    memory_usage : boolean/string, default None
        Specifies whether total memory usage of the DataFrame
        elements (including index) should be displayed. None follows
        the `display.memory_usage` setting. True or False overrides
        the `display.memory_usage` setting. A value of 'deep' is equivalent
        of True, with deep introspection. Memory usage is shown in
        human-readable units (base-2 representation).
    null_counts : boolean, default None
        Whether to show the non-null counts
    
        - If None, then only show if the frame is smaller than
          max_info_rows and max_info_columns.
        - If True, always show counts.
        - If False, never show counts.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
category    8 non-null object
data        8 non-null float64
weights     8 non-null float64
dtypes: float64(2), object(1)
memory usage: 272.0+ bytes

One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns(computed from the percent changes) with SPX. As one way to do this,we first create a function that computes the pairwise correlation of each column with SPX column:

spx_corr=lambda x:x.corrwith(x['SPX'])
help(pd.DataFrame.pct_change)
Help on function pct_change in module pandas.core.generic:

pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
    Percent change over given number of periods.
    
    Parameters
    ----------
    periods : int, default 1
        Periods to shift for forming percent change
    fill_method : str, default 'pad'
        How to handle NAs before computing percent changes
    limit : int, default None
        The number of consecutive NAs to fill before stopping
    freq : DateOffset, timedelta, or offset alias string, optional
        Increment to use from time series API (e.g. 'M' or BDay())
    
    Returns
    -------
    chg : NDFrame
    
    Notes
    -----
    
    By default, the percentage change is calculated along the stat
    axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
    ``Panel``. You can change this with the ``axis`` keyword argument.
df=pd.DataFrame(np.arange(12).reshape((3,4)));df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.pct_change()
0 1 2 3
0 NaN NaN NaN NaN
1 inf 4.0 2.000000 1.333333
2 1.000000 0.8 0.666667 0.571429

Above
(7-3)/3=1.33333,(6-2)/2=2,(5-1)/1=4,
(4-0)/0=inf,(8-4)/4=1,(9-5)/5=0.6667,(11-7)/7=0.57143

rets=close_px.pct_change().dropna()
rets.head()
AAPL MSFT XOM SPX
2003-01-03 0.006757 0.001421 0.000684 -0.000484
2003-01-06 0.000000 0.017975 0.024624 0.022474
2003-01-07 -0.002685 0.019052 -0.033712 -0.006545
2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
2003-01-09 0.008242 0.029094 0.021159 0.019386

Lastly,we group these percent changed by year,which can be extracted from each row label with a one-line function that returns the year attribute of each datetime laebl:

get_year=lambda x:x.year
by_year=rets.groupby(get_year) # get_year will be applied to rets.index
list(by_year)
[(2003,                 AAPL      MSFT       XOM       SPX
  2003-01-03  0.006757  0.001421  0.000684 -0.000484
  2003-01-06  0.000000  0.017975  0.024624  0.022474
  2003-01-07 -0.002685  0.019052 -0.033712 -0.006545
  2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
  2003-01-09  0.008242  0.029094  0.021159  0.019386
  2003-01-10  0.002725  0.001824 -0.013927  0.000000
  2003-01-13 -0.005435  0.008648 -0.004134 -0.001412
  2003-01-14 -0.002732  0.010379  0.008993  0.005830
  2003-01-15 -0.010959 -0.012506 -0.013713 -0.014426
  2003-01-16  0.012465 -0.016282  0.004519 -0.003942
  2003-01-17 -0.035568 -0.070345 -0.010381 -0.014017
  2003-01-21 -0.005674 -0.002473 -0.023077 -0.015702
  2003-01-22 -0.009986 -0.006445 -0.012885 -0.010432
  2003-01-23  0.021614  0.024950 -0.002175  0.010224
  2003-01-24 -0.026798 -0.046251 -0.021439 -0.029233
  2003-01-27  0.024638 -0.013783 -0.026736 -0.016160
  2003-01-28  0.031117 -0.007246  0.026326  0.013050
  2003-01-29  0.024691  0.022419  0.036431  0.006779
  2003-01-30 -0.041499 -0.033656 -0.018293 -0.022849
  2003-01-31  0.002793 -0.015831  0.027768  0.013130
  2003-02-03  0.020891  0.023056  0.013864  0.005399
  2003-02-04 -0.004093 -0.025681  0.000000 -0.014088
  2003-02-05 -0.010959 -0.007531 -0.014376 -0.005435
  2003-02-06  0.000000  0.009756 -0.008538 -0.006449
  2003-02-07 -0.020776 -0.017713 -0.007535 -0.010094
  2003-02-10  0.015559  0.017486  0.007592  0.007569
  2003-02-11  0.000000 -0.019871 -0.007176 -0.008098
  2003-02-12  0.002786  0.000000 -0.019877 -0.012687
  2003-02-13  0.009722  0.011507  0.012906 -0.001600
  2003-02-14  0.009629  0.028169  0.009465  0.021435
  ...              ...       ...       ...       ...
  2003-11-18 -0.034091  0.000000 -0.012604 -0.009084
  2003-11-19  0.000980  0.008024  0.008062  0.008016
  2003-11-20 -0.001959 -0.009950 -0.003665 -0.008432
  2003-11-21 -0.004907  0.000503  0.001003  0.001577
  2003-11-24  0.042406  0.024611  0.006014  0.016227
  2003-11-25 -0.021760 -0.012745  0.007307  0.001720
  2003-11-26  0.001934  0.001986  0.012859  0.004327
  2003-11-28  0.008687  0.009911 -0.002279 -0.000236
  2003-12-01  0.038278  0.005397  0.005546  0.011264
  2003-12-02 -0.007373 -0.007321  0.002920 -0.003271
  2003-12-03 -0.023213  0.000492  0.000971 -0.001772
  2003-12-04  0.004753  0.020639  0.013251  0.004687
  2003-12-05 -0.013245 -0.008185 -0.006380 -0.007684
  2003-12-08  0.008629  0.009709  0.016693  0.007348
  2003-12-09 -0.027567  0.005288  0.006315 -0.008529
  2003-12-10 -0.003910  0.008130  0.004079 -0.001066
  2003-12-11  0.040236  0.000949  0.003125  0.011482
  2003-12-12 -0.015094  0.001422  0.004050  0.002735
  2003-12-15 -0.033525  0.003313 -0.006826 -0.005679
  2003-12-16 -0.002973  0.011792  0.017807  0.006638
  2003-12-17 -0.011928 -0.000466  0.010436  0.001256
  2003-12-18  0.008048  0.013060  0.013670  0.011798
  2003-12-19 -0.016966 -0.001381  0.004795 -0.000477
  2003-12-22  0.008122 -0.006455 -0.003281  0.003931
  2003-12-23 -0.003021 -0.001392  0.007181  0.002818
  2003-12-24  0.030303 -0.003717  0.002080 -0.001807
  2003-12-26  0.018627  0.006063  0.005336  0.001691
  2003-12-29  0.017324  0.009272  0.013270  0.012401
  2003-12-30  0.006623  0.002297  0.002619  0.000144
  2003-12-31  0.004699 -0.005500  0.007837  0.002055
  
  [251 rows x 4 columns]),
 (2004,                 AAPL      MSFT       XOM       SPX
  2004-01-02 -0.004677  0.002765 -0.008929 -0.003094
  2004-01-05  0.042293  0.025276  0.023249  0.012395
  2004-01-06 -0.003607  0.003586 -0.006816  0.001292
  2004-01-07  0.022624 -0.001340 -0.007149  0.002367
  2004-01-08  0.033628 -0.001342 -0.002592  0.004963
  2004-01-09 -0.015411 -0.017913 -0.014727 -0.008888
  2004-01-12  0.031304 -0.003192  0.014068  0.004787
  2004-01-13  0.016863 -0.005032 -0.004335 -0.005332
  2004-01-14  0.003317  0.009655  0.000000  0.008295
  2004-01-15 -0.055372 -0.005920 -0.009869  0.001353
  2004-01-16 -0.006124  0.010078  0.005570  0.006872
  2004-01-20  0.000000  0.010431  0.009329 -0.000930
  2004-01-21 -0.004401  0.007181  0.013287  0.007772
  2004-01-22 -0.019452 -0.010250 -0.003421 -0.003207
  2004-01-23  0.017133  0.016659 -0.006007 -0.002089
  2004-01-26  0.020390  0.011072  0.012374  0.012106
  2004-01-27  0.001738 -0.018835 -0.008528 -0.009798
  2004-01-28 -0.023417 -0.019196 -0.009174 -0.013610
  2004-01-29  0.007105  0.007283  0.016204  0.004989
  2004-01-30 -0.005291 -0.009489 -0.016515 -0.002628
  2004-02-02 -0.010638 -0.009124 -0.002316  0.003651
  2004-02-03 -0.002688 -0.003683 -0.006965  0.000678
  2004-02-04 -0.020665 -0.010628 -0.001461 -0.008371
  2004-02-05  0.028440 -0.001868  0.000585  0.001838
  2004-02-06  0.012489  0.004679  0.003217  0.012555
  2004-02-09 -0.000881 -0.006521  0.011662 -0.002581
  2004-02-10  0.013228  0.004219  0.008934  0.005027
  2004-02-11  0.035683  0.004669  0.020280  0.010667
  2004-02-12 -0.003361 -0.006970 -0.000280 -0.004880
  2004-02-13 -0.030354 -0.013570  0.000000 -0.005468
  ...              ...       ...       ...       ...
  2004-11-18  0.008743 -0.003738  0.000922  0.001362
  2004-11-19 -0.003973 -0.007920  0.006680 -0.011161
  2004-11-22  0.112038 -0.007563  0.010526  0.005896
  2004-11-23 -0.000978 -0.004657  0.005661 -0.000255
  2004-11-24  0.045366  0.004254  0.005629  0.004095
  2004-11-26  0.007805 -0.001694  0.003135  0.000753
  2004-11-29  0.060099  0.006364 -0.005357 -0.003450
  2004-11-30 -0.020164  0.001686 -0.002469 -0.004030
  2004-12-01  0.011035  0.016414 -0.001800  0.014951
  2004-12-02 -0.038348 -0.005797 -0.019157 -0.000873
  2004-12-03 -0.038650  0.004998  0.001838  0.000706
  2004-12-06  0.049458  0.003730 -0.002523 -0.000772
  2004-12-07 -0.044086 -0.009496 -0.008738 -0.011073
  2004-12-08  0.006361  0.010838  0.006495  0.004877
  2004-12-09  0.011378 -0.004948  0.005992  0.005436
  2004-12-10  0.018125 -0.005387 -0.003436 -0.001043
  2004-12-13 -0.003683  0.006250  0.010345  0.008990
  2004-12-14  0.005853 -0.000828 -0.000228  0.003921
  2004-12-15 -0.000613 -0.004144 -0.002959  0.001945
  2004-12-16  0.020533  0.001665 -0.007989 -0.002082
  2004-12-17 -0.024324 -0.007478 -0.005983 -0.007488
  2004-12-20 -0.034780 -0.000419  0.025231  0.000377
  2004-12-21  0.015306  0.004606  0.012192  0.009040
  2004-12-22  0.001256 -0.003752 -0.007584  0.003418
  2004-12-23  0.004078  0.001674  0.013261  0.000463
  2004-12-27 -0.013433 -0.006266 -0.021074 -0.004305
  2004-12-28  0.016149  0.003783  0.003399  0.007154
  2004-12-29  0.004051 -0.001675  0.001581 -0.000074
  2004-12-30  0.005587 -0.005034 -0.002029  0.000082
  2004-12-31 -0.006173 -0.001686  0.004519 -0.001343
  
  [252 rows x 4 columns]),
 (2005,                 AAPL      MSFT       XOM       SPX
  2005-01-03 -0.017081  0.000845 -0.022717 -0.008119
  2005-01-04  0.010111  0.003797 -0.006904 -0.011671
  2005-01-05  0.008758 -0.002522 -0.005098 -0.003628
  2005-01-06  0.000930 -0.000843  0.012579  0.003506
  2005-01-07  0.072491 -0.002952 -0.006441 -0.001431
  2005-01-10 -0.004044  0.004653  0.003705  0.003423
  2005-01-11 -0.063805 -0.002526 -0.002537 -0.006100
  2005-01-12  0.013941  0.001688  0.014801  0.003981
  2005-01-13  0.066300 -0.018963 -0.000684 -0.008630
  2005-01-14  0.005731 -0.005584  0.010262  0.006005
  2005-01-18  0.006553  0.007775  0.008352  0.009675
  2005-01-19 -0.011039 -0.013288 -0.008731 -0.009490
  2005-01-20  0.008300 -0.004344 -0.006549 -0.007783
  2005-01-21  0.000284 -0.008290 -0.005456 -0.006415
  2005-01-24  0.003973  0.000880  0.013714 -0.003528
  2005-01-25  0.018372  0.013626  0.003833  0.004004
  2005-01-26  0.002498 -0.000434  0.007188  0.004844
  2005-01-27  0.005537  0.003905  0.001115  0.000409
  2005-01-28  0.018447  0.002593 -0.009356 -0.002716
  2005-01-31  0.039470  0.003879  0.006521  0.008460
  2005-02-01  0.008062  0.004294  0.032395  0.006891
  2005-02-02  0.027090  0.002565  0.012335  0.003178
  2005-02-03 -0.022859 -0.010661  0.010902 -0.002766
  2005-02-04  0.013368  0.005603  0.014168  0.011043
  2005-02-07  0.001268 -0.006429  0.001877 -0.001089
  2005-02-08  0.024829  0.003020  0.011863  0.000483
  2005-02-09 -0.026700 -0.006452 -0.004319 -0.008575
  2005-02-10 -0.004826 -0.000433  0.014460  0.004211
  2005-02-11  0.036243 -0.003032 -0.004072  0.006926
  2005-02-14  0.042118  0.001303  0.012676  0.000697
  ...              ...       ...       ...       ...
  2005-11-17 -0.006620  0.008441  0.003360  0.009414
  2005-11-18  0.000620  0.003587  0.015166  0.004401
  2005-11-21  0.006196  0.003177  0.019208  0.005271
  2005-11-22  0.024015 -0.008709  0.004950  0.005084
  2005-11-23  0.008870  0.000399  0.003599  0.003473
  2005-11-25  0.033229 -0.005988  0.003964  0.002086
  2005-11-28  0.004615 -0.000402 -0.022749 -0.008508
  2005-11-29 -0.022394 -0.002411 -0.006926  0.000016
  2005-11-30 -0.004112  0.000000 -0.005231 -0.006362
  2005-12-01  0.055736  0.007652  0.022785  0.012157
  2005-12-02  0.014385  0.004396 -0.004760  0.000324
  2005-12-05 -0.011152 -0.005969  0.007461 -0.002363
  2005-12-06  0.031050 -0.005604  0.003038  0.001276
  2005-12-07 -0.001350  0.002013 -0.011359 -0.005009
  2005-12-08  0.001758 -0.002009  0.006894 -0.001217
  2005-12-09  0.003375  0.000805 -0.015595  0.002811
  2005-12-12  0.007803 -0.009654  0.006182  0.000842
  2005-12-13  0.000934 -0.011373  0.003456  0.005554
  2005-12-14 -0.039611 -0.001643  0.013586  0.004190
  2005-12-15  0.002361 -0.006173 -0.006230 -0.001414
  2005-12-16 -0.014824 -0.000828 -0.024126 -0.002848
  2005-12-19  0.003797 -0.002487 -0.006035 -0.005839
  2005-12-20  0.010227  0.000831  0.003917 -0.000238
  2005-12-21  0.019276 -0.004566 -0.005657  0.002517
  2005-12-22  0.007075 -0.005421 -0.008633  0.004221
  2005-12-23 -0.009052  0.002096  0.000000  0.000426
  2005-12-27  0.011997 -0.006695 -0.021571 -0.009553
  2005-12-28 -0.008891 -0.002949  0.006675  0.001297
  2005-12-29 -0.028816 -0.004225 -0.000201 -0.002981
  2005-12-30  0.006158 -0.004667 -0.001206 -0.004887
  
  [252 rows x 4 columns]),
 (2006,                 AAPL      MSFT       XOM       SPX
  2006-01-03  0.039783  0.026428  0.041046  0.016430
  2006-01-04  0.002943  0.004568  0.001739  0.003673
  2006-01-05 -0.007870  0.000827 -0.005016  0.000016
  2006-01-06  0.025813 -0.002891  0.019779  0.009399
  2006-01-09 -0.003277 -0.002071 -0.000570  0.003656
  2006-01-10  0.063248  0.005396  0.007801 -0.000357
  2006-01-11  0.037596  0.010735  0.006796  0.003481
  2006-01-12  0.004648 -0.005310 -0.010501 -0.006274
  2006-01-13  0.015423  0.001643  0.022361  0.001205
  2006-01-17 -0.010282 -0.007380  0.009268 -0.003557
  2006-01-18 -0.026207 -0.005783 -0.013958 -0.003975
  2006-01-19 -0.041823  0.007063  0.013597  0.005564
  2006-01-20 -0.037323 -0.022690 -0.015803 -0.018326
  2006-01-23  0.020765 -0.002111  0.011016  0.001847
  2006-01-24 -0.020986 -0.002961 -0.003878  0.002405
  2006-01-25 -0.024198  0.004667 -0.012236 -0.001721
  2006-01-26 -0.025202  0.003801 -0.004317  0.007235
  2006-01-27 -0.004148  0.048801  0.022243  0.007764
  2006-01-30  0.041233  0.007621  0.029688  0.001145
  2006-01-31  0.006800  0.005175 -0.005731 -0.003976
  2006-02-01 -0.001192 -0.003960 -0.012608  0.001859
  2006-02-02 -0.044020 -0.012724  0.000000 -0.009061
  2006-02-03 -0.003467 -0.005236 -0.009121 -0.005359
  2006-02-06 -0.063326 -0.013360  0.009389  0.000783
  2006-02-07  0.004458 -0.008207 -0.022798 -0.008095
  2006-02-08  0.017899 -0.001241  0.002613  0.008663
  2006-02-09 -0.056096 -0.009114 -0.007818 -0.001478
  2006-02-10  0.036336  0.000836 -0.008068  0.002540
  2006-02-13 -0.038627 -0.011278  0.002837 -0.003260
  2006-02-14  0.045279  0.010139 -0.000943  0.010033
  ...              ...       ...       ...       ...
  2006-11-16  0.018560  0.012075 -0.028457  0.002284
  2006-11-17  0.002803 -0.002610  0.005645  0.001029
  2006-11-20  0.007222  0.016822 -0.008799 -0.000500
  2006-11-21  0.024633  0.001103  0.013162  0.001649
  2006-11-22  0.019300  0.000000 -0.006495  0.002338
  2006-11-24  0.014616 -0.005509 -0.007298 -0.003656
  2006-11-27 -0.022809 -0.009232  0.001225 -0.013555
  2006-11-28  0.025352 -0.002982  0.023405  0.003444
  2006-11-29 -0.000109  0.005981  0.025112  0.009202
  2006-11-30 -0.001525 -0.007061  0.010353  0.000822
  2006-12-01 -0.003709 -0.008234  0.005051 -0.002799
  2006-12-04 -0.002190  0.007170  0.007323  0.008885
  2006-12-05  0.001646 -0.006744  0.003849  0.004002
  2006-12-06 -0.015777 -0.004904 -0.022437 -0.001315
  2006-12-07 -0.031059 -0.004928 -0.007844 -0.003971
  2006-12-08  0.014017  0.019048 -0.002782  0.001812
  2006-12-11  0.005552  0.004860 -0.001909  0.002270
  2006-12-12 -0.029408 -0.003720  0.011768 -0.001047
  2006-12-13  0.033782  0.004108  0.014539  0.001169
  2006-12-14 -0.005615  0.017479  0.017770  0.008689
  2006-12-15 -0.009373  0.004020 -0.018164  0.001122
  2006-12-18 -0.025650 -0.009829 -0.023089 -0.003230
  2006-12-19  0.009828  0.003309  0.019524  0.002158
  2006-12-20 -0.017959  0.003298 -0.011807 -0.001417
  2006-12-21 -0.021944 -0.003652 -0.002768 -0.003674
  2006-12-22 -0.008444 -0.011364 -0.005991 -0.005316
  2006-12-26 -0.008394  0.011865  0.008379  0.004352
  2006-12-27  0.000123  0.001099  0.011953  0.007015
  2006-12-28 -0.007974 -0.001464  0.004898 -0.001479
  2006-12-29  0.049091 -0.004032 -0.009031 -0.004513
  
  [251 rows x 4 columns]),
 (2007,                 AAPL      MSFT       XOM       SPX
  2007-01-03 -0.012258  0.000000 -0.032981 -0.001199
  2007-01-04  0.022196 -0.001472 -0.018699  0.001228
  2007-01-05 -0.007121 -0.005898  0.007165 -0.006085
  2007-01-08  0.004938  0.010011 -0.008022  0.002220
  2007-01-09  0.083070  0.000734 -0.007782 -0.000517
  2007-01-10  0.047856 -0.009905 -0.015224  0.001940
  2007-01-11 -0.012371  0.035198 -0.000156  0.006340
  2007-01-12 -0.012317  0.016464  0.023583  0.004853
  2007-01-16  0.026210 -0.001408 -0.014037  0.000818
  2007-01-17 -0.022142 -0.002116  0.011452 -0.000894
  2007-01-18 -0.061927 -0.003180 -0.006885 -0.002971
  2007-01-19 -0.006399  0.003545  0.021876  0.002895
  2007-01-22 -0.019322 -0.012363 -0.008593 -0.005278
  2007-01-23 -0.012559  0.000358  0.021898  0.003542
  2007-01-24  0.011669  0.011441  0.005506  0.008501
  2007-01-25 -0.005190 -0.020502 -0.018499 -0.011270
  2007-01-26 -0.010087  0.005052  0.001206 -0.001208
  2007-01-29  0.006559 -0.002513 -0.005572 -0.001097
  2007-01-30 -0.004538 -0.001440  0.016356  0.005772
  2007-01-31  0.002104  0.012257 -0.004023  0.006593
  2007-02-01 -0.011548 -0.009615  0.013315  0.005354
  2007-02-02  0.000118 -0.012226  0.006053  0.001694
  2007-02-05 -0.009558 -0.018930  0.001761 -0.000967
  2007-02-06  0.002502 -0.003711 -0.002783  0.000698
  2007-02-07  0.023767 -0.004469 -0.004701  0.001395
  2007-02-08  0.000348 -0.003741  0.009004 -0.001179
  2007-02-09 -0.033767 -0.009763 -0.003218 -0.007077
  2007-02-12  0.019335 -0.001138 -0.008218 -0.003261
  2007-02-13 -0.002121  0.005695  0.011394  0.007597
  2007-02-14  0.007084  0.013590  0.002048  0.007644
  ...              ...       ...       ...       ...
  2007-11-16  0.012721  0.009634  0.007228  0.005230
  2007-11-19 -0.014664 -0.003817 -0.011661 -0.017460
  2007-11-20  0.029887  0.018519  0.044081  0.004486
  2007-11-21 -0.002310 -0.010345 -0.008817 -0.015927
  2007-11-23  0.018283 -0.003484  0.014282  0.016891
  2007-11-26  0.005830 -0.033376 -0.029521 -0.023239
  2007-11-27  0.013156  0.002631  0.008146  0.014930
  2007-11-28  0.030948  0.019351  0.017801  0.028560
  2007-11-29  0.022584 -0.003218  0.007690  0.000477
  2007-11-30 -0.011232  0.000323  0.006401  0.007770
  2007-12-03 -0.018439 -0.020329 -0.003425 -0.005887
  2007-12-04  0.005311 -0.004282 -0.008223 -0.006540
  2007-12-05  0.031645  0.042011  0.020418  0.015190
  2007-12-06  0.023989  0.011746  0.016857  0.015037
  2007-12-07  0.022901 -0.000628  0.000716 -0.001778
  2007-12-10 -0.000463  0.006593  0.005720  0.007510
  2007-12-11 -0.029195 -0.019027 -0.018960 -0.025271
  2007-12-12  0.012305  0.010811  0.018118  0.006050
  2007-12-13  0.005082  0.022020  0.008779  0.001224
  2007-12-14 -0.007507  0.002462 -0.016700 -0.013746
  2007-12-17 -0.031462 -0.026098 -0.014113 -0.015021
  2007-12-18 -0.007701  0.010088  0.016984  0.006280
  2007-12-19  0.000765  0.001561  0.000119 -0.001361
  2007-12-20  0.022335  0.020879  0.007634  0.004900
  2007-12-21  0.035789  0.015263  0.014205  0.016670
  2007-12-24  0.025218  0.014432  0.002451  0.008077
  2007-12-26  0.000755  0.000889  0.012225  0.000809
  2007-12-27 -0.001910 -0.017471 -0.011962 -0.014282
  2007-12-28  0.006345  0.004219  0.014203  0.001504
  2007-12-31 -0.008757 -0.014406 -0.013889 -0.006852
  
  [251 rows x 4 columns]),
 (2008,                 AAPL      MSFT       XOM       SPX
  2008-01-02 -0.016357 -0.010658 -0.001862 -0.014438
  2008-01-03  0.000462  0.004001  0.003382  0.000000
  2008-01-04 -0.076335 -0.027897 -0.018596 -0.024552
  2008-01-07 -0.013385  0.006623 -0.009356  0.003223
  2008-01-08 -0.035972 -0.033521 -0.012791 -0.018352
  2008-01-09  0.047591  0.029822  0.016711  0.013624
  2008-01-10 -0.007692 -0.003148  0.001072  0.007948
  2008-01-11 -0.029940 -0.012314 -0.014753 -0.013595
  2008-01-14  0.035266  0.014066  0.005796  0.010871
  2008-01-15 -0.054480 -0.011349 -0.019930 -0.024925
  2008-01-16 -0.055608 -0.022640 -0.027931 -0.005612
  2008-01-17  0.007830 -0.003589 -0.030246 -0.029093
  2008-01-18  0.002921 -0.002947  0.013905 -0.006045
  2008-01-22 -0.035449 -0.031856 -0.030890 -0.011085
  2008-01-23 -0.106464 -0.001018  0.012035  0.021442
  2008-01-24 -0.024951  0.041426  0.030580  0.010063
  2008-01-25 -0.041224 -0.009455 -0.023967 -0.015872
  2008-01-28  0.000000 -0.006583  0.014291  0.017548
  2008-01-29  0.011768 -0.003645 -0.000640  0.006160
  2008-01-30  0.004865 -0.012305  0.002307 -0.004764
  2008-01-31  0.024058  0.012458  0.004987  0.016772
  2008-02-01 -0.011894 -0.065846  0.002927  0.012237
  2008-02-04 -0.015701 -0.008544 -0.005963 -0.010463
  2008-02-05 -0.017395 -0.037343 -0.038928 -0.031995
  2008-02-06 -0.056895 -0.018650 -0.008234 -0.007624
  2008-02-07 -0.006230 -0.014063  0.009909  0.007886
  2008-02-08  0.034972  0.015420 -0.002254 -0.004204
  2008-02-11  0.031639 -0.012149  0.018472  0.005889
  2008-02-12 -0.035458  0.004612  0.013961  0.007266
  2008-02-13  0.036361  0.021806  0.013126  0.013604
  ...              ...       ...       ...       ...
  2008-11-18  0.020082  0.022185  0.040222  0.009838
  2008-11-19 -0.040262 -0.067824 -0.038106 -0.061156
  2008-11-20 -0.067215 -0.041327 -0.066851 -0.067123
  2008-11-21  0.025966  0.122647  0.106446  0.063248
  2008-11-24  0.125575  0.051379  0.039498  0.064723
  2008-11-25 -0.023131 -0.033951 -0.008685  0.006551
  2008-11-26  0.046256  0.025027  0.035455  0.035328
  2008-11-28 -0.024526 -0.013506 -0.009122  0.009643
  2008-12-01 -0.040358 -0.079516 -0.072849 -0.089295
  2008-12-02  0.039807  0.029176  0.044467  0.039941
  2008-12-03  0.037093  0.037799  0.016947  0.025836
  2008-12-04 -0.046820 -0.038565 -0.033735 -0.029308
  2008-12-05  0.028334  0.040111  0.004347  0.036499
  2008-12-08  0.060851  0.057311  0.039229  0.038387
  2008-12-09  0.003410 -0.019757 -0.017598 -0.023118
  2008-12-10 -0.018489  0.000517  0.023930  0.011894
  2008-12-11 -0.032685 -0.056302 -0.000668 -0.028524
  2008-12-12  0.034421 -0.004379  0.005345  0.007028
  2008-12-15 -0.035820 -0.016493 -0.006115 -0.012686
  2008-12-16  0.007177  0.055897  0.039856  0.051360
  2008-12-17 -0.065703 -0.022234 -0.025080 -0.009593
  2008-12-18  0.003028 -0.018408 -0.050000 -0.021163
  2008-12-19  0.006374 -0.009377 -0.025691  0.002937
  2008-12-22 -0.047333  0.003341 -0.001568 -0.018302
  2008-12-23  0.007464  0.004994  0.002570 -0.009717
  2008-12-24 -0.015513 -0.005522  0.009113  0.005781
  2008-12-26  0.009055 -0.002221  0.018485  0.005356
  2008-12-29  0.009323 -0.008904  0.010806 -0.003873
  2008-12-30 -0.003695  0.020213  0.007264  0.024407
  2008-12-31 -0.010893  0.004953  0.015784  0.014158
  
  [253 rows x 4 columns]),
 (2009,                 AAPL      MSFT       XOM       SPX
  2009-01-02  0.063269  0.046002  0.022773  0.031608
  2009-01-05  0.042204  0.009424 -0.000131 -0.004668
  2009-01-06 -0.016494  0.011411 -0.016374  0.007817
  2009-01-07 -0.021608 -0.060000 -0.025436 -0.030010
  2009-01-08  0.018569  0.031097  0.010659  0.003397
  2009-01-09 -0.022869 -0.029630 -0.019200 -0.021303
  2009-01-12 -0.021197 -0.002726 -0.013234 -0.022564
  2009-01-13 -0.010715  0.018043  0.018022  0.001758
  2009-01-14 -0.027135 -0.037057 -0.036229 -0.033460
  2009-01-15 -0.022852  0.007808  0.020789  0.001329
  2009-01-16 -0.012593  0.024903  0.018831  0.007562
  2009-01-20 -0.050164 -0.062635 -0.023275 -0.052816
  2009-01-21  0.059207  0.048963  0.038968  0.043491
  2009-01-22  0.066763 -0.117518 -0.012952 -0.015162
  2009-01-23  0.000000  0.005600 -0.002460  0.005378
  2009-01-26  0.014486  0.024752  0.007125  0.005553
  2009-01-27  0.012160  0.001812  0.004082  0.010926
  2009-01-28  0.038245  0.021700  0.004201  0.033558
  2009-01-29 -0.012739 -0.025369 -0.028336 -0.033120
  2009-01-30 -0.030860 -0.027845 -0.006805 -0.022789
  2009-02-02  0.015311  0.042964  0.002796 -0.000533
  2009-02-03  0.016064  0.037612  0.018544  0.015834
  2009-02-04  0.006130  0.006904 -0.005886 -0.007489
  2009-02-05  0.031106  0.022286  0.027403  0.016366
  2009-02-06  0.033796  0.032420  0.012063  0.026896
  2009-02-09  0.027978 -0.011370 -0.010727  0.001485
  2009-02-10 -0.045654 -0.032859 -0.042035 -0.049121
  2009-02-11 -0.010324  0.022084 -0.020402  0.007955
  2009-02-12  0.025305  0.002216  0.008559  0.001739
  2009-02-13 -0.001108 -0.008845 -0.008487 -0.009998
  ...              ...       ...       ...       ...
  2009-11-18 -0.005024  0.003814  0.003204 -0.000468
  2009-11-19 -0.026461 -0.011054 -0.008331 -0.013426
  2009-11-20 -0.002942 -0.005239 -0.003500 -0.003215
  2009-11-23  0.029812  0.010885  0.017704  0.013616
  2009-11-24 -0.006994 -0.001042  0.003590 -0.000533
  2009-11-25 -0.001223 -0.004172  0.006603  0.004504
  2009-11-27 -0.017631 -0.019204 -0.021047 -0.017233
  2009-11-30 -0.003390  0.006764  0.002792  0.003793
  2009-12-01 -0.014707  0.020156  0.012808  0.012075
  2009-12-02 -0.003757 -0.007626 -0.003162  0.000343
  2009-12-03  0.001274  0.001746 -0.010618 -0.008402
  2009-12-04 -0.016083  0.004881 -0.009895  0.005509
  2009-12-07 -0.022605 -0.006246 -0.006475 -0.002468
  2009-12-08  0.004869 -0.007332 -0.011051 -0.010252
  2009-12-09  0.041765  0.004573 -0.002292  0.003672
  2009-12-10 -0.006926  0.005602 -0.005313  0.005840
  2009-12-11 -0.008960 -0.000696  0.005919  0.003683
  2009-12-14  0.011866  0.008711 -0.043054  0.006959
  2009-12-15 -0.014265 -0.003109 -0.007499 -0.005547
  2009-12-16  0.004429  0.002772 -0.010728  0.001128
  2009-12-17 -0.016254 -0.016586 -0.003055 -0.011811
  2009-12-18  0.018607  0.025650 -0.000153  0.005830
  2009-12-21  0.014327  0.005139  0.004444  0.010504
  2009-12-22  0.010745  0.009884  0.000915  0.003564
  2009-12-23  0.008684  0.003375 -0.004572  0.002299
  2009-12-24  0.034339  0.002355  0.005818  0.005256
  2009-12-28  0.012294  0.005705  0.006241  0.001154
  2009-12-29 -0.011861  0.007007 -0.003480 -0.001401
  2009-12-30  0.012147 -0.013585 -0.001063  0.000195
  2009-12-31 -0.004300 -0.015788 -0.008511 -0.010050
  
  [252 rows x 4 columns]),
 (2010,                 AAPL      MSFT       XOM       SPX
  2010-01-04  0.015565  0.015700  0.014102  0.016043
  2010-01-05  0.001729  0.000336  0.003930  0.003116
  2010-01-06 -0.015906 -0.006382  0.008582  0.000546
  2010-01-07 -0.001849 -0.010142 -0.003135  0.004001
  2010-01-08  0.006648  0.006831 -0.003893  0.002882
  2010-01-11 -0.008822 -0.012890  0.011124  0.001747
  2010-01-12 -0.011375 -0.006529 -0.004906 -0.009381
  2010-01-13  0.014106  0.009339 -0.004034  0.008326
  2010-01-14 -0.005792  0.020219  0.000150  0.002427
  2010-01-15 -0.016712 -0.003359 -0.008250 -0.010823
  2010-01-19  0.044238  0.007752  0.002420  0.012500
  2010-01-20 -0.015392 -0.016388 -0.017954 -0.010598
  2010-01-21 -0.017286 -0.019041 -0.019511 -0.018945
  2010-01-22 -0.049599 -0.035009 -0.009088 -0.022141
  2010-01-25  0.026903  0.012572 -0.003795  0.004598
  2010-01-26  0.014133  0.006031  0.001111 -0.004203
  2010-01-27  0.009420  0.005994 -0.005708  0.004880
  2010-01-28 -0.041322 -0.017175 -0.008930 -0.011818
  2010-01-29 -0.036279 -0.033880 -0.008045 -0.009829
  2010-02-01  0.013902  0.008121  0.027088  0.014266
  2010-02-02  0.005803  0.001831  0.011845  0.012973
  2010-02-03  0.017206  0.006213 -0.005463 -0.005474
  2010-02-04 -0.036039 -0.027606 -0.028249 -0.031141
  2010-02-05  0.017756  0.006350  0.001292  0.002897
  2010-02-08 -0.006856 -0.010765 -0.000484 -0.008863
  2010-02-09  0.010664  0.010507  0.013232  0.013040
  2010-02-10 -0.005454 -0.000743 -0.005415 -0.002233
  2010-02-11  0.018194  0.004831  0.006085  0.009680
  2010-02-12  0.008607 -0.007027 -0.006844 -0.002745
  2010-02-16  0.015071  0.019739  0.022917  0.018001
  ...              ...       ...       ...       ...
  2010-11-18  0.026389  0.010359  0.018876  0.015357
  2010-11-19 -0.005512 -0.005915  0.003329  0.002540
  2010-11-22  0.021615  0.001587 -0.005049 -0.001575
  2010-11-23 -0.014775 -0.023762 -0.017254 -0.014284
  2010-11-24  0.019661  0.010142  0.013131  0.014923
  2010-11-26  0.000635 -0.004819 -0.009320 -0.007469
  2010-11-29  0.005937  0.002421  0.003234 -0.001379
  2010-11-30 -0.018052 -0.002013  0.001465 -0.006070
  2010-12-01  0.016873  0.031061  0.025457  0.021617
  2010-12-02  0.005531  0.032473  0.002140  0.012818
  2010-12-03 -0.002232  0.004926 -0.003986  0.002603
  2010-12-06  0.008537 -0.006787  0.001572 -0.001298
  2010-12-07 -0.006060  0.001139  0.002141  0.000515
  2010-12-08  0.008799  0.013652  0.005412  0.003702
  2010-12-09 -0.003894 -0.005612  0.002125  0.003843
  2010-12-10  0.002502  0.009406  0.002544  0.006002
  2010-12-13  0.003463 -0.003354  0.002538  0.000048
  2010-12-14 -0.004290  0.013837 -0.002531  0.000911
  2010-12-15  0.000219  0.008115 -0.004652 -0.005122
  2010-12-16  0.002778  0.005123  0.005241  0.006185
  2010-12-17 -0.001992 -0.003276 -0.000705  0.000837
  2010-12-20  0.004990 -0.003287  0.000705  0.002548
  2010-12-21  0.006176  0.009527  0.006904  0.006030
  2010-12-22  0.002961  0.004356  0.001120  0.003380
  2010-12-23 -0.004798  0.003975  0.005451 -0.001644
  2010-12-27  0.003337 -0.008279 -0.002641  0.000613
  2010-12-28  0.002433 -0.002178  0.005715  0.000771
  2010-12-29 -0.000553 -0.001455 -0.000693  0.001009
  2010-12-30 -0.005011 -0.004372 -0.000139 -0.001508
  2010-12-31 -0.003399  0.002195 -0.003329 -0.000191
  
  [252 rows x 4 columns]),
 (2011,                 AAPL      MSFT       XOM       SPX
  2011-01-03  0.021732  0.002556  0.019624  0.011315
  2011-01-04  0.005219  0.004006  0.004641 -0.001313
  2011-01-05  0.008180 -0.003264 -0.002717  0.005007
  2011-01-06 -0.000808  0.029476  0.006540 -0.002123
  2011-01-07  0.007161 -0.007777  0.005414 -0.001845
  2011-01-10  0.018833 -0.013181 -0.006058 -0.001376
  2011-01-11 -0.002365 -0.003971  0.007450  0.003725
  2011-01-12  0.008137  0.015585  0.011697  0.009008
  2011-01-13  0.003658 -0.012491  0.001728 -0.001711
  2011-01-14  0.008100  0.003975  0.014725  0.007385
  2011-01-18 -0.022469  0.012599  0.011243  0.001376
  2011-01-19 -0.005313 -0.006754 -0.006076 -0.010116
  2011-01-20 -0.018180 -0.004295 -0.006243 -0.001295
  2011-01-21 -0.017915 -0.011503  0.015838  0.002414
  2011-01-24  0.032842  0.012727 -0.004896  0.005836
  2011-01-25  0.011705  0.002513  0.001165  0.000263
  2011-01-26  0.007176  0.011819  0.012416  0.004221
  2011-01-27 -0.001861  0.002832  0.002683  0.002244
  2011-01-28 -0.020716 -0.038475 -0.011084 -0.017852
  2011-01-31  0.009580 -0.000734  0.021386  0.007663
  2011-02-01  0.016828  0.009184  0.039985  0.016694
  2011-02-02 -0.002058 -0.001820 -0.005943 -0.002723
  2011-02-03 -0.002556 -0.010212  0.000366  0.002354
  2011-02-04  0.008910  0.004422 -0.001829  0.002884
  2011-02-07  0.015527  0.015407  0.007698  0.006240
  2011-02-08  0.009435  0.002890 -0.006063  0.004185
  2011-02-09  0.008333 -0.011167 -0.005124 -0.002786
  2011-02-10 -0.010107 -0.016758  0.007848  0.000750
  2011-02-11  0.006515 -0.009263 -0.004623  0.005507
  2011-02-14  0.006529 -0.000374  0.025303  0.002385
  ...              ...       ...       ...       ...
  2011-09-02 -0.018319 -0.015643 -0.018370 -0.025282
  2011-09-06  0.015212 -0.011240 -0.013723 -0.007436
  2011-09-07  0.011034  0.019208  0.035137  0.028646
  2011-09-08  0.000547  0.008462 -0.011270 -0.010612
  2011-09-09 -0.017337 -0.018307 -0.024856 -0.026705
  2011-09-12  0.006517  0.005828  0.011688  0.006966
  2011-09-13  0.012318  0.005794 -0.002645  0.009120
  2011-09-14  0.012168  0.017665  0.013817  0.013480
  2011-09-15  0.009401  0.018491  0.018860  0.017187
  2011-09-16  0.019188  0.004817  0.007296  0.005707
  2011-09-19  0.027790  0.003319 -0.011402 -0.009803
  2011-09-20  0.004421 -0.008453  0.004206 -0.001661
  2011-09-21 -0.003168 -0.036694 -0.027564 -0.029390
  2011-09-22 -0.025040 -0.035783 -0.037932 -0.031883
  2011-09-23  0.006172  0.000000  0.001011  0.006082
  2011-09-26 -0.002795  0.015164  0.034771  0.023336
  2011-09-27 -0.009698  0.009041  0.016592  0.010688
  2011-09-28 -0.005635 -0.003506 -0.011521 -0.020691
  2011-09-29 -0.016221 -0.005082  0.025114  0.008114
  2011-09-30 -0.023683 -0.022004 -0.016919 -0.024974
  2011-10-03 -0.017623 -0.014464 -0.020377 -0.028451
  2011-10-04 -0.005606  0.033021  0.023612  0.022488
  2011-10-05  0.015436  0.021705  0.015378  0.017866
  2011-10-06 -0.002327  0.017381 -0.000811  0.018304
  2011-10-07 -0.020060 -0.003417 -0.004466 -0.008163
  2011-10-10  0.051406  0.026286  0.036977  0.034125
  2011-10-11  0.029526  0.002227 -0.000131  0.000544
  2011-10-12  0.004747 -0.001481  0.011669  0.009795
  2011-10-13  0.015515  0.008160 -0.010238 -0.002974
  2011-10-14  0.033225  0.003311  0.022784  0.017380
  
  [199 rows x 4 columns])]
rets.index[0].year
2003
by_year.apply(spx_corr)
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0

We could also compute inter-column correlations.Here we compute the annual correlation between Apple and Microsoft.

rets['AAPL'].corr(rets['MSFT'])
0.44469678861732237
df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df[0].corr(df[1])
1.0
help(pd.Series.corr)
Help on function corr in module pandas.core.series:

corr(self, other, method='pearson', min_periods=None)
    Compute correlation with `other` Series, excluding missing values
    
    Parameters
    ----------
    other : Series
    method : {'pearson', 'kendall', 'spearman'}
        * pearson : standard correlation coefficient
        * kendall : Kendall Tau correlation coefficient
        * spearman : Spearman rank correlation
    min_periods : int, optional
        Minimum number of observations needed to have a valid result
    
    
    Returns
    -------
    correlation : float

Example:Group-wise Linear regression

You can use groupby to perform more complex group-wise statistical analysis,as long as the function returns a pandas object or scalar value.For example,I can define the following regress function(using the statsmodels econometrics libraby),which executes an ordinary least quares(OLS) regression on each chunk of data:

import statsmodels.api as sm
def regress(data,yvar,xvars):
    Y=data[yvar]
    X=data[xvars]
    X['intercept']=1
    result=sm.OLS(Y,X).fit()
    return result.params
by_year.apply(regress,'AAPL',['SPX'])
SPX intercept
2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514

Pivot tables and cross-tabulation

help(pd.DataFrame.pivot_table)
Help on function pivot_table in module pandas.tools.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
    Create a spreadsheet-style pivot table as a DataFrame. The levels in the
    pivot table will be stored in MultiIndex objects (hierarchical indexes) on
    the index and columns of the result DataFrame
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The list
        can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed, it
        is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The list
        can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed, it
        is being used as the same manner as column values.
    aggfunc : function or list of functions, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names (inferred
        from the function objects themselves)
    fill_value : scalar, default None
        Value to replace missing values with
    margins : boolean, default False
        Add all row / columns (e.g. for subtotal / grand totals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    margins_name : string, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    
    Examples
    --------
    >>> df
       A   B   C      D
    0  foo one small  1
    1  foo one large  2
    2  foo one large  2
    3  foo two small  3
    4  foo two small  3
    5  bar one large  4
    6  bar one small  5
    7  bar two small  6
    8  bar two large  7
    
    >>> table = pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
              small  large
    foo  one  1      4
         two  6      NaN
    bar  one  5      4
         two  6      7
    
    Returns
    -------
    table : DataFrame
ex1=pd.read_csv('ex1.csv')
ex1
sex age grade province
0 m 20 89 H
1 f 23 90 B
2 m 21 86 B
3 m 22 56 H
4 f 23 78 J
5 f 22 90 B
6 m 21 100 H
7 m 20 98 J
8 m 25 67 B
9 f 23 89 J
ex1.pivot_table(index='sex',aggfunc=[np.mean,np.sum,np.min,np.max],margins=True)
mean sum amin amax
age grade age grade age grade province age grade province
sex
f 22.75 86.750000 91.0 347.0 22 78 B 23 90 J
m 21.50 82.666667 129.0 496.0 20 56 B 25 100 J
All 22.00 84.300000 220.0 843.0 20 56 B 25 100 J
ex1.pivot_table(index='sex')
age grade
sex
f 22.75 86.750000
m 21.50 82.666667

Attention :min and max will discard non-number column 'province`.

df=pd.DataFrame(np.arange(4).reshape((2,2)),index=['a','b'],columns=['one','two']);df
one two
a 0 1
b 2 3
df.pivot_table(index='one',aggfunc=np.sum,margins=True)
two
one
0 1.0
2 3.0
All 4.0
tips.head()
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
tips.pivot_table(index=['day','smoker'])
size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588
tips.pivot_table(values=['tip_pct','size'],index=['time','day'],columns='smoker')
tip_pct size
smoker No Yes No Yes
time day
Dinner Fri 0.139622 0.165347 2.000000 2.222222
Sat 0.158048 0.147906 2.555556 2.476190
Sun 0.160113 0.187250 2.929825 2.578947
Thur 0.159744 NaN 2.000000 NaN
Lunch Fri 0.187735 0.188937 3.000000 1.833333
Thur 0.160311 0.163863 2.500000 2.352941
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker') #The same with above
tip_pct size
smoker No Yes No Yes
time day
Dinner Fri 0.139622 0.165347 2.000000 2.222222
Sat 0.158048 0.147906 2.555556 2.476190
Sun 0.160113 0.187250 2.929825 2.578947
Thur 0.159744 NaN 2.000000 NaN
Lunch Fri 0.187735 0.188937 3.000000 1.833333
Thur 0.160311 0.163863 2.500000 2.352941

We could augment this table to include partial totals by passing margins=True.This has the effect of adding All row and column labels,with corresponding values being the group statistics for all the data within a single tier:

tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
tip_pct size
smoker No Yes All No Yes All
time day
Dinner Fri 0.139622 0.165347 0.158916 2.000000 2.222222 2.166667
Sat 0.158048 0.147906 0.153152 2.555556 2.476190 2.517241
Sun 0.160113 0.187250 0.166897 2.929825 2.578947 2.842105
Thur 0.159744 NaN 0.159744 2.000000 NaN 2.000000
Lunch Fri 0.187735 0.188937 0.188765 3.000000 1.833333 2.000000
Thur 0.160311 0.163863 0.161301 2.500000 2.352941 2.459016
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672

Here,the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.

if some combinations are empty,you may wish to pass a fill_value.

tips.pivot_table('tip_pct',index=['time','size','smoker'],columns='day',aggfunc='mean',fill_value=0)
day Fri Sat Sun Thur
time size smoker
Dinner 1 No 0.000000 0.137931 0.000000 0.000000
Yes 0.000000 0.325733 0.000000 0.000000
2 No 0.139622 0.162705 0.168859 0.159744
Yes 0.171297 0.148668 0.207893 0.000000
3 No 0.000000 0.154661 0.152663 0.000000
Yes 0.000000 0.144995 0.152660 0.000000
4 No 0.000000 0.150096 0.148143 0.000000
Yes 0.117750 0.124515 0.193370 0.000000
5 No 0.000000 0.000000 0.206928 0.000000
Yes 0.000000 0.106572 0.065660 0.000000
6 No 0.000000 0.000000 0.103799 0.000000
Lunch 1 No 0.000000 0.000000 0.000000 0.181728
Yes 0.223776 0.000000 0.000000 0.000000
2 No 0.000000 0.000000 0.000000 0.166005
Yes 0.181969 0.000000 0.000000 0.158843
3 No 0.187735 0.000000 0.000000 0.084246
Yes 0.000000 0.000000 0.000000 0.204952
4 No 0.000000 0.000000 0.000000 0.138919
Yes 0.000000 0.000000 0.000000 0.155410
5 No 0.000000 0.000000 0.000000 0.121389
6 No 0.000000 0.000000 0.000000 0.173706
data
Ohio         -0.953675
New york      0.531946
Vermont            NaN
Florida       1.018829
Oregon       -0.539930
Nevada             NaN
California   -0.869226
Idaho              NaN
dtype: float64

Cross-tabulation:Crosstab

A cross-tabulation is a special case of a pivot table that computes group frequencies.

data=pd.read_csv('ex2.csv')
data
Sample Nationality Handeness
0 1 USA R
1 2 JAPAN L
2 3 USA R
3 4 JAPAN R
4 5 JAPAN L
5 6 JAPAN R
6 7 USA R
7 8 USA L
8 9 JAPAN R
9 10 USA R
pd.crosstab(data.Nationality,data.Handeness,margins=True)
Handeness L R All
Nationality
JAPAN 2 3 5
USA 1 4 5
All 3 7 10
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244
help(pd.crosstab)
Help on function crosstab in module pandas.tools.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
    Compute a simple cross-tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    aggfunc : function, optional
        If specified, requires `values` be specified as well
    rownames : sequence, default None
        If passed, must match number of row arrays passed
    colnames : sequence, default None
        If passed, must match number of column arrays passed
    margins : boolean, default False
        Add row/column margins (subtotals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
        .. versionadded:: 0.18.1
    
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
    >>> a
    array([foo, foo, foo, foo, bar, bar,
           bar, bar, foo, foo, foo], dtype=object)
    >>> b
    array([one, one, one, two, one, one,
           one, two, two, two, one], dtype=object)
    >>> c
    array([dull, dull, shiny, dull, dull, shiny,
           shiny, dull, shiny, shiny, shiny], dtype=object)
    
    >>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b    one          two
    c    dull  shiny  dull  shiny
    a
    bar  1     2      1     0
    foo  2     2      1     2
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                            # but they still will be counted in the output
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0
    
    Returns
    -------
    crosstab : DataFrame

posted @ 2020-04-28 13:21  JohnYang819  阅读(764)  评论(0编辑  收藏  举报