Data aggregation and group operations in pandas
After loading,merging and preparing a dataset,you may need to compute group statistics or possibly pivot tables for reporting or visualization purpose.pandas provides a flexible groupby
interface,enabling you to slice,dice and summarize datasets in a natural way.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Groupby mechanics
split-apply-combine
:
In the first stage of the process,data contained in a pandas object,whether a Series,DataFrame,or otherwise,is split into groups based on one or more keys that you provide.The splitting is performed on a particular axis off an object.For example,a DataFrame can be grouped on its rows(axis=0) or its columns(axis=1).
Once the first step is done, a function is applied to each group,producing a new value.
Finally,the results of all those function applications are combined into a result object.
help(pd.Series.groupby)
Help on function groupby in module pandas.core.generic:
groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Group series using mapper (dict or key function, apply given function
to group, return result as series) or by a series of columns.
Parameters
----------
by : mapping function / list of functions, dict, Series, or tuple /
list of column names.
Called on each element of the object index to determine the groups.
If a dict or Series is passed, the Series or dict VALUES will be
used to determine the groups
axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular
level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively "SQL-style" grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
Examples
--------
DataFrame results
>>> data.groupby(func, axis=0).mean()
>>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
Returns
-------
GroupBy object
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)});df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
- The first method:
Suppose you wanted to compute the mean of the data
column using the labels from key1
.There are a number of ways to do this.One is to access data1
and call groupby
with the column(a Series) at key1
:
grouped=df['data1'].groupby(df['key1'])
help(grouped)
Help on SeriesGroupBy in module pandas.core.groupby object:
class SeriesGroupBy(GroupBy)
| Class for grouping and aggregating relational data. See aggregate,
| transform, and apply functions on this object.
|
| It's easiest to use obj.groupby(...) to use GroupBy, but you can also do:
|
| ::
|
| grouped = groupby(obj, ...)
|
| Parameters
| ----------
| obj : pandas object
| axis : int, default 0
| level : int, default None
| Level of MultiIndex
| groupings : list of Grouping objects
| Most users should ignore this
| exclusions : array-like, optional
| List of columns to exclude
| name : string
| Most users should ignore this
|
| Notes
| -----
| After grouping, see aggregate, apply, and transform functions. Here are
| some other brief notes about usage. When grouping by multiple groups, the
| result index will be a MultiIndex (hierarchical) by default.
|
| Iteration produces (key, group) tuples, i.e. chunking the data by group. So
| you can write code like:
|
| ::
|
| grouped = obj.groupby(keys, axis=axis)
| for key, group in grouped:
| # do something with the data
|
| Function calls on GroupBy, if not specially implemented, "dispatch" to the
| grouped data. So if you group a DataFrame and wish to invoke the std()
| method on each group, you can simply do:
|
| ::
|
| df.groupby(mapper).std()
|
| rather than
|
| ::
|
| df.groupby(mapper).aggregate(np.std)
|
| You can pass arguments to these "wrapped" functions, too.
|
| See the online documentation for full exposition on these topics and much
| more
|
| Returns
| -------
| **Attributes**
| groups : dict
| {group name -> group labels}
| len(grouped) : int
| Number of groups
|
| Method resolution order:
| SeriesGroupBy
| GroupBy
| _GroupBy
| pandas.core.base.PandasObject
| pandas.core.base.StringMixin
| pandas.core.base.SelectionMixin
| builtins.object
|
| Methods defined here:
|
| agg = aggregate(self, func_or_funcs, *args, **kwargs)
|
| aggregate(self, func_or_funcs, *args, **kwargs)
| Apply aggregation function or functions to groups, yielding most likely
| Series but in some cases DataFrame depending on the output of the
| aggregation function
|
| Parameters
| ----------
| func_or_funcs : function or list / dict of functions
| List/dict of functions will produce DataFrame with column names
| determined by the function names themselves (list) or the keys in
| the dict
|
| Notes
| -----
| agg is an alias for aggregate. Use it.
|
| Examples
| --------
| >>> series
| bar 1.0
| baz 2.0
| qot 3.0
| qux 4.0
|
| >>> mapper = lambda x: x[0] # first letter
| >>> grouped = series.groupby(mapper)
|
| >>> grouped.aggregate(np.sum)
| b 3.0
| q 7.0
|
| >>> grouped.aggregate([np.sum, np.mean, np.std])
| mean std sum
| b 1.5 0.5 3
| q 3.5 0.5 7
|
| >>> grouped.agg({'result' : lambda x: x.mean() / x.std(),
| ... 'total' : np.sum})
| result total
| b 2.121 3
| q 4.95 7
|
| See also
| --------
| apply, transform
|
| Returns
| -------
| Series or DataFrame
|
| count(self)
| Compute count of group, excluding missing values
|
| filter(self, func, dropna=True, *args, **kwargs)
| Return a copy of a Series excluding elements from groups that
| do not satisfy the boolean criterion specified by func.
|
| Parameters
| ----------
| func : function
| To apply to each group. Should return True or False.
| dropna : Drop groups that do not pass the filter. True by default;
| if False, groups that evaluate False are filled with NaNs.
|
| Examples
| --------
| >>> grouped.filter(lambda x: x.mean() > 0)
|
| Returns
| -------
| filtered : Series
|
| nlargest(self, n=5, keep='first')
| Return the largest `n` elements.
|
| Parameters
| ----------
| n : int
| Return this many descending sorted values
| keep : {'first', 'last', False}, default 'first'
| Where there are duplicate values:
| - ``first`` : take the first occurrence.
| - ``last`` : take the last occurrence.
| take_last : deprecated
|
| Returns
| -------
| top_n : Series
| The n largest values in the Series, in sorted order
|
| Notes
| -----
| Faster than ``.sort_values(ascending=False).head(n)`` for small `n`
| relative to the size of the ``Series`` object.
|
| See Also
| --------
| Series.nsmallest
|
| Examples
| --------
| >>> import pandas as pd
| >>> import numpy as np
| >>> s = pd.Series(np.random.randn(1e6))
| >>> s.nlargest(10) # only sorts up to the N requested
|
| nsmallest(self, n=5, keep='first')
| Return the smallest `n` elements.
|
| Parameters
| ----------
| n : int
| Return this many ascending sorted values
| keep : {'first', 'last', False}, default 'first'
| Where there are duplicate values:
| - ``first`` : take the first occurrence.
| - ``last`` : take the last occurrence.
| take_last : deprecated
|
| Returns
| -------
| bottom_n : Series
| The n smallest values in the Series, in sorted order
|
| Notes
| -----
| Faster than ``.sort_values().head(n)`` for small `n` relative to
| the size of the ``Series`` object.
|
| See Also
| --------
| Series.nlargest
|
| Examples
| --------
| >>> import pandas as pd
| >>> import numpy as np
| >>> s = pd.Series(np.random.randn(1e6))
| >>> s.nsmallest(10) # only sorts up to the N requested
|
| nunique(self, dropna=True)
| Returns number of unique elements in the group
|
| transform(self, func, *args, **kwargs)
| Call function producing a like-indexed Series on each group and return
| a Series with the transformed values
|
| Parameters
| ----------
| func : function
| To apply to each group. Should return a Series with the same index
|
| Examples
| --------
| >>> grouped.transform(lambda x: (x - x.mean()) / x.std())
|
| Returns
| -------
| transformed : Series
|
| value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| all
|
| Return whether all elements are True over requested axis
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level : int or level name, default None
| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| bool_only : boolean, default None
| Include only boolean columns. If None, will attempt to use everything,
| then use only boolean data. Not implemented for Series.
|
| Returns
| -------
| all : scalar or Series (if level specified)
|
| any
|
| Return whether any element is True over requested axis
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level : int or level name, default None
| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| bool_only : boolean, default None
| Include only boolean columns. If None, will attempt to use everything,
| then use only boolean data. Not implemented for Series.
|
| Returns
| -------
| any : scalar or Series (if level specified)
|
| corr
| Compute correlation with `other` Series, excluding missing values
|
| Parameters
| ----------
| other : Series
| method : {'pearson', 'kendall', 'spearman'}
| * pearson : standard correlation coefficient
| * kendall : Kendall Tau correlation coefficient
| * spearman : Spearman rank correlation
| min_periods : int, optional
| Minimum number of observations needed to have a valid result
|
|
| Returns
| -------
| correlation : float
|
| cov
| Compute covariance with Series, excluding missing values
|
| Parameters
| ----------
| other : Series
| min_periods : int, optional
| Minimum number of observations needed to have a valid result
|
| Returns
| -------
| covariance : float
|
| Normalized by N-1 (unbiased estimator).
|
| cummax
| Return cumulative max over requested axis.
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
|
| Returns
| -------
| cummax : scalar
|
| cummin
| Return cumulative minimum over requested axis.
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
|
| Returns
| -------
| cummin : scalar
|
| describe
| Generate various summary statistics, excluding NaN values.
|
| Parameters
| ----------
| percentiles : array-like, optional
| The percentiles to include in the output. Should all
| be in the interval [0, 1]. By default `percentiles` is
| [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
| include, exclude : list-like, 'all', or None (default)
| Specify the form of the returned result. Either:
|
| - None to both (default). The result will include only
| numeric-typed columns or, if none are, only categorical columns.
| - A list of dtypes or strings to be included/excluded.
| To select all numeric types use numpy numpy.number. To select
| categorical objects use type object. See also the select_dtypes
| documentation. eg. df.describe(include=['O'])
| - If include is the string 'all', the output column-set will
| match the input one.
|
| Returns
| -------
| summary: NDFrame of summary statistics
|
| Notes
| -----
| The output DataFrame index depends on the requested dtypes:
|
| For numeric dtypes, it will include: count, mean, std, min,
| max, and lower, 50, and upper percentiles.
|
| For object dtypes (e.g. timestamps or strings), the index
| will include the count, unique, most common, and frequency of the
| most common. Timestamps also include the first and last items.
|
| For mixed dtypes, the index will be the union of the corresponding
| output types. Non-applicable entries will be filled with NaN.
| Note that mixed-dtype outputs can only be returned from mixed-dtype
| inputs and appropriate use of the include/exclude arguments.
|
| If multiple values have the highest count, then the
| `count` and `most common` pair will be arbitrarily chosen from
| among those with the highest count.
|
| The include, exclude arguments are ignored for Series.
|
| See Also
| --------
| DataFrame.select_dtypes
|
| diff
| 1st discrete difference of object
|
| Parameters
| ----------
| periods : int, default 1
| Periods to shift for forming difference
|
| Returns
| -------
| diffed : Series
|
| dtype
| return the dtype object of the underlying data
|
| fillna
| Fill NA/NaN values using the specified method
|
| Parameters
| ----------
| value : scalar, dict, Series, or DataFrame
| Value to use to fill holes (e.g. 0), alternately a
| dict/Series/DataFrame of values specifying which value to use for
| each index (for a Series) or column (for a DataFrame). (values not
| in the dict/Series/DataFrame will not be filled). This value cannot
| be a list.
| method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
| Method to use for filling holes in reindexed Series
| pad / ffill: propagate last valid observation forward to next valid
| backfill / bfill: use NEXT valid observation to fill gap
| axis : {0, 'index'}
| inplace : boolean, default False
| If True, fill in place. Note: this will modify any
| other views on this object, (e.g. a no-copy slice for a column in a
| DataFrame).
| limit : int, default None
| If method is specified, this is the maximum number of consecutive
| NaN values to forward/backward fill. In other words, if there is
| a gap with more than this number of consecutive NaNs, it will only
| be partially filled. If method is not specified, this is the
| maximum number of entries along the entire axis where NaNs will be
| filled.
| downcast : dict, default is None
| a dict of item->dtype of what to downcast if possible,
| or the string 'infer' which will try to downcast to an appropriate
| equal type (e.g. float64 to int64 if possible)
|
| See Also
| --------
| reindex, asfreq
|
| Returns
| -------
| filled : Series
|
| hist
| Draw histogram of the input series using matplotlib
|
| Parameters
| ----------
| by : object, optional
| If passed, then used to form histograms for separate groups
| ax : matplotlib axis object
| If not passed, uses gca()
| grid : boolean, default True
| Whether to show axis grid lines
| xlabelsize : int, default None
| If specified changes the x-axis label size
| xrot : float, default None
| rotation of x axis labels
| ylabelsize : int, default None
| If specified changes the y-axis label size
| yrot : float, default None
| rotation of y axis labels
| figsize : tuple, default None
| figure size in inches by default
| bins: integer, default 10
| Number of histogram bins to be used
| kwds : keywords
| To be passed to the actual plotting function
|
| Notes
| -----
| See matplotlib documentation online for more on this
|
| idxmax
| Index of first occurrence of maximum of values.
|
| Parameters
| ----------
| skipna : boolean, default True
| Exclude NA/null values
|
| Returns
| -------
| idxmax : Index of maximum of values
|
| Notes
| -----
| This method is the Series version of ``ndarray.argmax``.
|
| See Also
| --------
| DataFrame.idxmax
| numpy.ndarray.argmax
|
| idxmin
| Index of first occurrence of minimum of values.
|
| Parameters
| ----------
| skipna : boolean, default True
| Exclude NA/null values
|
| Returns
| -------
| idxmin : Index of minimum of values
|
| Notes
| -----
| This method is the Series version of ``ndarray.argmin``.
|
| See Also
| --------
| DataFrame.idxmin
| numpy.ndarray.argmin
|
| mad
|
| Return the mean absolute deviation of the values for the requested axis
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level : int or level name, default None
| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| numeric_only : boolean, default None
| Include only float, int, boolean columns. If None, will attempt to use
| everything, then use only numeric data. Not implemented for Series.
|
| Returns
| -------
| mad : scalar or Series (if level specified)
|
| name
| since we are a series, we by definition only have
| a single name, but may be the result of a selection or
| the name of our object
|
| pct_change
| Percent change over given number of periods.
|
| Parameters
| ----------
| periods : int, default 1
| Periods to shift for forming percent change
| fill_method : str, default 'pad'
| How to handle NAs before computing percent changes
| limit : int, default None
| The number of consecutive NAs to fill before stopping
| freq : DateOffset, timedelta, or offset alias string, optional
| Increment to use from time series API (e.g. 'M' or BDay())
|
| Returns
| -------
| chg : NDFrame
|
| Notes
| -----
|
| By default, the percentage change is calculated along the stat
| axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
| ``Panel``. You can change this with the ``axis`` keyword argument.
|
| quantile
| Return value at the given quantile, a la numpy.percentile.
|
| Parameters
| ----------
| q : float or array-like, default 0.5 (50% quantile)
| 0 <= q <= 1, the quantile(s) to compute
| interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
| .. versionadded:: 0.18.0
|
| This optional parameter specifies the interpolation method to use,
| when the desired quantile lies between two data points `i` and `j`:
|
| * linear: `i + (j - i) * fraction`, where `fraction` is the
| fractional part of the index surrounded by `i` and `j`.
| * lower: `i`.
| * higher: `j`.
| * nearest: `i` or `j` whichever is nearest.
| * midpoint: (`i` + `j`) / 2.
|
| Returns
| -------
| quantile : float or Series
| if ``q`` is an array, a Series will be returned where the
| index is ``q`` and the values are the quantiles.
|
| Examples
| --------
| >>> s = Series([1, 2, 3, 4])
| >>> s.quantile(.5)
| 2.5
| >>> s.quantile([.25, .5, .75])
| 0.25 1.75
| 0.50 2.50
| 0.75 3.25
| dtype: float64
|
| rank
| Compute numerical data ranks (1 through n) along axis. Equal values are
| assigned a rank that is the average of the ranks of those values
|
| Parameters
| ----------
| axis: {0 or 'index', 1 or 'columns'}, default 0
| index to direct ranking
| method : {'average', 'min', 'max', 'first', 'dense'}
| * average: average rank of group
| * min: lowest rank in group
| * max: highest rank in group
| * first: ranks assigned in order they appear in the array
| * dense: like 'min', but rank always increases by 1 between groups
| numeric_only : boolean, default None
| Include only float, int, boolean data. Valid only for DataFrame or
| Panel objects
| na_option : {'keep', 'top', 'bottom'}
| * keep: leave NA values where they are
| * top: smallest rank if ascending
| * bottom: smallest rank if descending
| ascending : boolean, default True
| False for ranks by high (1) to low (N)
| pct : boolean, default False
| Computes percentage rank of data
|
| Returns
| -------
| ranks : same type as caller
|
| skew
|
| Return unbiased skew over requested axis
| Normalized by N-1
|
| Parameters
| ----------
| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level : int or level name, default None
| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| numeric_only : boolean, default None
| Include only float, int, boolean columns. If None, will attempt to use
| everything, then use only numeric data. Not implemented for Series.
|
| Returns
| -------
| skew : scalar or Series (if level specified)
|
| take
| return Series corresponding to requested indices
|
| Parameters
| ----------
| indices : list / array of ints
| convert : translate negative to positive indices (default)
|
| Returns
| -------
| taken : Series
|
| See also
| --------
| numpy.ndarray.take
|
| tshift
| Shift the time index, using the index's frequency if available.
|
| Parameters
| ----------
| periods : int
| Number of periods to move, can be positive or negative
| freq : DateOffset, timedelta, or time rule string, default None
| Increment to use from the tseries module or time rule (e.g. 'EOM')
| axis : int or basestring
| Corresponds to the axis that contains the Index
|
| Notes
| -----
| If freq is not specified then tries to use the freq or inferred_freq
| attributes of the index. If neither of those attributes exist, a
| ValueError is thrown
|
| Returns
| -------
| shifted : NDFrame
|
| unique
| Return np.ndarray of unique values in the object.
| Significantly faster than numpy.unique. Includes NA values.
| The order of the original is preserved.
|
| Returns
| -------
| uniques : np.ndarray
|
| ----------------------------------------------------------------------
| Methods inherited from GroupBy:
|
| backfill(self, limit=None)
| Backward fill the values
|
| Parameters
| ----------
| limit : integer, optional
| limit of how many values to fill
|
| See Also
| --------
| Series.fillna
| DataFrame.fillna
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| bfill = backfill(self, limit=None)
|
| cumcount(self, ascending=True)
| Number each item in each group from 0 to the length of that group - 1.
|
| Essentially this is equivalent to
|
| >>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
|
| Parameters
| ----------
| ascending : bool, default True
| If False, number in reverse, from length of group - 1 to 0.
|
| Examples
| --------
|
| >>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
| ... columns=['A'])
| >>> df
| A
| 0 a
| 1 a
| 2 a
| 3 b
| 4 b
| 5 a
| >>> df.groupby('A').cumcount()
| 0 0
| 1 1
| 2 2
| 3 0
| 4 1
| 5 3
| dtype: int64
| >>> df.groupby('A').cumcount(ascending=False)
| 0 3
| 1 2
| 2 1
| 3 1
| 4 0
| 5 0
| dtype: int64
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| cumprod(self, axis=0, *args, **kwargs)
| Cumulative product for each group
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| cumsum(self, axis=0, *args, **kwargs)
| Cumulative sum for each group
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| expanding(self, *args, **kwargs)
| Return an expanding grouper, providing expanding
| functionaility per group
|
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| ffill = pad(self, limit=None)
|
| first = f(self)
| Compute first of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| head(self, n=5)
| Returns first n rows of each group.
|
| Essentially equivalent to ``.apply(lambda x: x.head(n))``,
| except ignores as_index flag.
|
| Examples
| --------
|
| >>> df = DataFrame([[1, 2], [1, 4], [5, 6]],
| columns=['A', 'B'])
| >>> df.groupby('A', as_index=False).head(1)
| A B
| 0 1 2
| 2 5 6
| >>> df.groupby('A').head(1)
| A B
| 0 1 2
| 2 5 6
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| irow(self, i)
| DEPRECATED. Use ``.nth(i)`` instead
|
| last = f(self)
| Compute last of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| max = f(self)
| Compute max of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| mean(self, *args, **kwargs)
| Compute mean of groups, excluding missing values
|
| For multiple groupings, the result index will be a MultiIndex
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| median(self)
| Compute median of groups, excluding missing values
|
| For multiple groupings, the result index will be a MultiIndex
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| min = f(self)
| Compute min of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| nth(self, n, dropna=None)
| Take the nth row from each group if n is an int, or a subset of rows
| if n is a list of ints.
|
| If dropna, will take the nth non-null row, dropna is either
| Truthy (if a Series) or 'all', 'any' (if a DataFrame);
| this is equivalent to calling dropna(how=dropna) before the
| groupby.
|
| Parameters
| ----------
| n : int or list of ints
| a single nth value for the row or a list of nth values
| dropna : None or str, optional
| apply the specified dropna operation before counting which row is
| the nth row. Needs to be None, 'any' or 'all'
|
| Examples
| --------
|
| >>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
| ... 'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
| >>> g = df.groupby('A')
| >>> g.nth(0)
| B
| A
| 1 NaN
| 2 3.0
| >>> g.nth(1)
| B
| A
| 1 2.0
| 2 5.0
| >>> g.nth(-1)
| B
| A
| 1 4.0
| 2 5.0
| >>> g.nth([0, 1])
| B
| A
| 1 NaN
| 1 2.0
| 2 3.0
| 2 5.0
|
| Specifying ``dropna`` allows count ignoring NaN
|
| >>> g.nth(0, dropna='any')
| B
| A
| 1 2.0
| 2 3.0
|
| NaNs denote group exhausted when using dropna
|
| >>> g.nth(3, dropna='any')
| B
| A
| 1 NaN
| 2 NaN
|
| Specifying ``as_index=False`` in ``groupby`` keeps the original index.
|
| >>> df.groupby('A', as_index=False).nth(1)
| A B
| 1 1 2.0
| 4 2 5.0
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| ohlc(self)
| Compute sum of values, excluding missing values
| For multiple groupings, the result index will be a MultiIndex
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| pad(self, limit=None)
| Forward fill the values
|
| Parameters
| ----------
| limit : integer, optional
| limit of how many values to fill
|
| See Also
| --------
| Series.fillna
| DataFrame.fillna
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| prod = f(self)
| Compute prod of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| resample(self, rule, *args, **kwargs)
| Provide resampling when using a TimeGrouper
| Return a new grouper with our resampler appended
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| rolling(self, *args, **kwargs)
| Return a rolling grouper, providing rolling
| functionaility per group
|
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| sem(self, ddof=1)
| Compute standard error of the mean of groups, excluding missing values
|
| For multiple groupings, the result index will be a MultiIndex
|
| Parameters
| ----------
| ddof : integer, default 1
| degrees of freedom
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| shift(self, periods=1, freq=None, axis=0)
| Shift each group by periods observations
|
| Parameters
| ----------
| periods : integer, default 1
| number of periods to shift
| freq : frequency string
| axis : axis to shift, default 0
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| size(self)
| Compute group sizes
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| std(self, ddof=1, *args, **kwargs)
| Compute standard deviation of groups, excluding missing values
|
| For multiple groupings, the result index will be a MultiIndex
|
| Parameters
| ----------
| ddof : integer, default 1
| degrees of freedom
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| sum = f(self)
| Compute sum of group values
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| tail(self, n=5)
| Returns last n rows of each group
|
| Essentially equivalent to ``.apply(lambda x: x.tail(n))``,
| except ignores as_index flag.
|
| Examples
| --------
|
| >>> df = DataFrame([['a', 1], ['a', 2], ['b', 1], ['b', 2]],
| columns=['A', 'B'])
| >>> df.groupby('A').tail(1)
| A B
| 1 a 2
| 3 b 2
| >>> df.groupby('A').head(1)
| A B
| 0 a 1
| 2 b 1
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| var(self, ddof=1, *args, **kwargs)
| Compute variance of groups, excluding missing values
|
| For multiple groupings, the result index will be a MultiIndex
|
| Parameters
| ----------
| ddof : integer, default 1
| degrees of freedom
|
|
| See also
| --------
| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|
| ----------------------------------------------------------------------
| Methods inherited from _GroupBy:
|
| __getattr__(self, attr)
|
| __init__(self, obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| __iter__(self)
| Groupby iterator
|
| Returns
| -------
| Generator yielding sequence of (name, subsetted object)
| for each group
|
| __len__(self)
|
| __unicode__(self)
| Return a string representation for a particular object.
|
| Invoked by unicode(obj) in py2 only. Yields a Unicode String in both
| py2/py3.
|
| apply(self, func, *args, **kwargs)
| Apply function and combine results together in an intelligent way. The
| split-apply-combine combination rules attempt to be as common sense
| based as possible. For example:
|
| case 1:
| group DataFrame
| apply aggregation function (f(chunk) -> Series)
| yield DataFrame, with group axis having group labels
|
| case 2:
| group DataFrame
| apply transform function ((f(chunk) -> DataFrame with same indexes)
| yield DataFrame with resulting chunks glued together
|
| case 3:
| group Series
| apply function with f(chunk) -> DataFrame
| yield DataFrame with result of chunks glued together
|
| Parameters
| ----------
| func : function
|
| Notes
| -----
| See online documentation for full exposition on how to use apply.
|
| In the current implementation apply calls func twice on the
| first group to decide whether it can take a fast or slow code
| path. This can lead to unexpected behavior if func has
| side-effects, as they will take effect twice for the first
| group.
|
|
| See also
| --------
| aggregate, transform
|
| get_group(self, name, obj=None)
| Constructs NDFrame from group with provided name
|
| Parameters
| ----------
| name : object
| the name of the group to get as a DataFrame
| obj : NDFrame, default None
| the NDFrame to take the DataFrame out of. If
| it is None, the object groupby was called on will
| be used
|
| Returns
| -------
| group : type of obj
|
| ----------------------------------------------------------------------
| Data descriptors inherited from _GroupBy:
|
| groups
| dict {group name -> group labels}
|
| indices
| dict {group name -> group indices}
|
| ngroups
|
| plot
| Class implementing the .plot attribute for groupby objects
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.base.PandasObject:
|
| __dir__(self)
| Provide method name lookup and completion
| Only provide 'public' methods
|
| __sizeof__(self)
| Generates the total memory usage for a object that returns
| either a value or Series of values
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.base.StringMixin:
|
| __bytes__(self)
| Return a string representation for a particular object.
|
| Invoked by bytes(obj) in py3 only.
| Yields a bytestring in both py2/py3.
|
| __repr__(self)
| Return a string representation for a particular object.
|
| Yields Bytestring in Py2, Unicode String in py3.
|
| __str__(self)
| Return a string representation for a particular Object
|
| Invoked by str(df) in both py2/py3.
| Yields Bytestring in Py2, Unicode String in py3.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from pandas.core.base.StringMixin:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from pandas.core.base.SelectionMixin:
|
| __getitem__(self, key)
|
| ----------------------------------------------------------------------
| Data descriptors inherited from pandas.core.base.SelectionMixin:
|
| ndim
grouped.mean()
key1
a -0.198323
b -0.613947
Name: data1, dtype: float64
df['data1'].groupby(['a','a','b','b','a']).mean()
a -0.198323
b -0.613947
Name: data1, dtype: float64
p=df['data1'];p
0 -0.288087
1 0.804061
2 -1.672009
3 0.444115
4 -1.110942
Name: data1, dtype: float64
p.groupby(df['key1']).mean()
key1
a -0.198323
b -0.613947
Name: data1, dtype: float64
The important thing above is that the data(Series) has been aggregated according to the group key,producing a new Series that is now indexed by the unique values in the key1
column.
The result index has the name 'key1' because the DataFrame column df['key1'] did!
If instead we had passed multiple arrays as a list,we'd get something different:
means=df['data1'].groupby([df['key1'],df['key2']]).mean();means
key1 key2
a one -0.699514
two 0.804061
b one -1.672009
two 0.444115
Name: data1, dtype: float64
df['data1'].groupby([df['key2'],df['key1']]).mean()
key2 key1
one a -0.699514
b -1.672009
two a 0.804061
b 0.444115
Name: data1, dtype: float64
Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:
means.unstack()
key2 |
one |
two |
key1 |
|
|
a |
-0.699514 |
0.804061 |
b |
-1.672009 |
0.444115 |
In this example,the group keys are all Series,though they could be any arrays of the right length:
states=np.array(['Ohio','Califonia','Ohio','Ohio','Califonia'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
Califonia 2005 0.804061
2006 -1.110942
Ohio 2005 0.078014
2006 -1.672009
Name: data1, dtype: float64
- The second method:
Frequently the grouping information is found in the same DataFrame as the data you want to work on.In this case,you can pass column names(whether those are strings,numbers or other python objects) as the group keys:
df.groupby('key1').mean()
|
data1 |
data2 |
key1 |
|
|
a |
-0.198323 |
-0.039899 |
b |
-0.613947 |
1.384352 |
df.groupby(['key1','key2']).mean()
|
|
data1 |
data2 |
key1 |
key2 |
|
|
a |
one |
-0.699514 |
-0.386868 |
two |
0.804061 |
0.654039 |
b |
one |
-1.672009 |
1.780028 |
two |
0.444115 |
0.988676 |
You may have noticed in the first case of df.groupby('key').mean()
that there is no key2
column in the result.**Because df['key2']
is not numeric data,it is said to be a nuisance column,which is therefor excluded from the result.By default,all of teh numeric columns are aggregated,though it is possible to filter down to a subset.
help(df.groupby(['key1','key2']).size)
Help on method size in module pandas.core.groupby:
size() method of pandas.core.groupby.DataFrameGroupBy instance
Compute group sizes
See also
--------
pandas.Series.groupby
pandas.DataFrame.groupby
pandas.Panel.groupby
A generally useful Groupby method is size
,which returns a Series containing group sizes:
df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
We can see from above,the number of ('a','one') is 2 indeed...
Iterating over groups
The Groupby object supports iteration,generating a sequence of 2-tuples containning the group name along with the chunk of data.
for name,group in df.groupby('key1'):
print(name)
print(group)
a
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
4 -1.110942 0.581883 a one
b
data1 data2 key1 key2
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two
In the case of multiple keys,the first element in the tuple will be a tuple of key values.
for (k1,k2),group in df.groupby(['key1','key2']):
print((k1,k2))
print(group)
('a', 'one')
data1 data2 key1 key2
0 -0.288087 -1.355619 a one
4 -1.110942 0.581883 a one
('a', 'two')
data1 data2 key1 key2
1 0.804061 0.654039 a two
('b', 'one')
data1 data2 key1 key2
2 -1.672009 1.780028 b one
('b', 'two')
data1 data2 key1 key2
3 0.444115 0.988676 b two
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
A useful recipe to get the pieces of data through computing a dict of the data pieces as a one-liner:
pieces=dict(list(df.groupby('key1')))
pieces
{'a': data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
4 -1.110942 0.581883 a one, 'b': data1 data2 key1 key2
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two}
pieces['b']
|
data1 |
data2 |
key1 |
key2 |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
list(df.groupby('key1'))
[('a', data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
4 -1.110942 0.581883 a one), ('b', data1 data2 key1 key2
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two)]
dict([('a',1),('b',2)])
{'a': 1, 'b': 2}
dict([('a',1)])
{'a': 1}
dict([(1,2)])
{1: 2}
Above: dict's built-in creation.
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
df.dtypes
data1 float64
data2 float64
key1 object
key2 object
dtype: object
grouped=df.groupby(df.dtypes,axis=1) # notice here,axis=1,df is divided into 2 DataFrame accroding `dtype`.
for dtype,group in grouped:
print(dtype)
print(group)
float64
data1 data2
0 -0.288087 -1.355619
1 0.804061 0.654039
2 -1.672009 1.780028
3 0.444115 0.988676
4 -1.110942 0.581883
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
pp=dict(list(grouped))
pp
{dtype('float64'): data1 data2
0 -0.288087 -1.355619
1 0.804061 0.654039
2 -1.672009 1.780028
3 0.444115 0.988676
4 -1.110942 0.581883, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
i,j=pp.keys()
pp[i]
|
data1 |
data2 |
0 |
-0.288087 |
-1.355619 |
1 |
0.804061 |
0.654039 |
2 |
-1.672009 |
1.780028 |
3 |
0.444115 |
0.988676 |
4 |
-1.110942 |
0.581883 |
pp[j]
|
key1 |
key2 |
0 |
a |
one |
1 |
a |
two |
2 |
b |
one |
3 |
b |
two |
4 |
a |
one |
type(i)
numpy.dtype
pp[np.dtype('object')]
|
key1 |
key2 |
0 |
a |
one |
1 |
a |
two |
2 |
b |
one |
3 |
b |
two |
4 |
a |
one |
Selecting a column or subset of columns
a=list(df.groupby('key1')['data1']);a
[('a', 0 -0.288087
1 0.804061
4 -1.110942
Name: data1, dtype: float64), ('b', 2 -1.672009
3 0.444115
Name: data1, dtype: float64)]
list(df['data1'].groupby(df['key1']))
[('a', 0 -0.288087
1 0.804061
4 -1.110942
Name: data1, dtype: float64), ('b', 2 -1.672009
3 0.444115
Name: data1, dtype: float64)]
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
df.groupby('key1').size()
key1
a 3
b 2
dtype: int64
a=df.groupby('key1')['data1'];a
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
b=df.groupby('key1')[['data1']];a
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
aa=df['data1'].groupby(df['key1']);aa
<pandas.core.groupby.SeriesGroupBy object at 0x000001DDB56707B8>
bb=df[['data1']].groupby(df['key1']);bb
<pandas.core.groupby.DataFrameGroupBy object at 0x000001DDB56703C8>
Indexing a groupby object created from a DataFrame with a column name or array of column names has the efffect of column subsetting for aggregation.
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
list(df.groupby(['key1','key2'])['data2'])
[(('a', 'one'), 0 -1.355619
4 0.581883
Name: data2, dtype: float64), (('a', 'two'), 1 0.654039
Name: data2, dtype: float64), (('b', 'one'), 2 1.780028
Name: data2, dtype: float64), (('b', 'two'), 3 0.988676
Name: data2, dtype: float64)]
df.groupby(['key1','key2'])[['data2']].mean()
|
|
data2 |
key1 |
key2 |
|
a |
one |
-0.386868 |
two |
0.654039 |
b |
one |
1.780028 |
two |
0.988676 |
df[['data1','data2']].mean()
data1 -0.364572
data2 0.529801
dtype: float64
Grouping with Dicts and Series
Grouping information may exist in a form other than an array.
people=pd.DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['Joe','Steve','Wes','Jim','Travis']);people
|
a |
b |
c |
d |
e |
Joe |
0.696631 |
-0.491001 |
-0.065796 |
-1.392325 |
-0.651280 |
Steve |
0.965853 |
-1.694954 |
2.142343 |
-0.969325 |
1.129057 |
Wes |
-0.145624 |
-0.548971 |
-0.893291 |
2.421140 |
0.305064 |
Jim |
0.434581 |
1.189071 |
0.455694 |
-0.929888 |
1.193285 |
Travis |
-0.321174 |
-1.400983 |
-0.529033 |
0.037673 |
1.221623 |
Now suppose we have a group correspondence for the columns and want to sum together the columns by group.
mapping={'a':'red','b':'red',
'c':'blue','d':'blue',
'e':'red','f':'orange'}
by_column=people.groupby(mapping,axis=1)
by_column.sum()
|
blue |
red |
Joe |
-1.458121 |
-0.445651 |
Steve |
1.173019 |
0.399957 |
Wes |
1.527849 |
-0.389532 |
Jim |
-0.474194 |
2.816937 |
Travis |
-0.491360 |
-0.500533 |
The same functionality holds for Series,which can be viewed as a fixed-size mapping.
map_series=pd.Series(mapping);map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
people.groupby(map_series,axis=1).count()
|
blue |
red |
Joe |
2 |
3 |
Steve |
2 |
3 |
Wes |
2 |
3 |
Jim |
2 |
3 |
Travis |
2 |
3 |
people.groupby(['red','red','blue','blue','red']).mean()
|
a |
b |
c |
d |
e |
blue |
0.144479 |
0.320050 |
-0.218799 |
0.745626 |
0.749174 |
red |
0.447103 |
-1.195646 |
0.515838 |
-0.774659 |
0.566467 |
people.groupby(['red','red','blue','blue','red'],axis=1).mean()
|
blue |
red |
Joe |
-0.729060 |
-0.148550 |
Steve |
0.586509 |
0.133319 |
Wes |
0.763924 |
-0.129844 |
Jim |
-0.237097 |
0.938979 |
Travis |
-0.245680 |
-0.166844 |
Grouping with function
Using python functions is a more generic way of defining a group mappingg compared with a dict or Series.Any function passed as a group key will be called once per index value,with the return values being used as the group names.
Suppose you want to group by the length of the names;while you compute an array of string lengths,it's simpler to just pass the len
function.
people.groupby(len).sum()
|
a |
b |
c |
d |
e |
3 |
0.985588 |
0.149098 |
-0.503394 |
0.098927 |
0.847068 |
5 |
0.965853 |
-1.694954 |
2.142343 |
-0.969325 |
1.129057 |
6 |
-0.321174 |
-1.400983 |
-0.529033 |
0.037673 |
1.221623 |
Mixing functions with arrays,dicts,or Series is not a problem as everything gets converted to arrays internaly:
key_list=['one','one','one','two','two']
key_test=[1,1,1,1,1]
people.groupby([len,key_list]).min()
|
|
a |
b |
c |
d |
e |
3 |
one |
-0.145624 |
-0.548971 |
-0.893291 |
-1.392325 |
-0.651280 |
two |
0.434581 |
1.189071 |
0.455694 |
-0.929888 |
1.193285 |
5 |
one |
0.965853 |
-1.694954 |
2.142343 |
-0.969325 |
1.129057 |
6 |
two |
-0.321174 |
-1.400983 |
-0.529033 |
0.037673 |
1.221623 |
people.groupby([len,key_test]).min()
|
|
a |
b |
c |
d |
e |
3 |
1 |
-0.145624 |
-0.548971 |
-0.893291 |
-1.392325 |
-0.651280 |
5 |
1 |
0.965853 |
-1.694954 |
2.142343 |
-0.969325 |
1.129057 |
6 |
1 |
-0.321174 |
-1.400983 |
-0.529033 |
0.037673 |
1.221623 |
Grouping by Index levels
A final vonvenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an index.
columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3,]],names=['City','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns);hier_df
City |
US |
JP |
tenor |
1 |
3 |
5 |
1 |
3 |
0 |
-1.878462 |
1.360321 |
-1.444771 |
0.883927 |
0.995222 |
1 |
-1.082905 |
1.061055 |
-0.938387 |
-0.391143 |
1.383284 |
2 |
-0.626484 |
-0.298127 |
1.144298 |
0.941223 |
1.163550 |
3 |
-0.272500 |
-3.441699 |
0.804467 |
-0.029275 |
-0.003555 |
To group by level,pass the level number or name using level
keyword.
hier_df.groupby(level='City',axis=1).count()
City |
JP |
US |
0 |
2 |
3 |
1 |
2 |
3 |
2 |
2 |
3 |
3 |
2 |
3 |
Data aggregation
Aggregations refer to any data transformation that produces scalar values from arrays.
description=['Number of non-NAN values in the group',
'Sum of non-NA values',
'Mean of non-NA values',
'Arithmetic median off non-NAN values',
'Unbiased(n-1 denominator)standard deviation and variance',
'Minimun and maximum of non-NA values',
'Product of non-NA values',
'First and last non-NA value']
Tips:Custom aggregation functions are generally much slower than the optimized functions found in above table.This is because there is some extra overhead(function calls,data rearrangement) in constructing the intermeediate chunks.
index=pd.Index(['count','sum','mean','median','std,var','min,max','prod','first,last'],names='Function name')
pd.DataFrame(description,index=index)
|
0 |
count |
Number of non-NAN values in the group |
sum |
Sum of non-NA values |
mean |
Mean of non-NA values |
median |
Arithmetic median off non-NAN values |
std,var |
Unbiased(n-1 denominator)standard deviation an... |
min,max |
Minimun and maximum of non-NA values |
prod |
Product of non-NA values |
first,last |
First and last non-NA value |
To use your own aggregation functions,pass any function that aggregates an array to the aggregate
or agg
method.
grouped=df.groupby('key1')
def peak_to_peak(arr):
return arr.max()-arr.min()
grouped.agg(peak_to_peak)
|
data1 |
data2 |
key1 |
|
|
a |
1.915003 |
2.009658 |
b |
2.116124 |
0.791351 |
df
|
data1 |
data2 |
key1 |
key2 |
0 |
-0.288087 |
-1.355619 |
a |
one |
1 |
0.804061 |
0.654039 |
a |
two |
2 |
-1.672009 |
1.780028 |
b |
one |
3 |
0.444115 |
0.988676 |
b |
two |
4 |
-1.110942 |
0.581883 |
a |
one |
def length(arr):
l=0
for i in arr:
l+=i**2
return l**0.5
grouped.agg(length)
|
data1 |
data2 |
key1 |
|
|
a |
1.401321 |
1.613710 |
b |
1.729986 |
2.036168 |
grouped.describe()
|
|
data1 |
data2 |
key1 |
|
|
|
a |
count |
3.000000 |
3.000000 |
mean |
-0.198323 |
-0.039899 |
std |
0.960652 |
1.140018 |
min |
-1.110942 |
-1.355619 |
25% |
-0.699514 |
-0.386868 |
50% |
-0.288087 |
0.581883 |
75% |
0.257987 |
0.617961 |
max |
0.804061 |
0.654039 |
b |
count |
2.000000 |
2.000000 |
mean |
-0.613947 |
1.384352 |
std |
1.496326 |
0.559570 |
min |
-1.672009 |
0.988676 |
25% |
-1.142978 |
1.186514 |
50% |
-0.613947 |
1.384352 |
75% |
-0.084916 |
1.582190 |
max |
0.444115 |
1.780028 |
You may notice some methods like describe
also work,even though they are not aggregations.
Column-wise and multiple function application
tips=pd.read_csv(r'./pydata-book-2nd-edition/examples/tips.csv')
tips.head()
|
total_bill |
tip |
smoker |
day |
time |
size |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
tips['tip_pct']=tips['tip']/tips['total_bill']
As we have seen,aggregating a Series or all of the columns of a DataFrame is a matter of using aggregae
with the desired function or calling a method like mean
or std
.However,you may want to aggregate using a different function depending on the column,or multiple functions at once.
grouped=tips.groupby(['day','smoker'])
group_pct=grouped['tip_pct']
grouped.size()
day smoker
Fri No 4
Yes 15
Sat No 45
Yes 42
Sun No 57
Yes 19
Thur No 45
Yes 17
dtype: int64
group_pct.agg('mean')
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
If you pass a list of functions or function names instead,you get back a DataFrame with column names taken from the functions:
group_pct.agg(['mean','std',peak_to_peak])
|
|
mean |
std |
peak_to_peak |
day |
smoker |
|
|
|
Fri |
No |
0.151650 |
0.028123 |
0.067349 |
Yes |
0.174783 |
0.051293 |
0.159925 |
Sat |
No |
0.158048 |
0.039767 |
0.235193 |
Yes |
0.147906 |
0.061375 |
0.290095 |
Sun |
No |
0.160113 |
0.042347 |
0.193226 |
Yes |
0.187250 |
0.154134 |
0.644685 |
Thur |
No |
0.160298 |
0.038774 |
0.193350 |
Yes |
0.163863 |
0.039389 |
0.151240 |
df=pd.DataFrame(np.arange(12).reshape(3,4),index=['one','two','three'],columns=['o','t','th','f']);df
|
o |
t |
th |
f |
one |
0 |
1 |
2 |
3 |
two |
4 |
5 |
6 |
7 |
three |
8 |
9 |
10 |
11 |
list(df.groupby([0,0,1])['f'])
[(0, one 3
two 7
Name: f, dtype: int32), (1, three 11
Name: f, dtype: int32)]
If you pass a list of (name,function) tuples,the first element of each tuple will be used as the DataFrame column names.
group_pct.agg([('foo','mean'),('bar',np.std)])
|
|
foo |
bar |
day |
smoker |
|
|
Fri |
No |
0.151650 |
0.028123 |
Yes |
0.174783 |
0.051293 |
Sat |
No |
0.158048 |
0.039767 |
Yes |
0.147906 |
0.061375 |
Sun |
No |
0.160113 |
0.042347 |
Yes |
0.187250 |
0.154134 |
Thur |
No |
0.160298 |
0.038774 |
Yes |
0.163863 |
0.039389 |
With a DataFrame you have more options,as you can specify a list of functions to apply to all of the columns or different functions per column.To start ,suppose we wanted to compute the same three statistics for the tip_pct
and total_bill
columns:
functions=['count','mean','max']
result=grouped['tip_pct','total_bill'].agg(functions);result
|
|
tip_pct |
total_bill |
|
|
count |
mean |
max |
count |
mean |
max |
day |
smoker |
|
|
|
|
|
|
Fri |
No |
4 |
0.151650 |
0.187735 |
4 |
18.420000 |
22.75 |
Yes |
15 |
0.174783 |
0.263480 |
15 |
16.813333 |
40.17 |
Sat |
No |
45 |
0.158048 |
0.291990 |
45 |
19.661778 |
48.33 |
Yes |
42 |
0.147906 |
0.325733 |
42 |
21.276667 |
50.81 |
Sun |
No |
57 |
0.160113 |
0.252672 |
57 |
20.506667 |
48.17 |
Yes |
19 |
0.187250 |
0.710345 |
19 |
24.120000 |
45.35 |
Thur |
No |
45 |
0.160298 |
0.266312 |
45 |
17.113111 |
41.19 |
Yes |
17 |
0.163863 |
0.241255 |
17 |
19.190588 |
43.11 |
As you can see,the resulting DataFrame has a hierarchical columns,the same as you would get aggregating each column separately and using concat
to glue the results together using the column names as the keys
argument
As before,a list of tuples with custom names can be passed:
ftuples=[('Durchs','mean'),('adkfd',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
|
|
tip_pct |
total_bill |
|
|
Durchs |
adkfd |
Durchs |
adkfd |
day |
smoker |
|
|
|
|
Fri |
No |
0.151650 |
0.000791 |
18.420000 |
25.596333 |
Yes |
0.174783 |
0.002631 |
16.813333 |
82.562438 |
Sat |
No |
0.158048 |
0.001581 |
19.661778 |
79.908965 |
Yes |
0.147906 |
0.003767 |
21.276667 |
101.387535 |
Sun |
No |
0.160113 |
0.001793 |
20.506667 |
66.099980 |
Yes |
0.187250 |
0.023757 |
24.120000 |
109.046044 |
Thur |
No |
0.160298 |
0.001503 |
17.113111 |
59.625081 |
Yes |
0.163863 |
0.001551 |
19.190588 |
69.808518 |
Now,suppose you want to apply potentially different functions to one or more of the columns.To do this,pass a dict to agg
that contains a mappingg of column names to any of the function specifications listed so far:
grouped.agg({'tip':np.max,'size':'sum'})
|
|
tip |
size |
day |
smoker |
|
|
Fri |
No |
3.50 |
9 |
Yes |
4.73 |
31 |
Sat |
No |
9.00 |
115 |
Yes |
10.00 |
104 |
Sun |
No |
6.00 |
167 |
Yes |
6.50 |
49 |
Thur |
No |
6.70 |
112 |
Yes |
5.00 |
40 |
grouped.agg(ftuples)
|
|
total_bill |
tip |
size |
tip_pct |
|
|
Durchs |
adkfd |
Durchs |
adkfd |
Durchs |
adkfd |
Durchs |
adkfd |
day |
smoker |
|
|
|
|
|
|
|
|
Fri |
No |
18.420000 |
25.596333 |
2.812500 |
0.807292 |
2.250000 |
0.250000 |
0.151650 |
0.000791 |
Yes |
16.813333 |
82.562438 |
2.714000 |
1.161369 |
2.066667 |
0.352381 |
0.174783 |
0.002631 |
Sat |
No |
19.661778 |
79.908965 |
3.102889 |
2.696453 |
2.555556 |
0.616162 |
0.158048 |
0.001581 |
Yes |
21.276667 |
101.387535 |
2.875476 |
2.658791 |
2.476190 |
0.743322 |
0.147906 |
0.003767 |
Sun |
No |
20.506667 |
66.099980 |
3.167895 |
1.500099 |
2.929825 |
1.066416 |
0.160113 |
0.001793 |
Yes |
24.120000 |
109.046044 |
3.516842 |
1.590501 |
2.578947 |
0.812865 |
0.187250 |
0.023757 |
Thur |
No |
17.113111 |
59.625081 |
2.673778 |
1.645997 |
2.488889 |
1.391919 |
0.160298 |
0.001503 |
Yes |
19.190588 |
69.808518 |
3.030000 |
1.239863 |
2.352941 |
0.492647 |
0.163863 |
0.001551 |
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
|
|
tip_pct |
size |
|
|
min |
max |
mean |
std |
sum |
day |
smoker |
|
|
|
|
|
Fri |
No |
0.120385 |
0.187735 |
0.151650 |
0.028123 |
9 |
Yes |
0.103555 |
0.263480 |
0.174783 |
0.051293 |
31 |
Sat |
No |
0.056797 |
0.291990 |
0.158048 |
0.039767 |
115 |
Yes |
0.035638 |
0.325733 |
0.147906 |
0.061375 |
104 |
Sun |
No |
0.059447 |
0.252672 |
0.160113 |
0.042347 |
167 |
Yes |
0.065660 |
0.710345 |
0.187250 |
0.154134 |
49 |
Thur |
No |
0.072961 |
0.266312 |
0.160298 |
0.038774 |
112 |
Yes |
0.090014 |
0.241255 |
0.163863 |
0.039389 |
40 |
A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.
Returning Aggregated data without row index
In all of the examples up until now,the aggregated data comes back with an index,potentially hierarchical,composed from the unique group key combinations.Since this is not always desirable,you can disable this behaviour in most cases by passing as_index=False
to groupby
.
tips.groupby(['day','smoker'],as_index=False).mean()
|
day |
smoker |
total_bill |
tip |
size |
tip_pct |
0 |
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
1 |
Fri |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
2 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
3 |
Sat |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
4 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
5 |
Sun |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
6 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
7 |
Thur |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
tips.groupby(['day','smoker']).mean()
|
|
total_bill |
tip |
size |
tip_pct |
day |
smoker |
|
|
|
|
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
Apply:General split-apply-combine
The most general-purpose GroupBy method is apply
,which is the subject of the rest of this section.
tips.head()
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
0.059447 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
0.160542 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
0.166587 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
0.139780 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
0.146808 |
Returning to the tipping dataset from before,suppose you want to select the top five tip_pct
values by group.First,write a function that selects the rows with the largest values in a particular column:
def top(df,n=5,column='tip_pct'):
return df.sort_values(by=column)[-n:]
tips.sort_values(by='tip_pct')[-5:]
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
top(tips,n=6)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
Now,if we group by smoker
,and call apply
with this function.we get:
tips.groupby('smoker').apply(top)
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
|
|
|
|
|
|
|
|
No |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
Yes |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
If you pass a function to apply
that takes other arguments or keywords,you can pass these after the function:
tips.groupby(['smoker','day']).apply(top,n=2,column='total_bill')
|
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
day |
|
|
|
|
|
|
|
|
No |
Fri |
91 |
22.49 |
3.50 |
No |
Fri |
Dinner |
2 |
0.155625 |
94 |
22.75 |
3.25 |
No |
Fri |
Dinner |
2 |
0.142857 |
Sat |
59 |
48.27 |
6.73 |
No |
Sat |
Dinner |
4 |
0.139424 |
212 |
48.33 |
9.00 |
No |
Sat |
Dinner |
4 |
0.186220 |
Sun |
112 |
38.07 |
4.00 |
No |
Sun |
Dinner |
3 |
0.105070 |
156 |
48.17 |
5.00 |
No |
Sun |
Dinner |
6 |
0.103799 |
Thur |
85 |
34.83 |
5.17 |
No |
Thur |
Lunch |
4 |
0.148435 |
142 |
41.19 |
5.00 |
No |
Thur |
Lunch |
5 |
0.121389 |
Yes |
Fri |
90 |
28.97 |
3.00 |
Yes |
Fri |
Dinner |
2 |
0.103555 |
95 |
40.17 |
4.73 |
Yes |
Fri |
Dinner |
4 |
0.117750 |
Sat |
102 |
44.30 |
2.50 |
Yes |
Sat |
Dinner |
3 |
0.056433 |
170 |
50.81 |
10.00 |
Yes |
Sat |
Dinner |
3 |
0.196812 |
Sun |
184 |
40.55 |
3.00 |
Yes |
Sun |
Dinner |
2 |
0.073983 |
182 |
45.35 |
3.50 |
Yes |
Sun |
Dinner |
3 |
0.077178 |
Thur |
83 |
32.68 |
5.00 |
Yes |
Thur |
Lunch |
2 |
0.152999 |
197 |
43.11 |
5.00 |
Yes |
Thur |
Lunch |
4 |
0.115982 |
Suppressing the Group keys
In the preceding examples,you see that the resulting object has a hierarchical index formed from the group keys along the indexes of each piece of the original object.You can disable this by passing group_keys=False
to groupby
.
tips.groupby('smoker',group_keys=False).apply(top)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
**Notice the difference between agg
and apply
:agg
aggregates GroupBy object's every column into one value,while apply
apply function to every column of GroupBy object and the result may not to be just one value like agg
.
tips.groupby('smoker').agg('max')
|
total_bill |
tip |
day |
time |
size |
tip_pct |
smoker |
|
|
|
|
|
|
No |
48.33 |
9.0 |
Thur |
Lunch |
6 |
0.291990 |
Yes |
50.81 |
10.0 |
Thur |
Lunch |
5 |
0.710345 |
tips.groupby('smoker').apply(max)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
|
|
|
|
|
|
|
No |
48.33 |
9.0 |
No |
Thur |
Lunch |
6 |
0.291990 |
Yes |
50.81 |
10.0 |
Yes |
Thur |
Lunch |
5 |
0.710345 |
The above two methods return just the same result.To some extent,aggg
is a specific version of apply
.
Quantile and bucket analysis
Combining qcut
or cut
with groupby
makes it convenient to perform bucket or quantile analysis on a dataset.
frame=pd.DataFrame({'data1':np.random.randn(1000),
'data2':np.random.randn(1000)})
quantiles=pd.cut(frame.data1,4)
quantiles[:10]
0 (0.00174, 1.893]
1 (0.00174, 1.893]
2 (0.00174, 1.893]
3 (-3.788, -1.89]
4 (0.00174, 1.893]
5 (0.00174, 1.893]
6 (0.00174, 1.893]
7 (-3.788, -1.89]
8 (0.00174, 1.893]
9 (0.00174, 1.893]
Name: data1, dtype: category
Categories (4, object): [(-3.788, -1.89] < (-1.89, 0.00174] < (0.00174, 1.893] < (1.893, 3.784]]
help(pd.cut)
Help on function cut in module pandas.tools.tile:
cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
Return indices of half-open bins to which each value of `x` belongs.
Parameters
----------
x : array-like
Input array to be binned. It has to be 1-dimensional.
bins : int or sequence of scalars
If `bins` is an int, it defines the number of equal-width bins in the
range of `x`. However, in this case, the range of `x` is extended
by .1% on each side to include the min or max values of `x`. If
`bins` is a sequence it defines the bin edges allowing for
non-uniform bin width. No extension of the range of `x` is done in
this case.
right : bool, optional
Indicates whether the bins include the rightmost edge or not. If
right == True (the default), then the bins [1,2,3,4] indicate
(1,2], (2,3], (3,4].
labels : array or boolean, default None
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. If False, return only integer indicators of the
bins.
retbins : bool, optional
Whether to return the bins or not. Can be useful if bins is given
as a scalar.
precision : int
The precision at which to store and display the bins labels
include_lowest : bool
Whether the first interval should be left-inclusive or not.
Returns
-------
out : Categorical or Series or array of integers if labels is False
The return type (Categorical or Series) depends on the input: a Series
of type category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` is True.
Notes
-----
The `cut` function can be useful for going from a continuous variable to
a categorical variable. For example, `cut` could convert ages to groups
of age ranges.
Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting Categorical object
Examples
--------
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
(6.533, 9.7], (0.191, 3.367]]
Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
array([ 0.1905 , 3.36666667, 6.53333333, 9.7 ]))
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3,
labels=["good","medium","bad"])
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
>>> pd.cut(np.ones(5), 4, labels=False)
array([1, 1, 1, 1, 1], dtype=int64)
help(pd.qcut)
Help on function qcut in module pandas.tools.tile:
qcut(x, q, labels=None, retbins=False, precision=3)
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for 10 quantiles would produce a Categorical object indicating
quantile membership for each data point.
Parameters
----------
x : ndarray or Series
q : integer or array of quantiles
Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
labels : array or boolean, default None
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. If False, return only integer indicators of the
bins.
retbins : bool, optional
Whether to return the bins or not. Can be useful if bins is given
as a scalar.
precision : int
The precision at which to store and display the bins labels
Returns
-------
out : Categorical or Series or array of integers if labels is False
The return type (Categorical or Series) depends on the input: a Series
of type category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` is True.
Notes
-----
Out of bounds values will be NA in the resulting Categorical object
Examples
--------
>>> pd.qcut(range(5), 4)
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]
>>> pd.qcut(range(5), 3, labels=["good","medium","bad"])
[good, good, medium, bad, bad]
Categories (3, object): [good < medium < bad]
>>> pd.qcut(range(5), 4, labels=False)
array([0, 0, 1, 2, 3], dtype=int64)
The Categorical
object returned by cut
can be passed directly to groupby
.So we could compute a set of statistics for the data2
like so:
def get_status(group):
return {'min':group.min(),'max':group.max(),
'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(quantiles)
all(frame.data2==frame['data2'])
True
df=pd.Series([1,2,3,5])
pd.cut(df,2)
0 (0.996, 3]
1 (0.996, 3]
2 (0.996, 3]
3 (3, 5]
dtype: category
Categories (2, object): [(0.996, 3] < (3, 5]]
pd.qcut(df,2)
0 [1, 2.5]
1 [1, 2.5]
2 (2.5, 5]
3 (2.5, 5]
dtype: category
Categories (2, object): [[1, 2.5] < (2.5, 5]]
fig=plt.figure()
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1ddb56f2908>
grouped=frame.data2.groupby(quantiles)
grouped.size()
data1
(-3.788, -1.89] 34
(-1.89, 0.00174] 443
(0.00174, 1.893] 492
(1.893, 3.784] 31
dtype: int64
Here,pd.cut(frame.data1,4) returns 1000 rows of Categorical object dividing data1 into 4 groups
grouped.apply(get_status)
data1
(-3.788, -1.89] count 34.000000
max 1.935178
mean -0.147920
min -1.843208
(-1.89, 0.00174] count 443.000000
max 2.547017
mean -0.020845
min -2.779875
(0.00174, 1.893] count 492.000000
max 2.729491
mean -0.066996
min -3.854003
(1.893, 3.784] count 31.000000
max 1.533001
mean -0.070277
min -1.639600
Name: data2, dtype: float64
grouped.apply(get_status).unstack()
|
count |
max |
mean |
min |
data1 |
|
|
|
|
(-3.788, -1.89] |
34.0 |
1.935178 |
-0.147920 |
-1.843208 |
(-1.89, 0.00174] |
443.0 |
2.547017 |
-0.020845 |
-2.779875 |
(0.00174, 1.893] |
492.0 |
2.729491 |
-0.066996 |
-3.854003 |
(1.893, 3.784] |
31.0 |
1.533001 |
-0.070277 |
-1.639600 |
There were equal-length buckets;to compute equal-size buckets based on sample quantiles,use qcut
.We will pass labels=False
to jsut get quantile numbers:
grouping=pd.qcut(frame.data1,10,labels=False)
grouped=frame.data2.groupby(grouping)
grouped.size()
data1
0 100
1 100
2 100
3 100
4 100
5 100
6 100
7 100
8 100
9 100
dtype: int64
grouped.apply(get_status).unstack()
|
count |
max |
mean |
min |
data1 |
|
|
|
|
0 |
100.0 |
2.547017 |
-0.013468 |
-2.357562 |
1 |
100.0 |
2.462236 |
-0.042549 |
-2.698867 |
2 |
100.0 |
2.546495 |
0.084219 |
-2.423998 |
3 |
100.0 |
2.225054 |
0.035150 |
-2.778011 |
4 |
100.0 |
1.669557 |
-0.134327 |
-2.779875 |
5 |
100.0 |
2.649044 |
-0.071834 |
-1.871068 |
6 |
100.0 |
2.729491 |
0.036419 |
-2.505988 |
7 |
100.0 |
2.014786 |
-0.009873 |
-2.671353 |
8 |
100.0 |
2.043135 |
-0.193331 |
-3.854003 |
9 |
100.0 |
2.491035 |
-0.184446 |
-2.033173 |
Example:Filling missing values with group-specific values
When cleaning up missing data,in some cases you will replace data observations using dropna
,but in others you may want to impute(fill in) the null(NA) values using a fixed value or some value derived from the data.fillna
is the right tool to use:
s=pd.Series(np.random.randn(6))
s[::2]=np.nan
s
0 NaN
1 -0.875438
2 NaN
3 1.106316
4 NaN
5 0.020065
dtype: float64
s.fillna(s.mean())
0 0.083648
1 -0.875438
2 0.083648
3 1.106316
4 0.083648
5 0.020065
dtype: float64
Suppose you need the fill value to vary by group.One way to do this is to group the data and use apply
with a function that calls fillna
on each data chunk.
states=['Ohio','New york','Vermont','Florida','Oregon','Nevada','California','Idaho']
group_key=['East']*4+['West']*4
data=pd.Series(np.random.randn(8),index=states)
data[['Vermont','Nevada','Idaho']]=np.nan
data.groupby(group_key).mean()
East 0.199033
West -0.704578
dtype: float64
We can fill the NA values using the group means like so:
fill_mean=lambda g : g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio -0.953675
New york 0.531946
Vermont 0.199033
Florida 1.018829
Oregon -0.539930
Nevada -0.704578
California -0.869226
Idaho -0.704578
dtype: float64
In another case,you might have predefined fill values in your code that vary by group.
fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Ohio -0.953675
New york 0.531946
Vermont 0.500000
Florida 1.018829
Oregon -0.539930
Nevada -1.000000
California -0.869226
Idaho -1.000000
dtype: float64
Example: Random sampling and permutation
Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Carlo simulation purpose or some other application.There are a number of ways to perform the 'draws':Here we use sample
method for Series.
suits=list('HSCD')
card_val=(list(range(1,11))+[10]*3)*4
base_names=['A']+list(range(2,11))+['J','K','Q'];base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
cards=[]
for suit in ['H','S','C','D']:
cards.extend(str(num)+suit for num in base_names)
deck=pd.Series(card_val,index=cards)
deck
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
AS 1
2S 2
3S 3
4S 4
5S 5
6S 6
7S 7
8S 8
9S 9
10S 10
JS 10
KS 10
QS 10
AC 1
2C 2
3C 3
4C 4
5C 5
6C 6
7C 7
8C 8
9C 9
10C 10
JC 10
KC 10
QC 10
AD 1
2D 2
3D 3
4D 4
5D 5
6D 6
7D 7
8D 8
9D 9
10D 10
JD 10
KD 10
QD 10
dtype: int64
def draw(deck,n=5):
return deck.sample(n)
draw(deck)
4H 4
AD 1
8S 8
3C 3
KH 10
dtype: int64
Suppose you want two random cards from each suit.Because the suit is the last character of each card name,we can group based on this and use apply
.
get_suit=lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C KC 10
8C 8
D 5D 5
8D 8
H QH 10
5H 5
S 4S 4
10S 10
dtype: int64
Alternatively,we could write:
deck.groupby(get_suit,group_keys=False).apply(draw,2)
8C 8
KC 10
8D 8
9D 9
8H 8
QH 10
10S 10
4S 4
dtype: int64
Example:Group weighted average and correlation
Under the split-apply-combine paradigm of groupby
operations between columns in a DataFrame or two Series,such as a group weighted average,are possible.
df=pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],
'data':np.random.randn(8),'weights':np.random.randn(8)});df
|
category |
data |
weights |
0 |
a |
0.902349 |
1.730949 |
1 |
a |
1.287855 |
0.768677 |
2 |
a |
-0.751881 |
-0.571836 |
3 |
a |
1.208645 |
0.229177 |
4 |
b |
0.986127 |
-0.262788 |
5 |
b |
1.566069 |
-0.252579 |
6 |
b |
0.285959 |
0.184662 |
7 |
b |
1.059962 |
1.804769 |
The group weighted average by category
would then be:
grouped=df.groupby('category')
get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
list(grouped)
[('a', category data weights
0 a 0.902349 1.730949
1 a 1.287855 0.768677
2 a -0.751881 -0.571836
3 a 1.208645 0.229177), ('b', category data weights
4 b 0.986127 -0.262788
5 b 1.566069 -0.252579
6 b 0.285959 0.184662
7 b 1.059962 1.804769)]
grouped.apply(get_wavg)
category
a 1.510830
b 0.889441
dtype: float64
close_px=pd.read_csv('./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px.head()
|
AAPL |
MSFT |
XOM |
SPX |
2003-01-02 |
7.40 |
21.11 |
29.22 |
909.03 |
2003-01-03 |
7.45 |
21.14 |
29.24 |
908.59 |
2003-01-06 |
7.45 |
21.52 |
29.96 |
929.01 |
2003-01-07 |
7.43 |
21.93 |
28.95 |
922.93 |
2003-01-08 |
7.28 |
21.31 |
28.83 |
909.93 |
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
Read CSV (comma-separated) file into DataFrame
Also supports optionally iterating or breaking of the file
into chunks.
Additional help can be found in the `online docs for IO Tools
<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.
Parameters
----------
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file could
be file ://localhost/path/to/table.csv
sep : str, default ','
Delimiter to use. If sep is None, will try to automatically determine
this. Separators longer than 1 character and different from ``'\s+'`` will
be interpreted as regular expressions, will force use of the python parsing
engine and will ignore quotes in the data. Regex example: ``'\r\t'``
delimiter : str, default ``None``
Alternative argument name for sep.
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be
used as the sep. Equivalent to setting ``sep='\s+'``. If this option
is set to True, nothing should be passed in for the ``delimiter``
parameter.
.. versionadded:: 0.18.1 support for the Python parser.
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the data.
Default behavior is as if set to 0 if no ``names`` passed, otherwise
``None``. Explicitly pass ``header=0`` to be able to replace existing
names. The header can be a list of integers that specify row locations for
a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
specified will be skipped (e.g. 2 in this example is skipped). Note that
this parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you
should explicitly pass header=None. Duplicates in this list are not
allowed unless mangle_dupe_cols=True, which is the default.
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
usecols : array-like, default None
Return a subset of the columns. All elements in this array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
as_recarray : boolean, default False
DEPRECATED: this argument will be removed in a future version. Please call
`pd.read_csv(...).to_records()` instead.
Return a NumPy recarray instead of a DataFrame after parsing the data.
If set to True, this option takes precedence over the `squeeze` parameter.
In addition, as row indices are not available in such a format, the
`index_col` parameter will be ignored.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than
'X'...'X'. Passing in False will cause data to be overwritten if there
are duplicate names in the columns.
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python'). Use `str` or `object` to preserve and
not interpret dtype.
engine : {'c', 'python'}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
true_values : list, default None
Values to consider as True
false_values : list, default None
Values to consider as False
skipinitialspace : boolean, default False
Skip spaces after delimiter.
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')
skip_footer : int, default 0
DEPRECATED: use the `skipfooter` parameter instead, as they are identical
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted as
NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they're appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
parse_dates : boolean or list of ints or names or list of lists or dict, default False
* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format
of the datetime strings in the columns, and if it can be inferred, switch
to a faster method of parsing them. In some cases this can increase the
parsing speed by ~5-10x.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses ``dateutil.parser.parser`` to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array
and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with
``get_chunk()``.
chunksize : int, default None
Return TextFileReader object for iteration. `See IO Tools docs for more
information
<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ on
``iterator`` and ``chunksize``.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',
'.zip', or 'xz', respectively, and no decompression otherwise. If using
'zip', the ZIP file must contain only one data file to be read in.
Set to None for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
thousands : str, default None
Thousands separator
decimal : str, default '.'
Character to recognize as decimal point (e.g. use ',' for European data).
float_precision : string, default None
Specifies which converter the C engine should use for floating-point
values. The options are `None` for the ordinary converter,
`high` for the high-precision converter, and `round_trip` for the
round-trip converter.
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default 0
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : boolean, default ``True``
When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
whether or not to interpret two consecutive quotechar elements INSIDE a
field as a single ``quotechar`` element.
escapechar : str (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header` but not by
`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'
with `header=0` will result in 'a,b,c' being
treated as the header.
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
dialect : str or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser)
buffer_lines : int, default None
DEPRECATED: this argument will be removed in a future version because its
value is not respected by the parser
compact_ints : boolean, default False
DEPRECATED: this argument will be removed in a future version
If compact_ints is True, then for any column that is of integer dtype,
the parser will attempt to cast it as the smallest integer dtype possible,
either signed or unsigned depending on the specification from the
`use_unsigned` parameter.
use_unsigned : boolean, default False
DEPRECATED: this argument will be removed in a future version
If integer columns are being compacted (i.e. `compact_ints=True`), specify
whether the column should be compacted to the smallest signed or unsigned
integer dtype.
memory_map : boolean, default False
If a filepath is provided for `filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
Returns
-------
result : DataFrame or TextParser
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL 2214 non-null float64
MSFT 2214 non-null float64
XOM 2214 non-null float64
SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
|
AAPL |
MSFT |
XOM |
SPX |
2011-10-11 |
400.29 |
27.00 |
76.27 |
1195.54 |
2011-10-12 |
402.19 |
26.96 |
77.16 |
1207.25 |
2011-10-13 |
408.43 |
27.18 |
76.37 |
1203.66 |
2011-10-14 |
422.00 |
27.27 |
78.11 |
1224.58 |
help(pd.DataFrame.info)
Help on function info in module pandas.core.frame:
info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
Concise summary of a DataFrame.
Parameters
----------
verbose : {None, True, False}, optional
Whether to print the full summary.
None follows the `display.max_info_columns` setting.
True or False overrides the `display.max_info_columns` setting.
buf : writable buffer, defaults to sys.stdout
max_cols : int, default None
Determines whether full summary or short summary is printed.
None follows the `display.max_info_columns` setting.
memory_usage : boolean/string, default None
Specifies whether total memory usage of the DataFrame
elements (including index) should be displayed. None follows
the `display.memory_usage` setting. True or False overrides
the `display.memory_usage` setting. A value of 'deep' is equivalent
of True, with deep introspection. Memory usage is shown in
human-readable units (base-2 representation).
null_counts : boolean, default None
Whether to show the non-null counts
- If None, then only show if the frame is smaller than
max_info_rows and max_info_columns.
- If True, always show counts.
- If False, never show counts.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
category 8 non-null object
data 8 non-null float64
weights 8 non-null float64
dtypes: float64(2), object(1)
memory usage: 272.0+ bytes
One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns(computed from the percent changes) with SPX
. As one way to do this,we first create a function that computes the pairwise correlation of each column with SPX
column:
spx_corr=lambda x:x.corrwith(x['SPX'])
help(pd.DataFrame.pct_change)
Help on function pct_change in module pandas.core.generic:
pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
Percent change over given number of periods.
Parameters
----------
periods : int, default 1
Periods to shift for forming percent change
fill_method : str, default 'pad'
How to handle NAs before computing percent changes
limit : int, default None
The number of consecutive NAs to fill before stopping
freq : DateOffset, timedelta, or offset alias string, optional
Increment to use from time series API (e.g. 'M' or BDay())
Returns
-------
chg : NDFrame
Notes
-----
By default, the percentage change is calculated along the stat
axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
``Panel``. You can change this with the ``axis`` keyword argument.
df=pd.DataFrame(np.arange(12).reshape((3,4)));df
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
df.pct_change()
|
0 |
1 |
2 |
3 |
0 |
NaN |
NaN |
NaN |
NaN |
1 |
inf |
4.0 |
2.000000 |
1.333333 |
2 |
1.000000 |
0.8 |
0.666667 |
0.571429 |
Above
(7-3)/3=1.33333,(6-2)/2=2,(5-1)/1=4,
(4-0)/0=inf,(8-4)/4=1,(9-5)/5=0.6667,(11-7)/7=0.57143
rets=close_px.pct_change().dropna()
rets.head()
|
AAPL |
MSFT |
XOM |
SPX |
2003-01-03 |
0.006757 |
0.001421 |
0.000684 |
-0.000484 |
2003-01-06 |
0.000000 |
0.017975 |
0.024624 |
0.022474 |
2003-01-07 |
-0.002685 |
0.019052 |
-0.033712 |
-0.006545 |
2003-01-08 |
-0.020188 |
-0.028272 |
-0.004145 |
-0.014086 |
2003-01-09 |
0.008242 |
0.029094 |
0.021159 |
0.019386 |
Lastly,we group these percent changed by year,which can be extracted from each row label with a one-line function that returns the year
attribute of each datetime
laebl:
get_year=lambda x:x.year
by_year=rets.groupby(get_year) # get_year will be applied to rets.index
list(by_year)
[(2003, AAPL MSFT XOM SPX
2003-01-03 0.006757 0.001421 0.000684 -0.000484
2003-01-06 0.000000 0.017975 0.024624 0.022474
2003-01-07 -0.002685 0.019052 -0.033712 -0.006545
2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
2003-01-09 0.008242 0.029094 0.021159 0.019386
2003-01-10 0.002725 0.001824 -0.013927 0.000000
2003-01-13 -0.005435 0.008648 -0.004134 -0.001412
2003-01-14 -0.002732 0.010379 0.008993 0.005830
2003-01-15 -0.010959 -0.012506 -0.013713 -0.014426
2003-01-16 0.012465 -0.016282 0.004519 -0.003942
2003-01-17 -0.035568 -0.070345 -0.010381 -0.014017
2003-01-21 -0.005674 -0.002473 -0.023077 -0.015702
2003-01-22 -0.009986 -0.006445 -0.012885 -0.010432
2003-01-23 0.021614 0.024950 -0.002175 0.010224
2003-01-24 -0.026798 -0.046251 -0.021439 -0.029233
2003-01-27 0.024638 -0.013783 -0.026736 -0.016160
2003-01-28 0.031117 -0.007246 0.026326 0.013050
2003-01-29 0.024691 0.022419 0.036431 0.006779
2003-01-30 -0.041499 -0.033656 -0.018293 -0.022849
2003-01-31 0.002793 -0.015831 0.027768 0.013130
2003-02-03 0.020891 0.023056 0.013864 0.005399
2003-02-04 -0.004093 -0.025681 0.000000 -0.014088
2003-02-05 -0.010959 -0.007531 -0.014376 -0.005435
2003-02-06 0.000000 0.009756 -0.008538 -0.006449
2003-02-07 -0.020776 -0.017713 -0.007535 -0.010094
2003-02-10 0.015559 0.017486 0.007592 0.007569
2003-02-11 0.000000 -0.019871 -0.007176 -0.008098
2003-02-12 0.002786 0.000000 -0.019877 -0.012687
2003-02-13 0.009722 0.011507 0.012906 -0.001600
2003-02-14 0.009629 0.028169 0.009465 0.021435
... ... ... ... ...
2003-11-18 -0.034091 0.000000 -0.012604 -0.009084
2003-11-19 0.000980 0.008024 0.008062 0.008016
2003-11-20 -0.001959 -0.009950 -0.003665 -0.008432
2003-11-21 -0.004907 0.000503 0.001003 0.001577
2003-11-24 0.042406 0.024611 0.006014 0.016227
2003-11-25 -0.021760 -0.012745 0.007307 0.001720
2003-11-26 0.001934 0.001986 0.012859 0.004327
2003-11-28 0.008687 0.009911 -0.002279 -0.000236
2003-12-01 0.038278 0.005397 0.005546 0.011264
2003-12-02 -0.007373 -0.007321 0.002920 -0.003271
2003-12-03 -0.023213 0.000492 0.000971 -0.001772
2003-12-04 0.004753 0.020639 0.013251 0.004687
2003-12-05 -0.013245 -0.008185 -0.006380 -0.007684
2003-12-08 0.008629 0.009709 0.016693 0.007348
2003-12-09 -0.027567 0.005288 0.006315 -0.008529
2003-12-10 -0.003910 0.008130 0.004079 -0.001066
2003-12-11 0.040236 0.000949 0.003125 0.011482
2003-12-12 -0.015094 0.001422 0.004050 0.002735
2003-12-15 -0.033525 0.003313 -0.006826 -0.005679
2003-12-16 -0.002973 0.011792 0.017807 0.006638
2003-12-17 -0.011928 -0.000466 0.010436 0.001256
2003-12-18 0.008048 0.013060 0.013670 0.011798
2003-12-19 -0.016966 -0.001381 0.004795 -0.000477
2003-12-22 0.008122 -0.006455 -0.003281 0.003931
2003-12-23 -0.003021 -0.001392 0.007181 0.002818
2003-12-24 0.030303 -0.003717 0.002080 -0.001807
2003-12-26 0.018627 0.006063 0.005336 0.001691
2003-12-29 0.017324 0.009272 0.013270 0.012401
2003-12-30 0.006623 0.002297 0.002619 0.000144
2003-12-31 0.004699 -0.005500 0.007837 0.002055
[251 rows x 4 columns]),
(2004, AAPL MSFT XOM SPX
2004-01-02 -0.004677 0.002765 -0.008929 -0.003094
2004-01-05 0.042293 0.025276 0.023249 0.012395
2004-01-06 -0.003607 0.003586 -0.006816 0.001292
2004-01-07 0.022624 -0.001340 -0.007149 0.002367
2004-01-08 0.033628 -0.001342 -0.002592 0.004963
2004-01-09 -0.015411 -0.017913 -0.014727 -0.008888
2004-01-12 0.031304 -0.003192 0.014068 0.004787
2004-01-13 0.016863 -0.005032 -0.004335 -0.005332
2004-01-14 0.003317 0.009655 0.000000 0.008295
2004-01-15 -0.055372 -0.005920 -0.009869 0.001353
2004-01-16 -0.006124 0.010078 0.005570 0.006872
2004-01-20 0.000000 0.010431 0.009329 -0.000930
2004-01-21 -0.004401 0.007181 0.013287 0.007772
2004-01-22 -0.019452 -0.010250 -0.003421 -0.003207
2004-01-23 0.017133 0.016659 -0.006007 -0.002089
2004-01-26 0.020390 0.011072 0.012374 0.012106
2004-01-27 0.001738 -0.018835 -0.008528 -0.009798
2004-01-28 -0.023417 -0.019196 -0.009174 -0.013610
2004-01-29 0.007105 0.007283 0.016204 0.004989
2004-01-30 -0.005291 -0.009489 -0.016515 -0.002628
2004-02-02 -0.010638 -0.009124 -0.002316 0.003651
2004-02-03 -0.002688 -0.003683 -0.006965 0.000678
2004-02-04 -0.020665 -0.010628 -0.001461 -0.008371
2004-02-05 0.028440 -0.001868 0.000585 0.001838
2004-02-06 0.012489 0.004679 0.003217 0.012555
2004-02-09 -0.000881 -0.006521 0.011662 -0.002581
2004-02-10 0.013228 0.004219 0.008934 0.005027
2004-02-11 0.035683 0.004669 0.020280 0.010667
2004-02-12 -0.003361 -0.006970 -0.000280 -0.004880
2004-02-13 -0.030354 -0.013570 0.000000 -0.005468
... ... ... ... ...
2004-11-18 0.008743 -0.003738 0.000922 0.001362
2004-11-19 -0.003973 -0.007920 0.006680 -0.011161
2004-11-22 0.112038 -0.007563 0.010526 0.005896
2004-11-23 -0.000978 -0.004657 0.005661 -0.000255
2004-11-24 0.045366 0.004254 0.005629 0.004095
2004-11-26 0.007805 -0.001694 0.003135 0.000753
2004-11-29 0.060099 0.006364 -0.005357 -0.003450
2004-11-30 -0.020164 0.001686 -0.002469 -0.004030
2004-12-01 0.011035 0.016414 -0.001800 0.014951
2004-12-02 -0.038348 -0.005797 -0.019157 -0.000873
2004-12-03 -0.038650 0.004998 0.001838 0.000706
2004-12-06 0.049458 0.003730 -0.002523 -0.000772
2004-12-07 -0.044086 -0.009496 -0.008738 -0.011073
2004-12-08 0.006361 0.010838 0.006495 0.004877
2004-12-09 0.011378 -0.004948 0.005992 0.005436
2004-12-10 0.018125 -0.005387 -0.003436 -0.001043
2004-12-13 -0.003683 0.006250 0.010345 0.008990
2004-12-14 0.005853 -0.000828 -0.000228 0.003921
2004-12-15 -0.000613 -0.004144 -0.002959 0.001945
2004-12-16 0.020533 0.001665 -0.007989 -0.002082
2004-12-17 -0.024324 -0.007478 -0.005983 -0.007488
2004-12-20 -0.034780 -0.000419 0.025231 0.000377
2004-12-21 0.015306 0.004606 0.012192 0.009040
2004-12-22 0.001256 -0.003752 -0.007584 0.003418
2004-12-23 0.004078 0.001674 0.013261 0.000463
2004-12-27 -0.013433 -0.006266 -0.021074 -0.004305
2004-12-28 0.016149 0.003783 0.003399 0.007154
2004-12-29 0.004051 -0.001675 0.001581 -0.000074
2004-12-30 0.005587 -0.005034 -0.002029 0.000082
2004-12-31 -0.006173 -0.001686 0.004519 -0.001343
[252 rows x 4 columns]),
(2005, AAPL MSFT XOM SPX
2005-01-03 -0.017081 0.000845 -0.022717 -0.008119
2005-01-04 0.010111 0.003797 -0.006904 -0.011671
2005-01-05 0.008758 -0.002522 -0.005098 -0.003628
2005-01-06 0.000930 -0.000843 0.012579 0.003506
2005-01-07 0.072491 -0.002952 -0.006441 -0.001431
2005-01-10 -0.004044 0.004653 0.003705 0.003423
2005-01-11 -0.063805 -0.002526 -0.002537 -0.006100
2005-01-12 0.013941 0.001688 0.014801 0.003981
2005-01-13 0.066300 -0.018963 -0.000684 -0.008630
2005-01-14 0.005731 -0.005584 0.010262 0.006005
2005-01-18 0.006553 0.007775 0.008352 0.009675
2005-01-19 -0.011039 -0.013288 -0.008731 -0.009490
2005-01-20 0.008300 -0.004344 -0.006549 -0.007783
2005-01-21 0.000284 -0.008290 -0.005456 -0.006415
2005-01-24 0.003973 0.000880 0.013714 -0.003528
2005-01-25 0.018372 0.013626 0.003833 0.004004
2005-01-26 0.002498 -0.000434 0.007188 0.004844
2005-01-27 0.005537 0.003905 0.001115 0.000409
2005-01-28 0.018447 0.002593 -0.009356 -0.002716
2005-01-31 0.039470 0.003879 0.006521 0.008460
2005-02-01 0.008062 0.004294 0.032395 0.006891
2005-02-02 0.027090 0.002565 0.012335 0.003178
2005-02-03 -0.022859 -0.010661 0.010902 -0.002766
2005-02-04 0.013368 0.005603 0.014168 0.011043
2005-02-07 0.001268 -0.006429 0.001877 -0.001089
2005-02-08 0.024829 0.003020 0.011863 0.000483
2005-02-09 -0.026700 -0.006452 -0.004319 -0.008575
2005-02-10 -0.004826 -0.000433 0.014460 0.004211
2005-02-11 0.036243 -0.003032 -0.004072 0.006926
2005-02-14 0.042118 0.001303 0.012676 0.000697
... ... ... ... ...
2005-11-17 -0.006620 0.008441 0.003360 0.009414
2005-11-18 0.000620 0.003587 0.015166 0.004401
2005-11-21 0.006196 0.003177 0.019208 0.005271
2005-11-22 0.024015 -0.008709 0.004950 0.005084
2005-11-23 0.008870 0.000399 0.003599 0.003473
2005-11-25 0.033229 -0.005988 0.003964 0.002086
2005-11-28 0.004615 -0.000402 -0.022749 -0.008508
2005-11-29 -0.022394 -0.002411 -0.006926 0.000016
2005-11-30 -0.004112 0.000000 -0.005231 -0.006362
2005-12-01 0.055736 0.007652 0.022785 0.012157
2005-12-02 0.014385 0.004396 -0.004760 0.000324
2005-12-05 -0.011152 -0.005969 0.007461 -0.002363
2005-12-06 0.031050 -0.005604 0.003038 0.001276
2005-12-07 -0.001350 0.002013 -0.011359 -0.005009
2005-12-08 0.001758 -0.002009 0.006894 -0.001217
2005-12-09 0.003375 0.000805 -0.015595 0.002811
2005-12-12 0.007803 -0.009654 0.006182 0.000842
2005-12-13 0.000934 -0.011373 0.003456 0.005554
2005-12-14 -0.039611 -0.001643 0.013586 0.004190
2005-12-15 0.002361 -0.006173 -0.006230 -0.001414
2005-12-16 -0.014824 -0.000828 -0.024126 -0.002848
2005-12-19 0.003797 -0.002487 -0.006035 -0.005839
2005-12-20 0.010227 0.000831 0.003917 -0.000238
2005-12-21 0.019276 -0.004566 -0.005657 0.002517
2005-12-22 0.007075 -0.005421 -0.008633 0.004221
2005-12-23 -0.009052 0.002096 0.000000 0.000426
2005-12-27 0.011997 -0.006695 -0.021571 -0.009553
2005-12-28 -0.008891 -0.002949 0.006675 0.001297
2005-12-29 -0.028816 -0.004225 -0.000201 -0.002981
2005-12-30 0.006158 -0.004667 -0.001206 -0.004887
[252 rows x 4 columns]),
(2006, AAPL MSFT XOM SPX
2006-01-03 0.039783 0.026428 0.041046 0.016430
2006-01-04 0.002943 0.004568 0.001739 0.003673
2006-01-05 -0.007870 0.000827 -0.005016 0.000016
2006-01-06 0.025813 -0.002891 0.019779 0.009399
2006-01-09 -0.003277 -0.002071 -0.000570 0.003656
2006-01-10 0.063248 0.005396 0.007801 -0.000357
2006-01-11 0.037596 0.010735 0.006796 0.003481
2006-01-12 0.004648 -0.005310 -0.010501 -0.006274
2006-01-13 0.015423 0.001643 0.022361 0.001205
2006-01-17 -0.010282 -0.007380 0.009268 -0.003557
2006-01-18 -0.026207 -0.005783 -0.013958 -0.003975
2006-01-19 -0.041823 0.007063 0.013597 0.005564
2006-01-20 -0.037323 -0.022690 -0.015803 -0.018326
2006-01-23 0.020765 -0.002111 0.011016 0.001847
2006-01-24 -0.020986 -0.002961 -0.003878 0.002405
2006-01-25 -0.024198 0.004667 -0.012236 -0.001721
2006-01-26 -0.025202 0.003801 -0.004317 0.007235
2006-01-27 -0.004148 0.048801 0.022243 0.007764
2006-01-30 0.041233 0.007621 0.029688 0.001145
2006-01-31 0.006800 0.005175 -0.005731 -0.003976
2006-02-01 -0.001192 -0.003960 -0.012608 0.001859
2006-02-02 -0.044020 -0.012724 0.000000 -0.009061
2006-02-03 -0.003467 -0.005236 -0.009121 -0.005359
2006-02-06 -0.063326 -0.013360 0.009389 0.000783
2006-02-07 0.004458 -0.008207 -0.022798 -0.008095
2006-02-08 0.017899 -0.001241 0.002613 0.008663
2006-02-09 -0.056096 -0.009114 -0.007818 -0.001478
2006-02-10 0.036336 0.000836 -0.008068 0.002540
2006-02-13 -0.038627 -0.011278 0.002837 -0.003260
2006-02-14 0.045279 0.010139 -0.000943 0.010033
... ... ... ... ...
2006-11-16 0.018560 0.012075 -0.028457 0.002284
2006-11-17 0.002803 -0.002610 0.005645 0.001029
2006-11-20 0.007222 0.016822 -0.008799 -0.000500
2006-11-21 0.024633 0.001103 0.013162 0.001649
2006-11-22 0.019300 0.000000 -0.006495 0.002338
2006-11-24 0.014616 -0.005509 -0.007298 -0.003656
2006-11-27 -0.022809 -0.009232 0.001225 -0.013555
2006-11-28 0.025352 -0.002982 0.023405 0.003444
2006-11-29 -0.000109 0.005981 0.025112 0.009202
2006-11-30 -0.001525 -0.007061 0.010353 0.000822
2006-12-01 -0.003709 -0.008234 0.005051 -0.002799
2006-12-04 -0.002190 0.007170 0.007323 0.008885
2006-12-05 0.001646 -0.006744 0.003849 0.004002
2006-12-06 -0.015777 -0.004904 -0.022437 -0.001315
2006-12-07 -0.031059 -0.004928 -0.007844 -0.003971
2006-12-08 0.014017 0.019048 -0.002782 0.001812
2006-12-11 0.005552 0.004860 -0.001909 0.002270
2006-12-12 -0.029408 -0.003720 0.011768 -0.001047
2006-12-13 0.033782 0.004108 0.014539 0.001169
2006-12-14 -0.005615 0.017479 0.017770 0.008689
2006-12-15 -0.009373 0.004020 -0.018164 0.001122
2006-12-18 -0.025650 -0.009829 -0.023089 -0.003230
2006-12-19 0.009828 0.003309 0.019524 0.002158
2006-12-20 -0.017959 0.003298 -0.011807 -0.001417
2006-12-21 -0.021944 -0.003652 -0.002768 -0.003674
2006-12-22 -0.008444 -0.011364 -0.005991 -0.005316
2006-12-26 -0.008394 0.011865 0.008379 0.004352
2006-12-27 0.000123 0.001099 0.011953 0.007015
2006-12-28 -0.007974 -0.001464 0.004898 -0.001479
2006-12-29 0.049091 -0.004032 -0.009031 -0.004513
[251 rows x 4 columns]),
(2007, AAPL MSFT XOM SPX
2007-01-03 -0.012258 0.000000 -0.032981 -0.001199
2007-01-04 0.022196 -0.001472 -0.018699 0.001228
2007-01-05 -0.007121 -0.005898 0.007165 -0.006085
2007-01-08 0.004938 0.010011 -0.008022 0.002220
2007-01-09 0.083070 0.000734 -0.007782 -0.000517
2007-01-10 0.047856 -0.009905 -0.015224 0.001940
2007-01-11 -0.012371 0.035198 -0.000156 0.006340
2007-01-12 -0.012317 0.016464 0.023583 0.004853
2007-01-16 0.026210 -0.001408 -0.014037 0.000818
2007-01-17 -0.022142 -0.002116 0.011452 -0.000894
2007-01-18 -0.061927 -0.003180 -0.006885 -0.002971
2007-01-19 -0.006399 0.003545 0.021876 0.002895
2007-01-22 -0.019322 -0.012363 -0.008593 -0.005278
2007-01-23 -0.012559 0.000358 0.021898 0.003542
2007-01-24 0.011669 0.011441 0.005506 0.008501
2007-01-25 -0.005190 -0.020502 -0.018499 -0.011270
2007-01-26 -0.010087 0.005052 0.001206 -0.001208
2007-01-29 0.006559 -0.002513 -0.005572 -0.001097
2007-01-30 -0.004538 -0.001440 0.016356 0.005772
2007-01-31 0.002104 0.012257 -0.004023 0.006593
2007-02-01 -0.011548 -0.009615 0.013315 0.005354
2007-02-02 0.000118 -0.012226 0.006053 0.001694
2007-02-05 -0.009558 -0.018930 0.001761 -0.000967
2007-02-06 0.002502 -0.003711 -0.002783 0.000698
2007-02-07 0.023767 -0.004469 -0.004701 0.001395
2007-02-08 0.000348 -0.003741 0.009004 -0.001179
2007-02-09 -0.033767 -0.009763 -0.003218 -0.007077
2007-02-12 0.019335 -0.001138 -0.008218 -0.003261
2007-02-13 -0.002121 0.005695 0.011394 0.007597
2007-02-14 0.007084 0.013590 0.002048 0.007644
... ... ... ... ...
2007-11-16 0.012721 0.009634 0.007228 0.005230
2007-11-19 -0.014664 -0.003817 -0.011661 -0.017460
2007-11-20 0.029887 0.018519 0.044081 0.004486
2007-11-21 -0.002310 -0.010345 -0.008817 -0.015927
2007-11-23 0.018283 -0.003484 0.014282 0.016891
2007-11-26 0.005830 -0.033376 -0.029521 -0.023239
2007-11-27 0.013156 0.002631 0.008146 0.014930
2007-11-28 0.030948 0.019351 0.017801 0.028560
2007-11-29 0.022584 -0.003218 0.007690 0.000477
2007-11-30 -0.011232 0.000323 0.006401 0.007770
2007-12-03 -0.018439 -0.020329 -0.003425 -0.005887
2007-12-04 0.005311 -0.004282 -0.008223 -0.006540
2007-12-05 0.031645 0.042011 0.020418 0.015190
2007-12-06 0.023989 0.011746 0.016857 0.015037
2007-12-07 0.022901 -0.000628 0.000716 -0.001778
2007-12-10 -0.000463 0.006593 0.005720 0.007510
2007-12-11 -0.029195 -0.019027 -0.018960 -0.025271
2007-12-12 0.012305 0.010811 0.018118 0.006050
2007-12-13 0.005082 0.022020 0.008779 0.001224
2007-12-14 -0.007507 0.002462 -0.016700 -0.013746
2007-12-17 -0.031462 -0.026098 -0.014113 -0.015021
2007-12-18 -0.007701 0.010088 0.016984 0.006280
2007-12-19 0.000765 0.001561 0.000119 -0.001361
2007-12-20 0.022335 0.020879 0.007634 0.004900
2007-12-21 0.035789 0.015263 0.014205 0.016670
2007-12-24 0.025218 0.014432 0.002451 0.008077
2007-12-26 0.000755 0.000889 0.012225 0.000809
2007-12-27 -0.001910 -0.017471 -0.011962 -0.014282
2007-12-28 0.006345 0.004219 0.014203 0.001504
2007-12-31 -0.008757 -0.014406 -0.013889 -0.006852
[251 rows x 4 columns]),
(2008, AAPL MSFT XOM SPX
2008-01-02 -0.016357 -0.010658 -0.001862 -0.014438
2008-01-03 0.000462 0.004001 0.003382 0.000000
2008-01-04 -0.076335 -0.027897 -0.018596 -0.024552
2008-01-07 -0.013385 0.006623 -0.009356 0.003223
2008-01-08 -0.035972 -0.033521 -0.012791 -0.018352
2008-01-09 0.047591 0.029822 0.016711 0.013624
2008-01-10 -0.007692 -0.003148 0.001072 0.007948
2008-01-11 -0.029940 -0.012314 -0.014753 -0.013595
2008-01-14 0.035266 0.014066 0.005796 0.010871
2008-01-15 -0.054480 -0.011349 -0.019930 -0.024925
2008-01-16 -0.055608 -0.022640 -0.027931 -0.005612
2008-01-17 0.007830 -0.003589 -0.030246 -0.029093
2008-01-18 0.002921 -0.002947 0.013905 -0.006045
2008-01-22 -0.035449 -0.031856 -0.030890 -0.011085
2008-01-23 -0.106464 -0.001018 0.012035 0.021442
2008-01-24 -0.024951 0.041426 0.030580 0.010063
2008-01-25 -0.041224 -0.009455 -0.023967 -0.015872
2008-01-28 0.000000 -0.006583 0.014291 0.017548
2008-01-29 0.011768 -0.003645 -0.000640 0.006160
2008-01-30 0.004865 -0.012305 0.002307 -0.004764
2008-01-31 0.024058 0.012458 0.004987 0.016772
2008-02-01 -0.011894 -0.065846 0.002927 0.012237
2008-02-04 -0.015701 -0.008544 -0.005963 -0.010463
2008-02-05 -0.017395 -0.037343 -0.038928 -0.031995
2008-02-06 -0.056895 -0.018650 -0.008234 -0.007624
2008-02-07 -0.006230 -0.014063 0.009909 0.007886
2008-02-08 0.034972 0.015420 -0.002254 -0.004204
2008-02-11 0.031639 -0.012149 0.018472 0.005889
2008-02-12 -0.035458 0.004612 0.013961 0.007266
2008-02-13 0.036361 0.021806 0.013126 0.013604
... ... ... ... ...
2008-11-18 0.020082 0.022185 0.040222 0.009838
2008-11-19 -0.040262 -0.067824 -0.038106 -0.061156
2008-11-20 -0.067215 -0.041327 -0.066851 -0.067123
2008-11-21 0.025966 0.122647 0.106446 0.063248
2008-11-24 0.125575 0.051379 0.039498 0.064723
2008-11-25 -0.023131 -0.033951 -0.008685 0.006551
2008-11-26 0.046256 0.025027 0.035455 0.035328
2008-11-28 -0.024526 -0.013506 -0.009122 0.009643
2008-12-01 -0.040358 -0.079516 -0.072849 -0.089295
2008-12-02 0.039807 0.029176 0.044467 0.039941
2008-12-03 0.037093 0.037799 0.016947 0.025836
2008-12-04 -0.046820 -0.038565 -0.033735 -0.029308
2008-12-05 0.028334 0.040111 0.004347 0.036499
2008-12-08 0.060851 0.057311 0.039229 0.038387
2008-12-09 0.003410 -0.019757 -0.017598 -0.023118
2008-12-10 -0.018489 0.000517 0.023930 0.011894
2008-12-11 -0.032685 -0.056302 -0.000668 -0.028524
2008-12-12 0.034421 -0.004379 0.005345 0.007028
2008-12-15 -0.035820 -0.016493 -0.006115 -0.012686
2008-12-16 0.007177 0.055897 0.039856 0.051360
2008-12-17 -0.065703 -0.022234 -0.025080 -0.009593
2008-12-18 0.003028 -0.018408 -0.050000 -0.021163
2008-12-19 0.006374 -0.009377 -0.025691 0.002937
2008-12-22 -0.047333 0.003341 -0.001568 -0.018302
2008-12-23 0.007464 0.004994 0.002570 -0.009717
2008-12-24 -0.015513 -0.005522 0.009113 0.005781
2008-12-26 0.009055 -0.002221 0.018485 0.005356
2008-12-29 0.009323 -0.008904 0.010806 -0.003873
2008-12-30 -0.003695 0.020213 0.007264 0.024407
2008-12-31 -0.010893 0.004953 0.015784 0.014158
[253 rows x 4 columns]),
(2009, AAPL MSFT XOM SPX
2009-01-02 0.063269 0.046002 0.022773 0.031608
2009-01-05 0.042204 0.009424 -0.000131 -0.004668
2009-01-06 -0.016494 0.011411 -0.016374 0.007817
2009-01-07 -0.021608 -0.060000 -0.025436 -0.030010
2009-01-08 0.018569 0.031097 0.010659 0.003397
2009-01-09 -0.022869 -0.029630 -0.019200 -0.021303
2009-01-12 -0.021197 -0.002726 -0.013234 -0.022564
2009-01-13 -0.010715 0.018043 0.018022 0.001758
2009-01-14 -0.027135 -0.037057 -0.036229 -0.033460
2009-01-15 -0.022852 0.007808 0.020789 0.001329
2009-01-16 -0.012593 0.024903 0.018831 0.007562
2009-01-20 -0.050164 -0.062635 -0.023275 -0.052816
2009-01-21 0.059207 0.048963 0.038968 0.043491
2009-01-22 0.066763 -0.117518 -0.012952 -0.015162
2009-01-23 0.000000 0.005600 -0.002460 0.005378
2009-01-26 0.014486 0.024752 0.007125 0.005553
2009-01-27 0.012160 0.001812 0.004082 0.010926
2009-01-28 0.038245 0.021700 0.004201 0.033558
2009-01-29 -0.012739 -0.025369 -0.028336 -0.033120
2009-01-30 -0.030860 -0.027845 -0.006805 -0.022789
2009-02-02 0.015311 0.042964 0.002796 -0.000533
2009-02-03 0.016064 0.037612 0.018544 0.015834
2009-02-04 0.006130 0.006904 -0.005886 -0.007489
2009-02-05 0.031106 0.022286 0.027403 0.016366
2009-02-06 0.033796 0.032420 0.012063 0.026896
2009-02-09 0.027978 -0.011370 -0.010727 0.001485
2009-02-10 -0.045654 -0.032859 -0.042035 -0.049121
2009-02-11 -0.010324 0.022084 -0.020402 0.007955
2009-02-12 0.025305 0.002216 0.008559 0.001739
2009-02-13 -0.001108 -0.008845 -0.008487 -0.009998
... ... ... ... ...
2009-11-18 -0.005024 0.003814 0.003204 -0.000468
2009-11-19 -0.026461 -0.011054 -0.008331 -0.013426
2009-11-20 -0.002942 -0.005239 -0.003500 -0.003215
2009-11-23 0.029812 0.010885 0.017704 0.013616
2009-11-24 -0.006994 -0.001042 0.003590 -0.000533
2009-11-25 -0.001223 -0.004172 0.006603 0.004504
2009-11-27 -0.017631 -0.019204 -0.021047 -0.017233
2009-11-30 -0.003390 0.006764 0.002792 0.003793
2009-12-01 -0.014707 0.020156 0.012808 0.012075
2009-12-02 -0.003757 -0.007626 -0.003162 0.000343
2009-12-03 0.001274 0.001746 -0.010618 -0.008402
2009-12-04 -0.016083 0.004881 -0.009895 0.005509
2009-12-07 -0.022605 -0.006246 -0.006475 -0.002468
2009-12-08 0.004869 -0.007332 -0.011051 -0.010252
2009-12-09 0.041765 0.004573 -0.002292 0.003672
2009-12-10 -0.006926 0.005602 -0.005313 0.005840
2009-12-11 -0.008960 -0.000696 0.005919 0.003683
2009-12-14 0.011866 0.008711 -0.043054 0.006959
2009-12-15 -0.014265 -0.003109 -0.007499 -0.005547
2009-12-16 0.004429 0.002772 -0.010728 0.001128
2009-12-17 -0.016254 -0.016586 -0.003055 -0.011811
2009-12-18 0.018607 0.025650 -0.000153 0.005830
2009-12-21 0.014327 0.005139 0.004444 0.010504
2009-12-22 0.010745 0.009884 0.000915 0.003564
2009-12-23 0.008684 0.003375 -0.004572 0.002299
2009-12-24 0.034339 0.002355 0.005818 0.005256
2009-12-28 0.012294 0.005705 0.006241 0.001154
2009-12-29 -0.011861 0.007007 -0.003480 -0.001401
2009-12-30 0.012147 -0.013585 -0.001063 0.000195
2009-12-31 -0.004300 -0.015788 -0.008511 -0.010050
[252 rows x 4 columns]),
(2010, AAPL MSFT XOM SPX
2010-01-04 0.015565 0.015700 0.014102 0.016043
2010-01-05 0.001729 0.000336 0.003930 0.003116
2010-01-06 -0.015906 -0.006382 0.008582 0.000546
2010-01-07 -0.001849 -0.010142 -0.003135 0.004001
2010-01-08 0.006648 0.006831 -0.003893 0.002882
2010-01-11 -0.008822 -0.012890 0.011124 0.001747
2010-01-12 -0.011375 -0.006529 -0.004906 -0.009381
2010-01-13 0.014106 0.009339 -0.004034 0.008326
2010-01-14 -0.005792 0.020219 0.000150 0.002427
2010-01-15 -0.016712 -0.003359 -0.008250 -0.010823
2010-01-19 0.044238 0.007752 0.002420 0.012500
2010-01-20 -0.015392 -0.016388 -0.017954 -0.010598
2010-01-21 -0.017286 -0.019041 -0.019511 -0.018945
2010-01-22 -0.049599 -0.035009 -0.009088 -0.022141
2010-01-25 0.026903 0.012572 -0.003795 0.004598
2010-01-26 0.014133 0.006031 0.001111 -0.004203
2010-01-27 0.009420 0.005994 -0.005708 0.004880
2010-01-28 -0.041322 -0.017175 -0.008930 -0.011818
2010-01-29 -0.036279 -0.033880 -0.008045 -0.009829
2010-02-01 0.013902 0.008121 0.027088 0.014266
2010-02-02 0.005803 0.001831 0.011845 0.012973
2010-02-03 0.017206 0.006213 -0.005463 -0.005474
2010-02-04 -0.036039 -0.027606 -0.028249 -0.031141
2010-02-05 0.017756 0.006350 0.001292 0.002897
2010-02-08 -0.006856 -0.010765 -0.000484 -0.008863
2010-02-09 0.010664 0.010507 0.013232 0.013040
2010-02-10 -0.005454 -0.000743 -0.005415 -0.002233
2010-02-11 0.018194 0.004831 0.006085 0.009680
2010-02-12 0.008607 -0.007027 -0.006844 -0.002745
2010-02-16 0.015071 0.019739 0.022917 0.018001
... ... ... ... ...
2010-11-18 0.026389 0.010359 0.018876 0.015357
2010-11-19 -0.005512 -0.005915 0.003329 0.002540
2010-11-22 0.021615 0.001587 -0.005049 -0.001575
2010-11-23 -0.014775 -0.023762 -0.017254 -0.014284
2010-11-24 0.019661 0.010142 0.013131 0.014923
2010-11-26 0.000635 -0.004819 -0.009320 -0.007469
2010-11-29 0.005937 0.002421 0.003234 -0.001379
2010-11-30 -0.018052 -0.002013 0.001465 -0.006070
2010-12-01 0.016873 0.031061 0.025457 0.021617
2010-12-02 0.005531 0.032473 0.002140 0.012818
2010-12-03 -0.002232 0.004926 -0.003986 0.002603
2010-12-06 0.008537 -0.006787 0.001572 -0.001298
2010-12-07 -0.006060 0.001139 0.002141 0.000515
2010-12-08 0.008799 0.013652 0.005412 0.003702
2010-12-09 -0.003894 -0.005612 0.002125 0.003843
2010-12-10 0.002502 0.009406 0.002544 0.006002
2010-12-13 0.003463 -0.003354 0.002538 0.000048
2010-12-14 -0.004290 0.013837 -0.002531 0.000911
2010-12-15 0.000219 0.008115 -0.004652 -0.005122
2010-12-16 0.002778 0.005123 0.005241 0.006185
2010-12-17 -0.001992 -0.003276 -0.000705 0.000837
2010-12-20 0.004990 -0.003287 0.000705 0.002548
2010-12-21 0.006176 0.009527 0.006904 0.006030
2010-12-22 0.002961 0.004356 0.001120 0.003380
2010-12-23 -0.004798 0.003975 0.005451 -0.001644
2010-12-27 0.003337 -0.008279 -0.002641 0.000613
2010-12-28 0.002433 -0.002178 0.005715 0.000771
2010-12-29 -0.000553 -0.001455 -0.000693 0.001009
2010-12-30 -0.005011 -0.004372 -0.000139 -0.001508
2010-12-31 -0.003399 0.002195 -0.003329 -0.000191
[252 rows x 4 columns]),
(2011, AAPL MSFT XOM SPX
2011-01-03 0.021732 0.002556 0.019624 0.011315
2011-01-04 0.005219 0.004006 0.004641 -0.001313
2011-01-05 0.008180 -0.003264 -0.002717 0.005007
2011-01-06 -0.000808 0.029476 0.006540 -0.002123
2011-01-07 0.007161 -0.007777 0.005414 -0.001845
2011-01-10 0.018833 -0.013181 -0.006058 -0.001376
2011-01-11 -0.002365 -0.003971 0.007450 0.003725
2011-01-12 0.008137 0.015585 0.011697 0.009008
2011-01-13 0.003658 -0.012491 0.001728 -0.001711
2011-01-14 0.008100 0.003975 0.014725 0.007385
2011-01-18 -0.022469 0.012599 0.011243 0.001376
2011-01-19 -0.005313 -0.006754 -0.006076 -0.010116
2011-01-20 -0.018180 -0.004295 -0.006243 -0.001295
2011-01-21 -0.017915 -0.011503 0.015838 0.002414
2011-01-24 0.032842 0.012727 -0.004896 0.005836
2011-01-25 0.011705 0.002513 0.001165 0.000263
2011-01-26 0.007176 0.011819 0.012416 0.004221
2011-01-27 -0.001861 0.002832 0.002683 0.002244
2011-01-28 -0.020716 -0.038475 -0.011084 -0.017852
2011-01-31 0.009580 -0.000734 0.021386 0.007663
2011-02-01 0.016828 0.009184 0.039985 0.016694
2011-02-02 -0.002058 -0.001820 -0.005943 -0.002723
2011-02-03 -0.002556 -0.010212 0.000366 0.002354
2011-02-04 0.008910 0.004422 -0.001829 0.002884
2011-02-07 0.015527 0.015407 0.007698 0.006240
2011-02-08 0.009435 0.002890 -0.006063 0.004185
2011-02-09 0.008333 -0.011167 -0.005124 -0.002786
2011-02-10 -0.010107 -0.016758 0.007848 0.000750
2011-02-11 0.006515 -0.009263 -0.004623 0.005507
2011-02-14 0.006529 -0.000374 0.025303 0.002385
... ... ... ... ...
2011-09-02 -0.018319 -0.015643 -0.018370 -0.025282
2011-09-06 0.015212 -0.011240 -0.013723 -0.007436
2011-09-07 0.011034 0.019208 0.035137 0.028646
2011-09-08 0.000547 0.008462 -0.011270 -0.010612
2011-09-09 -0.017337 -0.018307 -0.024856 -0.026705
2011-09-12 0.006517 0.005828 0.011688 0.006966
2011-09-13 0.012318 0.005794 -0.002645 0.009120
2011-09-14 0.012168 0.017665 0.013817 0.013480
2011-09-15 0.009401 0.018491 0.018860 0.017187
2011-09-16 0.019188 0.004817 0.007296 0.005707
2011-09-19 0.027790 0.003319 -0.011402 -0.009803
2011-09-20 0.004421 -0.008453 0.004206 -0.001661
2011-09-21 -0.003168 -0.036694 -0.027564 -0.029390
2011-09-22 -0.025040 -0.035783 -0.037932 -0.031883
2011-09-23 0.006172 0.000000 0.001011 0.006082
2011-09-26 -0.002795 0.015164 0.034771 0.023336
2011-09-27 -0.009698 0.009041 0.016592 0.010688
2011-09-28 -0.005635 -0.003506 -0.011521 -0.020691
2011-09-29 -0.016221 -0.005082 0.025114 0.008114
2011-09-30 -0.023683 -0.022004 -0.016919 -0.024974
2011-10-03 -0.017623 -0.014464 -0.020377 -0.028451
2011-10-04 -0.005606 0.033021 0.023612 0.022488
2011-10-05 0.015436 0.021705 0.015378 0.017866
2011-10-06 -0.002327 0.017381 -0.000811 0.018304
2011-10-07 -0.020060 -0.003417 -0.004466 -0.008163
2011-10-10 0.051406 0.026286 0.036977 0.034125
2011-10-11 0.029526 0.002227 -0.000131 0.000544
2011-10-12 0.004747 -0.001481 0.011669 0.009795
2011-10-13 0.015515 0.008160 -0.010238 -0.002974
2011-10-14 0.033225 0.003311 0.022784 0.017380
[199 rows x 4 columns])]
rets.index[0].year
2003
by_year.apply(spx_corr)
|
AAPL |
MSFT |
XOM |
SPX |
2003 |
0.541124 |
0.745174 |
0.661265 |
1.0 |
2004 |
0.374283 |
0.588531 |
0.557742 |
1.0 |
2005 |
0.467540 |
0.562374 |
0.631010 |
1.0 |
2006 |
0.428267 |
0.406126 |
0.518514 |
1.0 |
2007 |
0.508118 |
0.658770 |
0.786264 |
1.0 |
2008 |
0.681434 |
0.804626 |
0.828303 |
1.0 |
2009 |
0.707103 |
0.654902 |
0.797921 |
1.0 |
2010 |
0.710105 |
0.730118 |
0.839057 |
1.0 |
2011 |
0.691931 |
0.800996 |
0.859975 |
1.0 |
We could also compute inter-column correlations.Here we compute the annual correlation between Apple and Microsoft.
rets['AAPL'].corr(rets['MSFT'])
0.44469678861732237
df
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
df[0].corr(df[1])
1.0
help(pd.Series.corr)
Help on function corr in module pandas.core.series:
corr(self, other, method='pearson', min_periods=None)
Compute correlation with `other` Series, excluding missing values
Parameters
----------
other : Series
method : {'pearson', 'kendall', 'spearman'}
* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns
-------
correlation : float
Example:Group-wise Linear regression
You can use groupby
to perform more complex group-wise statistical analysis,as long as the function returns a pandas object or scalar value.For example,I can define the following regress
function(using the statsmodels
econometrics libraby),which executes an ordinary least quares(OLS) regression on each chunk of data:
import statsmodels.api as sm
def regress(data,yvar,xvars):
Y=data[yvar]
X=data[xvars]
X['intercept']=1
result=sm.OLS(Y,X).fit()
return result.params
by_year.apply(regress,'AAPL',['SPX'])
|
SPX |
intercept |
2003 |
1.195406 |
0.000710 |
2004 |
1.363463 |
0.004201 |
2005 |
1.766415 |
0.003246 |
2006 |
1.645496 |
0.000080 |
2007 |
1.198761 |
0.003438 |
2008 |
0.968016 |
-0.001110 |
2009 |
0.879103 |
0.002954 |
2010 |
1.052608 |
0.001261 |
2011 |
0.806605 |
0.001514 |
Pivot tables and cross-tabulation
help(pd.DataFrame.pivot_table)
Help on function pivot_table in module pandas.tools.pivot:
pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
Create a spreadsheet-style pivot table as a DataFrame. The levels in the
pivot table will be stored in MultiIndex objects (hierarchical indexes) on
the index and columns of the result DataFrame
Parameters
----------
data : DataFrame
values : column to aggregate, optional
index : column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list
can contain any of the other types (except list).
Keys to group by on the pivot table index. If an array is passed, it
is being used as the same manner as column values.
columns : column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list
can contain any of the other types (except list).
Keys to group by on the pivot table column. If an array is passed, it
is being used as the same manner as column values.
aggfunc : function or list of functions, default numpy.mean
If list of functions passed, the resulting pivot table will have
hierarchical columns whose top level are the function names (inferred
from the function objects themselves)
fill_value : scalar, default None
Value to replace missing values with
margins : boolean, default False
Add all row / columns (e.g. for subtotal / grand totals)
dropna : boolean, default True
Do not include columns whose entries are all NaN
margins_name : string, default 'All'
Name of the row / column that will contain the totals
when margins is True.
Examples
--------
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
Returns
-------
table : DataFrame
ex1=pd.read_csv('ex1.csv')
ex1
|
sex |
age |
grade |
province |
0 |
m |
20 |
89 |
H |
1 |
f |
23 |
90 |
B |
2 |
m |
21 |
86 |
B |
3 |
m |
22 |
56 |
H |
4 |
f |
23 |
78 |
J |
5 |
f |
22 |
90 |
B |
6 |
m |
21 |
100 |
H |
7 |
m |
20 |
98 |
J |
8 |
m |
25 |
67 |
B |
9 |
f |
23 |
89 |
J |
ex1.pivot_table(index='sex',aggfunc=[np.mean,np.sum,np.min,np.max],margins=True)
|
mean |
sum |
amin |
amax |
|
age |
grade |
age |
grade |
age |
grade |
province |
age |
grade |
province |
sex |
|
|
|
|
|
|
|
|
|
|
f |
22.75 |
86.750000 |
91.0 |
347.0 |
22 |
78 |
B |
23 |
90 |
J |
m |
21.50 |
82.666667 |
129.0 |
496.0 |
20 |
56 |
B |
25 |
100 |
J |
All |
22.00 |
84.300000 |
220.0 |
843.0 |
20 |
56 |
B |
25 |
100 |
J |
ex1.pivot_table(index='sex')
|
age |
grade |
sex |
|
|
f |
22.75 |
86.750000 |
m |
21.50 |
82.666667 |
Attention :min and max will discard non-number column 'province`.
df=pd.DataFrame(np.arange(4).reshape((2,2)),index=['a','b'],columns=['one','two']);df
df.pivot_table(index='one',aggfunc=np.sum,margins=True)
|
two |
one |
|
0 |
1.0 |
2 |
3.0 |
All |
4.0 |
tips.head()
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
0.059447 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
0.160542 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
0.166587 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
0.139780 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
0.146808 |
tips.pivot_table(index=['day','smoker'])
|
|
size |
tip |
tip_pct |
total_bill |
day |
smoker |
|
|
|
|
Fri |
No |
2.250000 |
2.812500 |
0.151650 |
18.420000 |
Yes |
2.066667 |
2.714000 |
0.174783 |
16.813333 |
Sat |
No |
2.555556 |
3.102889 |
0.158048 |
19.661778 |
Yes |
2.476190 |
2.875476 |
0.147906 |
21.276667 |
Sun |
No |
2.929825 |
3.167895 |
0.160113 |
20.506667 |
Yes |
2.578947 |
3.516842 |
0.187250 |
24.120000 |
Thur |
No |
2.488889 |
2.673778 |
0.160298 |
17.113111 |
Yes |
2.352941 |
3.030000 |
0.163863 |
19.190588 |
tips.pivot_table(values=['tip_pct','size'],index=['time','day'],columns='smoker')
|
|
tip_pct |
size |
|
smoker |
No |
Yes |
No |
Yes |
time |
day |
|
|
|
|
Dinner |
Fri |
0.139622 |
0.165347 |
2.000000 |
2.222222 |
Sat |
0.158048 |
0.147906 |
2.555556 |
2.476190 |
Sun |
0.160113 |
0.187250 |
2.929825 |
2.578947 |
Thur |
0.159744 |
NaN |
2.000000 |
NaN |
Lunch |
Fri |
0.187735 |
0.188937 |
3.000000 |
1.833333 |
Thur |
0.160311 |
0.163863 |
2.500000 |
2.352941 |
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker') #The same with above
|
|
tip_pct |
size |
|
smoker |
No |
Yes |
No |
Yes |
time |
day |
|
|
|
|
Dinner |
Fri |
0.139622 |
0.165347 |
2.000000 |
2.222222 |
Sat |
0.158048 |
0.147906 |
2.555556 |
2.476190 |
Sun |
0.160113 |
0.187250 |
2.929825 |
2.578947 |
Thur |
0.159744 |
NaN |
2.000000 |
NaN |
Lunch |
Fri |
0.187735 |
0.188937 |
3.000000 |
1.833333 |
Thur |
0.160311 |
0.163863 |
2.500000 |
2.352941 |
We could augment this table to include partial totals by passing margins=True
.This has the effect of adding All
row and column labels,with corresponding values being the group statistics for all the data within a single tier:
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
|
|
tip_pct |
size |
|
smoker |
No |
Yes |
All |
No |
Yes |
All |
time |
day |
|
|
|
|
|
|
Dinner |
Fri |
0.139622 |
0.165347 |
0.158916 |
2.000000 |
2.222222 |
2.166667 |
Sat |
0.158048 |
0.147906 |
0.153152 |
2.555556 |
2.476190 |
2.517241 |
Sun |
0.160113 |
0.187250 |
0.166897 |
2.929825 |
2.578947 |
2.842105 |
Thur |
0.159744 |
NaN |
0.159744 |
2.000000 |
NaN |
2.000000 |
Lunch |
Fri |
0.187735 |
0.188937 |
0.188765 |
3.000000 |
1.833333 |
2.000000 |
Thur |
0.160311 |
0.163863 |
0.161301 |
2.500000 |
2.352941 |
2.459016 |
All |
|
0.159328 |
0.163196 |
0.160803 |
2.668874 |
2.408602 |
2.569672 |
Here,the All
values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.
if some combinations are empty,you may wish to pass a fill_value
.
tips.pivot_table('tip_pct',index=['time','size','smoker'],columns='day',aggfunc='mean',fill_value=0)
|
|
day |
Fri |
Sat |
Sun |
Thur |
time |
size |
smoker |
|
|
|
|
Dinner |
1 |
No |
0.000000 |
0.137931 |
0.000000 |
0.000000 |
Yes |
0.000000 |
0.325733 |
0.000000 |
0.000000 |
2 |
No |
0.139622 |
0.162705 |
0.168859 |
0.159744 |
Yes |
0.171297 |
0.148668 |
0.207893 |
0.000000 |
3 |
No |
0.000000 |
0.154661 |
0.152663 |
0.000000 |
Yes |
0.000000 |
0.144995 |
0.152660 |
0.000000 |
4 |
No |
0.000000 |
0.150096 |
0.148143 |
0.000000 |
Yes |
0.117750 |
0.124515 |
0.193370 |
0.000000 |
5 |
No |
0.000000 |
0.000000 |
0.206928 |
0.000000 |
Yes |
0.000000 |
0.106572 |
0.065660 |
0.000000 |
6 |
No |
0.000000 |
0.000000 |
0.103799 |
0.000000 |
Lunch |
1 |
No |
0.000000 |
0.000000 |
0.000000 |
0.181728 |
Yes |
0.223776 |
0.000000 |
0.000000 |
0.000000 |
2 |
No |
0.000000 |
0.000000 |
0.000000 |
0.166005 |
Yes |
0.181969 |
0.000000 |
0.000000 |
0.158843 |
3 |
No |
0.187735 |
0.000000 |
0.000000 |
0.084246 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.204952 |
4 |
No |
0.000000 |
0.000000 |
0.000000 |
0.138919 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.155410 |
5 |
No |
0.000000 |
0.000000 |
0.000000 |
0.121389 |
6 |
No |
0.000000 |
0.000000 |
0.000000 |
0.173706 |
data
Ohio -0.953675
New york 0.531946
Vermont NaN
Florida 1.018829
Oregon -0.539930
Nevada NaN
California -0.869226
Idaho NaN
dtype: float64
Cross-tabulation:Crosstab
A cross-tabulation is a special case of a pivot table that computes group frequencies.
data=pd.read_csv('ex2.csv')
data
|
Sample |
Nationality |
Handeness |
0 |
1 |
USA |
R |
1 |
2 |
JAPAN |
L |
2 |
3 |
USA |
R |
3 |
4 |
JAPAN |
R |
4 |
5 |
JAPAN |
L |
5 |
6 |
JAPAN |
R |
6 |
7 |
USA |
R |
7 |
8 |
USA |
L |
8 |
9 |
JAPAN |
R |
9 |
10 |
USA |
R |
pd.crosstab(data.Nationality,data.Handeness,margins=True)
Handeness |
L |
R |
All |
Nationality |
|
|
|
JAPAN |
2 |
3 |
5 |
USA |
1 |
4 |
5 |
All |
3 |
7 |
10 |
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
|
smoker |
No |
Yes |
All |
time |
day |
|
|
|
Dinner |
Fri |
3 |
9 |
12 |
Sat |
45 |
42 |
87 |
Sun |
57 |
19 |
76 |
Thur |
1 |
0 |
1 |
Lunch |
Fri |
1 |
6 |
7 |
Thur |
44 |
17 |
61 |
All |
|
151 |
93 |
244 |
help(pd.crosstab)
Help on function crosstab in module pandas.tools.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
Compute a simple cross-tabulation of two (or more) factors. By default
computes a frequency table of the factors unless an array of values and an
aggregation function are passed
Parameters
----------
index : array-like, Series, or list of arrays/Series
Values to group by in the rows
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns
values : array-like, optional
Array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
aggfunc : function, optional
If specified, requires `values` be specified as well
rownames : sequence, default None
If passed, must match number of row arrays passed
colnames : sequence, default None
If passed, must match number of column arrays passed
margins : boolean, default False
Add row/column margins (subtotals)
dropna : boolean, default True
Do not include columns whose entries are all NaN
normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
Normalize by dividing all values by the sum of values.
- If passed 'all' or `True`, will normalize over all values.
- If passed 'index' will normalize over each row.
- If passed 'columns' will normalize over each column.
- If margins is `True`, will also normalize margin values.
.. versionadded:: 0.18.1
Notes
-----
Any Series passed will have their name attributes used unless row or column
names for the cross-tabulation are specified.
Any input passed containing Categorical data will have **all** of its
categories included in the cross-tabulation, even if the actual data does
not contain any instances of a particular category.
In the event that there aren't overlapping indexes an empty DataFrame will
be returned.
Examples
--------
>>> a
array([foo, foo, foo, foo, bar, bar,
bar, bar, foo, foo, foo], dtype=object)
>>> b
array([one, one, one, two, one, one,
one, two, two, two, one], dtype=object)
>>> c
array([dull, dull, shiny, dull, dull, shiny,
shiny, dull, shiny, shiny, shiny], dtype=object)
>>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b one two
c dull shiny dull shiny
a
bar 1 2 1 0
foo 2 2 1 2
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,
# but they still will be counted in the output
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
Returns
-------
crosstab : DataFrame
#####
愿你一寸一寸地攻城略地,一点一点地焕然一新
#####