After loading,merging and preparing a dataset,you may need to compute group statistics or possibly pivot tables for reporting or visualization purpose.pandas provides a flexible groupby interface,enabling you to slice,dice and summarize datasets in a natural way.
split-apply-combine:
In the first stage of the process,data contained in a pandas object,whether a Series,DataFrame,or otherwise,is split into groups based on one or more keys that you provide.The splitting is performed on a particular axis off an object.For example,a DataFrame can be grouped on its rows(axis=0) or its columns(axis=1).
Once the first step is done, a function is applied to each group,producing a new value.
Finally,the results of all those function applications are combined into a result object.
help(pd.Series.groupby)
Help onfunction groupby inmodule pandas.core.generic:
groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Group series using mapper (dict orkeyfunction, apply given functiontogroup, return result as series) orby a series of columns.
Parameters
----------
by : mapping function / list of functions, dict, Series, or tuple /
list of column names.
Called oneach element of the object index to determine the groups.
If a dict or Series is passed, the Series or dict VALUES will be
used to determine the groups
axis : int, default0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), groupby a particular
level or levels
as_index : boolean, defaultTrueFor aggregated output, returnobjectwithgroup labels as the
index. Only relevant for DataFrame input. as_index=Falseis
effectively "SQL-style" grouped output
sort : boolean, defaultTrue
Sort group keys. Get better performance by turning this off.
Note this does not influence the orderof observations within eachgroup. groupby preserves the orderof rows within eachgroup.
group_keys : boolean, defaultTrueWhen calling apply, add group keys to index to identify pieces
squeeze : boolean, defaultFalse
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
Examples
--------
DataFrame results
>>> data.groupby(func, axis=0).mean()
>>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
Returns
-------
GroupBy object
The first method:
Suppose you wanted to compute the mean of the data column using the labels from key1.There are a number of ways to do this.One is to access data1 and call groupby with the column(a Series) at key1:
grouped=df['data1'].groupby(df['key1'])
help(grouped)
Help on SeriesGroupBy inmodule pandas.core.groupby object:class SeriesGroupBy(GroupBy)
| Class for grouping and aggregating relational data. See aggregate,
| transform, and apply functions on this object.
|| It's easiest touse obj.groupby(...) touse GroupBy, but you can also do:||::|| grouped = groupby(obj, ...)
|| Parameters
|----------| obj : pandas object
| axis :int, default 0
| level :int, defaultNone| Level of MultiIndex
| groupings :listof Grouping objects
| Most users should ignore this
| exclusions :array-like, optional
| List of columns to exclude
| name :string| Most users should ignore this
|| Notes
|-----| After grouping, see aggregate, apply, and transform functions. Here are
| some other brief notes about usage. When grouping by multiple groups, the
| result index will be a MultiIndex (hierarchical) by default.
|| Iteration produces (key, group) tuples, i.e. chunking the data by group. So
| you can write code like:||::|| grouped = obj.groupby(keys, axis=axis)
|for key, group in grouped:| # do something with the data
|| Function calls on GroupBy, ifnot specially implemented, "dispatch"to the
| grouped data. So if you group a DataFrame and wish to invoke the std()
| method on each group, you can simply do:||::|| df.groupby(mapper).std()
|| rather than
||::|| df.groupby(mapper).aggregate(np.std)
|| You can pass arguments to these "wrapped" functions, too.
|| See the online documentation for full exposition on these topics and much
| more
|| Returns
|-------|**Attributes**| groups :dict| {group name -> group labels}
| len(grouped) :int| Number of groups
|| Method resolution order:| SeriesGroupBy
| GroupBy
| _GroupBy
| pandas.core.base.PandasObject
| pandas.core.base.StringMixin
| pandas.core.base.SelectionMixin
| builtins.object
|| Methods defined here:|| agg = aggregate(self, func_or_funcs, *args, **kwargs)
|| aggregate(self, func_or_funcs, *args, **kwargs)
| Apply aggregation functionor functions to groups, yielding most likely
| Series but in some cases DataFrame depending on the output of the
| aggregation function|| Parameters
|----------| func_or_funcs :functionorlist/dictof functions
| List/dictof functions will produce DataFrame with column names
| determined by the function names themselves (list) or the keys in| the dict|| Notes
|-----| agg is an alias for aggregate. Use it.
|| Examples
|--------|>>> series
| bar 1.0| baz 2.0| qot 3.0| qux 4.0||>>> mapper = lambda x: x[0] # first letter
|>>> grouped = series.groupby(mapper)
||>>> grouped.aggregate(np.sum)
| b 3.0| q 7.0||>>> grouped.aggregate([np.sum, np.mean, np.std])
| mean std sum
| b 1.50.53| q 3.50.57||>>> grouped.agg({'result' : lambda x: x.mean() / x.std(),
|...'total' : np.sum})
| result total
| b 2.1213| q 4.957|| See also
|--------| apply, transform
|| Returns
|-------| Series or DataFrame
|| count(self)
| Compute count of group, excluding missing values
|| filter(self, func, dropna=True, *args, **kwargs)
| Return a copy of a Series excluding elements from groups that
|donot satisfy the boolean criterion specified by func.
|| Parameters
|----------| func :function| To apply to each group. Should return True or False.
| dropna : Drop groups that donot pass the filter. True by default;
|if False, groups that evaluate False are filled with NaNs.
|| Examples
|--------|>>> grouped.filter(lambda x: x.mean() > 0)
|| Returns
|-------| filtered : Series
|| nlargest(self, n=5, keep='first')
| Return the largest `n` elements.
|| Parameters
|----------| n :int| Return this many descending sorted values
| keep : {'first', 'last', False}, default'first'
| Where there are duplicate values:|-``first``: take the first occurrence.
|-``last``: take the last occurrence.
| take_last : deprecated
|| Returns
|-------| top_n : Series
| The n largest values in the Series, in sorted order
|| Notes
|-----| Faster than ``.sort_values(ascending=False).head(n)``for small `n`
| relative to the size of the ``Series`` object.
|| See Also
|--------| Series.nsmallest
|| Examples
|--------|>>> import pandas as pd
|>>> import numpy as np
|>>> s = pd.Series(np.random.randn(1e6))
|>>> s.nlargest(10) # only sorts up to the N requested
|| nsmallest(self, n=5, keep='first')
| Return the smallest `n` elements.
|| Parameters
|----------| n :int| Return this many ascending sorted values
| keep : {'first', 'last', False}, default'first'
| Where there are duplicate values:|-``first``: take the first occurrence.
|-``last``: take the last occurrence.
| take_last : deprecated
|| Returns
|-------| bottom_n : Series
| The n smallest values in the Series, in sorted order
|| Notes
|-----| Faster than ``.sort_values().head(n)``for small `n` relative to| the size of the ``Series`` object.
|| See Also
|--------| Series.nlargest
|| Examples
|--------|>>> import pandas as pd
|>>> import numpy as np
|>>> s = pd.Series(np.random.randn(1e6))
|>>> s.nsmallest(10) # only sorts up to the N requested
|| nunique(self, dropna=True)
| Returns number of unique elements in the group
|| transform(self, func, *args, **kwargs)
| Call function producing a like-indexed Series on each group andreturn| a Series with the transformed values
|| Parameters
|----------| func :function| To apply to each group. Should return a Series with the same index
|| Examples
|--------|>>> grouped.transform(lambda x: (x - x.mean()) / x.std())
|| Returns
|-------| transformed : Series
|| value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
||----------------------------------------------------------------------| Data descriptors defined here:|| all
|| Return whether all elements are True over requested axis
|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level :intor level name, defaultNone| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| bool_only : boolean, defaultNone| Include only boolean columns. If None, will attempt touse everything,
|thenuse only boolean data. Not implemented for Series.
|| Returns
|-------| all : scalar or Series (if level specified)
|| any
|| Return whether any element is True over requested axis
|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level :intor level name, defaultNone| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| bool_only : boolean, defaultNone| Include only boolean columns. If None, will attempt touse everything,
|thenuse only boolean data. Not implemented for Series.
|| Returns
|-------| any : scalar or Series (if level specified)
|| corr
| Compute correlation with `other` Series, excluding missing values
|| Parameters
|----------| other : Series
| method : {'pearson', 'kendall', 'spearman'}
|* pearson : standard correlation coefficient
|* kendall : Kendall Tau correlation coefficient
|* spearman : Spearman rank correlation
| min_periods :int, optional
| Minimum number of observations needed to have a valid result
||| Returns
|-------| correlation :float|| cov
| Compute covariance with Series, excluding missing values
|| Parameters
|----------| other : Series
| min_periods :int, optional
| Minimum number of observations needed to have a valid result
|| Returns
|-------| covariance :float|| Normalized by N-1 (unbiased estimator).
|| cummax
| Return cumulative max over requested axis.
|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
|| Returns
|-------| cummax : scalar
|| cummin
| Return cumulative minimum over requested axis.
|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
|| Returns
|-------| cummin : scalar
|| describe
| Generate various summary statistics, excluding NaN values.
|| Parameters
|----------| percentiles :array-like, optional
| The percentiles to include in the output. Should all
| be in the interval [0, 1]. By default `percentiles` is
| [.25, .5, .75], returning the 25th, 50th, and75th percentiles.
| include, exclude :list-like, 'all', orNone (default)
| Specify the form of the returned result. Either:||-Noneto both (default). The result will include only
| numeric-typed columns or, if none are, only categorical columns.
|- A list of dtypes or strings to be included/excluded.
| To select all numeric types use numpy numpy.number. To select
| categorical objects usetypeobject. See also the select_dtypes
| documentation. eg. df.describe(include=['O'])
|- If include is the string 'all', the output column-set will
|match the input one.
|| Returns
|-------| summary: NDFrame of summary statistics
|| Notes
|-----| The output DataFrame index depends on the requested dtypes:|| For numeric dtypes, it will include: count, mean, std, min,
| max, and lower, 50, and upper percentiles.
|| For object dtypes (e.g. timestamps or strings), the index
| will include the count, unique, most common, and frequency of the
| most common. Timestamps also include the first and last items.
|| For mixed dtypes, the index will be the union of the corresponding
| output types. Non-applicable entries will be filled with NaN.
| Note that mixed-dtype outputs can only be returned from mixed-dtype
| inputs and appropriate useof the include/exclude arguments.
|| If multiple values have the highest count, then the
| `count` and `most common` pair will be arbitrarily chosen from
| among those with the highest count.
|| The include, exclude arguments are ignored for Series.
|| See Also
|--------| DataFrame.select_dtypes
|| diff
|1st discrete difference of object
|| Parameters
|----------| periods :int, default 1
| Periods to shift for forming difference
|| Returns
|-------| diffed : Series
|| dtype
|return the dtype object of the underlying data
|| fillna
| Fill NA/NaN values using the specified method
|| Parameters
|----------| value : scalar, dict, Series, or DataFrame
| Value touseto fill holes (e.g. 0), alternately a
|dict/Series/DataFrame of values specifying which value tousefor| each index (for a Series) or column (for a DataFrame). (values not|in the dict/Series/DataFrame will not be filled). This value cannot
| be a list.
| method : {'backfill', 'bfill', 'pad', 'ffill', None}, defaultNone| Method tousefor filling holes in reindexed Series
| pad / ffill: propagate last valid observation forward to next valid
| backfill / bfill:use NEXT valid observation to fill gap
| axis : {0, 'index'}
| inplace : boolean, default False
| If True, fill in place. Note: this will modify any
| other views on this object, (e.g. a no-copy slice for a column in a
| DataFrame).
| limit :int, defaultNone| If method is specified, this is the maximum number of consecutive
| NaN values to forward/backward fill. In other words, if there is
| a gap with more than this number of consecutive NaNs, it will only
| be partially filled. If method is not specified, this is the
| maximum number of entries along the entire axis where NaNs will be
| filled.
|downcast:dict, default is None| a dictof item->dtype of what todowncastif possible,
|or the string 'infer' which will trytodowncastto an appropriate
| equal type (e.g. float64 to int64 if possible)
|| See Also
|--------| reindex, asfreq
|| Returns
|-------| filled : Series
|| hist
| Draw histogram of the input series using matplotlib
|| Parameters
|----------| by : object, optional
| If passed, then used to form histograms for separate groups
| ax : matplotlib axis object
| If not passed, uses gca()
| grid : boolean, default True
| Whether to show axis grid lines
| xlabelsize :int, defaultNone| If specified changes the x-axis label size
| xrot :float, defaultNone| rotation of x axis labels
| ylabelsize :int, defaultNone| If specified changes the y-axis label size
| yrot :float, defaultNone| rotation of y axis labels
| figsize : tuple, defaultNone| figure size in inches by default| bins: integer, default 10
| Number of histogram bins to be used
| kwds : keywords
| To be passed to the actual plotting function|| Notes
|-----| See matplotlib documentation online for more on this
|| idxmax
| Index of first occurrence of maximum of values.
|| Parameters
|----------| skipna : boolean, default True
| Exclude NA/null values
|| Returns
|-------| idxmax : Index of maximum of values
|| Notes
|-----| This method is the Series version of ``ndarray.argmax``.
|| See Also
|--------| DataFrame.idxmax
| numpy.ndarray.argmax
|| idxmin
| Index of first occurrence of minimum of values.
|| Parameters
|----------| skipna : boolean, default True
| Exclude NA/null values
|| Returns
|-------| idxmin : Index of minimum of values
|| Notes
|-----| This method is the Series version of ``ndarray.argmin``.
|| See Also
|--------| DataFrame.idxmin
| numpy.ndarray.argmin
|| mad
|| Return the mean absolute deviation of the values for the requested axis
|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level :intor level name, defaultNone| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| numeric_only : boolean, defaultNone| Include only float, int, boolean columns. If None, will attempt touse| everything, thenuse only numeric data. Not implemented for Series.
|| Returns
|-------| mad : scalar or Series (if level specified)
|| name
| since we are a series, we by definition only have
| a single name, but may be the result of a selection or| the name of our object
|| pct_change
| Percent change over given number of periods.
|| Parameters
|----------| periods :int, default 1
| Periods to shift for forming percent change
| fill_method : str, default'pad'
| How to handle NAs before computing percent changes
| limit :int, defaultNone| The number of consecutive NAs to fill before stopping
| freq : DateOffset, timedelta, or offset alias string, optional
| Increment touse from time series API (e.g. 'M'or BDay())
|| Returns
|-------| chg : NDFrame
|| Notes
|-----|| By default, the percentage change is calculated along the stat
| axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for|``Panel``. You can change this with the ``axis`` keyword argument.
|| quantile
| Return value at the given quantile, a la numpy.percentile.
|| Parameters
|----------| q :floatorarray-like, default 0.5 (50% quantile)
|0<= q <=1, the quantile(s) to compute
| interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
|.. versionadded::0.18.0|| This optional parameter specifies the interpolation method touse,
|when the desired quantile lies between two data points `i` and `j`:||* linear: `i + (j - i) * fraction`, where `fraction` is the
| fractional part of the index surrounded by `i` and `j`.
|* lower: `i`.
|* higher: `j`.
|* nearest: `i` or `j` whichever is nearest.
|* midpoint: (`i` + `j`) / 2.
|| Returns
|-------| quantile :floator Series
|if``q`` is an array, a Series will be returned where the
| index is ``q``and the values are the quantiles.
|| Examples
|--------|>>> s = Series([1, 2, 3, 4])
|>>> s.quantile(.5)
|2.5|>>> s.quantile([.25, .5, .75])
|0.251.75|0.502.50|0.753.25| dtype: float64
|| rank
| Compute numerical data ranks (1 through n) along axis. Equal values are
| assigned a rank that is the average of the ranks of those values
|| Parameters
|----------| axis: {0or'index', 1or'columns'}, default0| index to direct ranking
| method : {'average', 'min', 'max', 'first', 'dense'}
|* average: average rank of group
|* min: lowest rank in group
|* max: highest rank in group
|* first: ranks assigned in order they appear in the array|* dense: like 'min', but rank always increases by 1 between groups
| numeric_only : boolean, defaultNone| Include only float, int, boolean data. Valid only for DataFrame or| Panel objects
| na_option : {'keep', 'top', 'bottom'}
|* keep: leave NA values where they are
|* top: smallest rank if ascending
|* bottom: smallest rank if descending
| ascending : boolean, default True
| False for ranks by high (1) to low (N)
| pct : boolean, default False
| Computes percentage rank of data
|| Returns
|-------| ranks : same typeas caller
|| skew
|| Return unbiased skew over requested axis
| Normalized by N-1|| Parameters
|----------| axis : {index (0)}
| skipna : boolean, default True
| Exclude NA/null values. If an entire row/column is NA, the result
| will be NA
| level :intor level name, defaultNone| If the axis is a MultiIndex (hierarchical), count along a
| particular level, collapsing into a scalar
| numeric_only : boolean, defaultNone| Include only float, int, boolean columns. If None, will attempt touse| everything, thenuse only numeric data. Not implemented for Series.
|| Returns
|-------| skew : scalar or Series (if level specified)
|| take
|return Series corresponding to requested indices
|| Parameters
|----------| indices :list/arrayof ints
| convert : translate negative to positive indices (default)
|| Returns
|-------| taken : Series
|| See also
|--------| numpy.ndarray.take
|| tshift
| Shift the time index, using the index's frequency if available.
|| Parameters
|----------| periods :int| Number of periods to move, can be positive or negative
| freq : DateOffset, timedelta, or time rule string, defaultNone| Increment touse from the tseries moduleor time rule (e.g. 'EOM')
| axis :intor basestring
| Corresponds to the axis that contains the Index
|| Notes
|-----| If freq is not specified then tries touse the freq or inferred_freq
| attributes of the index. If neither of those attributes exist, a
| ValueError is thrown
|| Returns
|-------| shifted : NDFrame
|| unique
| Return np.ndarray of unique values in the object.
| Significantly faster than numpy.unique. Includes NA values.
| The order of the original is preserved.
|| Returns
|-------| uniques : np.ndarray
||----------------------------------------------------------------------| Methods inherited from GroupBy:|| backfill(self, limit=None)
| Backward fill the values
|| Parameters
|----------| limit : integer, optional
| limit of how many values to fill
|| See Also
|--------| Series.fillna
| DataFrame.fillna
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| bfill = backfill(self, limit=None)
|| cumcount(self, ascending=True)
| Number each item in each group from 0to the length of that group - 1.
|| Essentially this is equivalent to||>>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
|| Parameters
|----------| ascending :bool, default True
| If False, number in reverse, from length of group - 1 to 0.
|| Examples
|--------||>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
|... columns=['A'])
|>>> df
| A
|0 a
|1 a
|2 a
|3 b
|4 b
|5 a
|>>> df.groupby('A').cumcount()
|00|11|22|30|41|53| dtype:int64|>>> df.groupby('A').cumcount(ascending=False)
|03|12|21|31|40|50| dtype:int64||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| cumprod(self, axis=0, *args, **kwargs)
| Cumulative product for each group
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| cumsum(self, axis=0, *args, **kwargs)
| Cumulative sum for each group
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| expanding(self, *args, **kwargs)
| Return an expanding grouper, providing expanding
| functionaility per group
|||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| ffill = pad(self, limit=None)
|| first = f(self)
| Compute first of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| head(self, n=5)
| Returns first n rows of each group.
|| Essentially equivalent to``.apply(lambda x: x.head(n))``,
| except ignores as_index flag.
|| Examples
|--------||>>> df = DataFrame([[1, 2], [1, 4], [5, 6]],
| columns=['A', 'B'])
|>>> df.groupby('A', as_index=False).head(1)
| A B
|012|256|>>> df.groupby('A').head(1)
| A B
|012|256||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| irow(self, i)
| DEPRECATED. Use ``.nth(i)`` instead
|| last = f(self)
| Compute last of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| max = f(self)
| Compute max of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| mean(self, *args, **kwargs)
| Compute mean of groups, excluding missing values
|| For multiple groupings, the result index will be a MultiIndex
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| median(self)
| Compute median of groups, excluding missing values
|| For multiple groupings, the result index will be a MultiIndex
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| min = f(self)
| Compute min of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| nth(self, n, dropna=None)
| Take the nth row from each group if n is an int, or a subset of rows
|if n is a list of ints.
|| If dropna, will take the nth non-null row, dropna is either
| Truthy (if a Series) or'all', 'any' (if a DataFrame);
| this is equivalent to calling dropna(how=dropna) before the
| groupby.
|| Parameters
|----------| n :intorlistof ints
| a single nth value for the row or a list of nth values
| dropna :Noneor str, optional
| apply the specified dropna operation before counting which row is
| the nth row. Needs to be None, 'any' or'all'
|| Examples
|--------||>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
|...'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
|>>> g = df.groupby('A')
|>>> g.nth(0)
| B
| A
|1 NaN
|23.0|>>> g.nth(1)
| B
| A
|12.0|25.0|>>> g.nth(-1)
| B
| A
|14.0|25.0|>>> g.nth([0, 1])
| B
| A
|1 NaN
|12.0|23.0|25.0|| Specifying ``dropna`` allows count ignoring NaN
||>>> g.nth(0, dropna='any')
| B
| A
|12.0|23.0|| NaNs denote group exhausted whenusing dropna
||>>> g.nth(3, dropna='any')
| B
| A
|1 NaN
|2 NaN
|| Specifying ``as_index=False``in``groupby`` keeps the original index.
||>>> df.groupby('A', as_index=False).nth(1)
| A B
|112.0|425.0||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| ohlc(self)
| Compute sum of values, excluding missing values
| For multiple groupings, the result index will be a MultiIndex
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| pad(self, limit=None)
| Forward fill the values
|| Parameters
|----------| limit : integer, optional
| limit of how many values to fill
|| See Also
|--------| Series.fillna
| DataFrame.fillna
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| prod = f(self)
| Compute prod of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| resample(self, rule, *args, **kwargs)
| Provide resampling whenusing a TimeGrouper
| Return a new grouper with our resampler appended
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| rolling(self, *args, **kwargs)
| Return a rolling grouper, providing rolling
| functionaility per group
|||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| sem(self, ddof=1)
| Compute standard error of the mean of groups, excluding missing values
|| For multiple groupings, the result index will be a MultiIndex
|| Parameters
|----------| ddof : integer, default 1
| degrees of freedom
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| shift(self, periods=1, freq=None, axis=0)
| Shift each group by periods observations
|| Parameters
|----------| periods : integer, default 1
| number of periods to shift
| freq : frequency string| axis : axis to shift, default 0
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| size(self)
| Compute group sizes
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| std(self, ddof=1, *args, **kwargs)
| Compute standard deviation of groups, excluding missing values
|| For multiple groupings, the result index will be a MultiIndex
|| Parameters
|----------| ddof : integer, default 1
| degrees of freedom
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| sum = f(self)
| Compute sum of group values
|| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| tail(self, n=5)
| Returns last n rows of each group
|| Essentially equivalent to``.apply(lambda x: x.tail(n))``,
| except ignores as_index flag.
|| Examples
|--------||>>> df = DataFrame([['a', 1], ['a', 2], ['b', 1], ['b', 2]],
| columns=['A', 'B'])
|>>> df.groupby('A').tail(1)
| A B
|1 a 2|3 b 2|>>> df.groupby('A').head(1)
| A B
|0 a 1|2 b 1||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
|| var(self, ddof=1, *args, **kwargs)
| Compute variance of groups, excluding missing values
|| For multiple groupings, the result index will be a MultiIndex
|| Parameters
|----------| ddof : integer, default 1
| degrees of freedom
||| See also
|--------| pandas.Series.groupby
| pandas.DataFrame.groupby
| pandas.Panel.groupby
||----------------------------------------------------------------------| Methods inherited from _GroupBy:|| __getattr__(self, attr)
|| __init__(self, obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|| __iter__(self)
| Groupby iterator
|| Returns
|-------| Generator yielding sequence of (name, subsetted object)
|for each group
|| __len__(self)
|| __unicode__(self)
| Return a string representation for a particular object.
|| Invoked by unicode(obj) in py2 only. Yields a Unicode String in both
| py2/py3.
|| apply(self, func, *args, **kwargs)
| Apply functionand combine results together in an intelligent way. The
| split-apply-combine combination rules attempt to be as common sense
| based as possible. For example:|| case 1:| group DataFrame
| apply aggregation function (f(chunk) -> Series)
|yield DataFrame, with group axis having group labels
|| case 2:| group DataFrame
| apply transform function ((f(chunk) -> DataFrame with same indexes)
|yield DataFrame with resulting chunks glued together
|| case 3:| group Series
| apply functionwith f(chunk) -> DataFrame
|yield DataFrame with result of chunks glued together
|| Parameters
|----------| func :function|| Notes
|-----| See online documentation for full exposition on how touse apply.
|| In the current implementation apply calls func twice on the
| first group to decide whether it can take a fast or slow code
| path. This can lead to unexpected behavior if func has
| side-effects, as they will take effect twice for the first
| group.
||| See also
|--------| aggregate, transform
|| get_group(self, name, obj=None)
| Constructs NDFrame from group with provided name
|| Parameters
|----------| name : object
| the name of the group togetas a DataFrame
| obj : NDFrame, defaultNone| the NDFrame to take the DataFrame out of. If
| it is None, the object groupby was called on will
| be used
|| Returns
|-------| group :typeofobj||----------------------------------------------------------------------| Data descriptors inherited from _GroupBy:|| groups
|dict {group name -> group labels}
|| indices
|dict {group name -> group indices}
|| ngroups
|| plot
| Class implementing the .plot attribute for groupby objects
||----------------------------------------------------------------------| Methods inherited from pandas.core.base.PandasObject:|| __dir__(self)
| Provide method name lookup and completion
| Only provide 'public' methods
|| __sizeof__(self)
| Generates the total memory usage for a object that returns
| either a value or Series of values
||----------------------------------------------------------------------| Methods inherited from pandas.core.base.StringMixin:|| __bytes__(self)
| Return a string representation for a particular object.
|| Invoked by bytes(obj) in py3 only.
| Yields a bytestring in both py2/py3.
|| __repr__(self)
| Return a string representation for a particular object.
|| Yields Bytestring in Py2, Unicode String in py3.
|| __str__(self)
| Return a string representation for a particular Object
|| Invoked by str(df) in both py2/py3.
| Yields Bytestring in Py2, Unicode String in py3.
||----------------------------------------------------------------------| Data descriptors inherited from pandas.core.base.StringMixin:|| __dict__
| dictionary for instance variables (if defined)
|| __weakref__
| list of weak references to the object (if defined)
||----------------------------------------------------------------------| Methods inherited from pandas.core.base.SelectionMixin:|| __getitem__(self, key)
||----------------------------------------------------------------------| Data descriptors inherited from pandas.core.base.SelectionMixin:|| ndim
grouped.mean()
key1
a -0.198323b -0.613947
Name: data1, dtype: float64
df['data1'].groupby(['a','a','b','b','a']).mean()
a -0.198323b -0.613947
Name: data1, dtype: float64
key1
a -0.198323b -0.613947
Name: data1, dtype: float64
The important thing above is that the data(Series) has been aggregated according to the group key,producing a new Series that is now indexed by the unique values in the key1 column.
The result index has the name 'key1' because the DataFrame column df['key1'] did!
If instead we had passed multiple arrays as a list,we'd get something different:
The second method:
Frequently the grouping information is found in the same DataFrame as the data you want to work on.In this case,you can pass column names(whether those are strings,numbers or other python objects) as the group keys:
df.groupby('key1').mean()
data1
data2
key1
a
-0.198323
-0.039899
b
-0.613947
1.384352
df.groupby(['key1','key2']).mean()
data1
data2
key1
key2
a
one
-0.699514
-0.386868
two
0.804061
0.654039
b
one
-1.672009
1.780028
two
0.444115
0.988676
You may have noticed in the first case of df.groupby('key').mean() that there is no key2 column in the result.**Because df['key2'] is not numeric data,it is said to be a nuisance column,which is therefor excluded from the result.By default,all of teh numeric columns are aggregated,though it is possible to filter down to a subset.
help(df.groupby(['key1','key2']).size)
Help onmethod size inmodule pandas.core.groupby:
size() methodof pandas.core.groupby.DataFrameGroupBy instance
Compute group sizes
See also
--------
pandas.Series.groupby
pandas.DataFrame.groupby
pandas.Panel.groupby
A generally useful Groupby method is size,which returns a Series containing group sizes:
df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1b one 1
two 1
dtype: int64
df
data1
data2
key1
key2
0
-0.288087
-1.355619
a
one
1
0.804061
0.654039
a
two
2
-1.672009
1.780028
b
one
3
0.444115
0.988676
b
two
4
-1.110942
0.581883
a
one
We can see from above,the number of ('a','one') is 2 indeed...
Iterating over groups
The Groupby object supports iteration,generating a sequence of 2-tuples containning the group name along with the chunk of data.
for name,group in df.groupby('key1'):
print(name)
print(group)
a
data1 data2 key1 key2
0 -0.288087 -1.355619a one
10.8040610.654039a two
4 -1.1109420.581883a one
b
data1 data2 key1 key2
2 -1.6720091.780028b one
30.4441150.988676b two
In the case of multiple keys,the first element in the tuple will be a tuple of key values.
for (k1,k2),group in df.groupby(['key1','key2']):
print((k1,k2))
print(group)
('a', 'one')
data1 data2 key1 key2
0-0.288087-1.355619 a one
4-1.1109420.581883 a one
('a', 'two')
data1 data2 key1 key2
10.8040610.654039 a two
('b', 'one')
data1 data2 key1 key2
2-1.6720091.780028 b one
('b', 'two')
data1 data2 key1 key2
30.4441150.988676 b two
df
data1
data2
key1
key2
0
-0.288087
-1.355619
a
one
1
0.804061
0.654039
a
two
2
-1.672009
1.780028
b
one
3
0.444115
0.988676
b
two
4
-1.110942
0.581883
a
one
A useful recipe to get the pieces of data through computing a dict of the data pieces as a one-liner:
pieces=dict(list(df.groupby('key1')))
pieces
{'a': data1 data2 key1 key2
0 -0.288087 -1.355619 a one
1 0.804061 0.654039 a two
4 -1.110942 0.581883 a one, 'b': data1 data2 key1 key2
2 -1.672009 1.780028 b one
3 0.444115 0.988676 b two}
pieces['b']
data1
data2
key1
key2
2
-1.672009
1.780028
b
one
3
0.444115
0.988676
b
two
list(df.groupby('key1'))
[('a', data1 data2 key1 key2
0-0.288087-1.355619 a one10.8040610.654039 a two
4-1.1109420.581883 a one), ('b', data1 data2 key1 key2
2-1.6720091.780028 b one30.4441150.988676 b two)]
grouped=df.groupby(df.dtypes,axis=1) # notice here,axis=1,df is divided into 2 DataFrame accroding `dtype`.
for dtype,group in grouped:
print(dtype)
print(group)
float64
data1 data2
0 -0.288087 -1.35561910.8040610.6540392 -1.6720091.78002830.4441150.9886764 -1.1109420.581883object
key1 key2
0a one
1a two
2b one
3b two
4a one
Using python functions is a more generic way of defining a group mappingg compared with a dict or Series.Any function passed as a group key will be called once per index value,with the return values being used as the group names.
Suppose you want to group by the length of the names;while you compute an array of string lengths,it's simpler to just pass the len function.
people.groupby(len).sum()
a
b
c
d
e
3
0.985588
0.149098
-0.503394
0.098927
0.847068
5
0.965853
-1.694954
2.142343
-0.969325
1.129057
6
-0.321174
-1.400983
-0.529033
0.037673
1.221623
Mixing functions with arrays,dicts,or Series is not a problem as everything gets converted to arrays internaly:
key_list=['one','one','one','two','two']
key_test=[1,1,1,1,1]
people.groupby([len,key_list]).min()
a
b
c
d
e
3
one
-0.145624
-0.548971
-0.893291
-1.392325
-0.651280
two
0.434581
1.189071
0.455694
-0.929888
1.193285
5
one
0.965853
-1.694954
2.142343
-0.969325
1.129057
6
two
-0.321174
-1.400983
-0.529033
0.037673
1.221623
people.groupby([len,key_test]).min()
a
b
c
d
e
3
1
-0.145624
-0.548971
-0.893291
-1.392325
-0.651280
5
1
0.965853
-1.694954
2.142343
-0.969325
1.129057
6
1
-0.321174
-1.400983
-0.529033
0.037673
1.221623
Grouping by Index levels
A final vonvenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an index.
Aggregations refer to any data transformation that produces scalar values from arrays.
description=['Number of non-NAN values in the group',
'Sum of non-NA values',
'Mean of non-NA values',
'Arithmetic median off non-NAN values',
'Unbiased(n-1 denominator)standard deviation and variance',
'Minimun and maximum of non-NA values',
'Product of non-NA values',
'First and last non-NA value']
Tips:Custom aggregation functions are generally much slower than the optimized functions found in above table.This is because there is some extra overhead(function calls,data rearrangement) in constructing the intermeediate chunks.
As we have seen,aggregating a Series or all of the columns of a DataFrame is a matter of using aggregae with the desired function or calling a method like mean or std.However,you may want to aggregate using a different function depending on the column,or multiple functions at once.
If you pass a list of (name,function) tuples,the first element of each tuple will be used as the DataFrame column names.
group_pct.agg([('foo','mean'),('bar',np.std)])
foo
bar
day
smoker
Fri
No
0.151650
0.028123
Yes
0.174783
0.051293
Sat
No
0.158048
0.039767
Yes
0.147906
0.061375
Sun
No
0.160113
0.042347
Yes
0.187250
0.154134
Thur
No
0.160298
0.038774
Yes
0.163863
0.039389
With a DataFrame you have more options,as you can specify a list of functions to apply to all of the columns or different functions per column.To start ,suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:
As you can see,the resulting DataFrame has a hierarchical columns,the same as you would get aggregating each column separately and using concat to glue the results together using the column names as the keys argument
As before,a list of tuples with custom names can be passed:
ftuples=[('Durchs','mean'),('adkfd',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
tip_pct
total_bill
Durchs
adkfd
Durchs
adkfd
day
smoker
Fri
No
0.151650
0.000791
18.420000
25.596333
Yes
0.174783
0.002631
16.813333
82.562438
Sat
No
0.158048
0.001581
19.661778
79.908965
Yes
0.147906
0.003767
21.276667
101.387535
Sun
No
0.160113
0.001793
20.506667
66.099980
Yes
0.187250
0.023757
24.120000
109.046044
Thur
No
0.160298
0.001503
17.113111
59.625081
Yes
0.163863
0.001551
19.190588
69.808518
Now,suppose you want to apply potentially different functions to one or more of the columns.To do this,pass a dict to agg that contains a mappingg of column names to any of the function specifications listed so far:
A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.
Returning Aggregated data without row index
In all of the examples up until now,the aggregated data comes back with an index,potentially hierarchical,composed from the unique group key combinations.Since this is not always desirable,you can disable this behaviour in most cases by passing as_index=False to groupby.
The most general-purpose GroupBy method is apply,which is the subject of the rest of this section.
tips.head()
total_bill
tip
smoker
day
time
size
tip_pct
0
16.99
1.01
No
Sun
Dinner
2
0.059447
1
10.34
1.66
No
Sun
Dinner
3
0.160542
2
21.01
3.50
No
Sun
Dinner
3
0.166587
3
23.68
3.31
No
Sun
Dinner
2
0.139780
4
24.59
3.61
No
Sun
Dinner
4
0.146808
Returning to the tipping dataset from before,suppose you want to select the top five tip_pct values by group.First,write a function that selects the rows with the largest values in a particular column:
In the preceding examples,you see that the resulting object has a hierarchical index formed from the group keys along the indexes of each piece of the original object.You can disable this by passing group_keys=False to groupby.
**Notice the difference between agg and apply:agg aggregates GroupBy object's every column into one value,while apply apply function to every column of GroupBy object and the result may not to be just one value like agg.
tips.groupby('smoker').agg('max')
total_bill
tip
day
time
size
tip_pct
smoker
No
48.33
9.0
Thur
Lunch
6
0.291990
Yes
50.81
10.0
Thur
Lunch
5
0.710345
tips.groupby('smoker').apply(max)
total_bill
tip
smoker
day
time
size
tip_pct
smoker
No
48.33
9.0
No
Thur
Lunch
6
0.291990
Yes
50.81
10.0
Yes
Thur
Lunch
5
0.710345
The above two methods return just the same result.To some extent,aggg is a specific version of apply.
Quantile and bucket analysis
Combining qcut or cut with groupby makes it convenient to perform bucket or quantile analysis on a dataset.
Help onfunction cut inmodule pandas.tools.tile:
cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
Return indices of half-open bins to which eachvalueof `x` belongs.
Parameters
----------
x : array-like
Input arrayto be binned. It has to be 1-dimensional.
bins : intor sequence of scalars
If `bins` is an int, it defines the number of equal-width bins in the
rangeof `x`. However, in this case, the rangeof `x` is extended
by.1%oneach side to include the min or max valuesof `x`. If
`bins` is a sequence it defines the bin edges allowing for
non-uniform bin width. No extension of the rangeof `x` is done in
this case.
right : bool, optional
Indicates whether the bins include the rightmost edge or not. If
right==True (the default), then the bins [1,2,3,4] indicate
(1,2], (2,3], (3,4].
labels : arrayorboolean, defaultNone
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. If False, returnonlyinteger indicators of the
bins.
retbins : bool, optional
Whether toreturn the bins or not. Can be useful if bins is given
as a scalar.
precision : int
The precisionat which to store and display the bins labels
include_lowest : bool
Whether the firstinterval should be left-inclusive or not.
Returns-------out : Categorical or Series orarrayof integers if labels isFalse
The return type (Categorical or Series) depends on the input: a Series
of type category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` is True.
Notes
-----
The `cut` function can be useful for going from a continuous variable to
a categorical variable. For example, `cut` could convert ages togroupsof age ranges.
Any NA values will be NA in the result. Outof bounds values will be NA in
the resulting Categorical object
Examples
-------->>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
(6.533, 9.7], (0.191, 3.367]]
Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
array([ 0.1905 , 3.36666667, 6.53333333, 9.7 ]))
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3,
labels=["good","medium","bad"])
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
>>> pd.cut(np.ones(5), 4, labels=False)
array([1, 1, 1, 1, 1], dtype=int64)
help(pd.qcut)
Help onfunction qcut in module pandas.tools.tile:
qcut(x, q, labels=None, retbins=False, precision=3)
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for10 quantiles would produce a Categorical object indicating
quantile membership for each data point.
Parameters
----------
x : ndarray or Series
q : integerorarrayof quantiles
Number of quantiles. 10for deciles, 4for quartiles, etc. Alternately
arrayof quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
labels : arrayorboolean, default None
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. IfFalse, return only integer indicators of the
bins.
retbins : bool, optional
Whether toreturn the bins ornot. Can be useful if bins is given
as a scalar.
precision : int
The precision at which to store and display the bins labels
Returns
-------out : Categorical or Series orarrayof integers if labels isFalse
The returntype (Categorical or Series) depends on the input: a Series
oftype category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` isTrue.
Notes
-----Outof bounds values will be NA in the resulting Categorical object
Examples
--------
>>> pd.qcut(range(5), 4)
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]
>>> pd.qcut(range(5), 3, labels=["good","medium","bad"])
[good, good, medium, bad, bad]
Categories (3, object): [good < medium < bad]
>>> pd.qcut(range(5), 4, labels=False)
array([0, 0, 1, 2, 3], dtype=int64)
The Categorical object returned by cut can be passed directly to groupby.So we could compute a set of statistics for the data2 like so:
There were equal-length buckets;to compute equal-size buckets based on sample quantiles,use qcut.We will pass labels=False to jsut get quantile numbers:
Example:Filling missing values with group-specific values
When cleaning up missing data,in some cases you will replace data observations using dropna,but in others you may want to impute(fill in) the null(NA) values using a fixed value or some value derived from the data.fillna is the right tool to use:
Suppose you need the fill value to vary by group.One way to do this is to group the data and use apply with a function that calls fillna on each data chunk.
We can fill the NA values using the group means like so:
fill_mean=lambda g : g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio -0.953675New york 0.531946
Vermont 0.199033
Florida 1.018829
Oregon -0.539930
Nevada -0.704578
California -0.869226
Idaho -0.704578dtype: float64
In another case,you might have predefined fill values in your code that vary by group.
fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Ohio -0.953675New york 0.531946
Vermont 0.500000
Florida 1.018829
Oregon -0.539930
Nevada -1.000000
California -0.869226
Idaho -1.000000dtype: float64
Example: Random sampling and permutation
Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Carlo simulation purpose or some other application.There are a number of ways to perform the 'draws':Here we use sample method for Series.
Under the split-apply-combine paradigm of groupby operations between columns in a DataFrame or two Series,such as a group weighted average,are possible.
[('a', category data weights
0 a 0.9023491.7309491 a 1.2878550.7686772 a -0.751881 -0.5718363 a 1.2086450.229177), ('b', category data weights
4b0.986127 -0.2627885b1.566069 -0.2525796b0.2859590.1846627b1.0599621.804769)]
Help on function read_csv inmodule pandas.io.parsers:
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
ReadCSV (comma-separated) file into DataFrameAlso supports optionally iterating or breaking of the file
into chunks.
Additional help can be found in the `online docs for IO Tools
<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.
Parameters
----------
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or anyobjectwith a read() method (such as a file handle or StringIO)
Thestring could be a URL. ValidURL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file could
be file ://localhost/path/to/table.csv
sep : str, default','Delimiter to use. If sep is None, will try to automatically determine
this. Separators longer than 1 character and different from``'\s+'`` will
be interpreted as regular expressions, will force use of the python parsing
engine and will ignore quotes in the data. Regexexample: ``'\r\t'``
delimiter : str, default``None``Alternative argument name for sep.
delim_whitespace : boolean, defaultFalseSpecifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be
used as the sep. Equivalent to setting ``sep='\s+'``. Ifthis option
is set to True, nothing should be passed infor the ``delimiter``
parameter.
.. versionadded:: 0.18.1 support for the Python parser.
header : int or list of ints, default'infer'Rownumber(s) to use as the column names, and the start of the data.
Default behavior is asif set to 0if no ``names`` passed, otherwise
``None``. Explicitly pass ``header=0`` to be able to replace existing
names. The header can be a list of integers that specify row locations for
a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
specified will be skipped (e.g. 2inthis example is skipped). Note that
this parameter ignores commented lines and empty lines if``skip_blank_lines=True``, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, defaultNoneListof column names to use. If file contains no header row, then you
should explicitly pass header=None. Duplicatesinthis list are not
allowed unless mangle_dupe_cols=True, which is the default.
index_col : int or sequence or False, defaultNoneColumn to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
usecols : array-like, defaultNoneReturn a subset of the columns. All elements inthis array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in`names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Usingthis parameter
results in much faster parsing time and lower memory usage.
as_recarray : boolean, defaultFalseDEPRECATED: this argument will be removed in a future version. Please call
`pd.read_csv(...).to_records()` instead.
Return a NumPy recarray instead of a DataFrame after parsing the data.
If set to True, this option takes precedence over the `squeeze` parameter.
In addition, as row indices are not available in such a format, the
`index_col` parameter will be ignored.
squeeze : boolean, defaultFalseIf the parsed data only contains one column then return a Series
prefix : str, defaultNonePrefix to add to column numbers when no header, e.g. 'X'forX0, X1, ...
mangle_dupe_cols : boolean, defaultTrueDuplicate columns will be specified as'X.0'...'X.N', rather than
'X'...'X'. PassinginFalse will cause data to be overwritten if there
are duplicate names in the columns.
dtype : Type name or dict of column -> type, defaultNoneDatatypefor data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupportedwith engine='python'). Use`str` or `object` to preserve and
not interpret dtype.
engine : {'c', 'python'}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, defaultNoneDictof functions for converting values in certain columns. Keys can either
be integers or column labels
true_values : list, defaultNoneValues to consider asTrue
false_values : list, defaultNoneValues to consider asFalse
skipinitialspace : boolean, defaultFalseSkip spaces after delimiter.
skiprows : list-like or integer, defaultNoneLine numbers to skip (0-indexed) or numberof lines to skip (int)
at the start of the file
skipfooter : int, default0Numberof lines at bottom of file to skip (Unsupportedwith engine='c')
skip_footer : int, default0DEPRECATED: use the `skipfooter` parameter instead, as they are identical
nrows : int, defaultNoneNumberof rows of file to read. Usefulfor reading pieces of large files
na_values : scalar, str, list-like, or dict, defaultNoneAdditional strings to recognize asNA/NaN. If dict passed, specific
per-column NA values. Bydefault the following values are interpreted asNaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they're appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
parse_dates : boolean or list of ints or names or list of lists or dict, default False
* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format
of the datetime strings in the columns, and if it can be inferred, switch
to a faster method of parsing them. In some cases this can increase the
parsing speed by ~5-10x.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses ``dateutil.parser.parser`` to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array
and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with
``get_chunk()``.
chunksize : int, default None
Return TextFileReader object for iteration. `SeeIOTools docs for more
information
<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ on``iterator`` and ``chunksize``.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default'infer'For on-the-fly decompression of on-disk data. If'infer', then use gzip,
bz2, zip or xz if filepath_or_buffer is a string ending in'.gz', '.bz2',
'.zip', or 'xz', respectively, and no decompression otherwise. If using
'zip', the ZIP file must contain only one data file to be read in.
Set to Nonefor no decompression.
.. versionadded:: 0.18.1 support for'zip' and 'xz' compression.
thousands : str, defaultNoneThousands separator
decimal : str, default'.'Character to recognize as decimal point (e.g. use ','forEuropean data).
float_precision : string, defaultNoneSpecifies which converter the C engine should use for floating-point
values. The options are `None`for the ordinary converter,
`high`for the high-precision converter, and `round_trip`for the
round-trip converter.
lineterminator : str (length 1), defaultNoneCharacter to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default0Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one ofQUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : boolean, default``True``When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
whether or not to interpret two consecutive quotechar elements INSIDE a
field as a single ``quotechar`` element.
escapechar : str (length 1), defaultNoneOne-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, defaultNoneIndicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header` but not by
`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'with`header=0` will result in'a,b,c' being
treated as the header.
encoding : str, defaultNoneEncoding to use forUTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
dialect : str or csv.Dialect instance, defaultNoneIfNone defaults to Excel dialect. Ignoredif sep longer than 1 char
See csv.Dialect documentation for more details
tupleize_cols : boolean, defaultFalseLeave a list of tuples on columns as is (default is to convert to
a MultiIndex on the columns)
error_bad_lines : boolean, defaultTrueLineswith too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
IfFalse, then these "bad lines" will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, defaultTrueIf error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
low_memory : boolean, defaultTrueInternally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the typewith the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser)
buffer_lines : int, defaultNoneDEPRECATED: this argument will be removed in a future version because its
value is not respected by the parser
compact_ints : boolean, defaultFalseDEPRECATED: this argument will be removed in a future version
If compact_ints is True, then forany column that is of integer dtype,
the parser will attempt to cast it as the smallest integer dtype possible,
either signed or unsigned depending on the specification from the
`use_unsigned` parameter.
use_unsigned : boolean, defaultFalseDEPRECATED: this argument will be removed in a future version
If integer columns are being compacted (i.e. `compact_ints=True`), specify
whether the column should be compacted to the smallest signed or unsigned
integer dtype.
memory_map : boolean, defaultFalseIf a filepath is provided for`filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Usingthis
option can improve performance because there is no longer any I/O overhead.
Returns
-------
result : DataFrame or TextParser
Help on function info in module pandas.core.frame:
info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
Concise summary of a DataFrame.
Parameters
----------
verbose : {None, True, False}, optional
Whether to print the full summary.
None follows the `display.max_info_columns` setting.
TrueorFalse overrides the `display.max_info_columns` setting.
buf : writable buffer, defaults to sys.stdout
max_cols : int, default None
Determines whether full summary or short summary is printed.
None follows the `display.max_info_columns` setting.
memory_usage : boolean/string, default None
Specifies whether total memory usage of the DataFrame
elements (including index) should be displayed. None follows
the `display.memory_usage` setting. TrueorFalse overrides
the `display.memory_usage` setting. A value of 'deep'is equivalent
of True, with deep introspection. Memory usage is shown in
human-readable units (base-2 representation).
null_counts : boolean, default None
Whether to show the non-null counts
- If None, then only show if the frame is smaller than
max_info_rows and max_info_columns.
- If True, always show counts.
- If False, never show counts.
One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns(computed from the percent changes) with SPX. As one way to do this,we first create a function that computes the pairwise correlation of each column with SPX column:
spx_corr=lambda x:x.corrwith(x['SPX'])
help(pd.DataFrame.pct_change)
Help onfunction pct_change inmodule pandas.core.generic:
pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
Percent change over given number of periods.
Parameters
----------
periods : int, default1
Periods to shift for forming percent change
fill_method : str, default'pad'
How to handle NAs before computing percent changes
limit : int, defaultNone
The number of consecutive NAs to fill before stopping
freq : DateOffset, timedelta, oroffset alias string, optional
Increment to use fromtime series API (e.g. 'M'or BDay())
Returns-------
chg : NDFrame
Notes
-----Bydefault, the percentage change is calculated along the stat
axis: 0, or ``Index``, for ``DataFrame`` and1, or ``minor`` for
``Panel``. You can change this with the ``axis`` keyword argument.
Lastly,we group these percent changed by year,which can be extracted from each row label with a one-line function that returns the year attribute of each datetime laebl:
get_year=lambda x:x.year
by_year=rets.groupby(get_year) # get_year will be applied to rets.index
We could also compute inter-column correlations.Here we compute the annual correlation between Apple and Microsoft.
rets['AAPL'].corr(rets['MSFT'])
0.44469678861732237
df
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
df[0].corr(df[1])
1.0
help(pd.Series.corr)
Help on functioncorrinmodulepandas.core.series:
corr(self, other, method='pearson', min_periods=None)
Computecorrelationwith `other` Series, excludingmissingvaluesParameters
----------
other : Seriesmethod : {'pearson', 'kendall', 'spearman'}
* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns
-------
correlation : float
Example:Group-wise Linear regression
You can use groupby to perform more complex group-wise statistical analysis,as long as the function returns a pandas object or scalar value.For example,I can define the following regress function(using the statsmodels econometrics libraby),which executes an ordinary least quares(OLS) regression on each chunk of data:
Help onfunction pivot_table inmodule pandas.tools.pivot:
pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
Create a spreadsheet-style pivot tableas a DataFrame. The levels in the
pivot table will be stored in MultiIndex objects (hierarchical indexes) on
the index and columns of the result DataFrame
Parameters
----------
data : DataFrame
values : columnto aggregate, optional
index : column, Grouper, array, or list of the previous
If an arrayis passed, it must be the same length as the data. The list
can contain anyof the other types (except list).
Keys togroupbyon the pivot table index. If an arrayis passed, it
is being used as the same manner ascolumn values.
columns : column, Grouper, array, or list of the previous
If an arrayis passed, it must be the same length as the data. The list
can contain anyof the other types (except list).
Keys togroupbyon the pivot table column. If an arrayis passed, it
is being used as the same manner ascolumn values.
aggfunc : functionor list of functions, default numpy.mean
If list of functions passed, the resulting pivot table will have
hierarchical columns whose top level are the function names (inferred
from the function objects themselves)
fill_value : scalar, defaultNoneValueto replace missing valueswith
margins : boolean, defaultFalseAddallrow/ columns (e.g. for subtotal / grand totals)
dropna : boolean, defaultTrue
Do not include columns whose entries areall NaN
margins_name : string, default'All'
Name of the row/column that will contain the totals
when margins is True.
Examples
-------->>> df
A B C D
0 foo one small 11 foo onelarge22 foo onelarge23 foo two small 34 foo two small 35 bar onelarge46 bar one small 57 bar two small 68 bar two large7>>>table= pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>>table
small large
foo one14
two 6 NaN
bar one54
two 67Returns-------table : DataFrame
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker') #The same with above
tip_pct
size
smoker
No
Yes
No
Yes
time
day
Dinner
Fri
0.139622
0.165347
2.000000
2.222222
Sat
0.158048
0.147906
2.555556
2.476190
Sun
0.160113
0.187250
2.929825
2.578947
Thur
0.159744
NaN
2.000000
NaN
Lunch
Fri
0.187735
0.188937
3.000000
1.833333
Thur
0.160311
0.163863
2.500000
2.352941
We could augment this table to include partial totals by passing margins=True.This has the effect of adding All row and column labels,with corresponding values being the group statistics for all the data within a single tier:
Help on function crosstab inmodule pandas.tools.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
Compute a simple cross-tabulation of two (or more) factors. By default
computes a frequency table of the factors unless an arrayof values and an
aggregation function are passed
Parameters
----------
index :array-like, Series, orlistof arrays/Series
Values to group by in the rows
columns :array-like, Series, orlistof arrays/Series
Values to group by in the columns
values :array-like, optional
Array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
aggfunc :function, optional
If specified, requires `values` be specified as well
rownames : sequence, defaultNone
If passed, must match number of row arrays passed
colnames : sequence, defaultNone
If passed, must match number of column arrays passed
margins : boolean, default False
Add row/column margins (subtotals)
dropna : boolean, default True
Do not include columns whose entries are all NaN
normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
Normalize by dividing all values by the sum of values.
- If passed 'all' or `True`, will normalize over all values.
- If passed 'index' will normalize over each row.
- If passed 'columns' will normalize over each column.
- If margins is `True`, will also normalize margin values.
.. versionadded::0.18.1
Notes
-----
Any Series passed will have their name attributes used unless row or column
names for the cross-tabulation are specified.
Any input passed containing Categorical data will have **all**of its
categories included in the cross-tabulation, even if the actual data does
not contain any instances of a particular category.
In the event that there aren't overlapping indexes an empty DataFrame will
be returned.
Examples
-------->>> a
array([foo, foo, foo, foo, bar, bar,
bar, bar, foo, foo, foo], dtype=object)
>>> b
array([one, one, one, two, one, one,
one, two, two, two, one], dtype=object)
>>> c
array([dull, dull, shiny, dull, dull, shiny,
shiny, dull, shiny, shiny, shiny], dtype=object)
>>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b one two
c dull shiny dull shiny
a
bar 1210
foo 2212>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar) # 'c'and'f' are not represented in the data,
# but they still will be counted in the output
col_0 d e f
row_0
a 100
b 010
c 000
Returns
-------
crosstab : DataFrame
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 【自荐】一款简洁、开源的在线白板工具 Drawnix