pd.Series() in pandas
pd.Series
is a one-dimensional array-like object that is one of the main data structures in the pandas library. It is capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.), and it has an associated array of data labels, called its index. The basic structure of a pd.Series
object is very simple:
import pandas as pd
s = pd.Series(data, index=index)
Here's what each part means:
-
data
: Data can be many different things:- a Python dict
- an ndarray
- a scalar value (like 5)
-
index
: Index values must be immutable and hashable. The same length as the data. By default, it is a range of numbers from0
tolen(data) - 1
.
Here are some examples of creating a Series:
-
From a list: Without specifying an index, pandas will create a default integer index.
s = pd.Series([1, 3, 5, np.nan, 6, 8])
-
From a dict: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict's insertion order.
s = pd.Series({'a': 1, 'b': 2, 'c': 3})
-
From a scalar value: If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.
s = pd.Series(5.0, index=['a', 'b', 'c', 'd', 'e'])
Each Series object has attributes and methods that allow for easy data manipulation and analysis. For example, s.index
would give you the index of the Series s
, and s.values
would give you the data as a numpy array.
Series also support various operations like slicing, filtering, and aggregation, which makes it a very powerful tool for data analysis in Python.
The main difference between a series and a list
A pd.Series
object from the pandas library and a Python list have several key differences:
-
Data Types:
- A
pd.Series
can hold any data type but each Series object can only contain data of the same type. That means, if you create a Series with integers and floats, the integers will be upcast to floats. - A Python list can hold different data types within the same list. You can have integers, strings, and objects all in the same list.
- A
-
Performance:
pd.Series
is built on top of NumPy arrays, which makes it more optimized for performance for certain types of operations, especially on large datasets. Operations on Series are vectorized, meaning they are optimized for performance without the need for explicit looping.- Python lists are general-purpose containers that are not optimized for numerical computations or data analysis tasks.
-
Functionality:
pd.Series
comes with a lot of built-in methods for common data manipulation tasks such as summing values, calculating means, handling missing data, and more.- Python lists have methods for general-purpose tasks like adding and removing elements, but lack the advanced data manipulation capabilities of a pandas Series.
-
Indexing:
- Each element in a
pd.Series
has a unique index associated with it (which can be numeric or label-based), and this index is used to access and modify data. The index provides powerful and flexible data retrieval methods. - A Python list is indexed with integers starting from zero, and you can only access data by these positional indices.
- Each element in a
-
Size Mutability:
- A
pd.Series
has a fixed size once created; to change its size, a new Series must be created. However, you can easily change the values it contains. - A Python list is dynamic in size. You can append, insert, or remove items from a list, which changes its size.
- A
Below are examples of how to use a pandas Series
to sum values, calculate the mean, and handle missing data.
First, let's import pandas and create a Series
:
import pandas as pd
import numpy as np # for NaN (missing data)
# Creating a Series with some random numbers and a NaN value
s = pd.Series([1, 3, 5, np.nan, 7, 9, 11])
-
Summing Values:
You can sum the values in aSeries
using the.sum()
method.total = s.sum() print("Sum:", total)
-
Calculating Mean:
To calculate the mean of the values in aSeries
, use the.mean()
method.average = s.mean() print("Mean:", average)
-
Handling Missing Data:
-
To fill missing data (NaN values) with a specific value, use the
.fillna()
method.filled_s = s.fillna(0) # Replace NaN with 0 print("Filled Series:\n", filled_s)
-
To drop rows with missing data, use the
.dropna()
method.dropped_s = s.dropna() # Drop rows with NaN print("Dropped NaN Series:\n", dropped_s)
-
This code will output the sum and mean of the non-missing values and will show how the series looks both after filling in missing data with zeros and after dropping any missing data.
The indexing of a pandas Series
differs from a Python list primarily due to the design and purpose of each structure.
-
Python List Indexing:
- A Python list is indexed with integers starting from 0, and this is the only way to access elements in a list, by their position.
- Lists are part of Python's core data structures and are intended for general-purpose use. They don't have an associated index other than this default numerical index.
-
Pandas Series Indexing:
- A pandas
Series
has an explicitly defined index associated with its elements. This index can be numeric (like a list), but it is also often composed of labels (strings or dates, for example). - The index in a
Series
can be thought of as a set of keys similar to a dictionary; each key (index label) is mapped to a value in the Series. - This index provides powerful data alignment features. When performing operations across multiple Series or between Series and DataFrames, pandas will align data based on these index labels, not just the positional order.
- The index in a
Series
does not need to be unique, which allows for more complex data manipulations and groupings. - The index also supports hierarchical/multi-level indexing, which enables representing higher-dimensional data in a one-dimensional Series.
- A pandas
In essence, the indexing system of a pandas Series
is far more flexible and feature-rich compared to the simple, positional-only indexing of a Python list. This flexibility is one reason why pandas is so powerful for data analysis, as it allows for complex operations and easy subsetting of data based on sophisticated index criteria.
A special example of the functionality of index in series:
The use of pd.Series()
in the lambda function within the apply()
method is to create a new pandas Series from the list [x.split('_')[0], x.split('_')[2]]
. When using apply()
to return multiple new columns, you need to return a Series with each element corresponding to a new column. Here is why it's used:
-
Pandas Compatibility: By returning a
pd.Series
, you ensure that each element of the series is treated as a separate column when assigning todf[['code', 'year']]
. -
Structure: A pandas Series has an index, which aligns with the DataFrame's index when you're adding it as new columns.
-
Multiple Columns: Without
pd.Series()
, the lambda function would return a list, and pandas would not directly know how to convert this list into multiple columns. By usingpd.Series()
, you explicitly convert the list into a Series where each item in the list becomes a separate column.
Here's the modified part of the code for clarity:
df[['code', 'year']] = df['column_to_split'].apply(
lambda x: pd.Series([x.split('_')[0], x.split('_')[2]])
)
In this line, for each value in 'column_to_split', x.split('_')
creates a list of split components, and pd.Series([x.split('_')[0], x.split('_')[2]])
creates a new Series with the first and third elements of that list as its two entries. When applied across the whole DataFrame, this results in two new columns being created.