6 ways to Sort Pandas Dataframe: Pandas Tutorial

Often you want to sort Pandas data frame in a specific way. Typically, one may want to sort pandas data frame based on the values of one or more columns or sort based on the values of row index or row names of pandas dataframe. Pandas data frame has two useful functions

sort_values(): to sort pandas data frame by one or more columns
sort_index(): to sort pandas data frame by row index

Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm and so on.

Here is a quick Pandas tutorial on multiple ways of using sort_values() and sort_index() to sort pandas data frame using a real data set (gapminder).

Let us first load the gapminder data from software carpentry URL.

data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# print the first three rows
print(gapminder.head(n=3))

1. How to Sort Pandas Dataframe based on the values of a column?

We can sort pandas dataframe based on the values of a single column by specifying the column name wwe want to sort as input argument to sort_values(). For example, we can sort by the values of “lifeExp” column in the gapminder data like

sort_by_life = gapminder.sort_values('lifeExp')

print(sort_by_life.head(n=3))
          country  year        pop continent  lifeExp   gdpPercap
1292       Rwanda  1992  7290203.0    Africa   23.599  737.068595
0     Afghanistan  1952  8425333.0      Asia   28.801  779.445314
552        Gambia  1952   284320.0    Africa   30.000  485.230659

Note that by default sort_values sorts and gives a new data frame. The new sorted data frame is in ascending order (small values first and large values last). With head function we can see that the first rows have smaller life expectancy. Using tail function the sorted data frame, we can see that the last rows have higher life expectancy.

print(sort_by_life.tail(n=3))
             country  year          pop continent  lifeExp    gdpPercap
802            Japan  2002  127065841.0      Asia   82.000  28604.59190
671  Hong Kong China  2007    6980412.0      Asia   82.208  39724.97867
803            Japan  2007  127467972.0      Asia   82.603  31656.06806

2. How to Sort Pandas Dataframe based on the values of a column (Descending order)?

To sort a dataframe based on the values of a column but in descending order so that the largest values of the column are at the top, we can use the argument ascending=False.

sort_by_life = gapminder.sort_values('lifeExp',ascending=False)

In this example, we can see that after sorting the dataframe by lifeExp with ascending=False, the countries with largest life expectancy are at the top.

print(sort_by_life.head(n=3))
             country  year          pop continent  lifeExp    gdpPercap
803            Japan  2007  127467972.0      Asia   82.603  31656.06806
671  Hong Kong China  2007    6980412.0      Asia   82.208  39724.97867
802            Japan  2002  127065841.0      Asia   82.000  28604.59190

3. How to Sort Pandas Dataframe based on a column and put missing values first?

Often a data frame might contain missing values and when sorting a data frame on a column with missing value, we might want to have rows with missing values to be at the first or at the last.

We can specify the position we want for missing values using the argument na_position. With na_position=’first’, it will have the rows with missing values first.

sort_na_first = gapminder.sort_values('lifeExp',na_position='first')

In this example, there are NO missing values and that is why there is no na values at the top when sorted with the option na_position=’first’.

sort_na_first.head()
          country  year        pop continent  lifeExp   gdpPercap
1292       Rwanda  1992  7290203.0    Africa   23.599  737.068595
0     Afghanistan  1952  8425333.0      Asia   28.801  779.445314
552        Gambia  1952   284320.0    Africa   30.000  485.230659

4. How to Sort Pandas Dataframe based on a column in place?

By default sorting pandas data frame using sort_values() or sort_index() creates a new data frame. If you don’t want create a new data frame after sorting and just want to do the sort in place, you can use the argument “inplace = True”. Here is an example of sorting a pandas data frame in place without creating a new data frame.

gapminder.sort_values('lifeExp', inplace=True, ascending=False)

We can see that the data frame sorted as lifeExp values at the top are smallest and the row indices are not in order.

print(gapminder.head(n=3))
          country  year        pop continent  lifeExp   gdpPercap
1292       Rwanda  1992  7290203.0    Africa   23.599  737.068595
0     Afghanistan  1952  8425333.0      Asia   28.801  779.445314
552        Gambia  1952   284320.0    Africa   30.000  485.230659

Note that, the row index of the sorted data frame is different from the data frame before sorting.

5. How to Sort Pandas Dataframe based on Index (in place)?

We can use sort_index() to sort pandas dataframe to sort by row index or names. In this example, row index are numbers and in the earlier example we sorted data frame by lifeExp and therefore the row index are jumbled up. We can sort by row index (with inplace=True option) and retrieve the original dataframe.

gapminder.sort_index(inplace=True)

Now we can see that row indices start from 0 and sorted in ascending order. Compare it to the previous example, where the first row index is 1292 and row indices are not sorted.

print(gapminder.head(n=3))
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

6. How to Sort Pandas Dataframe Based on the Values of Multiple Columns?

Often, you might want to sort a data frame based on the values of multiple columns. We can specify the columns we want to sort by as a list in the argument for sort_values(). For example, to sort by values of two columns, we can do.

sort_by_life_gdp = gapminder.sort_values(['lifeExp','gdpPercap'])

We can see that lifeExp column is sorted in ascending order and for each values of lifeExp, gdpPercap is sorted.

print(sort_by_life_gdp.head())
          country  year        pop continent  lifeExp   gdpPercap
1292       Rwanda  1992  7290203.0    Africa   23.599  737.068595
0     Afghanistan  1952  8425333.0      Asia   28.801  779.445314
552        Gambia  1952   284320.0    Africa   30.000  485.230659

Note that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable next. We can see the difference by switching the order of column names in the list.

sort_by_life_gdp = gapminder.sort_values(['gdpPercap','lifeExp'])

print(sort_by_life_gdp.head())
             country  year         pop continent  lifeExp   gdpPercap
334  Congo Dem. Rep.  2002  55379852.0    Africa   44.966  241.165877
335  Congo Dem. Rep.  2007  64606759.0    Africa   46.462  277.551859
876          Lesotho  1952    748747.0    Africa   42.138  298.846212

posted @ 2019-09-03 09:00 andy_0212 阅读(229) 评论(0) 编辑收藏举报

刷新页面返回顶部

andy_0212