SQL 查询的 15 个 Pandas 函数

Import data from SQL to Python

In [1]:
# Let's start with connecting SQL with Python and Importing the SQL data as DataFrame
import pyodbc
import pandas as pd
import numpy as np
connection_string = ("Driver={SQL Server Native Client 11.0};"
            "Server=Your_Server_Name;"
            "Database=My_Database_Name;"
            "UID=Your_User_ID;"
            "PWD=Your_Password;")
connection = pyodbc.connect(connection_string)
# Using the same query as above to get the output in dataframe
# We are importing top 10 rows and all the columns of State_Population Table
population = pd.read_sql('SELECT TOP(10) * FROM State_Population', connection)
# OR
# write the query and assign it to variable
query = 'SELECT * FROM STATE_AREAS WHERE [area (sq. mi)] > 100000'
# use the variable name in place of query string
area = pd.read_sql(query, connection)
 
The output above is imported in Python as Pandas DataFrame.

Once we have the data in the form of DataFrame, now we can see how to manipulate them using Pandas in Python. In this article, we are going to see how we can replicate the SQL constructs in Python. There is no one “Best” way, but many good paths. You chose the one you wish for.

 

Basic SQL Queries

We are going to deconstruct the most basic of the SQL queries and see how the same result can be achieved in Python. The queries which we will discuss in this article are

  • SELECT column_name(s)
  • FROM table_name
  • WHERE condition
  • GROUP BY column_name(s)
  • HAVING condition
  • ORDER BY column_name(s)

The methodology we are going to adopt is like this: We will write a SQL query, and then list some possible ways in which the same result can be achieved in Python. We have three tables in the database which we are going to use, and we have imported two of them as DataFrames in Python already. We will use one of these Data Frames (population) to understand these concepts.

The table State_Population is already imported and the DataFrame is named as population.

In [2]:

population.head()

Out[2]:

 state/regionagesyearpopulation
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
 

Let us see how to replicate the SQL function in Python and get the same or similar results.

Note: The main headings are the broad SQL query names. And inside those headings, the actual SQL query being used to replicate are written in BOLD and note format. Below them, all the Python ways to replicate them are mentioned as numbered methods, one after the other.

SELECT column_name(s)

FROM table_name

SELECT * FROM State_Population;

This SQL query will fetch all the columns (and all the rows too) from the state_population table. The same result can be achieved by simply calling the DataFrame in Python.

1.  Call the DataFrame in Python

In [3]:
 
population
 
Out[3]:
 state/regionagesyearpopulation
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
2539 USA total 2010 309326295.0
2540 USA under18 2011 73902222.0
2541 USA total 2011 311582564.0
2542 USA under18 2012 73708179.0
2543 USA total 2012 313873685.0

2544 rows × 4 columns

 

SELECT year FROM State_Population;

 

此 SQL 查询将从 state_population 表中获取列(年)和所有行。在Python中,可以通过以下方式实现。这里要注意的一件事是,当我们只选择一列时,它会从 Pandas DataFrame 对象转换为 Pandas 系列对象。我们使用 DataFrame 函数将其转换回 DataFrame。

 

2.  Call the DataFrame.ColumnName

In [4]:
# By calling the dataframe.column
pd.DataFrame(population.year)

 

SELECT population, year FROM State_Population;

This query will fetch two columns(population and year) and all the rows from the state_population table. In Python, it can be achieved in the following ways.

3.  Call the DataFrame with column names (Selecting)

Notice the names of the columns as a list, inside the indexing brackets [].

In [5]:
population[['population', 'year']]

 

4.  Use the pandas .loc() method

The syntax for loc method is df.loc([row names], [column names]). If instead of the list of names, only “:” is passed, it means to consider all. So df.loc(: , [column names]) means fetch all rows for the given column names.

 
loc 方法的语法是df.loc([row names], [column names])如果只传递“ :”而不是名称列表,则表示考虑所有。所以df.loc(: , [column names])意味着获取给定列名的所有行
 
In [6]:
population.loc[:,['population', 'year']] 

The DataFrame above is the output from all the above codes. Different methods, same output.

 

SELECT column_name(s)

FROM table_name

WHERE condition

SELECT * FROM State_Population WHERE year = 2010;

This query will fetch all the columns and only those rows from the state_population table where the year column has a value equal to 2010. In Python, it can be achieved in the following ways.

5.  Use Python’s Slicing method

In [7]:
population[population.year == 2010]

 

6.  Use pandas .loc() method

In [8]:

population.loc[population.year == 2010,:]

 

7.  Use pandas .query() method

Notice that the input for df.query() is always a string.

In [9]:
population.query('year == 2010')

8.使用pandas lambda函数

请注意,使用了 apply 方法, lambda 函数应用于列的每个元素。然后将其结果送入索引括号内以对原始 DataFrame 进行切片。

In [10]:
population[population.apply(lambda x: x["year"] == 2010, axis=1)]

 

Out[10]:

 state/regionagesyearpopulation
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
90 AK under18 2010 187902.0
91 AK total 2010 713868.0
100 AZ under18 2010 1628563.0
2405 WY total 2010 564222.0
2490 PR total 2010 3721208.0
2491 PR under18 2010 896945.0
2538 USA under18 2010 74119556.0
2539 USA total 2010 309326295.0

106 rows × 4 columns

 

The DataFrame above is the output from all the above codes. Different methods, same output.

 

SELECT state/region, population, year FROM State_Population WHERE year = 2010 or 2012 and ages = under18;

此查询将获取这三列(州/地区、人口、年份)以及 state_population 表中年份列的值等于 2010 或 2012 且年龄列的值等于“under18”的那些行。在Python中,可以通过以下方式实现。

 

9.  Use Pythons indexing and slicing

In [11]:
# By using Pythons indexing and slicing
population[(population.year.isin([2010, 2012])) & (population.ages == "under18")][['state/region', 'population', 'year']]

 

Out[11]:

 state/regionpopulationyear
0 AL 1117489.0 2012
2 AL 1130966.0 2010
90 AK 187902.0 2010
94 AK 188162.0 2012
96 AZ 1617149.0 2012
2404 WY 135351.0 2010
2491 PR 896945.0 2010
2494 PR 841740.0 2012
2538 USA 74119556.0 2010
2542 USA 73708179.0 2012

106 rows × 3 columns

 

10.  Use pandas .loc() method

In [12]:
population.query('(year==2010 | year==2012) & ages == "under18"')[['state/region', 'population', 'year']]

 

Out[12]:

 state/regionpopulationyear
0 AL 1117489.0 2012
2 AL 1130966.0 2010
90 AK 187902.0 2010
94 AK 188162.0 2012
96 AZ 1617149.0 2012
2404 WY 135351.0 2010
2491 PR 896945.0 2010
2494 PR 841740.0 2012
2538 USA 74119556.0 2010
2542 USA 73708179.0 2012

106 rows × 3 columns

 

11.  Use pandas .query() method

Notice that the input for df.query() is always a string.

In [13]:
population.query('(year==2010 | year==2012) & ages == "under18"')[['state/region', 'population', 'year']]

 

Out[13]:

 state/regionpopulationyear
0 AL 1117489.0 2012
2 AL 1130966.0 2010
90 AK 187902.0 2010
94 AK 188162.0 2012
96 AZ 1617149.0 2012
2404 WY 135351.0 2010
2491 PR 896945.0 2010
2494 PR 841740.0 2012
2538 USA 74119556.0 2010
2542 USA 73708179.0 2012

106 rows × 3 columns

 

12.  Use lambda function

In [14]:
population[population.apply(lambda x: (x["year"] in [2010, 2012]) & (x["ages"] == "under18"), axis=1)]

 

Out[14]:

 state/regionagesyearpopulation
0 AL under18 2012 1117489.0
2 AL under18 2010 1130966.0
90 AK under18 2010 187902.0
94 AK under18 2012 188162.0
96 AZ under18 2012 1617149.0
2404 WY under18 2010 135351.0
2491 PR under18 2010 896945.0
2494 PR under18 2012 841740.0
2538 USA under18 2010 74119556.0
2542 USA under18 2012 73708179.0

106 rows × 4 columns

 

The DataFrame above is the output from all the above codes. Different methods, same output.

 

SELECT column_name(s)

FROM table_name

WHERE condition

GROUP BY column_name(s)

HAVING condition

SELECT * FROM State_Population WHERE ages = total GROUP BY state/region HAVING AVG(population) > 10000000;

SQL 和 Pandas 的 Group By 功能从表面上看是一样的,但 Pandas groupby 的能力和效率要高得多,尤其是对于更复杂的操作。为了在python中从SQL中实现上述操作,让我们近距离看看pandas groupby函数。

可以使用一列或多列进行分组。对于一列,只需传递列名,对于多列,将名称作为列表传递。
 
In [15]:
# grouped by state/region
population.groupby(by = 'state/region')

 

Out[15]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982CD0408>
In [16]:
# grouped by state/region and year
population.groupby(by = ['state/region', 'year'])

 

 
Out[16]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982D7A908>
 

输出是一个 groupby 对象。输出显示分组已经完成,并且 groupby 函数完成了它的工作。但是由于我们没有被告知按哪个函数聚合,因此输出不是 DataFrame 的形式。所以让我们现在就这样做。

 

 
In [17]:
population.groupby(by = ['state/region', 'year']).count()

 

Out[17]:

  agespopulation
state/regionyear  
AK1990 2 2
1991 2 2
1992 2 2
1993 2 2
1994 2 2
WY2009 2 2
2010 2 2
2011 2 2
2012 2 2
2013 2 2

1272 rows × 2 columns

 

我们可以将这个 groupby 对象分配给一个变量,然后使用该变量进行进一步的操作。

In [18]:
grouped = population.groupby(by = ['state/region', 'year'])

 

 

现在让我们复制 SQL 查询。要添加 HAVING 函数,我们需要使用 groupby 然后过滤条件。上述SQL代码的python实现如下。

13.  groupby and aggregate in Pandas

In [19]:
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'state/region').aggregate('population').mean())
​df.loc[df.population > 10000000, :] 
 
Out[19]:
 population
state/region 
CA 3.433414e+07
FL 1.649654e+07
IL 1.237080e+07
NY 1.892581e+07
OH 1.134238e+07
PA 1.236960e+07
TX 2.160626e+07
USA 2.849979e+08
 

SELECT column_name(s)

FROM table_name

WHERE condition

GROUP BY column_name(s)

HAVING condition

ORDER BY column_name(s)

SELECT * FROM State_Population WHERE ages = total GROUP BY state/region HAVING AVG(population) > 10000000 ORDER BY population;

The order by in SQL is used to sort the table in the given order. In the above SQL code, the table needs to be ordered in ascending order (default). This task can be accomplished by using the pandas sort_values() method.

14.  Order by using sort_values() in Python

In [20]:
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'state/region').aggregate('population').mean())
 df.loc[df .population > 10000000, :].sort_values(by = 'population')
 
Out[20]:
 population
state/region 
OH 1.134238e+07
PA 1.236960e+07
IL 1.237080e+07
FL 1.649654e+07
NY 1.892581e+07
TX 2.160626e+07
CA 3.433414e+07
USA 2.849979e+08
 

默认情况下按升序进行排序。要改变这一点,应使用升序 = False

In [21]:
df.loc[df.population > 10000000, :].sort_values(by = 'population', ascending = False)

Out[21]:

 population
state/region 
USA 2.849979e+08
CA 3.433414e+07
TX 2.160626e+07
NY 1.892581e+07
FL 1.649654e+07
IL 1.237080e+07
PA 1.236960e+07
OH 1.134238e+07
 

 

Bonus: 

按多列排序

Pandas 提供了按多列排序的功能。不仅如此,您还可以选择哪些列按升序排序,哪些列按降序排序。让我们对我们的人口数据集进行分组和排序。我们将根据州和年份对它们进行分组,然后根据年份和人口对它们进行排序。

15. sort_values() 在不止一列上

In [22]:
# Grouping by and the grouped table
grouped = population.groupby(by = ['state/region', 'year']).mean()
grouped
 
 
Out[22]:
  population
state/regionyear 
AK1990 365396.0
1991 376186.5
1992 386807.0
1993 393312.0
1994 395373.5
WY2009 347405.5
2010 349786.5
2011 351368.0
2012 356576.0
2013 360168.5

1272 rows × 1 columns

In [23]:
# Sorting the Grouped table in 
# Ascending order of Year and (Increasing Year)
# Descending order of population (decreasing population)
grouped.sort_values(by = ['year', 'population'], ascending=[True, False])

 

 
 
Out[23]:
  population
state/regionyear 
USA1990 156920663.0
CA1990 18970008.0
NY1990 11151213.5
TX1990 10981487.5
FL1990 8011057.0
AK2013 461632.0
ND2013 443040.5
DC2013 378961.5
VT2013 374665.5
WY2013 360168.5

1272 rows × 1 columns

 

6 结论:

您一定想知道组织如何管理其庞大的数据库。他们肯定不会将其保存在 Excel 或其他电子表格格式中。现实生活中的业务数据库维护在关系数据库系统中,该系统最常使用 SQL 创建和访问。因此,了解 SQL 是任何数据科学家的必备工具。但 SQL 不仅仅是一个数据挑选工具,功能更强大。它能够完成许多数据整理和数据操作任务。但 Python 也是如此。

现在没有一种语言足以完成所有的任务,具有操作效率。因此,对 SQL 和 Python 的深入理解将帮助您选择使用哪一个来完成哪个任务。

如果您只想对数据进行选择、筛选和基本操作,则可以在 SQL 中高效地完成。但是如果需要复杂的分组操作和更多的数据操作,Python 中的 Pandas 将是一个更合适的选择。

使用多种数据分析语言有很多好处,因为您可以自定义和使用非常适合您不断变化的需求的混合方法。

 

posted @   DaisyLinux  阅读(264)  评论(0编辑  收藏  举报
努力加载评论中...
点击右上角即可分享
微信分享提示