理解Python字典和集合的性能差异

本文为极客时间 Python核心技术与实战专栏的学习笔记

字典

在 Python3.7+，字典被确定为有序（注意：在 3.6 中，字典有序是一个 implementation detail，在 3.7 才正式成为语言特性，因此 3.6 中无法 100% 确保其有序性），而 3.6 之前是无序的，其长度大小可变，元素可以任意地删减和改变。

相比列表和元组，字典性能更优，可以在常数时间复杂度O(1)内完成查找、添加、删除操作。

常用创建方法

>>> d1 = {'name': 'Json', 'age': 20, 'gender': 'male'}
>>> d2 = dict( {'name': 'Json', 'age': 20, 'gender': 'male'})
>>> d3 = dict([('name', 'Json'),('age', 20),('gender','male')])
>>> d4 = dict(name='Json', age=20, gender='male')
>>> d1 == d2 == d3 == d4
True

索引

使用dict[key]格式索引，如果不存在，会抛出KeyError异常。

>>> d1['name']
'Json'
>>> d1['location']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'location'

使用get(key, default)方法不会抛出异常，此外当该键不存在时，可以指定返回的默认值。

>>> d1.get('name')
'Json'
>>> d1.get('location')

>>> d1.get('location', None)
None

判断某元素是否是字典的键：

>>> 'name' in d1		# 同d1.keys()等价
True
>>> 'name' in d1.keys()		
True
>>> 'Json' in d1		
False
>>> 'Json' in d1.values()
True

添加

>>> d = {'name': 'jason', 'age': 20}
>>> d['gender'] = 'male'
>>> d
{'name': 'jason', 'age': 20, 'gender': 'male'}

删除

>>> d.pop('gender')
'male'
>>> d
{'name': 'jason', 'age': 20}

排序

根据键排序

>>> d = {'b': 1, 'v': 20, 'a': 17}
>>> d_sorted_by_key = sorted(d.items(), key=lambda x: x[0])
>>> d_sorted_by_key
[('a', 17), ('b', 1), ('v', 20)]
>>> d
{'b': 1, 'v': 20, 'a': 17}

根据值排序

>>> d_sorted_by_value = sorted(d.items(), key = lambda x: x[1])
>>> d_sorted_by_value
[('b', 1), ('a', 17), ('v', 20)]

性能分析

举例：有1000万件产品，产品信息包括：产品ID、价格。现在需求是：给定某件产品的ID，找出其价格：

1.用列表来存储数据

存储结构如下：

products = [
    (143121312, 100), 
    (432314553, 30),
    (32421912367, 150) 
]

那么查找需要遍历整个列表，时间复杂度为O(n)。即使先对列表排序，然后二分查找，也会需要O(logn)的时间复杂度，并且排序还需要O(nlogn)的时间。

2. 用字典存储数据

存储结构如下：

products = {
	'143121312': 100,
	'432314553': 30,
	'32421912367': 150
}

因为字典内部结构是一张哈希表，所以可以在O(1)的时间复杂度内完成查找。

3. 效率对比

import time
import numpy
def find_product_price_list(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None
def find_product_price_dict(products, product_id):
    for id in products.keys():
        if id == product_id:
            return products_dict[id]
    return None
r = numpy.random.randint(0,10000000,10000000)       # 生成10000000个随机数
id = [str(x) for x in r]
price = [x for x in range(20000000, 30000000)]

products_list = list(zip(id, price))
products_dict = dict(zip(id, price))
# 添加新元素
products_list.append(('111111111', 300))	# 追加到列表末尾
products_dict['111111111'] = 300

start_using_dict = time.perf_counter()
find_product_price_dict(products_dict, '111111111')
end_using_dict = time.perf_counter()
print('time elapse using dict: {}'.format(end_using_dict - start_using_dict))

start_using_list = time.perf_counter()
find_product_price_list(products_dict, '111111111')
end_using_list = time.perf_counter()

print('time elapse using list: {}'.format(end_using_list - start_using_list))

# ===========运行结果============
time elapse using dict: 0.1983588489999999
time elapse using list: 0.41368435999999953

集合

而集合和字典基本相同，唯一的区别，就是集合没有键和值的配对，是一系列无序的、唯一的元素组合。

常用创建方法

>>> s1 = {1,2,3}
>>> s2 = set([1,2,3])
>>> s1 == s2
True

索引

集合并不支持索引操作，因为集合本质上是一个哈希表，和列表不一样。

进行如下操作，Python会抛出TypeError异常。

>>> s1[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

只能判断某元素是否在集合中：

>>> 1 in s1
True
>>> 10 in s1
False

添加

>>> s = {1,2,3}
>>> s.add(4)
>>> s
{1, 2, 3, 4}

删除

>>> s.remove(4)
>>> s
{1, 2, 3}

注意：由于集合是无序的，所以无法确定pop()方法会删除哪个元素，所以谨慎使用。一般删除操作采用remove()即可。

排序

>>> s  = {2,4,546,34}
>>> sorted(s)
[2, 4, 34, 546]

集合运算

常用集合运算

语法	操作	说明
set(list1) \| set(list2)	union	包含 list1 和 list2 所有数据的新集合
set(list1) & set(list2)	intersection	包含 list1 和 list2 中共同元素的新集合
set(list1) - set(list2)	difference	在 list1 中出现但不在 list2 中出现的元素的集合

性能分析

还是以上面那个例子为例，现在要求计算出有多少种价格。为了节省时间，我们把产品数量降低到10万。

查找效率

1. 用列表存储数据

需要两层循环。那么，在最差情况下，需要 O(n^2) 的时间复杂度。

2. 用集合存储数据

由于集合是高度优化的哈希表，里面元素不能重复，并且其添加和查找操作只需 O(1) 的复杂度，那么，总的时间复杂度就只有 O(n)。

3. 效率对比

import time
import numpy

def find_unique_price_set(products):
    unique_price_set = set()
    for _, price in products:
        unique_price_set.add(price)
    return len(unique_price_set)

def find_unique_price_list(products):
    unique_price_list = []
    for _, price in products: # A
        if price not in unique_price_list: #B
            unique_price_list.append(price)
    return len(unique_price_list)


r = numpy.random.randint(0,1000000,100000)       # 生成100000个随机数
id = [str(x) for x in r]
price = [x for x in range(200000, 300000)]

products = list(zip(id, price))


start_using_set = time.perf_counter()
find_unique_price_set(products)
end_using_set = time.perf_counter()
print('time elapse using set: {}'.format(end_using_set - start_using_set))

start_using_list = time.perf_counter()
find_unique_price_list(products)
end_using_list = time.perf_counter()
print('time elapse using list: {}'.format(end_using_list - start_using_list))

# ===========运行结果============
time elapse using set: 0.00985934799999999
time elapse using list: 65.528253501

可以看出，仅10万数据，差距就已经很明显了。

交集、并集、差集运算

以求交集为例：

import time

list_a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 34, 53, 42, 44]
list_b = [2, 4, 6, 9, 23]

intersection = []

# 列表做交集
start_using_list = time.perf_counter()
for a in list_a:
    for b in list_b:
        if a == b:
            intersection.append(a)
end_using_list = time.perf_counter()

print(intersection)
print('time: {}'.format(end_using_list - start_using_list))

# 集合做交集
start_using_list = time.perf_counter()
intersection = list(set(list_a) & set(list_b))
end_using_list = time.perf_counter()

print(intersection)
print('time: {}'.format(end_using_list - start_using_list))

# ===========运行结果============
[2, 4, 6, 9]
time: 9.622000000000797e-06
[9, 2, 4, 6]
time: 4.169000000001782e-06

posted @ 2019-12-29 12:38 西加加先生阅读(3096) 评论(0) 收藏举报

刷新页面返回顶部

西加加先生

公众号：西加加先生更多原创干货文章首发公众号

理解Python字典和集合的性能差异

字典

常用创建方法

索引

添加

删除

排序

性能分析

集合

常用创建方法

索引

添加

删除

排序

集合运算

性能分析

查找效率

交集、并集、差集运算

公告

西加加先生

公众号：西加加先生 更多原创干货文章首发公众号

理解Python字典和集合的性能差异

字典

常用创建方法

索引

添加

删除

排序

性能分析

集合

常用创建方法

索引

添加

删除

排序

集合运算

性能分析

查找效率

交集、并集、差集运算

公告

公众号：西加加先生更多原创干货文章首发公众号