Python读CookBook之数据结构和算法
1.将序列分解为单独的变量
任何序列(可迭代的变量)都可以通过一个简单的赋值操作来分解为单独的变量。唯一的要求是变量的总数和结构要与序列相吻合
data = ["Mike", 22, 73, (2017, 12, 28)] name, age, score, (year, month, date) = data print(name, age, score, year, month, date)
Mike 22 73 2017 12 28
分解操作时,可以用一个用不到的变量名来丢弃某一变量
data = ["Mike", 22, 73, (2017, 12, 28)] _, age, score, (_, _, date) = data print(age, score, date)
22 73 28
2.从任意长度的可迭代对象中分解元素
使用*表达式可以表示被*修饰的变量代表n个元素的列表 n 可以为0 可以为无限大
record = ("Jack", 22, "15012345678", "18099883311") name, age, *phone = record print(name, age, phone)
Jack 22 ['15012345678', '18099883311']
注意:分解一个元素时,只能有一个被*修饰的变量
3.保存最后N个元素
from collections import deque def search(lines, pattern, history=5): previous_lines = deque(maxlen=history) for line in lines: if pattern in line: yield line, previous_lines previous_lines.append(line) if __name__ == "__main__": with open("D:/Test1.txt") as f: for line, prelines in search(f, "456", 5): for pline in prelines: print(pline, end="") print(line, end="") print("-"*20)
123 456 --------------------
collection模块的deque能很好的完成这个工作,切deque在头尾位置插入数据时时间复杂度都为 O(1)
4.找到最大最小N个元素
①找最大最小的元素
使用 min() max()函数,时间复杂度 O(n)-- n为序列的长度
num = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2] maxnum = max(num) minnum = min(num) print(maxnum, "----", minnum)
42 ---- -4
②相对于列表长度极小(例如 N=2)
使用heapq库中的和heapify使序列成堆的形式分布,且第一个元素永远是最小的那个元素,此时,使用heappop()函数会弹出最小的那个元素,第二小的取而代之处于首元素的位置。
num = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2] heap = list(num) heapq.heapify(heap) print(heap) print("="*50) print(heapq.heappop(heap)) print(heap) print("="*50) print(heapq.heappop(heap)) print(heap) print("="*50)
[-4, 2, 1, 23, 7, 2, 18, 23, 42, 37, 8] ================================================== -4 [1, 2, 2, 23, 7, 8, 18, 23, 42, 37] ================================================== 1 [2, 2, 8, 23, 7, 37, 18, 23, 42] ==================================================
该方法时间复杂度为O(logn) n 为序列长度
③N相对数组长度小(例如N = 4)
使用heapq模块中的 nlargest() nsmallest()函数,这两个函数可以接受一个key作为参数
data = [ {"name": "Jack", "age": 21, "score": 99}, {"name": "Ben", "age": 22, "score": 90}, {"name": "Mark", "age": 20, "score": 72}, {"name": "Cook", "age": 20, "score": 53}, {"name": "Antony", "age": 23, "score": 94}, {"name": "Chris", "age": 24, "score": 62}, {"name": "Ken", "age": 22, "score": 81}, {"name": "Jackie", "age": 20, "score": 85}, {"name": "David", "age": 22, "score": 89}, {"name": "Jackson", "age": 23, "score": 89}, {"name": "Lucy", "age": 22, "score": 77} ] scoreMax = heapq.nlargest(4, data, key=lambda s: s["score"]) scoreMin = heapq.nsmallest(4, data, key=lambda s: s["score"]) print(scoreMax) print(scoreMin)
[{'name': 'Jack', 'age': 21, 'score': 99}, {'name': 'Antony', 'age': 23, 'score': 94}, {'name': 'Ben', 'age': 22, 'score': 90}, {'name': 'David', 'age': 22, 'score': 89}] [{'name': 'Cook', 'age': 20, 'score': 53}, {'name': 'Chris', 'age': 24, 'score': 62}, {'name': 'Mark', 'age': 20, 'score': 72}, {'name': 'Lucy', 'age': 22, 'score': 77}]
上面现象可以看出,有相同数据时,优先选取顺序在前的
④当N接近于序列的大小
使用sorted()并进行切片操作
num = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2] lst = sorted(num) lstmax = lst[:8] print(lstmax) lstrev = sorted(num, reverse=True) lstmin = lstrev[:8] print(lstmin)
或
num = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2] lst = sorted(num) lstmax = lst[:8] print(lstmax) lstmin = lst[-8:] print(lstmin)
5.实现优先级队列
使用heapq(堆操作)的heappush heappop实现这一操作
import heapq class PriorityQueue(object): def __init__(self): self._queue = [] self._index = 0 def push(self, item, priority): heapq.heappush(self._queue, (-priority, self._index, item)) self._index += 1 def pop(self): return heapq.heappop(self._queue)[-1] class Item(object): def __init__(self, name): self.name = name def __repr__(self): return self.name if __name__ == "__main__": q = PriorityQueue() q.push(Item("Jack"), 1) q.push(Item("Mike"), 2) q.push(Item("Ben"), 3) q.push(Item("David"), 1) for i in range(q._index): print(q.pop())
Ben
Mike
Jack
David
6.在字典中将键映射到多个值上:
使用collection模块中的defaultdict类来实现
当属性为list
from collections import defaultdict
d = defaultdict(list) d["a"].append(1) d["a"].append(1) d["b"].append(2) d["c"].append(3) d["d"].append(4) for key, values in d.items(): print(key, ":", values)
a : [1, 1] b : [2] c : [3] d : [4]
当属性为set
from collections import defaultdict d = defaultdict(set) d["a"].add(1) d["a"].add(1) d["b"].add(2) d["c"].add(3) d["d"].add(4) for key, values in d.items(): print(key, ":", values)
a : {1} b : {2} c : {3} d : {4}
不过这种方法会预先建立一个空的表项
也可通过普通字典的setdefault属性来实现这个功能
d = {} d.setdefault("a", []).append(1) d.setdefault("a", []).append(2) d.setdefault("b", []).append(3) d.setdefault("c", []).append(4) for key, values in d.items(): print(key, ":", values)
a : [1, 2] b : [3] c : [4]
不过这种方法每次都会创建一个新实例 [] 或者 ()
列举一个循环插入的示例:
from collections import defaultdict d = defaultdict(list) for key, values in pairs: d[key].append[values]
7.让字典保持有序
使用collection模块中的OrderedDict
from collections import OrderedDict d = OrderedDict() d["foo"] = 1 d["bar"] = 2 d["spam"] = 3 d["grok"] = 4 d["foo"] = 5 for k in d: print(k, ":", d[k])
foo : 5 bar : 2 spam : 3 grok : 4
由此可见,更改已经插入的键的值不会影响该项在排序字典中的位置
OrderedDict由一组双向链表维护,大小为普通字典内存的两倍
可适用于JSON格式文件编码时控制各字段的顺序
8.与字典有关的计算问题
prices = { "ACME": 45.23, "AAPL": 612.78, "IBM": 205.55, "HPQ": 37.20, "FB": 10.75 } print(max(zip(prices.keys(), prices.values()))) print(min(zip(prices.keys(), prices.values()))) print("-"*10) prices_sorted = sorted(zip(prices.keys(), prices.values())) for k in prices_sorted: print(k)
('IBM', 205.55) ('AAPL', 612.78) ---------- ('AAPL', 612.78) ('ACME', 45.23) ('FB', 10.75) ('HPQ', 37.2) ('IBM', 205.55)
zip可以反转key和value,且不改变字典原结构,属于迭代器,只能被消费一次
如果比较字典 只会用key进行比较
如果我们换一种方式,操作如下
prices = { "ACME": 45.23, "AAPL": 612.78, "IBM": 205.55, "HPQ": 37.20, "FB": 10.75 } minItem = min(prices, key=lambda k: prices[k]) maxItem = max(prices, key=lambda k: prices[k]) minValue = prices[minItem] maxValue = prices[maxItem] print(minItem, maxItem, "="*5, minValue, maxValue)
FB AAPL ===== 10.75 612.78
9.在两个字典中寻找相同点
通过keys() items() 的 + - & 计算进行操作
a = {"x": 1, "y": 2, "z": 3} b = {"w": 10, "x": 11, "y": 2} # Find Common Keys print(a.keys() & b.keys()) # Find keys in a but not in b print(a.keys() - b.keys()) # Find {keys,valus} in commom print(a.items() & b.items()) # Create a new dictionary with certain keys removed c = {key: a[key] for key in a.keys() - {"z", "w"}} print(c)
{'y', 'x'} {'z'} {('y', 2)} {'y': 2, 'x': 1}
10.从序列中移除重复项目且保持元素间顺序不变
如过序列中的值可哈希 ---- 生存期内不可变的对象,有一个__hash__()方法,如整数、浮点数、字符串、元组
def dedupe(items): seen = set() for item in items: if item not in seen: yield item seen.add(item) a = [1, 5, 2, 1, 9, 1, 5, 10] lst = list(dedupe(a)) print(lst)
[1, 5, 2, 9, 10]
如果值不可哈希
def dedupe(items, key=None): seen = set() for item in items: val = item if key is None else key(item) if val not in seen: yield item seen.add(val) b = [{"x": 1, "y": 2}, {"x": 1, "y": 3}, {"x": 1, "y": 2}, {"x": 2, "y": 4}, ] lst = list(dedupe(b, key=lambda k: (k["x"], k["y"]))) print(lst) lst2 = list(dedupe(b, key=lambda k: (k["x"]))) print(lst2)
想办法将不可哈希的项改为可哈希的项
set也可以去重复,但是无法保证原来的顺序不变
11.对切片命名
s = "Hello World" a = slice(2, 5) print(s[a])
llo
可以使用indice(size)将slice限定在安全的范围内
s = "HelloWorld" a = slice(5, 50, 2) print(a.start) print(a.stop) print(a.step) print(a.indices(len(s))) for i in range(*a.indices(len(s))): print(s[i])
5 50 2 (5, 10, 2) W r d
这样就不会因为切片的大小问题出现IndexError
12.找出序列中出现最多次数的元素
collection中的Counter类实现此功能
from collections import Counter words = [ 'ear', 'head', 'nose', 'ear', 'look', 'see', 'head', 'ear', 'nose', 'ear', 'read', 'see', 'head', 'see', 'watch', 'look', 'hair', 'see', 'ear', 'big', 'small', 'do', 'hair', 'nose', 'head', 'big', 'large', 'ear', 'do', 'ear' ] word_counter = Counter(words) most_three_couter = word_counter.most_common(3) print(most_three_couter)
[('ear', 7), ('head', 4), ('see', 4)]
手动增加计数
words = [ 'ear', 'head', 'nose', 'ear', 'look', 'see', 'head', 'ear', 'nose', 'ear', 'read', 'see', 'head', 'see', 'watch', 'look', 'hair', 'see', 'ear', 'big', 'small', 'do', 'hair', 'nose', 'head', 'big', 'large', 'ear', 'do', 'ear' ] addwords = ['ear', 'head', 'small', 'big', 'do'] word_counter = Counter(words) for word in addwords: word_counter[word] += 1 print(word_counter)
或
words = [ 'ear', 'head', 'nose', 'ear', 'look', 'see', 'head', 'ear', 'nose', 'ear', 'read', 'see', 'head', 'see', 'watch', 'look', 'hair', 'see', 'ear', 'big', 'small', 'do', 'hair', 'nose', 'head', 'big', 'large', 'ear', 'do', 'ear' ] addwords = ['ear', 'head', 'small', 'big', 'do'] word_counter = Counter(words) word_counter.update(addwords) print(word_counter)
Counter({'ear': 8, 'head': 5, 'see': 4, 'nose': 3, 'big': 3, 'do': 3, 'look': 2, 'hair': 2, 'small': 2, 'read': 1, 'watch': 1, 'large': 1})
Counter可以做加减法
words = [ 'ear', 'head', 'nose', 'ear', 'look', 'see', 'head', 'ear', 'nose', 'ear', 'read', 'see', 'head', 'see', 'watch', 'look', 'hair', 'see', 'ear', 'big', 'small', 'do', 'hair', 'nose', 'head', 'big', 'large', 'ear', 'do', 'ear' ] addwords = ['ear', 'head', 'small', 'big', 'do'] word_counter = Counter(words) addwords_counter = Counter(addwords) mix = word_counter + addwords_counter print(mix) subtract = word_counter - addwords_counter print(subtract)
Counter({'ear': 8, 'head': 5, 'see': 4, 'nose': 3, 'big': 3, 'do': 3, 'look': 2, 'hair': 2, 'small': 2, 'read': 1, 'watch': 1, 'large': 1}) Counter({'ear': 6, 'see': 4, 'head': 3, 'nose': 3, 'look': 2, 'hair': 2, 'read': 1, 'watch': 1, 'big': 1, 'do': 1, 'large': 1})
13.通过公共键对字典列表排序
使用operator中的itemgetter函数进行排序
from operator import itemgetter data = [ {'ID': 1, "Name": "Ben", "score": 88}, {'ID': 2, "Name": "Jack", "score": 92}, {'ID': 3, "Name": "Mike", "score": 73}, {'ID': 4, "Name": "Mark", "score": 81}, {'ID': 5, "Name": "Lucy", "score": 95}, ] rows_by_Name = sorted(data, key=itemgetter('Name')) rows_by_Score = sorted(data, key=itemgetter('score')) print(rows_by_Name) print(rows_by_Score)
[{'ID': 1, 'Name': 'Ben', 'score': 88}, {'ID': 2, 'Name': 'Jack', 'score': 92}, {'ID': 5, 'Name': 'Lucy', 'score': 95}, {'ID': 4, 'Name': 'Mark', 'score': 81}, {'ID': 3, 'Name': 'Mike', 'score': 73}] [{'ID': 3, 'Name': 'Mike', 'score': 73}, {'ID': 4, 'Name': 'Mark', 'score': 81}, {'ID': 1, 'Name': 'Ben', 'score': 88}, {'ID': 2, 'Name': 'Jack', 'score': 92}, {'ID': 5, 'Name': 'Lucy', 'score': 95}]
可以用lamda来代替itemgetter 但是通常itemgetter效率高
14.对不原生支持比较操作的对象排序
比如必将两个对象就可以通过对象的属性进行比较
class User(object): def __init__(self, id): self.id = id def __repr__(self): return "User({})".format(self.id) users = [User(2), User(3), User(1)] print(users) user_ordered = sorted(users, key=lambda k: k.id) print(user_ordered)
[User(2), User(3), User(1)]
[User(1), User(2), User(3)]
也可以使用operator中的attrgetter()
from operator import attrgetter class User(object): def __init__(self, id): self.id = id def __repr__(self): return "User({})".format(self.id) users = [User(2), User(3), User(1)] print(users) user_ordered = sorted(users, key=attrgetter('id')) print(user_ordered)
同理使用attrgetter的效率高一些
15.根据字段将记录分组
通过itemgetter与itertools模块中的groupby实现
from operator import itemgetter
from itertools import groupby
data = [
{"Name": "Jack", "Age": 21},
{"Name": "Ben", "Age": 22},
{"Name": "Lucy", "Age": 23},
{"Name": "Lily", "Age": 23},
{"Name": "Mike", "Age": 21},
{"Name": "Martin", "Age": 21},
{"Name": "Susan", "Age": 23},
{"Name": "Rose", "Age": 22},
{"Name": "Hill", "Age": 22},
]
sortdata = sorted(data, key=itemgetter('Age'))
for age, rows in groupby(sortdata,key=itemgetter("Age")):
print(age)
for row in rows:
print(" ", row)
21 {'Name': 'Jack', 'Age': 21} {'Name': 'Mike', 'Age': 21} {'Name': 'Martin', 'Age': 21} 22 {'Name': 'Ben', 'Age': 22} {'Name': 'Rose', 'Age': 22} {'Name': 'Hill', 'Age': 22} 23 {'Name': 'Lucy', 'Age': 23} {'Name': 'Lily', 'Age': 23} {'Name': 'Susan', 'Age': 23}
分组之前先对序列进行排序
16.筛选序列中的元素:
内建函数fiter
def is_int(Val): if isinstance(Val, int): return True else: return False a = [1, 2, "aaa", "b", 5, "cc"] b = list(filter(is_int, a)) print(b)
[1, 2, 5]
itertools库中的compress
from itertools import compress name = ["Jack", "Lucy", "Ben", "Lily", "Mike"] age = [8, 11, 12, 9, 11] more = [n > 10 for n in age] print(more) l = list(compress(name, more)) print(l)
[False, True, True, False, True] ['Lucy', 'Ben', 'Mike']
得到一组BOOL变量后使用compress
17.从字典中获取子集
字典推导式
data = { "a": 1, "b": 3, "c": 5, "d": 7, "f": 8 } sondata1 = {k: v for k, v in data.items() if v > 4} dataname = ["a", "c", "f"] sondata2 = {k: v for k, v in data.items() if k in dataname} print(sondata1) print(sondata2)
{'c': 5, 'd': 7, 'f': 8} {'a': 1, 'c': 5, 'f': 8}
18.将名称映射到序列的元素中
使用collection模块中的namedturple
from collections import namedtuple name = ("Data", ["Name", "Id"]) Data = namedtuple(name[0], name[1]) data = Data("Mike", 1) print(data) def change_data(s): return data._replace(**s) a = {"Name": "Jack", "Id": 2} b = change_data(a) print(b)
Data(Name='Mike', Id=1) Data(Name='Jack', Id=2)
元素不能改变,通过_replace改变,数据量大使用类__slot___方式实现
19.同时对数据做转换和运算
生成器列表
num = [1, 2, 3, 4, 5] sum = [n * n for n in num] print(sum)
[1, 4, 9, 16, 25]
20.将多个映射合并为单个映射
collection模块中的ChainMap
有相同的键会使用第一个字典的值,增删改查操作总是会影响第一个字典
from collections import ChainMap a = {"x": 1, "y": 2} b = {"z": 3, "y": 4} c = ChainMap(a, b) print(c) print(c["y"]) a["x"] = 5 print(c)
ChainMap({'x': 1, 'y': 2}, {'z': 3, 'y': 4}) 2 ChainMap({'x': 5, 'y': 2}, {'z': 3, 'y': 4})
同样可以建立一个用于合成两个字典的新字典 使用update
a = {"x": 1, "y": 2} b = {"z": 3, "y": 4} c = dict(a) c.update(b) print(c) a["x"] = 5 print(c)
{'x': 1, 'y': 4, 'z': 3} {'x': 1, 'y': 4, 'z': 3}