【六】推导数据
一:编写程序
现如今有4组秒表记录的数据,分别如下:
- james.txt:2-34,3:21,2.34,2.45,3.01,2:01,2:01,3:10,2-22
- julie.txt:2.59,2.11,2:11,2:23,3-10,2-23,3:10,3.21,3-21
- mikey.txt:2:22,3.01,3:01,3.02,3:02,3.02,3:22,2.49,2:38
- sarah.txt:2:58,2.58,2:39,2-25,2-55,2:54,2.18,2:55,2:55
1.需要从各个文件将数据读入各自的列表,编写一个小程序,处理每个文件,为每个数据创建一个列表,并在屏幕上显示这些列表
james.txt
In [1]: with open('james.txt',"r") as jam: ...: data=jam.readline() ...: james=data.strip().split(",") #strip():去除空白换行符 #split():分割(也是最快将元素转换成列表的方法) ...: In [2]: james #变量james Out[2]: ['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] In [3]: cat james.txt #james.txt下的内容 2-34,3:21,2.34,2.45,3.01,2:01,2:01,3:10,2-22
julie.txt
In [4]: with open("julie.txt","r") as ju: ...: data=ju.readline() ...: julie=data.strip().split(",") ...: In [5]: julie Out[5]: ['2.59', '2.11', '2:11', '2:23', '3-10', '2-23', '3:10', '3.21', '3-21'] In [6]: cat julie.txt 2.59,2.11,2:11,2:23,3-10,2-23,3:10,3.21,3-21
mikey.txt
In [7]: with open("mikey.txt","r") as mi: ...: data=mi.readline() ...: mikey=data.strip().split(",") ...: In [8]: mikey Out[8]: ['2:22', '3.01', '3:01', '3.02', '3:02', '3.02', '3:22', '2.49', '2:38'] In [9]: cat mikey.txt 2:22,3.01,3:01,3.02,3:02,3.02,3:22,2.49,2:38
sarah.txt
In [12]: with open("sarah.txt","r") as sa: ...: data=sa.readline() ...: sarah=data.strip().split(",") ...: In [13]: sarah Out[13]: ['2:58', '2.58', '2:39', '2-25', '2-55', '2:54', '2.18', '2:55', '2:55'] In [14]: cat sarah.txt 2:58,2.58,2:39,2-25,2-55,2:54,2.18,2:55,2:55
二:排序的两种方式
-
- 原地排序:sort()方法 升序
- 降序:sort(reverse=True)
- 复制排序:sort() BIF升序
- 降序:sorted(reverse=True)
In [15]: data=[2,3,4,543221,333,1,2,3,2] In [16]: data #原数据 Out[16]: [2, 3, 4, 543221, 333, 1, 2, 3, 2] In [17]: data.sort()#原地排序(升序) In [18]: data Out[18]: [1, 2, 2, 2, 3, 3, 4, 333, 543221] In [19]: data=[2,3,4,543221,333,1,2,3,2] In [20]: data2=sorted(data) #复制排序 In [21]: data Out[21]: [2, 3, 4, 543221, 333, 1, 2, 3, 2] In [22]: data2 Out[22]: [1, 2, 2, 2, 3, 3, 4, 333, 543221] In [24]: data.sort(reverse=True) #原地排序(降序) In [25]: data Out[25]: [543221, 333, 4, 3, 3, 2, 2, 2, 1]
1.给julie排序
In [32]: julie Out[32]: ['2.59', '2.11', '2:11', '2:23', '3-10', '2-23', '3:10', '3.21', '3-21'] In [33]: julie2=sorted(julie) In [34]: julie2 Out[34]: ['2-23', '2.11', '2.59', '2:11', '2:23', '3-10', '3-21', '3.21', '3:10'] In [35]: julie Out[35]: ['2.59', '2.11', '2:11', '2:23', '3-10', '2-23', '3:10', '3.21', '3-21'] #ps:该段代码还需修改
从上段代码结果可以看出:1..数据格式不统一导致排序错误(2-33居然在2.11前面)
思路:1.创建一个函数,这个函数从每个秒表数据的列表中接收一个字符串作为输入,然后处理这个字符串,将找到的所有短横线和冒号替换成一个点号,并返回清理过的字符串
2.创建一个空列表,将清理过的数据放在该列表中,然后进行排序.
注意:如果字符串已经包含一个点好,则不需要在做清理
2.修改为(james的正常排序):
In [55]: james#原数据 Out[55]: ['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] In [56]: clean_james=[]#创建一个空列表 In [57]: clean_james Out[57]: [] #定义一个转换数据格式的方法(将其中你给的:-都变成.) In [58]: def sanitize(time_string): ...: if '-' in time_string: ...: splitter="-" ...: elif ":" in time_string: ...: splitter=":" ...: else: ...: return time_string ...: (mins1,secs1)=time_string.split(splitter) ...: return(mins1+"."+secs1) ...: #循环james列表,将他变成(分.秒)形式狗,添加到clean_james列表中 In [59]: for i in james: ...: clean_james.append(sanitize(i)) ...: print(clean_james) ...: print(sorted(clean_james))#对该列表进行排序 ...: ['2.34', '3.21', '2.34', '2.45', '3.01', '2.01', '2.01', '3.10', '2.22'] ['2.01', '2.01', '2.22', '2.34', '2.34', '2.45', '3.01', '3.10', '3.21']
3.其他替换冒号跟短横线的方法:
james=['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] print(type(james)) tihuan=str(james) tihuan1=tihuan.replace("-",".") tihuan2=tihuan1.replace(":",".") print tihuan2 print(type(tihuan2))
列表推导
- 创建一个新列表来存放转换后的数据
- 迭代处理原列表中的各个数据项
- 每次迭代都要完成转换
- 将转换后的数据追加到新列表
#将分钟转换成秒 In [60]: mins=[1,2,3] In [61]: secs=[m*60 for m in mins] In [62]: secs Out[62]: [60, 120, 180] #将name的小写变成大写 In [63]: name=["my","name","is","huahua"] In [66]: upper=[s.upper() for s in name] In [67]: upper Out[67]: ['MY', 'NAME', 'IS', 'HUAHUA'] #将data中的字符串变成float In [68]: data=['2.01','2.22','9.66'] In [69]: data1=[float(q) for q in data] In [70]: data1 Out[70]: [2.01, 2.22, 9.66]
4. 简化上述序号为2的代码:
In [71]: james Out[71]: ['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] In [72]: def sanitize(time_string): ...: if '-' in time_string: ...: splitter="-" ...: elif ":" in time_string: ...: splitter=":" ...: else: ...: return time_string ...: (mins1,secs1)=time_string.split(splitter) ...: return(mins1+"."+secs1) ...: In [73]: print(sorted([sanitize(i) for i in james])) #列表推倒 ['2.01', '2.01', '2.22', '2.34', '2.34', '2.45', '3.01', '3.10', '3.21']
列表分片
5.迭代删除重复项,打印出最快的3个时间
思路:
- 需要新建一个空列表
- 填入james中找到的唯一的数据项(使用not in)
In [76]: james#james元数据 Out[76]: ['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] #替换数据格式的函数 In [77]: def sanitize(time_string): ...: if '-' in time_string: ...: splitter="-" ...: elif ":" in time_string: ...: splitter=":" ...: else: ...: return time_string ...: (mins1,secs1)=time_string.split(splitter) ...: return(mins1+"."+secs1) ...: #打印出转换格式后的james并将他排序 In [78]: james1=(sorted([sanitize(i)for i in james])) #排序后的数据 In [79]: james1 Out[79]: ['2.01', '2.01', '2.22', '2.34', '2.34', '2.45', '3.01', '3.10', '3.21'] #空列表 In [80]: unique_james=[] #循环james1列表,判断该元素是否在unique_james中存在,若存在,不添加,若不存在,添加 In [81]: for s in james1: ...: if s not in unique_james: ...: unique_james.append(s) ...: print(unique_james[0:3]) #打印出最快的3个成绩 ...: ['2.01', '2.22', '2.34']
使用集合删除重复项
注意:集合是不允许有重复元素的
In [82]: s={1,2,3,4,5,5,5,5,5,5,} In [83]: s Out[83]: {1, 2, 3, 4, 5}
6.使用set和列表分片修改上述代码(打印出最快的3个时间)
In [86]: james Out[86]: ['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22'] In [87]: def sanitize(time_string): ...: if '-' in time_string: ...: splitter="-" ...: elif ":" in time_string: ...: splitter=":" ...: else: ...: return time_string ...: (mins1,secs1)=time_string.split(splitter) ...: return(mins1+"."+secs1) ...: #1.将转换后的列表变成集合(set) #2.在将集合排序 In [88]: james1=(sorted(set([sanitize(i)for i in james]))) In [89]: james1 Out[89]: ['2.01', '2.22', '2.34', '2.45', '3.01', '3.10', '3.21'] #取最快的3个时间 In [90]: james1=(sorted(set([sanitize(i)for i in james]))[0:3]) In [91]: james1 Out[91]: ['2.01', '2.22', '2.34']
知识点总结:
sort()
- 原地排序(升序)
- a.sort()
- 降序(sort(reverse=True))
sorted()
- 复制排序(升序)
- data1=sorted(data)
- 降序(sorted(reverse=True))
列表推导
- [表达式 for 变量 in 列表] 或者 [表达式 for 变量 in 列表 if 条件]
- [m*60 for m in f]
列表分片
- 在分片规则里list、tuple、str(字符串)都可以称为序列,都可以按规则进行切片操作
- 注意切片的下标0代表顺序的第一个元素,-1代表倒序的第一个元素;且切片不包括右边界,例如[0:3]代表元素0、1、2不包括3。
- james[0:3]
set
- set是无序的
- 不存在重复元素(可以使用set来去重)
#coding=utf-8 """ 总需求:在4组秒表记录中取出最快的3个时间 """ #获取文件中的内容 def get_filecontent(filename): try: with open(filename) as f: data=f.readline().strip().split(",") return data except IOError as e: raise e #清洗数据 def sanitize(time_string): if "-" in time_string: splitter='-' elif ":" in time_string: splitter=':' else: return time_string (mins,sece)=time_string.strip().split(splitter) return(mins+"."+sece) #排序,取出最快的3个时间 yssj=get_filecontent("D:\pydj\james.txt") print(yssj) #数据推导分析 #1.将yssj中的每个元素去遍历,清理元素,变成分.秒格式 #2.set:将取出来的整个数据变成集合,因为集合可以去重,它具有不存在重复性元素的特性 #3.sorted:将集合复制排序 #4.[0:3],去除前3个数 print(sorted(set([sanitize(i)for i in yssj]))[0:3]) #clean_sj=[] #for i in yssj: # clean_sj.append(sanitize(i)) #print(clean_sj) #print(sorted(set(clean_sj))[0:3])