python教学----004 集合、文件操作和字符编码转码

本节内容

集合
文件操作
字符编码、转码

1.集合操作

集合是一个无序的，不重复的数据组合，它的主要作用如下：

去重，把一个列表变成集合，就自动去重了
关系测试，测试两组数据之前的交集、差集、并集等关系

常用操作 set()

s = set([3,5,9,10]) #创建一个数值集合

t = set("Hello") #创建一个唯一字符的集合

a = t | s # t 和 s的并集

b = t & s # t 和 s的交集

c = t – s # 求差集（项在t中，但不在s中）

d = t ^ s # 对称差集（项在t或s中，但不会同时出现在二者中）

基本操作：

t.add('x') # 添加一项

s.update([10,37,42]) # 在s中添加多项

使用remove()可以删除一项：

t.remove('H')

len(s)

set 的长度

x in s

测试 x 是否是 s 的成员，正确返回True；错误返回False

x not in s

测试 x 是否不是 s 的成员，正确返回True；错误返回False

s.issubset(t)

s <= t

测试是否 s 中的每一个元素都在 t 中，正确返回True；错误返回False

s.issuperset(t)

s >= t

测试是否 t 中的每一个元素都在 s 中

s.union(t)

s | t

返回一个新的 set 包含 s 和 t 中的每一个元素

s.intersection(t)

s & t

返回一个新的 set 包含 s 和 t 中的公共元素

s.difference(t)

s - t

返回一个新的 set 包含 s 中有但是 t 中没有的元素

s.symmetric_difference(t)

s ^ t

返回一个新的 set 包含 s 和 t 中不重复的元素

s.copy()

返回 set “s”的一个浅复制

=====================================eg.set.py=================================

list_1 = [11,9,5,6,7,8,7,2]
list_1 = set(list_1)　　#列表转成字典
print(list_1,type(list_1))  #结果{2, 5, 6, 7, 8, 9, 11} <class 'set'>

list_2 = set([2,6,0,66,22,8,11])
print(list_1,list_2)   #结果{2, 5, 6, 7, 8, 9, 11} {0, 2, 66, 6, 8, 11, 22}
#交集
print(list_1 & list_2) #结果{8, 2, 11, 6}
#等同于
print(list_1.intersection(list_2))  #结果{8, 2, 11, 6}



#并集
print(list_1.union(list_2))  #结果{0, 2, 66, 5, 6, 7, 8, 9, 11, 22}
#等同于
print(list_1 | list_2) #结果{0, 2, 66, 5, 6, 7, 8, 9, 11, 22}

#差集 in list_1 but not in list_2 
print(list_1.difference(list_2)) #结果{9, 5, 7}
print(list_2.difference(list_1)) #结果{0, 66, 22}

#子集&父集
list_3 = set([8, 2, 11, 6])
print(list_3.issubset(list_2)) #判断list_3 是否是list_2的子集 True
print(list_3 <= list_2) #子集 True

print(list_1.issuperset(list_3)) #判断list_1 是否是list_3的父集 True
print(list_1 >= list_3) #父集 True

#对称差集(相当于list_1和list_2的并集减去交集)
print(list_1.symmetric_difference(list_2)) #结果{0, 66, 5, 7, 9, 22}
print(list_1 ^ list_2) 　　#结果{0, 66, 5, 7, 9, 22}

print("---------------")
list_4 = set([5,4,3,1])
print(list_3.isdisjoint(list_4)) #没有交集返回true；有交集返回false；True


#添加
list_1.add(999) #一项
print(list_1) #结果{2, 5, 6, 7, 8, 9, 999, 11}

list_1.update([888,777,555])
print(list_1)　　　　　　　　　　#结果{2, 5, 6, 7, 8, 9, 999, 11, 777, 555, 888}


print(list_1.pop())  #删除任意的数字；2
print(list_1)　　　　#结果{5, 6, 7, 8, 9, 999, 11, 777, 555, 888}
#print(list_1.remove('ddd')) #如果‘ddd’没有，删除会报错
list_1.discard('ddd') #如果有‘ddd’就把他删掉，没有不操作，也不会报错
print(list_1) #结果{5, 6, 7, 8, 9, 999, 11, 777, 555, 888}

2. 文件操作

对文件操作流程

打开文件，得到文件句柄并赋值给一个变量
通过句柄对文件进行操作
关闭文件

现有文件如下

yesterday.txt

Somehow, it seems the love I knew was always the most destructive kind

不知为何，我经历的爱情总是最具毁灭性的的那种

Yesterday when I was young

昨日当我年少轻狂

The taste of life was sweet

生命的滋味是甜的

As rain upon my tongue

就如舌尖上的雨露

I teased at life as if it were a foolish game

我戏弄生命视其为愚蠢的游戏

The way the evening breeze

就如夜晚的微风

May tease the candle flame

逗弄蜡烛的火苗

The thousand dreams I dreamed

我曾千万次梦见

The splendid things I planned

那些我计划的绚丽蓝图

I always built to last on weak and shifting sand

但我总是将之建筑在易逝的流沙上

I lived by night and shunned the naked light of day

我夜夜笙歌逃避白昼赤裸的阳光

And only now I see how the time ran away

事到如今我才看清岁月是如何匆匆流逝

Yesterday when I was young

昨日当我年少轻狂

So many lovely songs were waiting to be sung

有那么多甜美的曲儿等我歌唱

So many wild pleasures lay in store for me

有那么多肆意的快乐等我享受

And so much pain my eyes refused to see

还有那么多痛苦我的双眼却视而不见

I ran so fast that time and youth at last ran out

我飞快地奔走最终时光与青春消逝殆尽

I never stopped to think what life was all about

我从未停下脚步去思考生命的意义

And every conversation that I can now recall

如今回想起的所有对话

Concerned itself with me and nothing else at all

除了和我相关的什么都记不得了

The game of love I played with arrogance and pride

我用自负和傲慢玩着爱情的游戏

And every flame I lit too quickly, quickly died

所有我点燃的火焰都熄灭得太快

The friends I made all somehow seemed to slip away

所有我交的朋友似乎都不知不觉地离开了

And only now I'm left alone to end the play, yeah

只剩我一个人在台上来结束这场闹剧

Oh, yesterday when I was young

噢昨日当我年少轻狂

So many, many songs were waiting to be sung

有那么那么多甜美的曲儿等我歌唱

So many wild pleasures lay in store for me

有那么多肆意的快乐等我享受

And so much pain my eyes refused to see

还有那么多痛苦我的双眼却视而不见

There are so many songs in me that won't be sung

我有太多歌曲永远不会被唱起

I feel the bitter taste of tears upon my tongue

我尝到了舌尖泪水的苦涩滋味

The time has come for me to pay for yesterday

终于到了付出代价的时间为了昨日

When I was young

当我年少轻狂

基本操作　　

f = open('lyrics') #打开文件

first_line = f.readline()

print('first line:',first_line) #读一行

print('我是分隔线'.center(50,'-'))

data = f.read()# 读取剩下的所有内容,文件大时不要用

print(data) #打印文件

f.close() #关闭文件

打开文件的模式有：

r，只读模式（默认）。
w，只写模式。【不可读；不存在则创建；存在则删除内容；】
a，追加模式。【不可读；不存在则创建；存在则只追加内容；】

"+" 表示可以同时读写某个文件

r+，可读写文件。【可读；可写；可追加】最常用
w+，写读
a+，同a

"U"表示在读取时，可以将 \r \n \r\n自动转换成 \n （与 r 或 r+ 模式同使用）

"b"表示处理二进制文件（如：FTP发送上传ISO镜像文件，linux可忽略，windows处理二进制文件时需标注）

file_op.py

#========================1=================
data=open("yesterday.txt",encoding="utf-8").read() #默认编码集是gbk的所以要转换成utf-8的
print(data)


#=======================2====================

f=open("yesterday.txt",encoding="utf-8") #f为文件句柄（内存对象；包含：文件名、字符集、大小、硬盘中起始位置）
data = f.read() #第一遍read(),类似一个“文件指针”，从头到尾按顺序执行。读取所有文件
data2= f.read()  #第二遍就读不到内容了，因为“文件指针”已经到尾部了 
print(data)　　#结果为全文
print('----------data2--------',data2) #结果为空

#========================3================

f=open("yesterday2.txt",mode='w',encoding="utf-8") #创建一个新文件：覆盖原文件

#a = append 只能追加   #r只能读  w 只能写
f.write("when i was young i listen to the radio..\n")
f.write("waitting for my favorite song.....\n")
f.close() #文件关闭

f=open("yesterday2.txt",mode='r',encoding="utf-8")
print(f.read())　　#结果只有上面写的那两行
f.close() #文件关闭


#===================================4=================
f=open("yesterday.txt",'r',encoding="utf-8")

print(f.readlines()) #将整个文件变成一个列表,每行一个元素；只适合小文件


#打印整个文档 
for line in f.readlines():
    print(line)


#忽略第十行，enumerate方法 low loop，比较low的方法====方法1==========
for index,line in enumerate(f.readlines()):
    if index == 9:
        continue
    print(line.strip()) #.strip去掉空格和回车


#列表方法，忽略第十行===================方法2=======================
list_1=f.readlines()
for line in list_1:
    if list_1.index(line) == 9:
        continue
    print(line)


for i in range(5):    #打印前五行
    print(f.readline()) #一行一行读取

#high bige 牛逼了==============方法3======================
count = 0
for line in f: #可以打印全文，效率是最高的#迭代器#高效
    count += 1
    if count == 10:
        continue
    print(line)

file_op2.py

f=open("yesterday.txt",'r',encoding='utf-8')

print(f.tell())   #.tell方法，打印现在光标位置，显示字符个数。
print(f.read(60))  #read方法，跟数字，从光标处打印字符60个字符
print(f.readline()) #打印一行
print(f.tell())　　#结果72
f.seek(0)           #将光标位置回到第0 个位置
print(f.tell())  #结果0

f.detach()    #少用，在文件编辑过程中，将字符编码将GBK改成UTF-8
print(f.encoding) #打印 当前字符编码

f.errors  做异常处理用的

print(f.fileno()) #返回一个系统io的调度文件接口编号。不常用。
print(f.flush())  #刷新；将内存中的内容写到硬盘中。

eg. 打印进度条

import sys,time
for i in range(50):
    sys.stdout.write('#')
    sys.stdout.flush()  #如果不用flush方法，将直接等五秒全部出来；加上之后实时刷新。
    time.sleep(0.1)





#====================== r+  =====================最常用====

f=open("yesterday2.txt",'r+',encoding='utf-8') #读写
f.truncate(50)  #从头开始截断，保留100个,保留原来的字符编码
#打开，读，追加类的write
print(f.readline())
print(f.readline())
print(f.readline())
print(f.tell())
f.write("--------nice----") #以读和追加模式工作;write到最后

#====================== w+  =====================
f=open("yesterday2.txt",'w+',encoding='utf-8') #写读 先创建一个文件，然后开始往里写
#用处少
print(f.readline())  #空重新创建文件
f.write("--------nice1----\n") #write到第一行
f.write("--------nice2----\n") #write到下一行
f.write("--------nice3----\n") #write到下一行
f.write("--------nice4----\n") #write到下一行

print(f.tell())   #结果76
f.seek(0)         #将光标移动至开始
print(f.tell())    #结果0
print(f.readline()) #结果--------nice1----


f.write("should be at the row two.") #还是写到最后一行，seek无法实现其改变顺序。
#但在Python 2.X 可以写到第二行
f.close()

#====================== a+  =====================

f=open("yesterday2.txt",'a+',encoding='utf-8') #追加读，
print(f.readline())  #空

f.write("--------nice1----\n") #追加到上文最后
f.write("--------nice2----\n") #追加到下一行
f.write("--------nice3----\n") #追加到下一行
f.write("--------nice4----\n") #追加到下一行

print(f.tell())  #结果177
f.seek(0)        #将光标跳转到开始
print(f.tell())　#结果0　　　
print(f.readline()) #结果--------nice1----


f.write("should be at the row two.") #还是写到最后一行，seek无法实现其改变顺序。
#在Python 2.X 可以写到这儿
f.close()


#==============rb============(二进制格式)

f = open("yesterday2.txt",'wb')     #网络传输，python 3.X只能用二进制传输
f.write(("hello binary\n").encode()) #覆盖原来的内容
f.close()

file_op3.py

f = open("yesterday2.txt",'r',encoding='utf-8')
f_new = open("yesterday3.new",'w',encoding='utf-8')



#==================改变某行的某些内容============

for i in f:
    if "肆意的快乐" in i:　　#定位到需要修改的那一行
        i=i.replace("我","旦总") #将这一行中的“我”改成“旦总”
    f_new.write(i)　　#将yesterday2.txt中逐行写到yesterday3.new中
f.close()
f_new.close()


#===========sed===========替换功能==========把yesterday2.txt中的某些词替换了，并写入yester3.txt
import sys
for line in f:
    if sys.argv[1] in line:
        line = line.replace(sys.argv[1],sys.argv[2])　　#将argv[1]替换为argv[2]

    f_new.write(line)

f.close()
f_new.close()

#将yesterday3内容写到yesterday2中
f_2=open("yesterday3.new",'r',encoding='utf-8')
f_3=open("yesterday2.txt",'w',encoding='utf-8')
for line in f_2:
    f_3.write(line)
f_2.close()
f_3.close()

===============上例子，简单实现sed的替换功能=========================

with语句

为了避免打开文件后忘记关闭，可以通过管理上下文，即：

with open('log','r') as f:

...

如此方式，当with代码块执行完毕时，内部会自动关闭并释放文件资源。

在Python 2.7 后，with又支持同时对多个文件的上下文进行管理，即：

1 2	with open('log1') as obj1, open('log2') as obj2: pass

注：#python开发规范：一行代码不允许超过80个字符！with 可以用",\"符号隔开后换行即可。
with open("yesterday2.txt",'r',encoding='utf-8') as f ,\
        open("yesterday3.new", 'r', encoding='utf-8') as f2:

3. 字符编码与转码

详细文章:

http://www.diveintopython3.net/strings.html

http://www.cnblogs.com/yuanchenqi/articles/5956943.html

需知:

1.在python2默认编码是ASCII, python3里默认是unicode

2.unicode 分为 utf-32(占4个字节),utf-16(占两个字节)，utf-8(占1-4个字节)， so utf-16就是现在最常用的unicode版本，不过在文件里存的还是utf-8，因为utf8省空间

3.在py3中encode,在转码的同时还会把string 变成bytes类型，decode在解码的同时还会把bytes变回string

上图仅适用于py2

python2

#-*-coding:utf-8-*-

__author__ = 'Kkk Li'

import sys

print(sys.getdefaultencoding()) #打印系统默认编码；结果为：ascii

msg = "我爱北京天安门"

msg_gb2312 = msg.decode("utf-8").encode("gb2312") #先转成unicode再转成gb2312

gb2312_to_gbk = msg_gb2312.decode("gb2312").encode("gbk")

print(msg)

print(msg_gb2312)

print(gb2312_to_gbk)

python3

#-*-coding:gb2312 -*- #这个也可以去掉 #这里制定的是该python脚本的编码

__author__ = 'Kkk Li'

import sys

print(sys.getdefaultencoding())

msg = "我爱北京天安门" #msg的类型还是为python3默认的unicode；

#msg_gb2312 = msg.decode("utf-8").encode("gb2312")

msg_gb2312 = msg.encode("gb2312") #默认就是unicode,不用再decode,喜大普奔

gb2312_to_unicode = msg_gb2312.decode("gb2312") #decode之后变成bytes类型

gb2312_to_utf8 = msg_gb2312.decode("gb2312").encode("utf-8")#encode之后变成string类型

print(msg)

print(msg_gb2312)

print(gb2312_to_unicode)

print(gb2312_to_utf8)

posted on 2018-01-17 11:36 小笨1987 阅读(124) 评论(0) 收藏举报