文计上机考试前的小小总结python

1.获取网页

1 import urllib.request as req
2 #用urllib获取网页
3 def getPage(url):
4     page = req.urlopen(url)
5     html = page.read().decode("gbk",errors="ignore")
6     return html

1 import requests
2 from bs4 import BeautifulSoup
3 #用requests获取网页
4 def getHtml(url):
5     r = requests.get(url)
6     r.encoding = "utf-8"
7     return r.text

2.用beautiful soup解析

1 def getlinks(): #获得十大链接
2     url = "https://bbs.pku.edu.cn/v2/home.php"
3     html = getHtml(url)
4     soup = BeautifulSoup(html, 'html.parser')
5     bigten = soup.find('section', attrs={'class':"topic-block big-ten"}) #找到十大标题
6     links= bigten.findAll('a', attrs={'href': re.compile(r'post\-read\.php\?bid=\d{2,4}')}) #获取所有的十大链接
7   #  print(links[2]['href'])
8   #  exit()
9     return links

获得的links是一个二维数组，即获取了所有满足条件的标签，在第二个方括号里输入各种类型，如'href'，可获得标签的相应属性

2.lambda表达式

1  friend.sort(key=lambda item: item[1], reverse=True)  # 按计数（第1个元素）进行排序

item是要排序的东西，item[i]是依据什么排，reverse = true表示从大到小排序

3.读取文件获得字符串

1 f = open(fileName, "r",encoding="utf-8")
2 content = "".join(list(f.readlines()))
3 f.close()

注意"".join的应用，双引号内表示用什么分割

4.正则表达式

1 patstr = '[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+'
2 pat = re.compile(patstr)
3 email = pat.findall(content)

找到的email是个列表

5.如何转成列表

1 friend = list(relation[people].items())

注意.items()

6.用正则处理得到的html获得纯文本

1 def getText(html):
2     html = re.compile(r"<title>.*?</title>").sub("", html) 
3     html = re.compile(r"<.*?>").sub("", html) 
4     html = re.compile(r"&.*?;").sub("", html) #去掉奇怪的字母部分
5     html = re.compile(r"[纯文学网站首页,《红楼梦》目录,purepen.com]").sub("", html) 
6     html = re.compile(r"上一回").sub("", html) 
7     html = re.compile(r"下一回").sub("", html) 
8     html = re.compile(r"(\s*?\n)+").sub("\n", html)  #多个空行换成一个空行
9     return html

posted @ 2018-12-27 19:33 timeaftertime 阅读(387) 评论(0) 编辑收藏举报

刷新页面返回顶部

tat296847

文计上机考试前的小小总结python

公告