20191118孙源《Python程序设计》实验四报告
实验报告
课 程: |
Python程序设计 |
实验名称: |
实验四 |
实验日期: |
2020年6月1日 |
学 号: |
20191108 |
姓 名: |
孙源 |
任课教师: |
王志强老师 |
成绩: 评语: |
l 实验目的与要求
使用Python爬虫进行网页内容爬取
l 实验设计与实现
import requests
import bs4
import re
code_class = {"sh_cpp": ".cpp",
"sh_c": ".c", "": ".py",
"sh_pascal": ".pas",
"sh_java": ".java"}
mylog = {"redirectUrl": "http://openjudge.cn/",
"password": "",
"email": ""}
session_requests = requests.session()
own_url = "http://openjudge.cn/"
# 爬取个人主页url下accept的内容,若有next
page则继续爬取
def download_url(url):
global session_requests
global own_url
global code_class
ans = session_requests.get(url)
s = str(ans.content, encoding="utf-8")
soup = bs4.BeautifulSoup(s, "html.parser")
blocks = soup.find_all("a", class_="result-right")
# blocks包含accept代码网页的信息,遍历该页
if blocks
!= []:
for i in blocks:
solution_url = i["href"]
solution =
session_requests.get(solution_url)
ss = str(solution.content, encoding="utf-8")
s_soup = bs4.BeautifulSoup(ss, "html.parser")
# 判断该题的代码类型
for class_name in code_class:
block = s_soup.find("pre", class_=class_name)
if block
== None:
continue
try:
name =
s_soup.find_all("h3")[1]
name = name.text[:-5]
except:
print("Get
name wrong!")
# 去掉第一个':'前面的编号
index = name.find(':')
if (index
!= -1):
name = name[index + 1:]
# 去除题名中的非法字符和开头结尾的空格
name = re.sub(r"[\\/:*?#\"<>|:]", "
", name).strip()
try:
# 已存在同名代码
f = open("C:/tmp/" + name + code_class[class_name])
print(name + "
has already downloaded")
continue
except IOError:
# 不存在同名代码
print("downloading
your correct code " +
name)
try:
f = open("C:/tmp/"
+ name + code_class[class_name], 'w', encoding="utf-8")
new_str =
block.text
f.write(new_str)
f.close()
except Exception as e:
print(name + "
can't be downloaded correctly")
print(e)
# next 是下一页的相对路径
next = soup.find("a", class_="nextprev", rel="next")
if next
!= None:
download_url(own_url + next["href"])
def spider():
global code_class
global mylog, session_requests, own_url
mylog["email"] = input("请输入您登陆openjudge使用的email账号:\n")
mylog["password"] = input("请输入您的密码:\n")
login_url = "http://openjudge.cn/api/auth/login/"
result =
session_requests.post( # 向服务器发送post请求
login_url,
data=mylog,
headers=dict(referer=login_url),
)
result = session_requests.get("http://openjudge.cn/")
# 用正则表达式匹配寻找个人首页的url
pt = r"<a
href=\"(http://[^\"]*)\">个人首页</a>"
try:
own_url = re.search(pt, result.text).group(1)
print("这是您的主页:" +
own_url)
except:
print("账号不存在或密码错误!请重新输入!")
spider()
return
own_url = ''
download_url(own_url)
print("您已成功下载所有accept的程序至c:\\tmp文件夹下!")
spider()
l 课程感悟
课程开始时因为对程序设计不了解,又没有好好复习,经常对老师上课讲的知识一头雾水,直到慢慢看云班课的视频,在网上看教程,才能慢慢跟上老师的节奏,当初选择这门课的时候就抱着学一门新技术的想法,学python确实受益匪浅,还为学习C语言提供了很多帮助。
## 参考资料:
- [《Python爬虫实例》](https://www.jianshu.com/p/757d8981fdda)
- [《Python 网络编程》](https://www.runoob.com/python/python-socket.html)
- [《Python爬虫实例》](https://www.jianshu.com/p/757d8981fdda)
##附码云链接:
[实验四]( https://gitee.com/sunyuan1118/python-test-2020)