解析poj页面获取题目
页面是这样的:http://poj.org/problem?id=3334
要从这样的页面里面提取题目标题,时间限制,内存限制,题目描述,输入,输出,示例输入,示例输出,提示,来源等信息,获取必要的题目中的图片。
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib
import re
def getpojhtml(pid):
url = "http://poj.org/problem?id="+str(pid)
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
title = soup.title.string[7:]
time_limit = soup.findAll(text = re.compile("Time Limit"))[0].next
mem_limit = soup.findAll(text = re.compile("Memory Limit"))[0].next
description = soup.findAll(text = re.compile("Description"))[0].next.contents
input = soup.findAll(text = re.compile("Input"))[0].next.contents
output = soup.findAll(text = re.compile("Output"))[0].next.contents
sim_input = soup.findAll(text = re.compile("Sample Input"))[0].next.contents
sim_output = soup.findAll(text = re.compile("Sample Output"))[0].next.contents
try:
hint = soup.findAll(text = re.compile("Hint"))[0].next.contents
except:
hint = []
try:
source = soup.findAll(text = re.compile("Source"))[0].next.contents
except :
source = []
pattern = re.compile('images/\d{4}[.\w]*')
pic = pattern.findall(html)
pic_url=[]
for item in pic:
pic_url.append( 'http://poj.org/'+str(item))
return title,time_limit,mem_limit,description,input,output,sim_input,sim_output,hint,source,pic_url
if __name__=='__main__':
ret = getpojhtml(3344)
for item in ret:
print item
实现方案
首先用urllib模块获取整个页面,然后用beautifulsoup来解析,由于个别页面没有hint或者source,所以用try避免出错退出
图片可以选择用beautifulsoup来解析,但是我还是选择了用正则表达式来解析,因为用正则表达式可以准确地定位到题目描述中的图片,而beautifulsoup把整个页面中的所有图片都找出来了,有些并不是我需要的。
运行结果
2000MS
65536K
[<div><p>Another boring Friday afternoon, Betty the Beetle thinks how to amuse herself. She goes out of her hiding place to take a walk around the living room in Bennett's house. Mr. and Mrs. Bennett are out to the theatre and there is a chessboard on the table! "The best time to practice my chessboard dance," Betty thinks! She gets so excited that she does not note that there are some pieces left on the board and starts the practice session! She has a script showing her how to move on the chessboard. The script is a sequence like the following example:</p><p><center><img src="images/3344_1.GIF" /></center></p><p>At each instant of time Betty, stands on a square of the chessboard, facing one of the four directions (up, down, left, right) when the board is viewed from the above. Performing a "move <i>n</i>" instruction, she moves <i>n</i> squares forward in her current direction. If moving <i>n</i> squares goes outside the board, she stays at the last square on the board and does not go out. There are three types of turns: turn right, turn left, and turn back, which change the direction of Betty. Note that turning does not change the position of Betty.</p><p>If Betty faces a chess piece when moving, she pushes that piece, together with all other pieces behind (a tough beetle she is!). This may cause some pieces fall of the edge of the chessboard, but she doesn't care! For example, in the following figure, the left board shows the initial state and the right board shows the state after performing the script in the above example. Upper-case and lower-case letters indicate the white and black pieces respectively. The arrow shows the position of Betty along with her direction. Note that during the first move, the black king (r) falls off the right edge of the board!</p><p><center><img src="images/3344_2.GIF" /></center></p><p>You are to write a program that reads the initial state of the board as well as the practice dance script, and writes the final state of the board after the practice.</p></div>]
[<div><p>There are multiple test cases in the input. Each test case has two parts: the initial state of the board and the script. The board comes in eight lines of eight characters. The letters r, d, t, a, c, p indicate black pieces, R, D, T, A, C, P indicate the white pieces and the period (dot) character indicates an empty square. The square from which Betty starts dancing is specified by one of the four characters <, >, ^, and v which also indicates her initial direction (left, right, up, and down respectively). Note that the input is not necessarily a valid chess game status.</p><p>The script comes immediately after the board. It consists of several lines (between 0 and 1000). In each line, there is one instruction in one of the following formats (<i>n</i> is a non-negative integer number):</p><p>move <i>n</i><br />turn left<br />turn right<br />turn back</p><p>At the end of each test case, there is a line containing a single # character. The last line of the input contains two dash characters.</p></div>]
[<p>The output for each test case should show the state of the board in the same format as the input. Write an empty line in the output after each board.</p>]
[u'.....c..\r\n.p..A..t\r\nD..>T.Pr\r\n....aP.P\r\np.d.C...\r\n.....p.R\r\n........\r\n........\r\nmove 2\r\nturn right\r\nmove 3\r\nturn left\r\nturn left\r\nmove 1\r\n#\r\n--\r\n']
[u'.....c..\r\n.p..A..t\r\nD.....TP\r\n....a..P\r\np.d.C^..\r\n.......R\r\n.....P..\r\n.....p..\r\n']
[]
[<a href="searchproblem?field=source&key=Tehran+2006">Tehran 2006</a>]
['http://poj.org/images/3344_1.GIF', 'http://poj.org/images/3344_2.GIF']
博主ma6174对本博客文章(除转载的)享有版权,未经许可不得用于商业用途。转载请注明出处http://www.cnblogs.com/ma6174/
对文章有啥看法或建议,可以评论或发电子邮件到ma6174@163.com
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架