hdu4872 Beautiful Soup 模拟
Beautiful Soup
Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)
Total Submission(s): 1912 Accepted Submission(s): 391
Problem Description
Coach Pang has a lot of hobbies. One of them is playing with “tag soup” with the help of Beautiful Soup. Coach Pang is satisfied with Beautiful Soup in every respect, except the prettify() method, which attempts to turn a soup into a nicely formatted string. He decides to rewrite the method to prettify a HTML document according to his personal preference. But Coach Pang is always very busy, so he gives this task to you. Considering that you do not know anything about “tag soup” or Beautiful Soup, Coach Pang kindly left some information with you:
In Web development, “tag soup” refers to formatted markup written for a web page that is very much like HTML but does not consist of correct HTML syntax and document structure. In short, “tag soup” refers to messy HTML code.
Beautiful Soup is a library for parsing HTML documents (including “tag soup”). It parses “tag soup” into regular HTML documents, and creates parse trees for the parsed pages.
The parsed HTML documents obey the rules below.
HTML
HTML stands for HyperText Markup Language.
HTML is a markup language.
A markup language is a set of markup tags.
The tags describe document content.
HTML documents consist of tags and texts.
Tags
HTML is using tags for its syntax.
A tag is composed with special characters: ‘<’, ‘>’ and ‘/’.
Tags usually come in pairs, the opening tag and the closing tag.
The opening tag starts with “<” and the tagname. It usually ends with a “>”.
The closing tag starts with “</” and the same tagname as the corresponding opening tag. It ends with a “>”.
There will not be any other angle brackets in the documents.
Tagnames are strings containing only lowercase letters.
Tags will contain no line break (‘\n’).
Except tags, anything occured in the document is considered as text content.
Elements
An element is everything from an opening tag to the matching closing tag (including the two tags).
The element content is everything between the opening and the closing tag.
Some elements may have no content. They’re called empty elements, like <hr></hr>.
Empty elements can be closed in the opening tag, ending with a “/>” instead of “>”.
All elements are closed either with a closing tag or in the opening tag.
Elements can have attributes.
Elements can be nested (can contain other elements).
The <html> element is the container for all other elements, it will not have any attributes.
Attributes
Attributes provide additional information about an element.
Attributes are always specified in the opening tag after the tagname.
Tag name and attributes are separated by single space.
An element may have several attributes.
Attributes come in name="value" pairs like class="icpc".
There will not be any space around the '='.
All attribute names are in lowercase.
A Simple Example <a href="http://icpc.baylor.edu/">ACM-ICPC</a>
The <a> element defines an HTML link with the <a> tag.
The link address is specified in the href attribute.
The content of the element is the text “ACM-ICPC”
You are feeling dizzy after reading all these, when Coach Pang shows up again. He starts to spout for hours about his personal preference and you catch his main points with difficulty. Coach Pang says:
Your task is to write a program that will turn parsed HTML documents into formatted parse trees. You should print each tag or text content on its own line preceded by a number of spaces that indicate its depth in the parse tree. The depth of the root of the a parse tree (the <html> tag) is 0. He is satisfied with the tags, so you shouldn’t change anything of any tag. For text content, throw away unnecessary white spaces including space (ASCII code 32), tab (ASCII code 9) and newline (ASCII code 10), so that words (sequence of characters without white spaces) are separated by single space. There should not be any trailing space after each line nor any blank line in the output. The line contains only white spaces is also considered as blank line. You quickly realize that your only job is to deal with the white spaces.
In Web development, “tag soup” refers to formatted markup written for a web page that is very much like HTML but does not consist of correct HTML syntax and document structure. In short, “tag soup” refers to messy HTML code.
Beautiful Soup is a library for parsing HTML documents (including “tag soup”). It parses “tag soup” into regular HTML documents, and creates parse trees for the parsed pages.
The parsed HTML documents obey the rules below.
HTML
HTML stands for HyperText Markup Language.
HTML is a markup language.
A markup language is a set of markup tags.
The tags describe document content.
HTML documents consist of tags and texts.
Tags
HTML is using tags for its syntax.
A tag is composed with special characters: ‘<’, ‘>’ and ‘/’.
Tags usually come in pairs, the opening tag and the closing tag.
The opening tag starts with “<” and the tagname. It usually ends with a “>”.
The closing tag starts with “</” and the same tagname as the corresponding opening tag. It ends with a “>”.
There will not be any other angle brackets in the documents.
Tagnames are strings containing only lowercase letters.
Tags will contain no line break (‘\n’).
Except tags, anything occured in the document is considered as text content.
Elements
An element is everything from an opening tag to the matching closing tag (including the two tags).
The element content is everything between the opening and the closing tag.
Some elements may have no content. They’re called empty elements, like <hr></hr>.
Empty elements can be closed in the opening tag, ending with a “/>” instead of “>”.
All elements are closed either with a closing tag or in the opening tag.
Elements can have attributes.
Elements can be nested (can contain other elements).
The <html> element is the container for all other elements, it will not have any attributes.
Attributes
Attributes provide additional information about an element.
Attributes are always specified in the opening tag after the tagname.
Tag name and attributes are separated by single space.
An element may have several attributes.
Attributes come in name="value" pairs like class="icpc".
There will not be any space around the '='.
All attribute names are in lowercase.
A Simple Example <a href="http://icpc.baylor.edu/">ACM-ICPC</a>
The <a> element defines an HTML link with the <a> tag.
The link address is specified in the href attribute.
The content of the element is the text “ACM-ICPC”
You are feeling dizzy after reading all these, when Coach Pang shows up again. He starts to spout for hours about his personal preference and you catch his main points with difficulty. Coach Pang says:
Your task is to write a program that will turn parsed HTML documents into formatted parse trees. You should print each tag or text content on its own line preceded by a number of spaces that indicate its depth in the parse tree. The depth of the root of the a parse tree (the <html> tag) is 0. He is satisfied with the tags, so you shouldn’t change anything of any tag. For text content, throw away unnecessary white spaces including space (ASCII code 32), tab (ASCII code 9) and newline (ASCII code 10), so that words (sequence of characters without white spaces) are separated by single space. There should not be any trailing space after each line nor any blank line in the output. The line contains only white spaces is also considered as blank line. You quickly realize that your only job is to deal with the white spaces.
Input
The first line of the input is an integer T representing the number of test cases.
Each test case is a valid HTML document starts with a <html> tag and ends with a </html> tag. See sample below for clarification of the input format.
The size of the input file will not exceed 20KB.
Each test case is a valid HTML document starts with a <html> tag and ends with a </html> tag. See sample below for clarification of the input format.
The size of the input file will not exceed 20KB.
Output
For each test case, first output a line “Case #x:”, where x is the case number (starting from 1).
Then you should write to the output the formatted parse trees as described above. See sample below for clarification of the output format.
Then you should write to the output the formatted parse trees as described above. See sample below for clarification of the output format.
Sample Input
2
<html><body>
<h1>ACM
ICPC</h1>
<p>Hello<br/>World</p>
</body></html>
<html><body><p>
Asia Chengdu Regional</p>
<p class="icpc">
ACM-ICPC</p></body></html>
Sample Output
[pre]Case #1: <html> <body> <h1> ACM ICPC </h1> <p> Hello <br/> World </p> </body> </html> Case #2: <html> <body> <p> Asia Chengdu Regional </p> <p class="icpc"> ACM-ICPC </p> </body> </html> [/pre]
调了一下午,但我还是不知道这份代码问题出在哪,和AC代码一个比较明显的区别是我是用把一般串弄成一个包括空格的string,难道这样在不同的环境下会出错?分析了AC代码后我又按AC代码的思路写了一下,直接过了,AC代码是将含空格的字符串分开成几个串读进去,等到输出的时候再做处理,而我WA的那份代码是在输入的时候处理,这是思路上唯一的区别,能找的细节,可能出错的地方我都找了,能找的数据也尽量测了,我已经尽力了。
下面是不知道为什么WA的代码
#include<iostream> #include<cstdio> #include<cstdlib> #include<cstring> #include<algorithm> #define REP(i,a,b) for(int i=a;i<=b;i++) #define MS0(a) memset(a,0,sizeof(a)) using namespace std; typedef long long ll; const int maxn=1000100; const int INF=1e9+10; string s[20010]; int n; char pc;bool pcv; void get1(char c) { string t=""; while(c!='>'){ t+=c; c=getchar(); } t+=c; s[++n]=t; pc=getchar(); pcv=1; } void get2(char c) { string t=""; while(c!='<'){ t+=c; c=getchar(); if(c==' '||c==9||c=='\n') c=' ',t+=c; while(c==' '||c=='\n'||c==9) c=getchar(); } int len=t.size(); while(t[len-1]==' ') t[len-1]='\0',len--; s[++n]=t; pc=c; pcv=1; } bool Get(char c) { pcv=0; while(c==' '||c=='\n'||c==9) c=getchar(); if(c=='<') get1(c); else get2(c); if(s[n]=="</html>") return 1; return 0; } void input() { n=0; if(pcv==0) pc=getchar(); while(!Get(pc)); } void solve() { //REP(i,1,n) cout<<s[i]<<endl; int tag=0; REP(i,1,n){ if(s[i][0]=='<'&&s[i][1]=='/') tag--; REP(j,1,tag) printf(" "); cout<<s[i]<<'\n'; int len=s[i].size(); if(s[i][0]=='<'&&s[i][1]!='/'&&(len<2||(len>=2&&s[i][len-2]!='/'))) tag++; } } int main() { freopen("in.txt","r",stdin); int T;cin>>T;int casen=1; pc='#'; pcv=0; while(T--){ printf("Case #%d:\n",casen++); input(); solve(); } return 0; } /** 2 <html> <body> <h1> ACM ICPC </h1> <p> Hello <br/> World </p> </body></html> <html> <body> <p> Asia Chengdu Regional</p> <p class = "icpc"> ACM-ICPC </p> </body></html> */
下面是按其他人的思路重写的AC代码
#include<iostream> #include<cstdio> #include<cstdlib> #include<cstring> #include<algorithm> #define REP(i,a,b) for(int i=a;i<=b;i++) #define MS0(a) memset(a,0,sizeof(a)) using namespace std; typedef long long ll; const int maxn=1000100; const int INF=1e9+10; string s[20010]; int n; void input(char &pc,bool pvc) { n=0; char c=pc; if(pvc==0) c=getchar(); while(1){ string t=""; while(c=='\n'||c==' '||c==9) c=getchar(); if(c=='<'){/// 读类标签 while(c!='>') t+=c,c=getchar(); t+=c;c=getchar(); } else{/// 读一般串 while(c!=' '&&c!='\n'&&c!=9&&c!='<') t+=c,c=getchar(); } s[++n]=t; pc=c; if(t=="</html>") return; } } void solve() { //REP(i,1,n) cout<<s[i]<<endl; int tag=0; int last=0,first=1; REP(i,1,n){ string t=s[i]; int len=t.size(); if(t[0]=='<'){ /// 是类标签 if(t[1]=='/'){ /// 是结束标签 tag--; } REP(j,1,tag) cout<<" "; cout<<t<<endl; if(t[1]!='/'&&t[len-2]!='/'){/// 是开始标签 tag++; } first=1; } else{///是一般串 if(s[i+1][0]=='<') last=1; else last=0; if(first) REP(j,1,tag) cout<<" "; if(first) first=0; cout<<t; if(last){/// 是尾串 cout<<endl; } else cout<<" "; } } } int main() { freopen("in.txt","r",stdin); int T;cin>>T;int casen=1; char pc='#';bool pvc=0; while(T--){ printf("Case #%d:\n",casen++); input(pc,pvc); solve(); pvc=1; } return 0; } /** 2 <html> <body> <h1> ACM ICPC </h1> <p> Hello <br/> World </p> </body></html> <html> <body> <p> Asia Chengdu Regional</p> <p class = "icpc"> ACM-ICPC </p> </body></html> */
如果比赛中碰到这种难度的模拟,30分钟确实足够了,再加上读题40分钟,当然这种应该和计算几何并列放在其它题之后,神题之前做。
对我而言这种模拟题是确实是永远无法稳过的,但是我还是会补这种题,慢慢增大过这种题的概率,如果平时不补,那么比赛的时候就只能靠奇迹了,相比奇迹,我更相信概率。
没有AC不了的题,只有不努力的ACMER!