阳光不锈

  博客园 :: 首页 :: 博问 :: 闪存 :: 新随笔 :: :: :: 管理 ::

源代码:

import java.net.URL;
import java.io.*;
import org.w3c.tidy.Tidy;

public class xml {
private String url;
private String outFileName;
private String errOutFileName;

public xml(String url, String outFileName, String
errOutFileName) {
this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}

public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;

Tidy tidy = new Tidy();

//Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);

try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);

//Create input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);

//Convert files
tidy.parse(in, out);

//Clean up
in.close();
out.close();

} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}

public static void main(String[] args) {
/*
* Parameters are:
* URL of HTML file
* Filename of output file
* Filename of error file
*/

String u="http://www.yahoo.com";
String o="index.xml";
String e="error.xml";

xml t = new xml(u, o, e);
 t.convert();


 }
}

 

运行后生成一个error文件和一个index文件

error文件为:


Tidy (vers 2004-9-27) Parsing "InputStream"
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 23 column 90 - Warning: <spacer> is not approved by W3C
line 26 column 39 - Warning: <spacer> is not approved by W3C
line 33 column 1 - Warning: missing </font> before <form>
line 33 column 1 - Warning: trimming empty <font>
line 34 column 1 - Warning: inserting implicit <font>
line 37 column 2 - Warning: missing </font> before <table>
line 41 column 8 - Warning: inserting implicit <font>
line 41 column 34 - Warning: trimming empty <span>
line 42 column 1 - Warning: trimming empty <font>
line 49 column 27 - Warning: <spacer> is not approved by W3C
line 52 column 1 - Warning: <td> attribute "width" has invalid value "100%"
line 52 column 59 - Warning: <spacer> is not approved by W3C
line 54 column 1 - Warning: <td> attribute "height" has invalid value "100%"
line 54 column 54 - Warning: <spacer> is not approved by W3C
line 55 column 95 - Warning: <spacer> is not approved by W3C
line 57 column 95 - Warning: <spacer> is not approved by W3C
line 58 column 1 - Warning: <td> attribute "height" has invalid value "100%"
line 58 column 54 - Warning: <spacer> is not approved by W3C
line 60 column 1 - Warning: <td> attribute "width" has invalid value "100%"
line 60 column 66 - Warning: <spacer> is not approved by W3C
line 63 column 27 - Warning: <spacer> is not approved by W3C
line 66 column 1 - Warning: <td> attribute "width" has invalid value "100%"
line 66 column 59 - Warning: <spacer> is not approved by W3C
line 68 column 1 - Warning: <td> attribute "height" has invalid value "100%"
line 68 column 54 - Warning: <spacer> is not approved by W3C
line 69 column 95 - Warning: <spacer> is not approved by W3C
line 71 column 95 - Warning: <spacer> is not approved by W3C
line 72 column 1 - Warning: <td> attribute "height" has invalid value "100%"
line 72 column 54 - Warning: <spacer> is not approved by W3C
line 74 column 1 - Warning: <td> attribute "width" has invalid value "100%"
line 74 column 66 - Warning: <spacer> is not approved by W3C
line 81 column 90 - Warning: <spacer> is not approved by W3C
line 84 column 90 - Warning: <spacer> is not approved by W3C
line 85 column 90 - Warning: <spacer> is not approved by W3C
line 91 column 26 - Warning: <spacer> is not approved by W3C
line 97 column 1 - Warning: missing </font> before <table>
line 105 column -3 - Warning: discarding unexpected </font>
line 105 column 38 - Warning: <spacer> is not approved by W3C
line 108 column 27 - Warning: <spacer> is not approved by W3C
line 110 column 86 - Warning: <td> attribute "width" has invalid value "100%"
line 111 column 87 - Warning: <td> attribute "width" has invalid value "100%"
line 113 column 1 - Warning: <td> attribute "width" has invalid value "25%"
line 122 column 1 - Warning: <td> attribute "width" has invalid value "25%"
line 131 column 1 - Warning: <td> attribute "width" has invalid value "25%"
line 140 column 1 - Warning: <td> attribute "width" has invalid value "25%"
line 149 column 90 - Warning: <spacer> is not approved by W3C
line 151 column 109 - Warning: <spacer> is not approved by W3C
line 152 column 87 - Warning: <td> attribute "width" has invalid value "100%"
line 159 column 91 - Warning: <spacer> is not approved by W3C
line 160 column 100 - Warning: <spacer> is not approved by W3C
line 161 column 87 - Warning: <td> attribute "width" has invalid value "100%"
InputStream: Document content looks like HTML proprietary
52 warnings, no errors were found!
The table summary attribute should be used to describe
the table structure. It is very helpful for people using
non-visual browsers. The scope and headers attributes for
table cells are useful for specifying which headers apply
to each table cell, enabling non-visual browsers to provide
a meaningful context for each cell.
For further advice on how to make your pages accessible
see "http://www.w3.org/WAI/GL". You may also want to try
"http://www.cast.org/bobby/" which is a free Web-based
service for checking URLs for accessibility.
You are recommended to use CSS for controlling white
space (e.g. for indentation, margins and line spacing).
The proprietary <SPACER> element has limited vendor support.
You are recommended to use CSS to specify the font and
properties such as its size and color. This will reduce
the size of HTML files and make them easier to maintain
compared with using <FONT> elements.

 


index 文件为:


<html>
  <head>
    <meta name="generator"
    content="HTML Tidy for Java (vers. 27 九月 2004), see www.w3.org" />
    <title>Yahoo!</title>
    <meta http-equiv="Content-Type"
    content="text/html; charset=UTF-8" />
    <meta name="robots" content="noindex, nofollow" />
    <meta name="robots" content="noarchive" />
    <meta http-equiv="PICS-Label"
    content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l r (cz 1 lz 1 nz 1 oz 1 vz 1) gen true for "http://www.yahoo.com" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l r (n 0 s 0 v 0 l 0) gen true for "http://www.yahoo.com" r (n 0 s 0 v 0 l 0))' />
    <base
    href="http://www.yahoo.com/_ylh=X3oDMTFnMzE5cjMyBF9TAzI3MTYxNDkEcGlkAzEyMzIyNzI5MjEEdGVzdAMwBHRtcGwDdGFibGUuaHRtbA--/"
     target="_top" />
<style type="text/css">
a{color:#16387c;}
a:link,a:visited{text-decoration:none;}
a:hover{text-decoration:underline;}
    </style>
<style type="text/css" media="all">
#p{width:310px;}
form{margin:0;}
    </style>
  </head>
  <body link="#16387C" vlink="#16387C">
    <center>
      <table cellpadding="0" cellspacing="0" border="0"
      bgcolor="#EEF3F6" width="760">
        <tr>
          <td colspan="3">
            <table cellpadding="0" cellspacing="0" border="0"
            width="0">
              <tr>
                <td width="0" height="7">
                  <spacer type="block" width="0" height="7" />
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td width="10" height="0" rowspan="2">
            <spacer type="block" width="10" height="0" />
          </td>
          <td height="56" valign="top">
            <img src="http://l.yimg.com/a/i/ww/beta/y3.gif"
            width="232" height="44" alt="Yahoo!" title="Yahoo" />
          </td>
          <td rowspan="2">
            <table cellpadding="1" cellspacing="0" border="0"
            bgcolor="#BFCFD7">
              <tr>
                <td>
                  <table cellpadding="20" cellspacing="0"
                  border="0" bgcolor="#F4F6F5">
                    <tr>
                      <td>
                      <form name="sf1"
                      action="r/sx/*-http://search.yahoo.com/search">
                        <font face="arial" size="-1">
                        <input type="hidden" name="ei"
                        value="UTF-8" />
                        
                        <input type="hidden" name="fr"
                        value="yfp-t-501" />
                        
                        <input type="hidden" name="cop"
                        value="mss" />
                        </font>
                        <table cellpadding="0" cellspacing="0"
                        border="0">
                          <tr>
                            <td>
                              <font size="+0">
                                <input type="text" name="p" id="p"
                                size="20" />
                              </font>
                            </td>
                            <td>
                              <input type="image"
                              value="Web Search"
                              src="http://l.yimg.com/a/i/ww/tbl/webs.gif"
                               alt="Web Search" border="0" />
                            </td>
                          </tr>
                        </table>
                      </form> </td>
                    </tr>
                  </table>
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td>
            <table cellpadding="0" cellspacing="0" border="0">
              <tr>
                <td width="10" height="0">
                  <spacer type="block" width="10" height="0" />
                </td>
                <td>
                  <table cellpadding="0" cellspacing="0"
                  border="0">
                    <tr>
                      <td width="100%" height="1" bgcolor="#DEE6E9"
                      colspan="5">
                        <spacer type="block" width="100%"
                        height="1" />
                      </td>
                    </tr>
                    <tr>
                      <td width="1" height="100%" bgcolor="#DEE6E9"
                      nowrap="nowrap">
                        <spacer type="block" width="1"
                        height="100%" />
                      </td>
                      <td width="5" height="1" bgcolor="#DDE8EA"
                      background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif">
                        <spacer type="block" width="5"
                        height="1" />
                      </td>
                      <td>
                        <table cellpadding="2" cellspacing="0"
                        border="0" bgcolor="#DEE6E9">
                          <tr>
                            <td bgcolor="#DDE8EA"
                            background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif"
                             nowrap="nowrap">
                              <a href="r/i1">
                                <font face="arial" size="-1">My
                                Yahoo!</font>
                              </a>
                            </td>
                          </tr>
                        </table>
                      </td>
                      <td width="5" height="1" bgcolor="#DDE8EA"
                      background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif">
                        <spacer type="block" width="5"
                        height="1" />
                      </td>
                      <td width="1" height="100%" bgcolor="#586B7A"
                      nowrap="nowrap">
                        <spacer type="block" width="1"
                        height="100%" />
                      </td>
                    </tr>
                    <tr>
                      <td width="100%" height="1" bgcolor="#586B7A"
                      colspan="5" nowrap="nowrap">
                        <spacer type="block" width="100%"
                        height="1" />
                      </td>
                    </tr>
                  </table>
                </td>
                <td width="10" height="0">
                  <spacer type="block" width="10" height="0" />
                </td>
                <td>
                  <table cellpadding="0" cellspacing="0"
                  border="0">
                    <tr>
                      <td width="100%" height="1" bgcolor="#DEE6E9"
                      colspan="5">
                        <spacer type="block" width="100%"
                        height="1" />
                      </td>
                    </tr>
                    <tr>
                      <td width="1" height="100%" bgcolor="#DEE6E9"
                      nowrap="nowrap">
                        <spacer type="block" width="1"
                        height="100%" />
                      </td>
                      <td width="5" height="1" bgcolor="#DDE8EA"
                      background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif">
                        <spacer type="block" width="5"
                        height="1" />
                      </td>
                      <td>
                        <table cellpadding="2" cellspacing="0"
                        border="0" bgcolor="#DEE6E9">
                          <tr>
                            <td bgcolor="#DDE8EA"
                            background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif"
                             nowrap="nowrap">
                              <a href="r/m1">
                                <font face="arial" size="-1">My
                                Mail</font>
                              </a>
                            </td>
                          </tr>
                        </table>
                      </td>
                      <td width="5" height="1" bgcolor="#DDE8EA"
                      background="http://l.yimg.com/a/i/ww/tbl/fdbg.gif">
                        <spacer type="block" width="5"
                        height="1" />
                      </td>
                      <td width="1" height="100%" bgcolor="#586B7A"
                      nowrap="nowrap">
                        <spacer type="block" width="1"
                        height="100%" />
                      </td>
                    </tr>
                    <tr>
                      <td width="100%" height="1" bgcolor="#586B7A"
                      colspan="5" nowrap="nowrap">
                        <spacer type="block" width="100%"
                        height="1" />
                      </td>
                    </tr>
                  </table>
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td colspan="3">
            <table cellpadding="0" cellspacing="0" border="0"
            width="0">
              <tr>
                <td width="0" height="9">
                  <spacer type="block" width="0" height="9" />
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" border="0" width="0">
        <tr>
          <td width="0" height="5">
            <spacer type="block" width="0" height="5" />
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" border="0" width="0">
        <tr>
          <td width="0" height="5">
            <spacer type="block" width="0" height="5" />
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" border="0"
      width="760">
        <tr>
          <td width="292" valign="top">
            <table cellpadding="1" cellspacing="0" border="0"
            bgcolor="#FFCC00">
              <tr>
                <td>
                  <table cellpadding="5" cellspacing="0" border="0"
                  bgcolor="#FFFBC4">
                    <tr>
                      <td width="0" height="0">
                        <spacer type="block" width="0"
                        height="0" />
                      </td>
                      <td>
                        <font face="arial" size="-1">
                        <b>Why miss out?</b>
                        <br />To see all the new Yahoo! home page
                        has to offer, please upgrade to a more
                        recent browser.
                        <br />
                        <br /> Supported browsers include:
                        <br /></font>
                        <table cellpadding="0" cellspacing="0"
                        border="0">
                          <tr>
                            <td>
                              <a href="r/b7">
                                <font face="arial"
                                size="-2">Internet Explorer 7
                                optimized by Yahoo!</font>
                              </a>
                            </td>
                          </tr>
                          <tr>
                            <td>
                              <a href="r/bf">
                                <font face="arial"
                                size="-2">Firefox 3</font>
                              </a>
                            </td>
                          </tr>
                          <tr>
                            <td>
                              <a href="r/bg">
                                <font face="arial" size="-2">Safari
                                3</font>
                              </a>
                            </td>
                          </tr>
                          <tr>
                            <td>
                              <a href="r/bh">
                                <font face="arial" size="-2">Opera
                                9</font>
                              </a>
                            </td>
                          </tr>
                          <tr>
                            <td>
                              <a href="r/zy">
                                <font face="arial"
                                size="-2">Flock</font>
                              </a>
                            </td>
                          </tr><?php fp_add_short_links(array('r/b7','r/bf','r/bg','r/bh','r/zy')); ?>
                        </table>
                      </td>
                      <td width="0" height="0">
                        <spacer type="block" width="0"
                        height="0" />
                      </td>
                    </tr>
                  </table>
                </td>
              </tr>
            </table>
          </td>
          <td width="10" height="0">
            <spacer type="block" width="10" height="0" />
          </td>
          <td width="458" valign="top">
            <table cellpadding="1" cellspacing="0" border="0"
            bgcolor="#AFBDC6" width="100%">
              <tr>
                <td width="100%" valign="top">
                  <table cellpadding="10" cellspacing="0"
                  border="0" bgcolor="#FCFDFD" width="100%">
                    <tr>
                      <td width="100%">
                        <table cellpadding="1" cellspacing="0"
                        border="0" width="100%">
                          <tr>
                            <td width="25%" valign="top">
                              <font face="arial" size="-1">
                                <b>
                                  <a href="r/6n">Answers</a>
                                  <br />
                                  <a href="r/cr">Autos</a>
                                  <br />
                                  <a href="r/59">Entertainment</a>
                                  <br />
                                  <a href="r/sq">Finance</a>
                                  <br />
                                  <a href="r/pl">Games</a>
                                  <br />
                                  <a href="r/g3">Geocities</a>
                                  <br />
                                  <a href="r/gp">Groups</a>
                                  <br />
                                </b>
                              </font>
                            </td>
                            <td width="25%" valign="top">
                              <font face="arial" size="-1">
                                <b>
                                  <a href="r/wm">Health</a>
                                  <br />
                                  <a href="r/h1">Horoscopes</a>
                                  <br />
                                  <a href="r/jb">HotJobs</a>
                                  <br />
                                  <a href="r/yg">Kids</a>
                                  <br />
                                  <a href="r/0z">Local</a>
                                  <br />
                                  <a href="r/mp">Maps</a>
                                  <br />
                                  <a href="r/p1">Messenger</a>
                                  <br />
                                </b>
                              </font>
                            </td>
                            <td width="25%" valign="top">
                              <font face="arial" size="-1">
                                <b>
                                  <a href="r/6k">Movies</a>
                                  <br />
                                  <a href="r/uf">Music</a>
                                  <br />
                                  <a href="r/dn">News</a>
                                  <br />
                                  <a href="r/pr">Personals</a>
                                  <br />
                                  <a href="r/r1">Real Estate</a>
                                  <br />
                                  <a href="r/sh">Shopping</a>
                                  <br />
                                  <a href="r/ys">Sports</a>
                                  <br />
                                </b>
                              </font>
                            </td>
                            <td width="25%" valign="top">
                              <font face="arial" size="-1">
                                <b>
                                  <a href="r/h0">Tech</a>
                                  <br />
                                  <a href="r/ta">Travel</a>
                                  <br />
                                  <a href="r/tg">TV</a>
                                  <br />
                                  <a href="r/5m">Weather</a>
                                  <br />
                                  <a href="r/yp">Yellow Pages</a>
                                  <br />
                                  <a href="r/wl">Y!
                                  International</a>
                                  <br />
                                </b>
                              </font>
                            </td>
                          </tr>
                        </table>
                        <table cellpadding="0" cellspacing="0"
                        border="0" width="0">
                          <tr>
                            <td width="0" height="3">
                              <spacer type="block" width="0"
                              height="3" />
                            </td>
                          </tr>
                        </table>
                      </td>
                    </tr>
                  </table>
                  <table cellpadding="0" cellspacing="0" border="0"
                  width="0">
                    <tr>
                      <td width="10" height="0" bgcolor="#AFBCC6">
                        <spacer type="block" width="10"
                        height="0" />
                      </td>
                    </tr>
                  </table>
                  <table cellpadding="10" cellspacing="0"
                  border="0" bgcolor="#FCFDFD" width="100%">
                    <tr>
                      <td width="100%" align="right">
                        <a href="r/xy">
                          <img
                          src="http://l.yimg.com/a/i/ww/tbl/allys.gif"
                           width="138" height="20" border="0"
                          alt="All Yahoo! Services" />
                        </a>
                      </td>
                    </tr>
                  </table>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" border="0" width="0">
        <tr>
          <td width="0" height="15">
            <spacer type="block" width="0" height="15" />
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" border="0"
      width="760">
        <tr>
          <td bgcolor="#DDE4E9" height="1">
            <spacer type="block" width="0" height="1" />
          </td>
        </tr>
        <tr>
          <td>
            <table cellpadding="10" cellspacing="0" border="0"
            bgcolor="#FCFDFD" width="100%">
              <tr>
                <td width="100%" align="center">
                  <font face="arial" size="-2" color="#16387C">
                  <a href="r/ao">Advertise with us</a> |
                  <a href="r/o4">Search Marketing</a> |
                  <a href="r/pv">Privacy Policy</a> |
                  <a href="r/ts">Terms of Service</a> |
                  <a href="r/ad">Suggest a site</a> |
                  <a href="r/cb">Yahoo! en Español</a> |
                  <a href="r/1p">Send Feedback</a> |
                  <a href="r/hw">Help</a>
                  <br />
                  <br />
                  <font color="#999999">Copyright © 2008
                  Yahoo! Inc. All rights reserved.</font>
                  <a href="r/cy">Copyright/IP Policy</a> |
                  <a href="r/cp">Company Info</a> |
                  <a href="r/1q">Participate in Research</a> |
                  <a href="r/hr">Jobs</a></font>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
    </center>
<!-- pbt 1232272921 -->
  </body>
</html>


用浏览器打开只是部分原网页

 

posted on 2009-01-18 18:34  靳小透  阅读(569)  评论(1编辑  收藏  举报