XML解析之Jsoup

操作xml文件

解析（读取）：将文档中的数据解读到内存中
写入：将内存中的数据保存到XML文档中。持久化的存储

解析xml的方式

DOM：将标记语言文档一次性加载进内存，在内存中形成一颗dom树
- 优点：
  
  操作方便，可以对文档进行CRUD(增删改查)的所有操作
- 缺点：
  
  占内存
SAX:逐行读取，基于事件驱动
- 优点
  
  不占内存
- 缺点
  
  只能读取

常用的解析器：

JAXP:sum公司提供的解析器，支持dom和sax两种思想
DOM4J：优秀的解析器
Jsoup:一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。
PULL：android系统内置解析器

Jsoup

快速入门

从URL，文件或字符串中刮取并解析HTML

查找和提取数据，使用DOM遍历或CSS选择器

操纵HTML元素，属性和文本

根据安全的白名单清理用户提交的内容，以防止XSS攻击

输出整洁的HTML

参考

步骤：

导入jar包
获取Document对象
获取对应的标签Element对象
获取数据

代码：

xml文件：

<?xml version="1.0" encoding="UTF-8" ?>
 <students>
 	<student number="heima_0001">
 		<name id="cat">tom</name>
 		<age>18</age>
 		<sex>male</sex>
 	</student>
	<student number="heima_0002">
		<name>jack</name>
		<age>12</age>
		<sex>male</sex>
	</student>
 </students>

测试代码:

public class JsoupTest {
    public static void main(String[] args) throws IOException {
        //获得路径path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //获取元素
        Elements elements = document.getElementsByTag("name");
        System.out.println(elements.size());
        //获取数据
        for (int i = 0; i < elements.size(); i++) {
            System.out.println(elements.get(i).text());
        }
    }
}

对象的使用

Jsoup:工具类，可以解析html或xml文档，返回Document
1. parse方法
  1. 解析xml或html对象
```
public static Document parse(File in,String charsetName)throws IOException
```
    Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
  2. 解析xml或html字符串
```
 public static Document parse(String html)
```
    Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.
  3. 通过网络路径获取指定的html或xml的文档对象
```
  public static Document parse(URL url,int timeoutMillis)throws IOException
```
    Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use [connect(String)](file:///C:/Users/ada/AppData/Local/Temp/360zip$Temp/360$3/day32_xml/03_参考/jsoup/jsoup-1.11.2-javadoc/org/jsoup/Jsoup.html#connect-java.lang.String-)
    
    The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.
    等；
Document ：文档对象。代表内存中的dom树
1. 获取Element对象
  1. 根据标签名获取对象集合
```
public Elements getElementsByTag(String tagName)
```
  Finds elements, including and recursively under this element, with the specified tag name.
  1. 根据属性名称获取对象集合
```
public Elements getElementsByAttribute(String key)
```
  Find elements that have a named attribute set. Case insensitive.
  1. 根据对应的属性名和值获取元素对象集合
```
public Elements getElementsByAttributeValue(String key, String value)
```
  Find elements that have an attribute with the specific value. Case insensitive.
  1. 根据ID属性获取唯一的element
```
public Element lastElementSibling()
```
  Gets the last element sibling of this element
Elements ：Element对象的集合。可以当作ArrayList来使用
Element ：元素对象
1. 获取子元素对象
2. 获取属性值
  1. String attr(String key):根据属性名称获取属性值
3. 获取文本内容
  1. String text():获取文本内容
  2. String html();获取标签体的所有内容
Node ：节点对象
- Document和Element的父类

快速查询方式

selector:选择器

使用的方法：Elements select(String cssQuery)

样例：

public class JsoupTest {
    public static void main(String[] args) throws IOException {
        //获得路径path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //查询name标签
        Elements elements = document.select("name");
        System.out.println(elements.get(0).text());
        //查询id
        Elements id = document.select("#cat");
        System.out.println(elements.get(0).select("name").text());
        System.out.println("******************");
        //查找student中number等于heima_0001
        Elements select = document.select("student[number=\"heima_0001\"]");
        System.out.println(select);
        System.out.println("******************");
        //查找student中number等于heima_0001中的age子标签
        Elements select1 = document.select("student[number=\"heima_0001\"]>age");
        System.out.println(select1);
    }
}

XPath:

解释：

XPath 是一门在 XML 文档中查找信息的语言。

XPath 是 XSLT 中的主要元素。

XQuery 和 XPointer 均构建于 XPath 表达式之上

使用Jsoup的xpath需要额外导入jar包

查询w3cschool参考手册，使用xpath语法完成

public class JsoupXpath {
    public static void main(String[] args) throws IOException, XpathSyntaxErrorException {
        //获得路径path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //剧创建JXDocumnet对象
        JXDocument jxDocument=new JXDocument(document);
        //结合xpath语法查询
        List<JXNode> jxNodes = jxDocument.selN("//student");
        System.out.println(jxNodes);

        System.out.println("__________________________");
        List<JXNode> jxNode = jxDocument.selN("//student[@number='heima_0001']");
        System.out.println(jxNode);
    }
}

posted @ 2019-08-03 22:37 PoetryAndYou 阅读(1184) 评论(0) 编辑收藏举报

刷新页面返回顶部

PoetryAndYou

XML解析之Jsoup

操作xml文件

解析xml的方式

Jsoup

快速查询方式

公告