Jsoup库 解析DOM文档
DOM文档包括 HTML, XML等等
Jsoup 获取数据的方式
//html 文本, url, 本地html String html = "<html>" + "<head>carloz Jsoup用法</head>" + "<body>" + "<p><a href='http://baidu.com'>这里是jsoup项目相关的文章</a></p>" + "</body>" + "</html>"; //html text Document doc = Jsoup.parse(html); doc.title(); String url = "http://baidu.com"; //get 方式获取数据 try { Document doc2 = Jsoup.connect(url).get(); } catch (IOException e) { e.printStackTrace(); } //post 方式获取数据 try { Document doc3 = Jsoup.connect(url).data("key", "value").timeout(3000).post(); } catch (IOException e) { e.printStackTrace(); } //本地文件获取数据 File input = new File(Environment.getDataDirectory()+"/index.html"); try { Document doc4 = Jsoup.parse(input, "utf-8", "http://baidu.com"); } catch (IOException e) { e.printStackTrace(); }
Jsoup操作DOM文档
//html 文本, url, 本地html String html = "<html>" + "<head>carloz Jsoup用法</head>" + "<body>" + "<p><a href='http://baidu.com'>这里是jsoup项目相关的文章</a></p>" + "</body>" + "</html>"; //html text Document doc = Jsoup.parse(html); doc.title(); //选择元素,获取元素的相关属性 Elements eles = doc.getElementsByTag("a"); for(Element link : eles){ String linkHref = link.attr("href"); String text = link.text(); } //遍历 Elements elements = doc.select("a[href]"); //带有连接的a标签 Elements elements2 = doc.select("img[src$=.png]"); //带有连接的图像 Element elements3 = doc.select("div.className").first(); //带有连接的图像 //元素操作 doc.select("div.className").attr("key", "value"); doc.select("div.className").addClass("myClass"); // 添加: class="myClass" doc.select("img").removeAttr("onclick"); //移除方法 //文档清理, 将不安全的html代码 转化为安全的形式 String htmls = ""; String safeHtml = Jsoup.clean(htmls, Whitelist.basic());
------------------------------------------------
实例操作
* 1. 使用Jsoup解析 html新闻列表
添加网络访问权限 <uses-permission android:name="android.permission.INTERNET"/>
网络操作,需要新建线程
//涉及网络请求,创建一个线程 new Thread(new Runnable(){ @Override public void run() { parseHtml(); } }).start();
private void parseHtml() { try { Document doc = Jsoup.connect(url).get(); //去网页分析html格式 Elements elements = doc.select("div.unit"); for(Element e : elements) { String title = e.getElementsByTag("h1").first().text(); String href = e.getElementsByTag("h1").first() .getElementsByTag("a").first().attr("href"); Log.i(TAG, title + ": " + href); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
运行结果
08-10 16:17:59.427: I/CARLOZ(14728): Google产品总监:改良Apple Watch UI的若干设想: http://www.csdn.net/article/2015-08-10/2825421 08-10 16:17:59.427: I/CARLOZ(14728): Android项目中如何用好构建神器Gradle?: http://www.csdn.net/article/2015-08-10/2825420 08-10 16:17:59.427: I/CARLOZ(14728): Oculus PC SDK 0.7版即将发布,含新Direct Driver Mode: http://www.csdn.net/article/2015-08-10/2825419-Oculus-SDK-0-7-new-direct-driver-mode-and-a-major-overhaul 08-10 16:17:59.427: I/CARLOZ(14728): 实战iOS 9:详解Xcode的Code Coverage工具: http://www.csdn.net/article/2015-08-07/2825410 08-10 16:17:59.427: I/CARLOZ(14728): 《近匠》Remix周哲,让Android运行在“电脑”上!: http://www.csdn.net/article/2015-08-06/2825396-remix-tablet 08-10 16:17:59.427: I/CARLOZ(14728): Android上玩玩Hook:Cydia Substrate实战: http://www.csdn.net/article/2015-08-07/2825405 08-10 16:17:59.427: I/CARLOZ(14728): 易讯理想科技自研AR技术 面向商业场景推“幻视”AR应用: http://www.csdn.net/article/2015-08-06/2825398 08-10 16:17:59.427: I/CARLOZ(14728): 令人惊叹的复杂之美:如何做一个iOS分形App?: http://www.csdn.net/article/2015-08-05/2825382 08-10 16:17:59.427: I/CARLOZ(14728): 游戏设计中的色彩哲学:没想象的那么简单: http://www.csdn.net/article/2015-08-04/2825378 08-10 16:17:59.427: I/CARLOZ(14728): iOS开发实战教学:在Swift怎样创建CocoaPod?: http://www.csdn.net/article/2015-08-05/2825383
* 2. 使用Jsoup解析Epub3
private void parseEpub(){ //2. 使用Jsoup解析Epub电子书 //toc.ncx 复制放在assets目录 try { InputStream is = getAssets().open("toc.ncx"); int size = is.available(); byte[] buffer = new byte[size]; is.read(buffer); is.close(); String epubText = new String(buffer, "utf-8"); Document doc = Jsoup.parse(epubText); String docTitle = doc.getElementsByTag("docTitle").first().text(); Elements eles = doc.getElementsByTag("navPoint"); for(Element e : eles){ String title = e.text(); String href = e.getElementsByTag("content").first().attr("src"); Log.i(TAG, title + ": " + href); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
运行结果
08-10 16:37:28.847: I/CARLOZ(16323): 第一章 童年: chapter_156815.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 第二章 灾难: chapter_156816.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 和尚的生涯: chapter_156817.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 第三章 踏上征途: chapter_156818.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 第四章 就从这里起步: chapter_156819.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 第五章 储蓄资本: chapter_156820.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 最后一个障碍: chapter_156821.xhtml 08-10 16:37:28.847: I/CARLOZ(16323): 第六章 霸业的开始: chapter_156822.xhtml ...