phpquery 学习笔记

phpQuery是一个基于PHP的服务端开源项目,它可以让PHP开发人员轻松处理DOM文档内容,比如获取某新闻网站的头条信息。更有意思的是,它采用了jQuery的思想,你可以像使用jQuery一样处理页面内容,获取你想要的页面信息。

Query的选择器之强大是有目共睹的,phpQuery 让php也拥有了这样的能力,它就相当于服务端的jQuery。

 

存在的意义


我们有时需要抓取一个网页的内容,但只需要特定部分的信息,通常会用正则来解决,这当然没有问题。正则是一个通用解决方案,但特定情况下,往往有更简单快 捷的方法。

1、编写条件表达式比较麻烦

尤其对于新手,看到一堆”不知所云”的字符评凑在一起,有种脑袋都要炸了的感觉。如果要分离的对象没有太明显的特征,正则写起来更是麻烦。

2、效率不高

对于php来说,正则应该是没有办法的办法,能通过字符串函数解决的,就不要劳烦正则了。用正则去处理一个30多k的文件,效率不敢保证。

3、有phpQuery

如果你使用过jQuery,想获取某个特定元素应该是轻而易举的事情,phpQuery让这成为了可能。

浅析phpQuery


phpQuery是基于php5新添加的DOMDocument。而DOMDocument则是专门用来处理html/xml。它提供了强大xpath选 择器及其他很多html/xml操作函数,使得处理html/xml起来非常方便。那为什么不直接使用呢?这个,去看一下官网的函数列表 ( http://www.php.net/manual/en/class.domdocument.php ) 就知道了,如果对自己的记忆力很有信心, 不妨一试。

 

项目官网地址:http://code.google.com/p/phpquery/

 

 

http://demo.jingwentian.com/phpQuery/

采集头条

先看一实例,现在我要采集新浪网国内新闻的头条,代码如下:

include 'phpQuery/phpQuery.php'; 
phpQuery::newDocumentFile('http://news.sina.com.cn/china'); 
echo pq(".blk121 a:eq(0)")->html();


简单的三行代码,就可以获取头条内容。首先在程序中包含phpQuery.php核心程序,然后调用读取目标网页,最后输出对应标签下的内容。
pq()是一个功能强大的方法,跟jQuery的$()如出一辙,jQuery的选择器基本上都能使用在phpQuery上,只要把“.”变成“->”。如上例中,pq(".blkTop h1:eq(0)")抓取了页面class属性为blkTop的DIV元素,并找到该DIV内部的第一个h1标签,然后用html()方法获取h1标签里的内容(带html标签),也就是我们要获取的头条信息,如果使用text()方法,则只获取头条的文本内容。当然要使用好phpQuery,关键是要找对文档中对应内容的节点。

 

 

 

采集文章列表

下面再来看一个例子,获取 http://www.jingwentian.com 网站的blog列表,请看代码:

include 'phpQuery/phpQuery.php'; 
phpQuery::newDocumentFile('http://www.jingwentian.com'); 
$artlist = pq(".post-list > .item-content"); 
foreach($artlist as $li){ 
   echo pq($li)->find('h1')->html()."<br>"; 
}


通过循环列表中的DIV,找出文章标题并输出,就是这么简单。

解析XML文档

假设现在有一个这样的test.xml文档:

<?xml version="1.0" encoding="utf-8"?> 
<root> 
  <contact> 
     <name>张三</name> 
     <age>22</age> 
  </contact> 
  <contact> 
     <name>王五</name> 
     <age>18</age> 
  </contact> 
</root>


现在我要获取名字为张三的联系人的年龄,代码如下:

include 'phpQuery/phpQuery.php'; 
phpQuery::newDocumentFile('test.xml'); 
echo pq('contact > age:eq(0)');


结果输出:22

像jQuery一样,精准查找文档节点,输出节点下的内容,解析一个XML文档就是这么简单。现在你不必为采集网站内容而使用那些头疼的正则算法、内容替换等繁琐的代码了,有了phpQuery,一切就变得轻松多了。

 

 

一、phpQuery的hello word!

下面简单举例:

include 'phpQuery.php'; 
phpQuery::newDocumentFile('http://www.phper.org.cn'); 
echo pq("title")->text();	// 获取网页标题
echo pq("div#header")->html();	// 获取id为header的div的html内容

 上例中第一行引入phpQuery.php文件,

第二行通过newDocumentFile加载一个文件,

第三行通过pq()函数获取title标签的文本内容,

第四行获取id为header的div标签所包含的HTML内容。

主要做了两个动作,即加载文件和读取文件内容。

 

二、载入文档(loading documents)

加载文档主要通过phpQuery::newDocument来进行操作,其作用是使得phpQuery可以在服务器预先读取到指定的文件或文本内容。

主要的方法包括:

phpQuery::newDocument($html, $contentType = null)

phpQuery::newDocumentFile($file, $contentType = null)

phpQuery::newDocumentHTML($html, $charset = ‘utf-8′)

phpQuery::newDocumentXHTML($html, $charset = ‘utf-8′)

phpQuery::newDocumentXML($html, $charset = ‘utf-8′)

phpQuery::newDocumentPHP($html, $contentType = null)

phpQuery::newDocumentFileHTML($file, $charset = ‘utf-8′)

phpQuery::newDocumentFileXHTML($file, $charset = ‘utf-8′)

phpQuery::newDocumentFileXML($file, $charset = ‘utf-8′)

phpQuery::newDocumentFilePHP($file, $contentType) 

 

三、pq()函数用法

pq()函数的用法是phpQuery的重点,主要分两部分:即选择器和过滤器

【选择器】

要了解phpQuery选择器的用法,建议先了解jQuery的语法

最常用的语法包括有:

pq('#id'):即以#号开头的ID选择器,用于选择已知ID的容器所包括的内容

pq('.classname'):即以.开头的class选择器,用于选择class匹配的容器内容

pq('parent > child'):选择指定层次结构的容器内容,如:pq('.main > p')用于选择class=main容器的所有p标签

更多的语法请参考jQuery手册

【过滤器】

主要包括::first,:last,:not,:even,:odd,:eq(index),:gt(index),:lt(index),:header,:animated等

如:

pq('p:last'):用于选择最后一个p标签

pq('tr:even'):用于选择表格中偶然行

 

四、phpQuery连贯操作

pq()函数返回的结果是一个phpQuery对象,可以对返回结果继续进行后续的操作,例如:

 pq('a')->attr('href', 'newVal')->removeClass('className')->html('newHtml')->...

详情请查阅jQuery相关资料,用法基本一致,只需要注意.与->的区别即可。

 

 

------------------------------------------------------------------------------------------------------------

官方文档自学。

  1. Basics
  2. Ported jQuery sections
    1. Selectors
    2. Attributes
    3. Traversing
    4. Manipulation
    5. Ajax
    6. Events
    7. Utilities
    8. Plugin ports
  3. PHP Support
  4. Command Line Interface
  5. Multi document support
  6. Plugins
    1. WebBrowser
    2. Scripts
  7. jQueryServer
  8. Debugging
  9. Bootstrap file

 

001

Loading documents 加载文件 文档

phpQuery::newDocument($html, $contentType = null) Creates new document from markup. If no $contentType, autodetection is made (based on markup). If it fails, text/html in utf-8 is used. 从标记 建立新文档自动识别 识别失败 自动默认utf8

phpQuery::newDocumentFile($file, $contentType = null) Creates new document from file. Works like newDocument() 从文件建立文档

phpQuery::newDocumentHTML($html, $charset = 'utf-8') 从html文件建立

phpQuery::newDocumentXHTML($html, $charset = 'utf-8') 从xhtml建立
phpQuery::newDocumentXML($html, $charset = 'utf-8') 从xml建立
phpQuery::newDocumentPHP($html, $contentType = null) Read more about it on PHPSupport page 从php建立(详情阅读https://code.google.com/p/phpquery/wiki/PHPSupport)


phpQuery::newDocumentFileHTML($file, $charset = 'utf-8')
phpQuery::newDocumentFileXHTML($file, $charset = 'utf-8')
phpQuery::newDocumentFileXML($file, $charset = 'utf-8')
phpQuery::newDocumentFilePHP($file, $contentType) Read more about it on PHPSupport page


pq function pq方法

pq($param, $context = null); 相当于jquery的$(); 有三种使用方式

01.Importing markup 导入标记

// Import into selected document: 导入要选择的标记
// doesn't accept text nodes at beginning of input string
// 在开头不接受文本节点
pq('<div/>')
// Import into document with ID from $pq->getDocumentID():
// 从 $pq->getDocumentID():文档中选择 id为获取id的那个
pq('<div/>', $pq->getDocumentID())

// Import into same document as DOMNode belongs to:
// 导入相同的文档作为dom
pq('<div/>', DOMNode)
// Import into document from phpQuery object:
// 从对象导入
pq('<div/>', $pq)


02.Running queries 运行查询 查找
// Run query on last selected document: 在最后选定的文档上查询。查找div 且class为。。。
pq('div.myClass')


// Run query on document with ID from $pq->getDocumentID(): 根据id查询
pq('div.myClass', $pq->getDocumentID())

// Run query on same document as DOMNode belongs to and use node(s)as root for query:DOMNode属于同一文档上运行查询和使用节点作为根用户(s)查询
pq('div.myClass', DOMNode)
// Run query on document from phpQuery object
//
// and use object's stack as root node(s) for query:
// 使用对象的栈作为查询的根节点(s
pq('div.myClass', $pq)

03.Wrapping DOMNodes with phpQuery objects 用对象包装dom文件
foreach(pq('li') as $li)
// $li is pure DOMNode, change it to phpQuery object
// $li是纯dom节点 把它变成phpquery对象
pq($li);

 

手册第二部分

  1. Ported jQuery sections

phpQuery几乎是一个完整的jQuery JavaScript库

Ported Sections 

  1. Selectors 选择器
  2. Attributes
  3. Traversing
  4. Manipulation
  5. Ajax
  6. Events
  7. Utilities
  8. Plugin ports

Additional methods 额外方法 附加方法

phpQuery features many additional methods comparing to jQuery:

 

Other Differences 其他的不同

选择器:

Selectors  
 

Selectors are the heart of jQuery-like interface. Most of CSS Level 3 syntax is implemented (in state same as in jQuery).选择器是jQuery-like接口的核心

Example

pq(".class ul > li[rel='foo']:first:has(a)")->appendTo('.append-target-wrapper div')->...

Basics 基础

Hierarchy 层级

Basic Filters 基础过滤

  • :first Matches the first selected element. 选在第一个选定的元素
  • :last Matches the last selected element.选择最后一个选定的元素
  • :not(selector) Filters out all elements matching the given selector. 过滤掉所有给定的选择器的元素
  • :even Matches even elements, zero-indexed.
  • :odd Matches odd elements, zero-indexed.
  • :eq(index) Matches a single element by its index.匹配单个元素的索引( 位置 第几个)
  • :gt(index) Matches all elements with an index above the given one. 匹配大于的
  • :lt(index) Matches all elements with an index below the given one. 匹配小于的
  • :header Matches all elements that are headers, like h1, h2, h3 and so on.标题匹配所有元素,如h1,h2,h3等等
  • :animated Matches all elements that are currently being animated.匹配当前正在动画发展的所有元素 。

Content Filters 内容过滤

  • :contains(text) 匹配元素包含给定的文本
  • :empty 匹配无子元素的 包括文本的
  • :has(selector) 匹配元素包含至少一个指定selecto相匹配的元素
  • :parent 匹配所有父元素(包含元素的元素)

Visibility Filters可见过滤

Attribute Filters属性过滤

Child Filters 子元素过滤

  • :nth-child(index/even/odd/equation) Matches all elements that are the nth-child of their parent or that are the parent's even or odd children.
  • 匹配所有元素的nth-child父母或父母的偶数或奇数的孩子。
  • :first-child Matches all elements that are the first child of their parent.第一个孩子元素
  • :last-child Matches all elements that are the last child of their parent.最后一个孩子元素
  • :only-child Matches all elements that are the only child of their parent. 仅有的子元素

Forms 表单

  • :input Matches all input, textarea, select and button elements. 所有输入
  • :text Matches all input elements of type text. text的
  • :password Matches all input elements of type password. 密码输入框的
  • :radio Matches all input elements of type radio. 单选按钮
  • :checkbox Matches all input elements of type checkbox. 复选框
  • :submit Matches all input elements of type submit.提交
  • :image Matches all input elements of type image.图片
  • :reset Matches all input elements of type reset. 重置
  • :button Matches all button elements and input elements of type button.按钮
  • :file Matches all input elements of type file.文件
  • :hidden Matches all elements that are hidden, or input elements of type "hidden".隐藏

Form Filters 表单过滤

  • :enabled Matches all elements that are enabled. 启用的元素
  • :disabled Matches all elements that are disabled.禁用的元素
  • :checked Matches all elements that are checked.勾选的
  • :selected Matches all elements that are selected. 被选中的

 

***************

属性

pq('a')->attr('href', 'newVal')->removeClass('className')->html('newHtml')->...

 

**********

Attr

  • attr($name) 获取第一个匹配元素的一个属性。该方法便于检索第一个匹配元素的属性值。如果元素没有这样一个名字,一个属性将返回未定义。
  • attr($properties) 键/值对象设置为所有匹配的元素属性
  • attr($key, $value) 将所有匹配元素的单个值设置为。。。
  • attr($key, $fn) 一个属性设置为一个计算值,在所有匹配的元素
  • removeAttr($name) 从每一个匹配的元素中删除一个属性

Class

  • addClass($class) Adds the specified class(es) to each of the set of matched elements.
  • hasClass($class) Returns true if the specified class is present on at least one of the set of matched elements.
  • removeClass($class) Removes all or the specified class(es) from the set of matched elements.
  • toggleClass($class) Adds the specified class if it is not present, removes the specified class if it is present.

HTML

  • html() 获得第一个匹配元素的html内容(innerHTML)。这个属性是不可以在XML文档(尽管它将为XHTML文档工作)。
  • html($val) 设置每一个匹配元素的html内容。这个属性是不可以在XML文档(尽管它将为XHTML文档工作)。

Text

  • text() 把所有匹配的元素的文本内容相结合.
  • text($val) 所有匹配的元素的文本内容

Value

  • val() 得到的内容第一个匹配元素的属性值 获取属性
  • val($val) 设置每一个匹配元素的属性值。
  • val($val) Checks, or selects, all the radio buttons, checkboxes, and select options that match the set of values.

 

**************

Traversing  

pq('div > p')->add('div > ul')->filter(':has(a)')->find('p:first')->nextAll()->andSelf()->...

**********

Filtering 过滤

  • eq($index) Reduce the set of matched elements to a single element.
  • hasClass($class) Checks the current selection against a class and returns true, if at least one element of the selection has the given class.
  • filter($expr) Removes all elements from the set of matched elements that do not match the specified expression(s).
  • filter($fn) Removes all elements from the set of matched elements that does not match the specified function.
  • is($expr) Checks the current selection against an expression and returns true, if at least one element of the selection fits the given expression.
  • map($callback) Translate a set of elements in the jQuery object into another set of values in an array (which may, or may not, be elements).
  • not($expr) Removes elements matching the specified expression from the set of matched elements.
  • slice($start, $end) Selects a subset of the matched elements.

Finding 查找

  • add($expr) Adds more elements, matched by the given expression, to the set of matched elements.
  • children($expr) Get a set of elements containing all of the unique immediate children of each of the matched set of elements.
  • contents() Find all the child nodes inside the matched elements (including text nodes), or the content document, if the element is an iframe.
  • find($expr) Searches for all elements that match the specified expression. This method is a good way to find additional descendant elements with which to process.
  • next($expr) Get a set of elements containing the unique next siblings of each of the given set of elements.
  • nextAll($expr) Find all sibling elements after the current element.
  • parent($expr) Get a set of elements containing the unique parents of the matched set of elements.
  • parents($expr) Get a set of elements containing the unique ancestors of the matched set of elements (except for the root element). The matched elements can be filtered with an optional expression.
  • prev($expr) Get a set of elements containing the unique previous siblings of each of the matched set of elements.
  • prevAll($expr) Find all sibling elements before the current element.
  • siblings($expr) Get a set of elements containing all of the unique siblings of each of the matched set of elements. Can be filtered with an optional expressions.

Chaining连接

  • andSelf() Add the previous selection to the current selection.
  • end() Revert the most recent 'destructive' operation, changing the set of matched elements to its previous state (right before the destructive operation).

Read more at Traversing section on jQuery Documentation Site.

 

 

*****************

Manipulation  
Updated Feb 4, 2010 by tobiasz....@gmail.com

Example

pq('div.old')->replaceWith( pq('div.new')->clone() )->appendTo('.trash')->prepend('Deleted')->...

Table of Contents

Changing Contents

  • html() Get the html contents (innerHTML) of the first matched element. This property is not available on XML documents (although it will work for XHTML documents).
  • html($val) Set the html contents of every matched element. This property is not available on XML documents (although it will work for XHTML documents).
  • text() Get the combined text contents of all matched elements.
  • text($val) Set the text contents of all matched elements.

Inserting Inside

Inserting Outside

Inserting Around

  • wrap($html) Wrap each matched element with the specified HTML content.
  • wrap($elem) Wrap each matched element with the specified element.
  • wrapAll($html) Wrap all the elements in the matched set into a single wrapper element.
  • wrapAll($elem) Wrap all the elements in the matched set into a single wrapper element.
  • wrapInner($html) Wrap the inner child contents of each matched element (including text nodes) with an HTML structure.
  • wrapInner($elem) Wrap the inner child contents of each matched element (including text nodes) with a DOM element.

Replacing

Removing

  • empty() Remove all child nodes from the set of matched elements.
  • remove($expr) Removes all matched elements from the DOM.

Copying

  • clone() Clone matched DOM Elements and select the clones.
  • clone($true) Clone matched DOM Elements, and all their event handlers, and select the clones.

Read more at Manipulation section on jQuery Documentation Site.

 

************************8

Ajax  
Updated Feb 4, 2010 by tobiasz....@gmail.com

Example

pq('#element')->load('http://somesite.com/page .inline-selector')->...

Table of Contents

Server Side Ajax

Ajax, standing for Asynchronous JavaScript and XML is combination of HTTP Client and XML parser which doesn't lock program's thread (doing request in asynchronous way).

phpQuery also offers such functionality, making use of solid quality Zend_Http_Client. Unfortunately requests aren't asynchronous, but nothing is impossible. For today, instead of XMLHttpRequest you always get Zend_Http_Client instance. API unification is planned.

Cross Domain Ajax

For security reasons, by default phpQuery doesn't allow connections to hosts other than actual $_SERVER['HTTP_HOST']. Developer needs to grant rights to other hosts before making an Ajax request.

There are 2 methods for allowing other hosts

  • phpQuery::ajaxAllowURL($url)
  • phpQuery::ajaxAllowHost($host)

 

// connect to google.com
phpQuery::ajaxAllowHost('google.com');
phpQuery::get('http://google.com/ig');
// or using same string
$url = 'http://google.com/ig';
phpQuery::ajaxAllowURL($url);
phpQuery::get($url);

Ajax Requests

Ajax Events

  • ajaxComplete($callback) Attach a function to be executed whenever an AJAX request completes. This is an Ajax Event.
  • ajaxError($callback) Attach a function to be executed whenever an AJAX request fails. This is an Ajax Event.
  • ajaxSend($callback) Attach a function to be executed before an AJAX request is sent. This is an Ajax Event.
  • ajaxStart($callback) Attach a function to be executed whenever an AJAX request begins and there is none already active. This is an Ajax Event.
  • ajaxStop($callback) Attach a function to be executed whenever all AJAX requests have ended. This is an Ajax Event.
  • ajaxSuccess($callback) Attach a function to be executed whenever an AJAX request completes successfully. This is an Ajax Event.

Misc

  • phpQuery::ajaxSetup($options) Setup global settings for AJAX requests.
  • serialize() Serializes a set of input elements into a string of data. This will serialize all given elements.
  • serializeArray() Serializes all forms and form elements (like the .serialize() method) but returns a JSON data structure for you to work with.

Options

Detailed options description in available at jQuery Documentation Site.

  • async Boolean
  • beforeSend Function
  • cache Boolean
  • complete Function
  • contentType String
  • data Object, String
  • dataType String
  • error Function
  • global Boolean
  • ifModified Boolean
  • jsonp String
  • password String
  • processData Boolean
  • success Function
  • timeout Number
  • type String
  • url String
  • username String

********************

Events  
Updated Feb 4, 2010 by tobiasz....@gmail.com

Table of Contents

Example

pq('form')->bind('submit', 'submitHandler')->trigger('submit')->...
function submitHandler($e) {
  print 'Target: '.$e->target->tagName;
  print 'Bubbling ? '.$e->currentTarget->tagName;
}

Server Side Events

phpQuery support server-side events, same as jQuery handle client-side ones. On server there isn't, of course, events such as mouseover (but they can be triggered).

By default, phpQuery automatically fires up only change event for form elements. If you load WebBrowser plugin, submit and click will be handled properly - eg submitting form with inputs' data to action URL via new Ajax request.

$this (this in JS) context for handler scope isn't available. You have to use one of following manually:

  • $event->target
  • $event->currentTarget
  • $event->relatedTarget

 

Page Load

none

Event Handling

  • bind($type, $data, $fn) Binds a handler to one or more events (like click) for each matched element. Can also bind custom events.
  • one($type, $data, $fn) Binds a handler to one or more events to be executed once for each matched element.
  • trigger($type , $data ) Trigger a type of event on every matched element.
  • triggerHandler($type , $data ) This particular method triggers all bound event handlers on an element (for a specific event type) WITHOUT executing the browsers default actions.
  • unbind($type , $data ) This does the opposite of bind, it removes bound events from each of the matched elements.

Interaction Helpers

none

Event Helpers

  • change() Triggers the change event of each matched element.
  • change($fn) Binds a function to the change event of each matched element.
  • submit() Trigger the submit event of each matched element.
  • submit($fn) Bind a function to the submit event of each matched element.

Read more at Events section on jQuery Documentation Site.

 

**********

Utilities  
Updated Feb 4, 2010 by tobiasz....@gmail.com

Table of Contents

User Agent

none

Array and Object operations

Test operations

String operations

Read more at Utilities section on jQuery Documentation Site.

 

 

Examples

CLI

Fetch number of downloads of all release packages

phpquery 'http://code.google.com/p/phpquery/downloads/list?can=1' \
  --find '.vt.col_4 a' --contents \
  --getString null array_sum

PHP

Examples from demo.php

require('phpQuery/phpQuery.php');
// for PEAR installation use this
// require('phpQuery.php');

INITIALIZE IT

// $doc = phpQuery::newDocumentHTML($markup);
// $doc = phpQuery::newDocumentXML();
// $doc = phpQuery::newDocumentFileXHTML('test.html');
// $doc = phpQuery::newDocumentFilePHP('test.php');
// $doc = phpQuery::newDocument('test.xml', 'application/rss+xml');
// this one defaults to text/html in utf8
$doc = phpQuery::newDocument('<div/>');

FILL IT

// array syntax works like ->find() here
$doc['div']->append('<ul></ul>');
// array set changes inner html
$doc['div ul'] = '<li>1</li><li>2</li><li>3</li>';

MANIPULATE IT

// almost everything can be a chain
$li = null;
$doc['ul > li']
        ->addClass('my-new-class')
        ->filter(':last')
                ->addClass('last-li')
// save it anywhere in the chain
                ->toReference($li);

SELECT DOCUMENT

// pq(); is using selected document as default
phpQuery::selectDocument($doc);
// documents are selected when created or by above method
// query all unordered lists in last selected document
pq('ul')->insertAfter('div');

ITERATE IT

// all LIs from last selected DOM
foreach(pq('li') as $li) {
        // iteration returns PLAIN dom nodes, NOT phpQuery objects
        $tagName = $li->tagName;
        $childNodes = $li->childNodes;
        // so you NEED to wrap it within phpQuery, using pq();
        pq($li)->addClass('my-second-new-class');
}

PRINT OUTPUT

// 1st way
print phpQuery::getDocument($doc->getDocumentID());
// 2nd way
print phpQuery::getDocument(pq('div')->getDocumentID());
// 3rd way
print pq('div')->getDocument();
// 4th way
print $doc->htmlOuter();
// 5th way
print $doc;
// another...
print $doc['ul'];
posted @ 2015-11-12 15:11  的士特啰嗦司机  阅读(3187)  评论(0编辑  收藏  举报