xpath tips

In the context of web scraping, [XPath](http://en.wikipedia.org/wiki/XPath) is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you're looking for a tutorial, [here is a XPath tutorial with nice examples](http://www.zvon.org/comp/r/tut-XPath_1.html).



In this post, we'll show you some tips we found valuable when using XPath in the trenches, using [Scrapy Selector API](http://doc.scrapy.org/en/latest/topics/selectors.html) for our examples.

## Avoid using contains(.//text(), 'search text') in your XPath conditions.

Use contains(., 'search text') instead.

Here is why: the expression `.//text()` yields a collection of text elements -- a *node-set*. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like `contains()` or `starts-with()`, results in the text for the **first** element only.

**>>>** from scrapy import Selector

**>>>** sel = Selector**(**text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'**)**

**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()** # let's type this only once

**>>>** xp**(**'//a//text()'**)** # take a peek at the node-set

   **[**u'Click here to go to the ', u'Next Page'**]**

**>>>** xp**(**'string(//a//text())'**)**  # convert it to a string

   **[**u'Click here to go to the '**]**

A *node* converted to a string, however, puts together the text of itself plus of all its descendants:

 **>>>** xp**(**'//a[1]'**)** # selects the first a node

**[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**

**>>>** xp**(**'string(//a[1])'**)** # converts it to string

**[**u'Click here to go to the Next Page'**]**

So, in general:

**GOOD:**

**>>>** xp**(**"//a[contains(., 'Next Page')]"**)**

**[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**

**BAD:**``

**>>>** xp**(**"//a[contains(.//text(), 'Next Page')]"**)**

**[]**

**GOOD:**

**>>>** xp**(**"substring-after(//a, 'Next ')"**)**

**[**u'Page'**]**

**BAD:**

**>>>** xp**(**"substring-after(//a//text(), 'Next ')"**)**

**[**u''**]**

You can read [more detailed explanations about string values of nodes and node-sets in the XPath spec](http://www.w3.org/TR/xpath/#dt-string-value).

## Beware of the difference between //node[1] and (//node)[1]

`//node[1]` selects all the nodes occurring first under their respective parents.

`(//node)[1]` selects all the nodes in the document, and then gets only the first of them.

**>>>** from scrapy import Selector

**>>>** sel=Selector**(**text="""

....:     <ul class="list">

....:         <li>1</li>

....:         <li>2</li>

....:         <li>3</li>

....:     </ul>

....:     <ul class="list">

....:         <li>4</li>

....:         <li>5</li>

....:         <li>6</li>

....:     </ul>"""**)**

**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**

**>>>** xp**(**"//li[1]"**)** # get all first LI elements under whatever it is its parent

**[**u'<li>1</li>', u'<li>4</li>'**]**

**>>>** xp**(**"(//li)[1]"**)** # get the first LI element in the whole document

**[**u'<li>1</li>'**]**

**>>>** xp**(**"//ul/li[1]"**)**  # get all first LI elements under an UL parent

**[**u'<li>1</li>', u'<li>4</li>'**]**

**>>>** xp**(**"(//ul/li)[1]"**)** # get the first LI element under an UL parent in the document

**[**u'<li>1</li>'**]**

Also,

`//a[starts-with(@href, '#')][1]` gets a collection of the local anchors that occur first under their respective parents.

`(//a[starts-with(@href, '#')])[1]` gets the first local anchor in the document.

## When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]


Let's cook up some examples:

**>>>** sel = Selector**(**text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>'**)**

**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**

**BAD:** doesn't work because there are multiple classes in the attribute

**>>>** xp**(**"//*[@class='content']"**)**

**[]**

**BAD:** gets more than we want

**>>>** xp**(**"//*[contains(@class,'content')]"**)**

**[**u'<p class="content-author">Someone</p>'**]**

**GOOD:**

**>>>** xp**(**"//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]"**)** 

**[**u'<p class="content text-wrap">Some content</p>'**]**

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

**ALSO GOOD:**

**>>>** sel.css**(**".content"**)**.extract**()**

**[**u'<p class="content text-wrap">Some content</p>'**]** 

**>>>** sel.css**(**'.content'**)**.xpath**(**'@class'**)**.extract**()**

**[**u'content text-wrap'**]**

Read [more about what you can do with Scrapy's Selectors here](http://scrapy.readthedocs.org/en/latest/topics/selectors.html#nesting-selectors).

## Learn to use all the different axes

It is handy to know how to use the axes, you can [follow through the examples given in the tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths) to quickly review this.

In particular, you should note that [following](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following_axis) and [following-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following-sibling_axis) are not the same thing, this is a common source of confusion. The same goes for [preceding](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding_axis) and [preceding-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding-sibling_axis), and also [ancestor](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Ancestor_axis) and [parent](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Parent_axis).

## Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from `script` and `style` tags and also skip whitespace-only text nodes.

Source: http://stackoverflow.com/a/19350897/2572383

from:https://www.zyte.com/blog/xpath-tips-from-the-web-scraping-trenches/

posted @ 2021-02-08 11:58  公众号python学习开发  阅读(83)  评论(0编辑  收藏  举报