xpath tips
In the context of web scraping, [XPath](http://en.wikipedia.org/wiki/XPath) is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you're looking for a tutorial, [here is a XPath tutorial with nice examples](http://www.zvon.org/comp/r/tut-XPath_1.html).
In this post, we'll show you some tips we found valuable when using XPath in the trenches, using [Scrapy Selector API](http://doc.scrapy.org/en/latest/topics/selectors.html) for our examples.
## Avoid using contains(.//text(), 'search text') in your XPath conditions.
Use contains(., 'search text') instead.
Here is why: the expression `.//text()` yields a collection of text elements -- a *node-set*. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like `contains()` or `starts-with()`, results in the text for the **first** element only.
**>>>** from scrapy import Selector
**>>>** sel = Selector**(**text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'**)**
**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()** # let's type this only once
**>>>** xp**(**'//a//text()'**)** # take a peek at the node-set
**[**u'Click here to go to the ', u'Next Page'**]**
**>>>** xp**(**'string(//a//text())'**)** # convert it to a string
**[**u'Click here to go to the '**]**
A *node* converted to a string, however, puts together the text of itself plus of all its descendants:
**>>>** xp**(**'//a[1]'**)** # selects the first a node
**[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**
**>>>** xp**(**'string(//a[1])'**)** # converts it to string
**[**u'Click here to go to the Next Page'**]**
So, in general:
**GOOD:**
**>>>** xp**(**"//a[contains(., 'Next Page')]"**)**
**[**u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'**]**
**BAD:**``
**>>>** xp**(**"//a[contains(.//text(), 'Next Page')]"**)**
**[]**
**GOOD:**
**>>>** xp**(**"substring-after(//a, 'Next ')"**)**
**[**u'Page'**]**
**BAD:**
**>>>** xp**(**"substring-after(//a//text(), 'Next ')"**)**
**[**u''**]**
You can read [more detailed explanations about string values of nodes and node-sets in the XPath spec](http://www.w3.org/TR/xpath/#dt-string-value).
## Beware of the difference between //node[1] and (//node)[1]
`//node[1]` selects all the nodes occurring first under their respective parents.
`(//node)[1]` selects all the nodes in the document, and then gets only the first of them.
**>>>** from scrapy import Selector
**>>>** sel=Selector**(**text="""
....: <ul class="list">
....: <li>1</li>
....: <li>2</li>
....: <li>3</li>
....: </ul>
....: <ul class="list">
....: <li>4</li>
....: <li>5</li>
....: <li>6</li>
....: </ul>"""**)**
**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**
**>>>** xp**(**"//li[1]"**)** # get all first LI elements under whatever it is its parent
**[**u'<li>1</li>', u'<li>4</li>'**]**
**>>>** xp**(**"(//li)[1]"**)** # get the first LI element in the whole document
**[**u'<li>1</li>'**]**
**>>>** xp**(**"//ul/li[1]"**)** # get all first LI elements under an UL parent
**[**u'<li>1</li>', u'<li>4</li>'**]**
**>>>** xp**(**"(//ul/li)[1]"**)** # get the first LI element under an UL parent in the document
**[**u'<li>1</li>'**]**
Also,
`//a[starts-with(@href, '#')][1]` gets a collection of the local anchors that occur first under their respective parents.
`(//a[starts-with(@href, '#')])[1]` gets the first local anchor in the document.
## When selecting by class, be as specific as necessary
If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
Let's cook up some examples:
**>>>** sel = Selector**(**text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>'**)**
**>>>** xp = lambda x: sel.xpath**(**x**)**.extract**()**
**BAD:** doesn't work because there are multiple classes in the attribute
**>>>** xp**(**"//*[@class='content']"**)**
**[]**
**BAD:** gets more than we want
**>>>** xp**(**"//*[contains(@class,'content')]"**)**
**[**u'<p class="content-author">Someone</p>'**]**
**GOOD:**
**>>>** xp**(**"//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]"**)**
**[**u'<p class="content text-wrap">Some content</p>'**]**
And many times, you can just use a CSS selector instead, and even combine the two of them if needed:
**ALSO GOOD:**
**>>>** sel.css**(**".content"**)**.extract**()**
**[**u'<p class="content text-wrap">Some content</p>'**]**
**>>>** sel.css**(**'.content'**)**.xpath**(**'@class'**)**.extract**()**
**[**u'content text-wrap'**]**
Read [more about what you can do with Scrapy's Selectors here](http://scrapy.readthedocs.org/en/latest/topics/selectors.html#nesting-selectors).
## Learn to use all the different axes
It is handy to know how to use the axes, you can [follow through the examples given in the tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths) to quickly review this.
In particular, you should note that [following](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following_axis) and [following-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Following-sibling_axis) are not the same thing, this is a common source of confusion. The same goes for [preceding](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding_axis) and [preceding-sibling](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Preceding-sibling_axis), and also [ancestor](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Ancestor_axis) and [parent](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Parent_axis).
## Useful trick to get text content
Here is another XPath trick that you may use to get the interesting text contents:
//*[not(self::script or self::style)]/text()[normalize-space(.)]
This excludes the content from `script` and `style` tags and also skip whitespace-only text nodes.
Source: http://stackoverflow.com/a/19350897/2572383
from:https://www.zyte.com/blog/xpath-tips-from-the-web-scraping-trenches/