Using YQL as crawler for Javascript

It is a good fun to play with Yahoo! Query Language (YQL). YQL is a service enables applications to query, filter, and combine data from different sources across the Internet. Many data in the Yahoo! network can be retrieved from YQL with a SQL like syntax.

SELECT * FROM flickr.photos.search WHERE text="cat"

Means to do a flickr search on photo with the text equals to cat. But the thing that catch me is the capability to convert the content (HTML page) from an external site to a well formatted XML / JSON.

select * from html where url="http://news.yahoo.com/"
and xpath="/html/body/div[@id='doc4']/div[@id='bd']/div[@id='yui-main']/div/div[@id='top-story']/div/div[1]/div[2]/h2/a"

The YQL above will return the headline from Yahoo! news. The xpath part looks pretty scary, but with the xpather firefox addon, you can get the xpath on any DOM element with right click -> Show in XPather. (P.S. One thing to notice with xpather is the tbody tag, which firefox will add to its DOM tree for table which might not really exist in the source HTML. This extra tbody will make YQL returns nothing as it never exists in the HTML code.)

This is an excellent tool for the Javascript. Imagine that if you are going implement a RSS reader, without YQL, the RSS reader application must prepare all the data at the server side and send back to the client (like Fig.1). This is bad for performance as curl call are blocking calls while consuming YQL at client browser can be asynchronous and parallel. This sounds wise to offload those data crawling process to the client (like Fig.2).

Fig. 1 The web application prepare all the data at the server side

浙江省高等学校教师教育理论培训

公告