浙江省高等学校教师教育理论培训

微信搜索“毛凌志岗前心得”小程序

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理
Crowbar - SIMILE
CrowbarCrowbar
glossary definition:= Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser.
Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.
Image:crowbar_logo.jpg
[edit]
Requirements

Crowbar wants to be able to run the same exact scrapers created by Solvent for the use in Piggy Bank, to avoid having to tweak and tune the scrapers due to implementation differences between parsers and DOM/XPath implementations.

For this reason, it needs to work inside an environment that is as close as possible to a browser. Luckily, Mozilla provides XULRunner: a firefox-like execution environment.

You need to have XULRunner (version 1.8.1 or higher) in order for Crowbar to work on your system.
[edit]
Where do I get it?

Crowbar is currently work in progress and has not been released yet. You can try it out by downloading it directly using a Subversion client from its subversion repository at

http://simile.mit.edu/repository/crowbar/trunk/

[edit]
Design

Crowbar is implemented as a (rather simple, in fact) XULRunner application that provides an HTTP RESTful web service implemented in javascript (basically turning a web browser into a web server!) that you can use to 'remotely control' the browser.
[edit]
How to Run Crowbar
[edit]
Windows

After you have installed XULRunner, open up the command prompt and type:

c:\> %XULRUNNER_HOME%\xulrunner.exe --install-app %CROWBAR%\xulapp
c:\> cd %CROWBAR%\xulapp
c:\> %XULRUNNER_HOME%\xulrunner.exe application.ini

where %XULRUNNER_HOME% is the path where you have installed XULRunner and %CROWBAR% is the folder where crowbar resides and c:\> is the prompt (meaning that you don't have to type that, it's just an indicate that these are different lines).

If all is successful, a small window named "Crowbar" will pop up.
[edit]
Linux

After you have installed XULRunner, open your favorite shell and type:

$ $XULRUNNER_HOME/xulrunner --install-app $CROWBAR/xulapp
$ cd $CROWBAR/xulapp
$ $XULRUNNER_HOME/xulrunner application.ini

where $XULRUNNER_HOME is the path where you have installed XULRunner and $CROWBAR is the folder where crowbar resides and $ is the command prompt (you don't have to type that, it's just an indicate that these are different lines).

If all is successful, a small window named "Crowbar" will pop up.
[edit]
MacOSX

After you have installed XULRunner, open the Terminal and type:

$ /Library/Frameworks/XUL.framework/xulrunner-bin --install-app $CROWBAR/xulapp
$ /Applications/Crowbar.app/Contents/MacOS/xulrunner

$CROWBAR is the folder where crowbar resides and $ is the shell prompt (so you don't have to type that, it's just an indicate that these are different lines).

If all is successful, a small window named "Crowbar" will pop up.
[edit]
Now what?

When Crowbar is running, it shows a small window that contains the address that is currently loading or has been loaded last. The real value of Crowbar is offered as a RESTful web service listening by default on port 10000.

You can use crowbar by simply pointing any other web browser to

http://127.0.0.1:10000/

Crowbar will reply with a web page that you can use to indicate what URL you want Crowbar to fetch, execute and serialize back at you.

You can also use command line HTTP clients such as curl or wget to interact with the web service. For example,

curl -s --data "url=http://simile.mit.edu/&delay=1000" http://127.0.0.1:10000/

will return the serialized DOM of the http://simile.mit.edu/ page at STDOUT after having waited for 1 second for the page to be fetched, loaded, parsed and made available to Crowbar by the underlying XulRunner.

To scrape data,

curl -s --data "url=http://simile.mit.edu/&delay=1000&mode=scrape&scraper=
http://simile.mit.edu/repository/piggy-bank/trunk/src/extension/chrome/content/scrapers/generic-page-scraper.js"
http://127.0.0.1:10000/

which will load the SIMILE page, then load and run the scraper as specified, returning RDF/XML of the scraped data to STDOUT.

For more information on what you can do with Crowbar, read the Crowbar web service description.
[edit]
Results

You can upload your results to a Semantic Bank with ATM, a command-line RDF/XML uploading script in Python:

% ./atm.py -b http://simile.mit.edu/bank/ -u yourusername output.rdf

[edit]
What functionality is planned?

See the Crowbar Todos list.
[edit]
There are tons of scraping solutions, why another one?

Most scraping solutions work on the so-called 'syntax space' (that is, extracting information out of the raw stream of data served as an HTTP response) and for that reason they have to cope with all the ways the same data can be serialized but result in the same visual web page in a browser. While these solutions are easy to write, they tend to require a lot of maintenance over time just to cope with syntax-space changes that have no influence on the overall outcome of the page rendering for the users (and therefore is something that the page creators are prone to change without worry). It is common, in fact, for these solutions to require constant tweaking.

Scraping on 'model space' is a better solution because it allows the scraper to work directly on the 'infoset' that contains the data and that is used by the browser to render the page to the user. This means, to be very least, working on the data model that comes out of an HTML parser and then operate on the resulting DOM using DOM-aware query mechanisms (such as, for example, XPath).

Solvent and Piggy Bank show how much easier, natural and solid it is to write scrapers against a DOM rather than against an array of characters (which is, the serialized DOM representation in HTML). Crowbar wants to enable the automation of that those scraping processes, reusing the exact same scrapers that work for Piggy Bank and are created in Solvent.

Unfortunately, as HTML in the real world varies very much, writing an HTML parser and DOM producer that can cope with real-world variety of HTML is no easy task. For this reason, Crowbar is based upon the very same browsing code that is used by Firefox to parse the HTML and create the resulting DOM.

An added benefit of working with a full browsing environment as your crawler's agent is that we can access the DOM after the "onload" javascript hooks were executed, which means that we can scrape content that was not even in the HTML page served by the web server originally and that was client-side included via AJAX or programmatically computed after the page was loaded.
[edit]
Licensing and legal issues

Crowbar is open source software and is licensed under the BSD license.

Portions of Crowbar are re-used from Tabulator, available under the W3C Software License.
[edit]
Contributing

Crowbar is an open source software and built around the spirit of open participation and collaboration. There are several ways you can help:

Blog about Crowbar
Edit, fix or otherwise contribute content on the wiki
Subscribe to our mailing lists to show your interest and give us feedback
Report problems and ask for new features through our issue tracking system (but take a look at our todo list first)
Send us patches or fixes to the code

[edit]
Credits

This software was created by the SIMILE project and in particular:

Stefano Mazzocchi (original author)
Ryan Lee
Juanan Pereira

Categories: Project | Crowbar


posted on 2011-10-26 12:07  lexus  阅读(450)  评论(0编辑  收藏  举报