Scraping Ajax websites with Crowbar

Sometimes you want to save a local copy of an entire website, either because you want to use it offline, or because you only have access to it temporarily. You could open each page in a browser and save it, but that's tedious. There are nice crawlers out there like HTTrack, which will save an entire website for you, and even tweak the links so that they work on the local version. Unfortunately, such crawlers do not handle JavaScript. If your website uses Ajax to load its content, you're out of luck. Your downloaded copy will contain unexecuted calls like "fetchContent()", instead of the actual stuff that you would see in a browser.

Wouldn't it be great if a crawler could execute the JavaScript on a page, and save it after all the content has been filled in? It turns out that some folks at MIT have already thought about this. Crowbar is a headless Firefox-like browser, running on top of Mozilla's XULRunner. You point it to a URL, it loads it, executes the JavaScript, and gives you the resulting page. It even seems to handle cookies. It's like the full-fledged Firefox running without any screen output, which is exactly what you'd want for web scraping!

After I stopped jumping up and down with excitement, I played with Crowbar a little bit. The idea is great, but the tool itself has some disappointing flaws. First, it hangs if you point it to a URL that is not an HTML page (an image, for example). Why would you want to fetch images with Crowbar? Because some evil websites won't deliver them unless you have the appropriate "session id" cookie. Which brings me to the second flaw: Even though Crowbar handles cookies, there is no way to get at them from the outside. I've found cookies.sqlite in ~/.crowbar/profile-name/, but the database is locked and inaccessible while Crowbar is running. (And even if I could open it, it probably wouldn't store session cookies.) The third flaw is more subtle: When faced with pages that have non-ASCII characters, represented with two bytes in UTF-8, Crowbar seems to silently drop the first byte. This gives you corrupted data, leading to hours of keyboard-smashing frustration.

The last SVN commit in Crowbar was in June 2008, so I am not too optimistic about seeing these bugs fixed. I also don't know enough JavaScript and XUL to do it myself. Still, I think that using a headless browser for scraping is a great idea. I first stumbled upon Crowbar via a post by Jabba Laci. I have since found a few related tools that might be more useful:

PhantomJS, a headless WebKit-based browser with a JavaScript API.
Using QtWebKit directly from Python to scrape pages.
Selenium, which automates browsers for testing. I bet it could be used for scraping, too.

I haven't tried any of these three, but I hold high hopes for the one that uses QtWebKit from within Python. If I understand it correctly, you should be able to get full access over the browser, and peek at cookies, HTTP headers, and anything else you might want. Finally, here are some other random resources that might be useful:

Mechanize, a programmable headless web browser for Python. It doesn't handle JavaScript, but it does handle cookies, and it has a nice interface for filling out forms. No JavaScript means this is much lighter than running a full (albeit headless) browser.
Tools to handle broken HTML: LXML, BeautifulSoup, html5lib.
An interesting blog about web scraping. Also a python library by the same guy.
Apparently there is a fair bit of money and controversy around web scraping.

I began this post talking about crawlers, but then focused on scraping a single page with a JavaScript-enabled headless browser. I don't know about any existing crawlers that support JavaScript / Ajax this way. One problem is that you can't tell when the scripts on a page have finished running. (Crowbar just waits a predetermined amount of time before delivering a snapshot of the page's contents.) Anyway, a JavaScript-enabled crawler sounds like an interesting project :) sudo give me free time...