After I stopped jumping up and down with excitement, I played with Crowbar a
little bit. The idea is great, but the tool itself has some disappointing
flaws. First, it hangs if you point it to a URL that is not an HTML page (an
image, for example). Why would you want to fetch images with Crowbar? Because
some evil websites won't deliver them unless you have the appropriate "session
id" cookie. Which brings me to the second flaw: Even though Crowbar handles
cookies, there is no way to get at them from the outside. I've found
~/.crowbar/profile-name/, but the database is locked and
inaccessible while Crowbar is running. (And even if I could open it, it
probably wouldn't store session cookies.) The third flaw is
more subtle: When faced with pages that have non-ASCII characters, represented
with two bytes in UTF-8, Crowbar seems to silently drop the first byte. This
gives you corrupted data, leading to hours of keyboard-smashing frustration.
Using QtWebKit directly from Python to scrape pages.
Selenium, which automates browsers for testing. I bet it could be used for scraping, too.
I haven't tried any of these three, but I hold high hopes for the one that uses QtWebKit from within Python. If I understand it correctly, you should be able to get full access over the browser, and peek at cookies, HTTP headers, and anything else you might want. Finally, here are some other random resources that might be useful:
Apparently there is a fair bit of money and controversy around web scraping.