Monday, May 7, 2007

Mechanized Scraping

Ever needed to interface with a web application without any real APIs? Take one step back from looking for a traditional API and use WWW::Mechanize to bend the application to your will.

WWW:Mechanize (inspired by "Andy Lester's":mailto:// perl Mechanize module and written by Aaron Patterson) allows you to moonlight as a web User Agent (browser) from the comfort of your ruby scripting environment. It is great for building automated tests of your web applications, creating your favourite mashups and also to treat another web application's UI as the API to the application.

I've been working on some code that needs to gather reporting information from our billing system but I have no real access to the Oracle db in the back to get to the require stored procedures. So, I decided to simply use the UI as my API to the data and dusted my trusty old WWW:Mechanize (which uses Hpricot internally to parse and tokenise pages) off for the challenge.

It provides you with all the required tools to log in to a site (as well as automatic cookie handling), click on URI, submit forms and oh so much more. The only real feature currently lacking is support for JavaScript (they do however provide you with ideas on how to manoeuvre around some of the more mundane corers) which is becoming more and more painful in this Web2.0 world of ours.

WWW:Mechanize is quite easy to use so I am not going to write an exposé on the in's and out's of the lib or share with you its secrets that helped me to sate world hunger and bring peace to all. Instead, I will mention some of the bits that tripped me up while trying to make the web application dance to my flute.

Button Value Attributes
I was getting nowhere while trying to submit a form in the web application with some crafted values. Tinker here, tinker there and still no go. Try a browser and the application itself and things work like swiss cheesewatches.

Right you mangy ASP application, its time for the big guns! Out comes Wireshark and the debugging starts in earnest. First I dump a session from my script and then one from a browser.

From the diff of of the POST request I notice that the browser has the value attribute for the 'Save' button in the form set whereas I didn't. Because the form was posting back to itself I assume they had some code like (pseudocode):

if $submit == 'Submit'
do your stuff when the form has been submitted
display the normal form

Adding something that resembles the following did the trick:'some_convoluted_button_name').value = 'Submit'

Out of Buffer Error
A few more form hoops later and I started getting an error like:

hpricot/parse.rb:44:in `scan': ran out of buffer space on element <group>, starting on line 361. (Hpricot::ParseError?)


A quick look on the bug db for WWW:Mechanize on RubyForge listed this closed bug that has some application to our situation. The error messages are not the same (I assume this is the case due to an earlier version of Hpricot that was used when this was reported).

According to this TT it is a Hrpicot issue and refers to this TT.

According to the problem description:

An 'OUT OF BUFFER SPACE' error shuts down my whole app when I try to parse through an aspx page with an abnormally (or normally?) large viewstate stuffed into an input. Here's what it looks like:

<input type="hidden" name="__VIEWSTATE"
value="dDw3NzQ0ODQ2ODQ ... 11954 characters in total ... DsXdJfP+k" />

If I remove the large value it works fine. Is there a way hpricot could not exit when trying to parse a page like this?


I am also scraping an ASP application and lo and behold I too have a ginormous __VIEWSTATE input tag in the page in question. I knew ASP was evil, but this?!

The limit on the buffer was of course a protection mechanism to ensure that a parsed page does not cause your computer to become the black hole of memory. The workaround for this is quite simple though, just increase the buffer

Okay, kids. [98] now has a buffer_size method.
Hpricot.buffer_size = 262144
doc = Hpricot(open(""))

Perhaps I will find the wherewithal to fix the parser to read these massive attributes, but on-the-other-hand I don't want to encourage this disastrous behavior by ASP.NET!! You know?

"That's all good and well but we're not really using Hpricot directly, we're using WWW:Mechanize!", you all shout in unison.

True, true. All you do is simply add the buffer_size declaration after instantiating your shiny new WWW:Mechanize object like so:

agent =
Hpricot.buffer_size = 204800

The default buffer size is defined in hpricot_scan.rl as:


#define BUFSIZE 16384


buffer_size = BUFSIZE;
if (rb_ivar_defined(self, rb_intern("@buffer_size")) == Qtrue) {
bufsize = rb_ivar_get(self, rb_intern("@buffer_size"));
if (!NIL_P(bufsize)) {
buffer_size = NUM2INT(bufsize);
buf = ALLOC_N(char, buffer_size);


That's a buffer of about 16KB for an attribute which under normal circumstances would be more than ample space for an attribute but working with ASP seems to be anything but normal.

In Closing
I have not had as much fun in quite some time. WWW:Mechanize had me clapping my little hands in glee while shouting "Wheeeeeeeeeee!" like a little kid that was given his first bunny rabbit just after having his second double espresso for the hour.

No comments:

About Me

My photo
I love solving real-world problems with code and systems (web apps, distributed systems and all the bits and pieces in-between).