Wednesday, April 18, 2007

Ruby (Hpricot) Program Guide - I

Do you live outside of South Africa and subscribe to the M-Net Africa service? Ever wanted to avoid the M-Net Africa site and just get the program guide for your region?

Well, look no further. Ruby and Hpricot to the rescue!

The M-Net Schedule site has changed quite often over the last few months so chances are good that by the time you get to this article their site may have devolved again. Doing screen scraping on web sites is generally fraught with pain, suffering and disappointment.

This is generally due to the fact that you're providing a static way to read dynamic (over time) content. Don't get discouraged though, just build notification of changes into your screen scraper and ensure that it can notify you when things have changed so that you can up date it.

To compound the problem, many sites (including the M-Net Schedule site) do not conform to the XHTML standard. This simply means that they have not used semantic tools to layout their site to abstract the structure, content and behaviour from their site. A quick validation via the W3C Markup Validation Service confirms that the parser can't even determine the content encoding.

Embrace change - it is a lot less painful (not to mention more productive ;).

Analysis of Structure
The first step of parsing content from an outside source is to analyse the structure of the content to determine what strategies you are going to employ to read and parse the content. Below is an extract of the type of content we're interested in:
<tr>
<td colspan="5">
<font class="ScheduleSchedule">Today's Schedule for :</font>
<font class="ScheduleChannel">Cartoon Network (Africa)</font>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr>
<td colspan="5">
<font class="ScheduleDate">17&nbsp;April&nbsp;2007</font>
</td>
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">&nbsp;</td>
</tr>
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:25</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">Codename: Kids Next Door
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.</p>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:50</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">The Powerpuff Girls
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>The wild and wacky escapades of three girls with extraordinary powers. Blossom, Buttercup and Bubbles use their superpowers to fight crime and villainy in Townsville.</p>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>

The first table row we're interested in is the one that tells us which channels we are looking at and what this day's date is (lightly formatted for readability):
<tr>
<td colspan="5">
<font class="ScheduleSchedule">Today\'s Schedule for :</font>
<font class="ScheduleChannel">Cartoon Network (Africa)</font>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr>
<td colspan="5">
<font class="ScheduleDate">17&nbsp;April&nbsp;2007</font>
</td>
</tr>
The name of the channel resides in a font tag whit a class attribute of "ScheduleChannel" and the date we're working with also resides in a font tag with a class attribute of "ScheduleDate". How does the search for this information translate into code?I will be using a XPath query (Hpricot supports both XPath and CSS selector based queries) to find the first font tag that has a class attribute that I am searching for:
def channel
@channel = @hp.at("font[@class='ScheduleChannel']").inner_html
end

def date
@date = @hp.at("font[@class='ScheduleDate']").inner_html
end

That's all pretty plain Jane so far. Here is what a typical table row looks like that contains the time of the program (reformatted for readability):
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:25</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">Codename: Kids Next Door
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
<tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.</p>
</td>
</tr>
The time is similarly found in a font tag with a class tag of "ScheduleTime" and the program is found in a font tag with a class attribute of "ScheduleTitle". The program synopsis is however wrapped in a table column with a span of 5 and a paragraph tag.
def time
@time = @hp.at("font[@class='ScheduleTime']").inner_html
end

def title
@title = @hp.at("font[@class='ScheduleTitle']").inner_html
end

def synopsis
@synopsis = (@hp/"td[@colspan=5]/p").first.inner_html
end
You will notice that the extraction of the time and title holds no surprises. The synopsis extraction however is something new. I chose to use a CSS selector search for the synopsis by looking for the first td tag that has a colspan=5 attribute, followed by a p tag's contents (inner_html).

If you were to print the values of the relevant variables you would see that there is still some cleaning up that needs to be done on them before they can be considered for programmatic consumption:
Channel: Cartoon Network (Africa)
Date: 17&nbsp;April&nbsp;2007
Time: 06:25
Title: Codename: Kids Next Door <!--Title-->


Synopsis: A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.
The channel and time looks fine so we'll just skip them for now. The date has some HTML entities in it so let's remove them using the handy HTMLEntities (I recommend installing from the gem) lib. The problem is that if they sneaked in some HTML entities in the date they may choose to do this elsewhere as well so let's not trust the input and ensure we sanitise all input in a generic way:
def initialize(url)
@coder = HTMLEntities.new
p sanitize(url)
end

def sanitize(string)
@coder.decode(string)
end
The only problem with this is that that HTMLEntities uses UTF-8 encoding which outputs (on my system) something like this for the date value:
"17\\302\\240April\\302\\2402007"
Not really ideal ... let's use the iconv lib to get the UTF-8 string forced into a US-ASCII encoding:
def initialize(url)
@coder = HTMLEntities.new
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
p sanitize(url)
end

def sanitize(string)
@ic.iconv(@coder.decode(string))
end
Right, now to get back on track after that slight detour. To recap, we now have a strategy to get all the items we're interested in but the solution above is a little naive because it assumes we only have one day with one program. The complete schedule for a channel could span several days, each having variable amounts of programs per day.

One can also extend the ideas above to make it a lot more usable by downloading multiple channels for you and possibly pretty print it, send it to yourself via email or drop it in a db for later processing or display.

I'll cover these in followup articles to come ...

No comments:

About Me

My photo
I love solving real-world problems with code and systems (web apps, distributed systems and all the bits and pieces in-between).