Monday, April 30, 2007

Ruby (Hpricot) Program Guide - II

For this installment we'll see if we can build on what we learnt last time to provide a less naive solution to get a complete schedule for a channel that spans several days, each having variable amounts of programs per day.

First thing first though. Let's add the code that will retrieve the page for the channel we choose. Let's assume we want the schedule for Cartoon Network (Africa). The channel id for this channels happens to be 219 (as per the select list on the search page).

class Hash
require 'uri'

def urlencode
map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end

class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'

def initialize()
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '01 May 2007',
'channelid' => 219
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@channel = channel
@date = date
@time = time
@title = title
@synopsis = synopsis

printf "Channel: %s\nDate: %s\nTime: %s\nTitle: %s\nSynopsis: %s\n",
@channel, @date, @time, @title, @synopsis
end

def channel
sanitize(@hp.at("font[@class='ScheduleChannel']").inner_html)
end

def date
sanitize(@hp.at("font[@class='ScheduleDate']").inner_html)
end

def time
sanitize(@hp.at("font[@class='ScheduleTime']").inner_html)
end

def title
sanitize(@hp.at("font[@class='ScheduleTitle']").inner_html)
end

def synopsis
sanitize((@hp/"td[@colspan=5]/p").first.inner_html)
end

def sanitize(string)
@ic.iconv(@coder.decode(string))
end
end


#
# Main
#
schedule = DSTVSchedule.new()

So what interesting changes are there from our last try? The first thing you'll notice is that I monkey patched the Hash class and added a nifty urlencode method to encode my URL parameters that are used to construct the query string which we will be sending off to the search application.

Inside the DSTVSchedule class we've added query_params to temporarily hold our variable URL parameters. We then construct the URL we'll use for the query and simply pass that to the open() method from open-uri.

The rest should all seem familiar to you (if you followed the previous article).

Now that we have that behind us do you notice we sit with a little dilemma? If we want multiple days' programs we cannot use the class as it stands because we will religiously only output the first program in the schedule. Let's replace all those methods (channel, time, date, title, synopsis) with one method that initialises an internal data structure which will represent the channel information.

def initialize()
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '01 May 2007',
'channelid' => 219
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp)

self.print_schedule
end

def process_html(hp)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(date|ScheduleDate)/
date = utf7((line/"[@class=date]|[@class='ScheduleDate']").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end

schedule
end

The process_html method replaces all the methods we removed. All we've done is use Hpricot to search for all table column tags, and their content, and done some further search refinement in the case statement.

In the case structure I use simple regexps to find the classes I want and then use Hpricot to pull out the information contained in the matched tag. The structure I create is a hash of hashes that has the date and time as keys and the title and synopsis as 2 elements in an array (tuple).

There is one strange case above; when searching for dates. The reason for this is to cope with the inconsistent semantics used in the HTML (as mentioned in the previous article). The first date is listed with a class attribute of 'ScheduleDate' while all the rest have a class attribute of 'date'.

Take note of the use of the specialised hash SequencedHash that is used instead of the vanilla hash that is included in the core of ruby. The SequencedHash is part of the Ruby Collections gem which keeps track in which order we add elements so that we're able to pull them out in the same order.

I suspect storing the order of the keys may be a lot faster than trying to sort through a (potentially) large data set at the end to ensure the data is printed out in ascending date/time order.

The sanitize() method has changed in the following ways from the last article:

  1. Forcing of encoding to UTF7 has been moved to the utf7() method.

  2. Drop any text that is a HTML comment to the end of the string.

  3. Reap any leading and trailing white space.


They are protected so we can only use them in our class.

protected

def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end

def utf7(string="")
@ic.iconv(@coder.decode(string))
end

We can now construct a valid query, execute the search and build an internal data structure that represents our schedule. We now need to find some way to output what we have internally.

def to_s
self.print_schedule("\t")
end

alias :to_tdt :to_s

def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end

def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end

print_schedule() forms the basis of my output strategy. It takes an optional separator character(s) and walks the internal data structure to construct a schedule entry with data concatenated by the separator.

I reuse this method in the to_s() and to_csv() methods to print out TAB delimited and comma separated values, respectively. I also added a to_tdt (TAD Delimited Text) alias which is essentially just another name for to_s().

Running the class as it stands should give you something like this (extract):

30 April 2007||00:20||King Arthur's Disasters||Following the crazy adventures of King Arthur as he tries to find a present for his true love, Princess Guinevere.
30 April 2007||00:45||Spaced Out||'Death Of An Alien!'. George feels guilty when a Russian astronaut who saved his life is evicted from the space station.
30 April 2007||01:10||The Cramp Twins||Follow the fun and adventures of the troublesome twins, Lucien and Wayne Cramp, who are always fighting, arguing and embarrassing each other!
[...]
1 May 2007||00:20||King Arthur's Disasters||'The Ice Palace'. King Arthur and Merlin are sent to Switzerland to find Guinevere an ice palace that she can live inside.
1 May 2007||00:45||Spaced Out||'Invasion'. When cockroaches invade the space station, the Martins are asked by a cockroach prince to solve a conflict between his people and another clan.
1 May 2007||01:10||The Cramp Twins||Follow the fun and adventures of the troublesome twins, Lucien and Wayne Cramp, who are always fighting, arguing and embarrassing each other!
[...]

Feel free to play with the other output options for more fun.

Here is the complete class as it stands now:

class Hash
require 'uri'

def urlencode
map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end

class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'
require 'collections/sequenced_hash'

def initialize(channel='', period=30, time_offset=2)
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '1 May 2007',
'channelid' => "219"
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp)
end

def process_html(hp)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(ScheduleDate|date)/
date = utf7((line/"[@class='ScheduleDate']|[@class=date]").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end

schedule
end

def to_s
self.print_schedule("\t")
end

alias :to_tdt :to_s

def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end

def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end

protected

def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end

def utf7(string="")
@ic.iconv(@coder.decode(string))
end
end


#
# Main
#
schedule = DSTVSchedule.new()
schedule.print_schedule

Further refactoring may see us adding some attributes to the constructor (channel name, time offset) and providing an example on how we can use objects from this class to collect and display multiple channels of our choice.

Sounds like there's another article in there somewhere.

Friday, April 27, 2007

Unholy Triumvirate: TextMate, MacPorts and Ruby

After switching back from a Ubuntu laptop to my MacBook Pro I was once again getting back to using TextMate to do some development and systems scripting. The combination of ruby and RubyGems have been a little bit rocky on OS X.

In part it was due to the default install of ruby on OS X, me using Fink for package management and then later switching from that to MacPorts.

Apple (and I presumably) suck cvyrf.

The problem I ran into was that after installing ruby and rb-rubygem via the ports system, TextMate no longer seems too interested in compiling ruby scripts when I hit CMD-R and provides me with a lovely:
"No such file to load ” rubygems
Checking Google the first listing I get is this.

It did not provide me with an applicable solution but got me thinking ... Either I have some environment variables that are not being set (or set incorrectly) or my library paths are screwy somehow.

An easy way to confirm the former is to check if your shell environment also suffers from the same malady:
$ ruby -r rubygems -e "p 1"
1

Not the problem then. Next step, let's pull out find and off a hunting we go:
$ sudo find / -name ruby -type f
Password:
/opt/local/bin/ruby /opt/local/var/db/dports/software/ruby/1.8.6_0/opt/local/bin/ruby /usr/bin/ruby
Let's see if there is some disparity between the ruby binary in /opt/local/bin and /usr/bin:
$ /usr/bin/ruby -v
ruby 1.8.2 (2004-12-25) [universal-darwin8.0]
$ /opt/local/bin/ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.9.1]

Well, what do you know. The version in /usr/bin is older and also looks for its libs in a non /opt location which means that it won't pick up the good work port has done for me. I moved /usr/bin/ruby to /tmp and added a soft link for /opt/local/bin/ruby to /usr/bin.

Running my script in TextMate now works like a charm!

Thursday, April 26, 2007

Puffing with SSHKeychain

In one of my previous articles I showed how you could use ssh-agent to your advantage to maximize lackadaisicalness. I have since then moved from the Ubuntu laptop that I was using at the time to my Mac that became available again.

I was looking for a nice and neat way to integrate ssh-agent into the Mac environment but could not get my shell scripting approach to gel elegantly. While doing the obligatory search on the web I found and fell in love with SSHKeychain.

This little app does all the had work (running ssh-agent from the correct place and exporting your keys into memory with ssh-add) for you, and more ... It not only handles the ssh-agent side of things but also provide support for integrating with the Apple Keychain and forward local ports over a ssh connection to set up ssh tunnels.

Go see the full feature list for more info.

Installation
Here are the step from their site:
  • Download SSHKeychain.dmg and mount it.
  • Copy SSHKeychain (SSHKeychain.app) to your Applications folder.
  • Run SSHKeychain. This should open a dock item and a statusbar item.
  • Click either the Statusbar Item, the Dock Item, or Main Menu and open the Preferences.
  • Open the Environment tab.
  • Enable "Manage global environment variables". This will make SSHKeychain available for other applications.
  • Open the keys tab and see if any of your keys are missing (~/.ssh/id_dsa and ~/.ssh/identity are default).
  • Re-login to make the global variables work.
  • Start up SSHKeychain, and you're set.
I added SSHKeychain to my Login Items in the System Preferences panel to ensure the app was running after a restart or log out/in sequence.

Setup
If you followed the installation instructions above there should be nothing further to do (assuming you had some pre-created keys in the default place like I had).

Excellent!

When I now fire Terminal.app up and log into a box that has my public key on it no password is required and I am logged in without further ado.

Friday, April 20, 2007

ssh-agent for Developers

Have you ever wanted to automate the ssh pass phrase login procedure when connecting to remote systems that have your public key in their .ssh/authorized_keys?

This is done using ssh-agent (and ssh-add) which will be on your Debian or Ubuntu system if you have the openssh-client dep installed. For other flavours of Linux, OS X or UNIX please refer to your package management documentation (or install from source) to see how you can install the required software.

On an Ubuntu system ssh-agent is started for you by default. Please refer to your system documentation for ssh-agent to find the correct way to run it on your system.

The following approach should work for situations where the client (the computer with the private key) is either a server (access is generally restricted to it via remote shell) or a desktop (includes laptops) with a graphical terminal program.

Add the following to the end of your ~/.bashrc (or other suitable shell setup configuration file):
# Run ssh-add if it has not been run already.
if ssh-add -l | grep -q 'The agent has no identities.'
then
eval "ssh-add"
fi
Save the addition to your .bashrc (or suitable alternative) and log out and back in.

You will be presented with a request for your pass phrase you chose when creating your public/private keys. Enter it and sigh with relief as your default key(s) are cached in memory.

When you now try and log into the remote system again there will be no passwords or pass phrases required for this session.

Wednesday, April 18, 2007

Ruby (Hpricot) Program Guide - I

Do you live outside of South Africa and subscribe to the M-Net Africa service? Ever wanted to avoid the M-Net Africa site and just get the program guide for your region?

Well, look no further. Ruby and Hpricot to the rescue!

The M-Net Schedule site has changed quite often over the last few months so chances are good that by the time you get to this article their site may have devolved again. Doing screen scraping on web sites is generally fraught with pain, suffering and disappointment.

This is generally due to the fact that you're providing a static way to read dynamic (over time) content. Don't get discouraged though, just build notification of changes into your screen scraper and ensure that it can notify you when things have changed so that you can up date it.

To compound the problem, many sites (including the M-Net Schedule site) do not conform to the XHTML standard. This simply means that they have not used semantic tools to layout their site to abstract the structure, content and behaviour from their site. A quick validation via the W3C Markup Validation Service confirms that the parser can't even determine the content encoding.

Embrace change - it is a lot less painful (not to mention more productive ;).

Analysis of Structure
The first step of parsing content from an outside source is to analyse the structure of the content to determine what strategies you are going to employ to read and parse the content. Below is an extract of the type of content we're interested in:
<tr>
<td colspan="5">
<font class="ScheduleSchedule">Today's Schedule for :</font>
<font class="ScheduleChannel">Cartoon Network (Africa)</font>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr>
<td colspan="5">
<font class="ScheduleDate">17&nbsp;April&nbsp;2007</font>
</td>
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">&nbsp;</td>
</tr>
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:25</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">Codename: Kids Next Door
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.</p>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:50</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">The Powerpuff Girls
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
</tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>The wild and wacky escapades of three girls with extraordinary powers. Blossom, Buttercup and Bubbles use their superpowers to fight crime and villainy in Townsville.</p>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>

The first table row we're interested in is the one that tells us which channels we are looking at and what this day's date is (lightly formatted for readability):
<tr>
<td colspan="5">
<font class="ScheduleSchedule">Today\'s Schedule for :</font>
<font class="ScheduleChannel">Cartoon Network (Africa)</font>
</td>
</tr>
<tr>
<td colspan="5">&nbsp;</td>
</tr>
<tr>
<td colspan="5">
<font class="ScheduleDate">17&nbsp;April&nbsp;2007</font>
</td>
</tr>
The name of the channel resides in a font tag whit a class attribute of "ScheduleChannel" and the date we're working with also resides in a font tag with a class attribute of "ScheduleDate". How does the search for this information translate into code?I will be using a XPath query (Hpricot supports both XPath and CSS selector based queries) to find the first font tag that has a class attribute that I am searching for:
def channel
@channel = @hp.at("font[@class='ScheduleChannel']").inner_html
end

def date
@date = @hp.at("font[@class='ScheduleDate']").inner_html
end

That's all pretty plain Jane so far. Here is what a typical table row looks like that contains the time of the program (reformatted for readability):
<tr bgcolor="F5F5F5">
<td width="40">
<b><font class="ScheduleTime"> 06:25</font></b>
</td>
<!--Time-->
<td width="420">
<font class="ScheduleTitle">Codename: Kids Next Door
<!--Title-->
</font>
</td>
<td width="17"></td>
<td width="188"></td>
<!--SMS Reminder-->
<td width="50" align="right">
<a href="#" onclick="OpenAgeRestriction(1);return false;">Family</a>
</td>
<!--Age Restriction-->
<tr>
<tr bgcolor="F5F5F5">
<td colspan="5">
<p>A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.</p>
</td>
</tr>
The time is similarly found in a font tag with a class tag of "ScheduleTime" and the program is found in a font tag with a class attribute of "ScheduleTitle". The program synopsis is however wrapped in a table column with a span of 5 and a paragraph tag.
def time
@time = @hp.at("font[@class='ScheduleTime']").inner_html
end

def title
@title = @hp.at("font[@class='ScheduleTitle']").inner_html
end

def synopsis
@synopsis = (@hp/"td[@colspan=5]/p").first.inner_html
end
You will notice that the extraction of the time and title holds no surprises. The synopsis extraction however is something new. I chose to use a CSS selector search for the synopsis by looking for the first td tag that has a colspan=5 attribute, followed by a p tag's contents (inner_html).

If you were to print the values of the relevant variables you would see that there is still some cleaning up that needs to be done on them before they can be considered for programmatic consumption:
Channel: Cartoon Network (Africa)
Date: 17&nbsp;April&nbsp;2007
Time: 06:25
Title: Codename: Kids Next Door <!--Title-->


Synopsis: A gang of 10 year olds takes on top secret missions, using fantastic home-made technology to safeguard their treehouse against attack and grown-ups.
The channel and time looks fine so we'll just skip them for now. The date has some HTML entities in it so let's remove them using the handy HTMLEntities (I recommend installing from the gem) lib. The problem is that if they sneaked in some HTML entities in the date they may choose to do this elsewhere as well so let's not trust the input and ensure we sanitise all input in a generic way:
def initialize(url)
@coder = HTMLEntities.new
p sanitize(url)
end

def sanitize(string)
@coder.decode(string)
end
The only problem with this is that that HTMLEntities uses UTF-8 encoding which outputs (on my system) something like this for the date value:
"17\\302\\240April\\302\\2402007"
Not really ideal ... let's use the iconv lib to get the UTF-8 string forced into a US-ASCII encoding:
def initialize(url)
@coder = HTMLEntities.new
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
p sanitize(url)
end

def sanitize(string)
@ic.iconv(@coder.decode(string))
end
Right, now to get back on track after that slight detour. To recap, we now have a strategy to get all the items we're interested in but the solution above is a little naive because it assumes we only have one day with one program. The complete schedule for a channel could span several days, each having variable amounts of programs per day.

One can also extend the ideas above to make it a lot more usable by downloading multiple channels for you and possibly pretty print it, send it to yourself via email or drop it in a db for later processing or display.

I'll cover these in followup articles to come ...

About Me

My photo
I love solving real-world problems with code and systems (web apps, distributed systems and all the bits and pieces in-between).