Monday, April 30, 2007

Ruby (Hpricot) Program Guide - II

For this installment we'll see if we can build on what we learnt last time to provide a less naive solution to get a complete schedule for a channel that spans several days, each having variable amounts of programs per day.

First thing first though. Let's add the code that will retrieve the page for the channel we choose. Let's assume we want the schedule for Cartoon Network (Africa). The channel id for this channels happens to be 219 (as per the select list on the search page).

class Hash
require 'uri'

def urlencode
map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end

class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'

def initialize()
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '01 May 2007',
'channelid' => 219
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@channel = channel
@date = date
@time = time
@title = title
@synopsis = synopsis

printf "Channel: %s\nDate: %s\nTime: %s\nTitle: %s\nSynopsis: %s\n",
@channel, @date, @time, @title, @synopsis
end

def channel
sanitize(@hp.at("font[@class='ScheduleChannel']").inner_html)
end

def date
sanitize(@hp.at("font[@class='ScheduleDate']").inner_html)
end

def time
sanitize(@hp.at("font[@class='ScheduleTime']").inner_html)
end

def title
sanitize(@hp.at("font[@class='ScheduleTitle']").inner_html)
end

def synopsis
sanitize((@hp/"td[@colspan=5]/p").first.inner_html)
end

def sanitize(string)
@ic.iconv(@coder.decode(string))
end
end


#
# Main
#
schedule = DSTVSchedule.new()

So what interesting changes are there from our last try? The first thing you'll notice is that I monkey patched the Hash class and added a nifty urlencode method to encode my URL parameters that are used to construct the query string which we will be sending off to the search application.

Inside the DSTVSchedule class we've added query_params to temporarily hold our variable URL parameters. We then construct the URL we'll use for the query and simply pass that to the open() method from open-uri.

The rest should all seem familiar to you (if you followed the previous article).

Now that we have that behind us do you notice we sit with a little dilemma? If we want multiple days' programs we cannot use the class as it stands because we will religiously only output the first program in the schedule. Let's replace all those methods (channel, time, date, title, synopsis) with one method that initialises an internal data structure which will represent the channel information.

def initialize()
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '01 May 2007',
'channelid' => 219
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp)

self.print_schedule
end

def process_html(hp)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(date|ScheduleDate)/
date = utf7((line/"[@class=date]|[@class='ScheduleDate']").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end

schedule
end

The process_html method replaces all the methods we removed. All we've done is use Hpricot to search for all table column tags, and their content, and done some further search refinement in the case statement.

In the case structure I use simple regexps to find the classes I want and then use Hpricot to pull out the information contained in the matched tag. The structure I create is a hash of hashes that has the date and time as keys and the title and synopsis as 2 elements in an array (tuple).

There is one strange case above; when searching for dates. The reason for this is to cope with the inconsistent semantics used in the HTML (as mentioned in the previous article). The first date is listed with a class attribute of 'ScheduleDate' while all the rest have a class attribute of 'date'.

Take note of the use of the specialised hash SequencedHash that is used instead of the vanilla hash that is included in the core of ruby. The SequencedHash is part of the Ruby Collections gem which keeps track in which order we add elements so that we're able to pull them out in the same order.

I suspect storing the order of the keys may be a lot faster than trying to sort through a (potentially) large data set at the end to ensure the data is printed out in ascending date/time order.

The sanitize() method has changed in the following ways from the last article:

  1. Forcing of encoding to UTF7 has been moved to the utf7() method.

  2. Drop any text that is a HTML comment to the end of the string.

  3. Reap any leading and trailing white space.


They are protected so we can only use them in our class.

protected

def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end

def utf7(string="")
@ic.iconv(@coder.decode(string))
end

We can now construct a valid query, execute the search and build an internal data structure that represents our schedule. We now need to find some way to output what we have internally.

def to_s
self.print_schedule("\t")
end

alias :to_tdt :to_s

def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end

def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end

print_schedule() forms the basis of my output strategy. It takes an optional separator character(s) and walks the internal data structure to construct a schedule entry with data concatenated by the separator.

I reuse this method in the to_s() and to_csv() methods to print out TAB delimited and comma separated values, respectively. I also added a to_tdt (TAD Delimited Text) alias which is essentially just another name for to_s().

Running the class as it stands should give you something like this (extract):

30 April 2007||00:20||King Arthur's Disasters||Following the crazy adventures of King Arthur as he tries to find a present for his true love, Princess Guinevere.
30 April 2007||00:45||Spaced Out||'Death Of An Alien!'. George feels guilty when a Russian astronaut who saved his life is evicted from the space station.
30 April 2007||01:10||The Cramp Twins||Follow the fun and adventures of the troublesome twins, Lucien and Wayne Cramp, who are always fighting, arguing and embarrassing each other!
[...]
1 May 2007||00:20||King Arthur's Disasters||'The Ice Palace'. King Arthur and Merlin are sent to Switzerland to find Guinevere an ice palace that she can live inside.
1 May 2007||00:45||Spaced Out||'Invasion'. When cockroaches invade the space station, the Martins are asked by a cockroach prince to solve a conflict between his people and another clan.
1 May 2007||01:10||The Cramp Twins||Follow the fun and adventures of the troublesome twins, Lucien and Wayne Cramp, who are always fighting, arguing and embarrassing each other!
[...]

Feel free to play with the other output options for more fun.

Here is the complete class as it stands now:

class Hash
require 'uri'

def urlencode
map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end

class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'
require 'collections/sequenced_hash'

def initialize(channel='', period=30, time_offset=2)
query_params = {
'startDate' => '30 Apr 2007',
'EndDate' => '1 May 2007',
'channelid' => "219"
}
query_string = query_params.urlencode + '&sType=5&searchstring=&submit=Submit'
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
url = "http://#{host}#{cgi}#{query_string}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp)
end

def process_html(hp)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(ScheduleDate|date)/
date = utf7((line/"[@class='ScheduleDate']|[@class=date]").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end

schedule
end

def to_s
self.print_schedule("\t")
end

alias :to_tdt :to_s

def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end

def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end

protected

def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end

def utf7(string="")
@ic.iconv(@coder.decode(string))
end
end


#
# Main
#
schedule = DSTVSchedule.new()
schedule.print_schedule

Further refactoring may see us adding some attributes to the constructor (channel name, time offset) and providing an example on how we can use objects from this class to collect and display multiple channels of our choice.

Sounds like there's another article in there somewhere.

No comments:

About Me

My photo
I love solving real-world problems with code and systems (web apps, distributed systems and all the bits and pieces in-between).