Wednesday, May 2, 2007

Ruby (Hpricot) Program Guide - III

As discussed in the previous article our next steps will be to refactor the constructor and provide an example of how we can use objects from the DSTVSchedule class to collect and display channels of our choice.

Let's change the constructor to take the channel ID, time offset (to account for different time zones) and the period ahead in time for which we want to gather schedule information as parameters. This will mean that we get rid of the custom hash class and tidy things up a little bit:

def initialize(channel=219, offset=2, period=30)
start_date, end_date = get_search_dates(period)
url = build_url(build_query_string(channel, start_date ,end_date))

p "Start: #{start_date} End: #{end_date} URL: #{url}"

@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp, offset)
end

def get_search_dates(period=30)
[DateTime.now().strftime("%d %b %Y"), (DateTime.now()+period).strftime("%d %b %Y")]
end

def build_query_string(channel, start_date, end_date)
urlencode({
'channelid' => channel,
'startDate' => start_date,
'EndDate' => end_date}) +
'&sType=5&searchstring=&submit=Submit'
end

def build_url(query_string)
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
"http://#{host}#{cgi}#{query_string}"
end

def urlencode(hash)
hash.map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end

We no longer statically define the query parameters in the constructor and therefore have no real need for the custom hash. We can still use the urlencode() method though and add it as a helper in the class.

The start and end dates for the query are calculated based on today's date and the period provided to the constructor as an argument.

We also dumped all that horrible looking query string and url variable construction code into separate methods.

The next step is to provide some automation to the channel schedule collection code for our example program. Look at the the HTML data in any of the search pages and you'll see the following (excerpt):

<select name="channelid" class="ScheduleInputSelect">
<option value="" >CHANNEL</option>
<option value=246>actionX </option>
<option value=322>Activate </option>
<option value=496>Africa Magic</option>
<option value=487>Africa Magic Channel (C-Band) </option>
<option value=639>Africa Magic W4</option>
<option value=417>Animal Planet </option>
[...]
<option value=254>TV Globo </option>
<option value=493>TV5 Afrique </option>
<option value=110>TV5 Afrique (Africa) </option>
<option value=65>VH1 </option>
<option value=67>ZEE TV </option>
</select>

These are the channels that we can search for. What we need is to represent this information as an internal data structure that we can use to search for the channels we want. I suggest a hash that has the channel name as a key and the channel ID and offset as a tuple.

I am lazy so I'd prefer to avoid typing all that information up or manually trying to transform it in the editor. Perhaps we can use some good old command line ruby to chew up and spit out the code we need which we can then just cut 'n paste or import (depending on the editor you use).

Copy the HTML and drop it in a file somewhere. Let's call the file in.html and run it through this command line script (output is truncated):

$ ruby -n -e '$_=~/value=(\d+)\>(.+)\s+\</;if $1&&$2 then a=$1;b=$2;print "\# \"#{b.sub(/\s+$/,"")}\" => [#{a}, 120],\n" end' < in.html | head
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
# "BBC Food" => [284, 120],
# "BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
[...]

Now take the output and place it in your script as a hash (as described above):

channels = {
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
"BBC Food" => [284, 120],
"BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
# "BVN" => [270, 120],
# "Canal+ Horizons" => [237, 120],
# "Cartoon Network" => [13, 120],
# "Cartoon Network (Africa)" => [219, 120],
# "Cartoon Network (W4)" => [182, 120],
# "Channel O - Sound Television" => [27, 120],
# "China Central Television 4" => [15, 120],
# "China Central Television 9 (Africa)" => [226, 120],
# "CNBC" => [90, 120],
# "CNBC (Africa)" => [194, 120],
# "CNBC (W4)" => [187, 120],
# "CNN International" => [18, 120],
# "Deukom - 3SAT" => [165, 120],
# "Deukom - ARD" => [93, 120],
# "Deukom - DW" => [94, 120],
# "Deukom - PRO 7" => [164, 120],
# "Deukom - RTL" => [91, 120],
# "Deukom - SAT 1" => [92, 120],
# "Deukom - ZDF" => [95, 120],
"Discovery Channel" => [21, 120],
# "E-Entertainment" => [646, 120],
"ESPN" => [24, 120],
# "eTV" => [111, 120],
# "Fashion TV" => [145, 120],
# "Fashion TV (Africa)" => [196, 120],
# "Fashion TV (W4)" => [216, 120],
"GO" => [542, 120],
# "Go (K-World Teen)" => [341, 120],
"Hallmark Entertainment Network" => [32, 120],
"History Channel" => [484, 120],
# "History Channel (Africa)" => [485, 120],
# "K-TV World" => [36, 120],
# "KTV (Indian Bouquet)" => [501, 120],
# "kykNET" => [112, 120],
# "M-Net Domestic" => [39, 120],
"M-Net East (Africa)" => [40, 120],
"M-Net Series" => [75, 120],
# "MK89" => [592, 120],
# "Movie Magic (Africa)" => [57, 120],
"Movie Magic 2 (Africa)" => [234, 120],
# "Movie Magic 2 (W4)" => [233, 120],
# "MTV" => [42, 120],
# "MTV Base" => [69, 120],
"National Geographic" => [102, 120],
# "NDTV" => [499, 120],
# "Parliamentary Service" => [45, 120],
# "Pay Per View" => [109, 120],
"Reality TV" => [248, 120],
# "Rhema Network" => [46, 120],
# "RTPi" => [48, 120],
# "SABC 1" => [84, 120],
# "SABC 2" => [85, 120],
# "SABC 3" => [86, 120],
# "SABC Africa" => [87, 120],
# "SIC" => [255, 120],
# "Sky News" => [120, 120],
"Sony Entertainment" => [228, 90],
# "Summit" => [104, 120],
# "Sun TV" => [500, 120],
# "SuperSport" => [52, 120],
# "SuperSport 2" => [54, 120],
# "SuperSport 3" => [80, 120],
# "SuperSport 3 (W4)" => [172, 120],
# "SuperSport 5" => [208, 120],
# "SuperSport 5 (Africa)" => [252, 120],
# "SuperSport 5 (W4)" => [251, 120],
# "SuperSport 6" => [209, 120],
# "SuperSport 7 (C-Band)" => [580, 120],
# "SuperSport Zone Mosaic" => [235, 120],
# "TellyTrack" => [34, 120],
# "Travel Channel" => [61, 120],
# "Trinity Broadcasting Network" => [276, 120],
# "Turner Classic Movies" => [59, 120],
# "Turner Classic Movies (Africa)" => [60, 120],
# "Turner Classic Movies (W4)" => [181, 120],
# "TV Globo" => [254, 120],
# "TV5 Afrique" => [493, 120],
# "TV5 Afrique (Africa)" => [110, 120],
# "VH1" => [65, 120],
# "ZEE TV" => [67, 120]
}

You'll notice I have removed the comments from any of the channels I want (I recommend you do the same for the channels you may be interested in). I also added a default time offset of 2 hours (120 minutes) for most of the channels to adjust the time for my time zone. You can change this in the command line ruby filter above to suit your needs.

All we need to do now is wrap our object creation and the output from it in a loop and we're off:

channels.keys.each do |channel|
p "Channel: #{channel}"
schedule = DSTVSchedule.new(channels[channel][0], channels[channel][1], 30)
schedule.print_schedule
print "\n\n"
end

All done. Here is the complete script source listing:

#!/usr/bin/ruby

class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'
require 'collections/sequenced_hash'

def initialize(channel=219, offset=2, period=30)
start_date, end_date = get_search_dates(period)
url = build_url(build_query_string(channel, start_date ,end_date))

p "Start: #{start_date} End: #{end_date} URL: #{url}"

@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp, offset)
end

def process_html(hp, offset)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(ScheduleDate|date)/
date = utf7((line/"[@class='ScheduleDate']|[@class=date]").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
time = (Time.parse("#{date} #{time}") + (60 * offset)).strftime("%H:%M")
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end

schedule
end

def to_s
self.print_schedule("\t")
end

alias :to_tdt :to_s

def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end

def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end

protected

def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end

def utf7(string="")
@ic.iconv(@coder.decode(string))
end

def get_search_dates(period=30)
[DateTime.now().strftime("%d %b %Y"), (DateTime.now()+period).strftime("%d %b %Y")]
end

def build_query_string(channel, start_date, end_date)
urlencode({
'channelid' => channel,
'startDate' => start_date,
'EndDate' => end_date}) +
'&sType=5&searchstring=&submit=Submit'
end

def build_url(query_string)
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
"http://#{host}#{cgi}#{query_string}"
end

def urlencode(hash)
hash.map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end


#
# Main
#
channels = {
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
"BBC Food" => [284, 120],
"BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
# "BVN" => [270, 120],
# "Canal+ Horizons" => [237, 120],
# "Cartoon Network" => [13, 120],
# "Cartoon Network (Africa)" => [219, 120],
# "Cartoon Network (W4)" => [182, 120],
# "Channel O - Sound Television" => [27, 120],
# "China Central Television 4" => [15, 120],
# "China Central Television 9 (Africa)" => [226, 120],
# "CNBC" => [90, 120],
# "CNBC (Africa)" => [194, 120],
# "CNBC (W4)" => [187, 120],
# "CNN International" => [18, 120],
# "Deukom - 3SAT" => [165, 120],
# "Deukom - ARD" => [93, 120],
# "Deukom - DW" => [94, 120],
# "Deukom - PRO 7" => [164, 120],
# "Deukom - RTL" => [91, 120],
# "Deukom - SAT 1" => [92, 120],
# "Deukom - ZDF" => [95, 120],
"Discovery Channel" => [21, 120],
# "E-Entertainment" => [646, 120],
"ESPN" => [24, 120],
# "eTV" => [111, 120],
# "Fashion TV" => [145, 120],
# "Fashion TV (Africa)" => [196, 120],
# "Fashion TV (W4)" => [216, 120],
"GO" => [542, 120],
# "Go (K-World Teen)" => [341, 120],
"Hallmark Entertainment Network" => [32, 120],
"History Channel" => [484, 120],
# "History Channel (Africa)" => [485, 120],
# "K-TV World" => [36, 120],
# "KTV (Indian Bouquet)" => [501, 120],
# "kykNET" => [112, 120],
# "M-Net Domestic" => [39, 120],
"M-Net East (Africa)" => [40, 120],
"M-Net Series" => [75, 120],
# "MK89" => [592, 120],
# "Movie Magic (Africa)" => [57, 120],
"Movie Magic 2 (Africa)" => [234, 120],
# "Movie Magic 2 (W4)" => [233, 120],
# "MTV" => [42, 120],
# "MTV Base" => [69, 120],
"National Geographic" => [102, 120],
# "NDTV" => [499, 120],
# "Parliamentary Service" => [45, 120],
# "Pay Per View" => [109, 120],
"Reality TV" => [248, 120],
# "Rhema Network" => [46, 120],
# "RTPi" => [48, 120],
# "SABC 1" => [84, 120],
# "SABC 2" => [85, 120],
# "SABC 3" => [86, 120],
# "SABC Africa" => [87, 120],
# "SIC" => [255, 120],
# "Sky News" => [120, 120],
"Sony Entertainment" => [228, 90],
# "Summit" => [104, 120],
# "Sun TV" => [500, 120],
# "SuperSport" => [52, 120],
# "SuperSport 2" => [54, 120],
# "SuperSport 3" => [80, 120],
# "SuperSport 3 (W4)" => [172, 120],
# "SuperSport 5" => [208, 120],
# "SuperSport 5 (Africa)" => [252, 120],
# "SuperSport 5 (W4)" => [251, 120],
# "SuperSport 6" => [209, 120],
# "SuperSport 7 (C-Band)" => [580, 120],
# "SuperSport Zone Mosaic" => [235, 120],
# "TellyTrack" => [34, 120],
# "Travel Channel" => [61, 120],
# "Trinity Broadcasting Network" => [276, 120],
# "Turner Classic Movies" => [59, 120],
# "Turner Classic Movies (Africa)" => [60, 120],
# "Turner Classic Movies (W4)" => [181, 120],
# "TV Globo" => [254, 120],
# "TV5 Afrique" => [493, 120],
# "TV5 Afrique (Africa)" => [110, 120],
# "VH1" => [65, 120],
# "ZEE TV" => [67, 120]
}

channels.keys.each do |channel|
p "Channel: #{channel}"
schedule = DSTVSchedule.new(channels[channel][0], channels[channel][1], 30)
schedule.print_schedule
print "\n\n"
end

I hope these articles have tickled your lobes and gets you to go explore Hpricot and the Wonderful World of Web Scraping.

No comments:

About Me

My photo
I love solving real-world problems with code and systems (web apps, distributed systems and all the bits and pieces in-between).