Let's change the constructor to take the channel ID, time offset (to account for different time zones) and the period ahead in time for which we want to gather schedule information as parameters. This will mean that we get rid of the custom hash class and tidy things up a little bit:
def initialize(channel=219, offset=2, period=30)
start_date, end_date = get_search_dates(period)
url = build_url(build_query_string(channel, start_date ,end_date))
p "Start: #{start_date} End: #{end_date} URL: #{url}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp, offset)
end
def get_search_dates(period=30)
[DateTime.now().strftime("%d %b %Y"), (DateTime.now()+period).strftime("%d %b %Y")]
end
def build_query_string(channel, start_date, end_date)
urlencode({
'channelid' => channel,
'startDate' => start_date,
'EndDate' => end_date}) +
'&sType=5&searchstring=&submit=Submit'
end
def build_url(query_string)
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
"http://#{host}#{cgi}#{query_string}"
end
def urlencode(hash)
hash.map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
We no longer statically define the query parameters in the constructor and therefore have no real need for the custom hash. We can still use the urlencode() method though and add it as a helper in the class.
The start and end dates for the query are calculated based on today's date and the period provided to the constructor as an argument.
We also dumped all that horrible looking query string and url variable construction code into separate methods.
The next step is to provide some automation to the channel schedule collection code for our example program. Look at the the HTML data in any of the search pages and you'll see the following (excerpt):
<select name="channelid" class="ScheduleInputSelect">
<option value="" >CHANNEL</option>
<option value=246>actionX </option>
<option value=322>Activate </option>
<option value=496>Africa Magic</option>
<option value=487>Africa Magic Channel (C-Band) </option>
<option value=639>Africa Magic W4</option>
<option value=417>Animal Planet </option>
[...]
<option value=254>TV Globo </option>
<option value=493>TV5 Afrique </option>
<option value=110>TV5 Afrique (Africa) </option>
<option value=65>VH1 </option>
<option value=67>ZEE TV </option>
</select>
These are the channels that we can search for. What we need is to represent this information as an internal data structure that we can use to search for the channels we want. I suggest a hash that has the channel name as a key and the channel ID and offset as a tuple.
I am lazy so I'd prefer to avoid typing all that information up or manually trying to transform it in the editor. Perhaps we can use some good old command line ruby to chew up and spit out the code we need which we can then just cut 'n paste or import (depending on the editor you use).
Copy the HTML and drop it in a file somewhere. Let's call the file in.html and run it through this command line script (output is truncated):
$ ruby -n -e '$_=~/value=(\d+)\>(.+)\s+\</;if $1&&$2 then a=$1;b=$2;print "\# \"#{b.sub(/\s+$/,"")}\" => [#{a}, 120],\n" end' < in.html | head
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
# "BBC Food" => [284, 120],
# "BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
[...]
Now take the output and place it in your script as a hash (as described above):
channels = {
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
"BBC Food" => [284, 120],
"BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
# "BVN" => [270, 120],
# "Canal+ Horizons" => [237, 120],
# "Cartoon Network" => [13, 120],
# "Cartoon Network (Africa)" => [219, 120],
# "Cartoon Network (W4)" => [182, 120],
# "Channel O - Sound Television" => [27, 120],
# "China Central Television 4" => [15, 120],
# "China Central Television 9 (Africa)" => [226, 120],
# "CNBC" => [90, 120],
# "CNBC (Africa)" => [194, 120],
# "CNBC (W4)" => [187, 120],
# "CNN International" => [18, 120],
# "Deukom - 3SAT" => [165, 120],
# "Deukom - ARD" => [93, 120],
# "Deukom - DW" => [94, 120],
# "Deukom - PRO 7" => [164, 120],
# "Deukom - RTL" => [91, 120],
# "Deukom - SAT 1" => [92, 120],
# "Deukom - ZDF" => [95, 120],
"Discovery Channel" => [21, 120],
# "E-Entertainment" => [646, 120],
"ESPN" => [24, 120],
# "eTV" => [111, 120],
# "Fashion TV" => [145, 120],
# "Fashion TV (Africa)" => [196, 120],
# "Fashion TV (W4)" => [216, 120],
"GO" => [542, 120],
# "Go (K-World Teen)" => [341, 120],
"Hallmark Entertainment Network" => [32, 120],
"History Channel" => [484, 120],
# "History Channel (Africa)" => [485, 120],
# "K-TV World" => [36, 120],
# "KTV (Indian Bouquet)" => [501, 120],
# "kykNET" => [112, 120],
# "M-Net Domestic" => [39, 120],
"M-Net East (Africa)" => [40, 120],
"M-Net Series" => [75, 120],
# "MK89" => [592, 120],
# "Movie Magic (Africa)" => [57, 120],
"Movie Magic 2 (Africa)" => [234, 120],
# "Movie Magic 2 (W4)" => [233, 120],
# "MTV" => [42, 120],
# "MTV Base" => [69, 120],
"National Geographic" => [102, 120],
# "NDTV" => [499, 120],
# "Parliamentary Service" => [45, 120],
# "Pay Per View" => [109, 120],
"Reality TV" => [248, 120],
# "Rhema Network" => [46, 120],
# "RTPi" => [48, 120],
# "SABC 1" => [84, 120],
# "SABC 2" => [85, 120],
# "SABC 3" => [86, 120],
# "SABC Africa" => [87, 120],
# "SIC" => [255, 120],
# "Sky News" => [120, 120],
"Sony Entertainment" => [228, 90],
# "Summit" => [104, 120],
# "Sun TV" => [500, 120],
# "SuperSport" => [52, 120],
# "SuperSport 2" => [54, 120],
# "SuperSport 3" => [80, 120],
# "SuperSport 3 (W4)" => [172, 120],
# "SuperSport 5" => [208, 120],
# "SuperSport 5 (Africa)" => [252, 120],
# "SuperSport 5 (W4)" => [251, 120],
# "SuperSport 6" => [209, 120],
# "SuperSport 7 (C-Band)" => [580, 120],
# "SuperSport Zone Mosaic" => [235, 120],
# "TellyTrack" => [34, 120],
# "Travel Channel" => [61, 120],
# "Trinity Broadcasting Network" => [276, 120],
# "Turner Classic Movies" => [59, 120],
# "Turner Classic Movies (Africa)" => [60, 120],
# "Turner Classic Movies (W4)" => [181, 120],
# "TV Globo" => [254, 120],
# "TV5 Afrique" => [493, 120],
# "TV5 Afrique (Africa)" => [110, 120],
# "VH1" => [65, 120],
# "ZEE TV" => [67, 120]
}
You'll notice I have removed the comments from any of the channels I want (I recommend you do the same for the channels you may be interested in). I also added a default time offset of 2 hours (120 minutes) for most of the channels to adjust the time for my time zone. You can change this in the command line ruby filter above to suit your needs.
All we need to do now is wrap our object creation and the output from it in a loop and we're off:
channels.keys.each do |channel|
p "Channel: #{channel}"
schedule = DSTVSchedule.new(channels[channel][0], channels[channel][1], 30)
schedule.print_schedule
print "\n\n"
end
All done. Here is the complete script source listing:
#!/usr/bin/ruby
class DSTVSchedule
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'iconv'
require 'collections/sequenced_hash'
def initialize(channel=219, offset=2, period=30)
start_date, end_date = get_search_dates(period)
url = build_url(build_query_string(channel, start_date ,end_date))
p "Start: #{start_date} End: #{end_date} URL: #{url}"
@hp = Hpricot(open(url))
@ic = Iconv.new('US-ASCII//TRANSLIT', 'UTF-8')
@coder = HTMLEntities.new
@schedule = process_html(@hp, offset)
end
def process_html(hp, offset)
schedule = SequencedHash.new
date = ""
time = ""
(hp/"td").each do |line|
case line.inner_html
when /ScheduleChannel/
@channel = sanitize((line/"[@class='ScheduleChannel']").inner_html)
when /(ScheduleDate|date)/
date = utf7((line/"[@class='ScheduleDate']|[@class=date]").inner_html)
schedule[date] = SequencedHash.new
when /ScheduleTime/
time = sanitize((line/"[@class='ScheduleTime']").inner_html)
time = (Time.parse("#{date} #{time}") + (60 * offset)).strftime("%H:%M")
schedule[date][time] = []
when /ScheduleTitle/
schedule[date][time] << sanitize((line/"[@class='ScheduleTitle']").inner_html)
when /\<p\>/
schedule[date][time] << sanitize((line/"p").inner_html)
end
end
schedule
end
def to_s
self.print_schedule("\t")
end
alias :to_tdt :to_s
def to_csv
##TODO - Add channel to the output
self.print_schedule(",")
end
def print_schedule(separator="||")
sep = separator
@schedule.keys.each do |date|
@schedule[date].keys.each do |time|
print [date, time, @schedule[date][time][0], @schedule[date][time][1]].join(sep) + "\n"
end
end
end
protected
def sanitize(string)
string.gsub!(/\<\!\-\-.+$/, '') # remove HTML comments to the end of the line
string.gsub!(/^\s+/, '') # remove leading whitespace
string.gsub!(/\s+$/, '') # remove trailing whitespace
string
end
def utf7(string="")
@ic.iconv(@coder.decode(string))
end
def get_search_dates(period=30)
[DateTime.now().strftime("%d %b %Y"), (DateTime.now()+period).strftime("%d %b %Y")]
end
def build_query_string(channel, start_date, end_date)
urlencode({
'channelid' => channel,
'startDate' => start_date,
'EndDate' => end_date}) +
'&sType=5&searchstring=&submit=Submit'
end
def build_url(query_string)
host = 'www.mnet.co.za'
cgi = '/schedules/default.asp?'
"http://#{host}#{cgi}#{query_string}"
end
def urlencode(hash)
hash.map {|k, v| "#{URI::encode(k.to_s)}=#{URI::encode(v.to_s)}"}.join('&')
end
end
#
# Main
#
channels = {
# "actionX" => [246, 120],
# "Activate" => [322, 120],
# "Africa Magic Channel (C-Band)" => [487, 120],
# "Animal Planet" => [417, 120],
# "B4U Movies" => [227, 120],
"BBC Food" => [284, 120],
"BBC Prime" => [121, 120],
# "BBC World" => [5, 120],
# "Bloomberg Information TV" => [8, 120],
# "Boomerang" => [314, 120],
# "BVN" => [270, 120],
# "Canal+ Horizons" => [237, 120],
# "Cartoon Network" => [13, 120],
# "Cartoon Network (Africa)" => [219, 120],
# "Cartoon Network (W4)" => [182, 120],
# "Channel O - Sound Television" => [27, 120],
# "China Central Television 4" => [15, 120],
# "China Central Television 9 (Africa)" => [226, 120],
# "CNBC" => [90, 120],
# "CNBC (Africa)" => [194, 120],
# "CNBC (W4)" => [187, 120],
# "CNN International" => [18, 120],
# "Deukom - 3SAT" => [165, 120],
# "Deukom - ARD" => [93, 120],
# "Deukom - DW" => [94, 120],
# "Deukom - PRO 7" => [164, 120],
# "Deukom - RTL" => [91, 120],
# "Deukom - SAT 1" => [92, 120],
# "Deukom - ZDF" => [95, 120],
"Discovery Channel" => [21, 120],
# "E-Entertainment" => [646, 120],
"ESPN" => [24, 120],
# "eTV" => [111, 120],
# "Fashion TV" => [145, 120],
# "Fashion TV (Africa)" => [196, 120],
# "Fashion TV (W4)" => [216, 120],
"GO" => [542, 120],
# "Go (K-World Teen)" => [341, 120],
"Hallmark Entertainment Network" => [32, 120],
"History Channel" => [484, 120],
# "History Channel (Africa)" => [485, 120],
# "K-TV World" => [36, 120],
# "KTV (Indian Bouquet)" => [501, 120],
# "kykNET" => [112, 120],
# "M-Net Domestic" => [39, 120],
"M-Net East (Africa)" => [40, 120],
"M-Net Series" => [75, 120],
# "MK89" => [592, 120],
# "Movie Magic (Africa)" => [57, 120],
"Movie Magic 2 (Africa)" => [234, 120],
# "Movie Magic 2 (W4)" => [233, 120],
# "MTV" => [42, 120],
# "MTV Base" => [69, 120],
"National Geographic" => [102, 120],
# "NDTV" => [499, 120],
# "Parliamentary Service" => [45, 120],
# "Pay Per View" => [109, 120],
"Reality TV" => [248, 120],
# "Rhema Network" => [46, 120],
# "RTPi" => [48, 120],
# "SABC 1" => [84, 120],
# "SABC 2" => [85, 120],
# "SABC 3" => [86, 120],
# "SABC Africa" => [87, 120],
# "SIC" => [255, 120],
# "Sky News" => [120, 120],
"Sony Entertainment" => [228, 90],
# "Summit" => [104, 120],
# "Sun TV" => [500, 120],
# "SuperSport" => [52, 120],
# "SuperSport 2" => [54, 120],
# "SuperSport 3" => [80, 120],
# "SuperSport 3 (W4)" => [172, 120],
# "SuperSport 5" => [208, 120],
# "SuperSport 5 (Africa)" => [252, 120],
# "SuperSport 5 (W4)" => [251, 120],
# "SuperSport 6" => [209, 120],
# "SuperSport 7 (C-Band)" => [580, 120],
# "SuperSport Zone Mosaic" => [235, 120],
# "TellyTrack" => [34, 120],
# "Travel Channel" => [61, 120],
# "Trinity Broadcasting Network" => [276, 120],
# "Turner Classic Movies" => [59, 120],
# "Turner Classic Movies (Africa)" => [60, 120],
# "Turner Classic Movies (W4)" => [181, 120],
# "TV Globo" => [254, 120],
# "TV5 Afrique" => [493, 120],
# "TV5 Afrique (Africa)" => [110, 120],
# "VH1" => [65, 120],
# "ZEE TV" => [67, 120]
}
channels.keys.each do |channel|
p "Channel: #{channel}"
schedule = DSTVSchedule.new(channels[channel][0], channels[channel][1], 30)
schedule.print_schedule
print "\n\n"
end
I hope these articles have tickled your lobes and gets you to go explore Hpricot and the Wonderful World of Web Scraping.
No comments:
Post a Comment