Okay, so by request here’s the Ruby script I’ve used to scrape vinmonopolet.no for the wines found in Vino.
Caveat: This is not publishing grade code, it is completely hacky and might not work for your setup. It will probably break if vinmonopolet.no updates their page structure even slightly. You will need some coding knowledge to have any use for this. I post it here strictly for educational value, so please bear that in mind.
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'sqlite3'
require 'iconv'
require 'HTMLEntities.rb'
class String
def to_iso
Iconv.conv('ISO-8859-1', 'utf-8', self)
end
def fixx
self.gsub(" "," ").decode_entities.strip
end
def camelcase
self.downcase.capitalize.gsub(/ (.)/) { " #{$1.upcase}" }
end
end
db = SQLite3::Database.new('vino.db')
wines = {}
base = "http://www.vinmonopolet.no/is-bin/INTERSHOP.enfinity/WFS/store-vmp-Site/no_NO/-/NOK/"
nexturl = ARGV[0]
while nexturl
puts nexturl
doc = Hpricot(open([base,nexturl].join))
doc.search("#productList tr").each do |row|
next if row.at("th")
wine = {}
wine[:polnumber] = row.at("td.id > p").inner_html.match(/(([0-9]+))/)[1]
wine[:volume] = row.at("td.price em").inner_html.fixx
if db.get_first_value("SELECT 1 FROM vino WHERE polnumber = '%s' AND volume = '%s';" % [wine[:polnumber],wine[:volume]])
puts "XX %s" % wine[:polnumber]
next
else
puts "-- %s" % wine[:polnumber]
end
wine[:link] = row.at("h3 a").attributes['href']
wine[:title] = row.at("h3 a").inner_html.fixx
wdoc = Hpricot(open(wine[:link]))
wine[:price] = wdoc.at("td.price strong").inner_html.gsub("Kr. ","").fixx
wdoc.search(".productData li").each do |data|
wine[:region] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "distrikt"
wine[:varietals] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "stoff"
wine[:winery] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "Produsent"
wine[:type] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "Varetype"
end
wine[:vintage] = wine[:title].scan(/[0-9]{4}/).last
db.execute("INSERT INTO vino VALUES (:title,:volume,:price,:region,:varietals,:winery,:polnumber,:type,:vintage,:link);",wine)
end #wine
lastlink = doc.at("table.pages tr > td:last a:last")
begin
if lastlink.inner_html.include? "Neste"
nexturl = lastlink.attributes['href']
else
nexturl = false
end
rescue
next
end
end
Usage: Save the file as vinoscan.rb along with HTMLEntities.rb, enter Terminal and go to the folder where it is saved, type
ruby vinoscan.rb "v_PerformSearch-Prod?query=*&nav=true&filter25=Hvitvin&sort=21"
with v_PerformSearch… being the page you want the script to start scraping. (To find this go to vinmonopolet.no and look for matching urls.) The script will follow Next links until it can’t find any more, and save all the wines it comes across into a SQLite database.
It’s important to note that this database is not directly compatible with Vino – I have another script cleaning the data before I copy it into Vino but that script is such a mess I find it hard to believe anyone could have anything to gain from looking at it.
Many thanks to Paul Battley for the HTML Entities library, and even more so to why the lucky stiff for the magnificent hpricot web scraping library.