Vinmonopolet Scraper Code

Okay, so by request here’s the Ruby script I’ve used to scrape vinmonopolet.no for the wines found in Vino.

Caveat: This is not publishing grade code, it is completely hacky and might not work for your setup. It will probably break if vinmonopolet.no updates their page structure even slightly. You will need some coding knowledge to have any use for this. I post it here strictly for educational value, so please bear that in mind.


require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'sqlite3'
require 'iconv'

require 'HTMLEntities.rb'

class String
  def to_iso
    Iconv.conv('ISO-8859-1', 'utf-8', self)
  end
  def fixx
    self.gsub(" "," ").decode_entities.strip
  end
  def camelcase
    self.downcase.capitalize.gsub(/ (.)/) { " #{$1.upcase}" }
  end
end

db = SQLite3::Database.new('vino.db')

wines = {}

base = "http://www.vinmonopolet.no/is-bin/INTERSHOP.enfinity/WFS/store-vmp-Site/no_NO/-/NOK/"
nexturl = ARGV[0]
while nexturl
  puts nexturl
  doc = Hpricot(open([base,nexturl].join))
  doc.search("#productList tr").each do |row|
    next if row.at("th")
    wine = {}
    wine[:polnumber] = row.at("td.id > p").inner_html.match(/(([0-9]+))/)[1]
    wine[:volume] = row.at("td.price em").inner_html.fixx
    if db.get_first_value("SELECT 1 FROM vino WHERE polnumber = '%s' AND volume = '%s';" % [wine[:polnumber],wine[:volume]])
      puts "XX %s" % wine[:polnumber]
      next
    else
      puts "-- %s" % wine[:polnumber]
    end
    wine[:link] = row.at("h3 a").attributes['href']
    wine[:title] = row.at("h3 a").inner_html.fixx
    wdoc = Hpricot(open(wine[:link]))
    wine[:price] = wdoc.at("td.price strong").inner_html.gsub("Kr. ","").fixx
    wdoc.search(".productData li").each do |data|
      wine[:region] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "distrikt"
      wine[:varietals] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "stoff"
      wine[:winery] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "Produsent"
      wine[:type] = data.at("span").inner_html.fixx if data.at("strong").inner_html.include? "Varetype"
    end
    wine[:vintage] = wine[:title].scan(/[0-9]{4}/).last
    db.execute("INSERT INTO vino VALUES (:title,:volume,:price,:region,:varietals,:winery,:polnumber,:type,:vintage,:link);",wine)
  end #wine
  lastlink = doc.at("table.pages tr > td:last a:last")
  begin
    if lastlink.inner_html.include? "Neste"
      nexturl = lastlink.attributes['href']
    else
      nexturl = false
    end
  rescue
    next
  end
end

Usage: Save the file as vinoscan.rb along with HTMLEntities.rb, enter Terminal and go to the folder where it is saved, type

ruby vinoscan.rb "v_PerformSearch-Prod?query=*&nav=true&filter25=Hvitvin&sort=21"

with v_PerformSearch… being the page you want the script to start scraping. (To find this go to vinmonopolet.no and look for matching urls.) The script will follow Next links until it can’t find any more, and save all the wines it comes across into a SQLite database.

It’s important to note that this database is not directly compatible with Vino – I have another script cleaning the data before I copy it into Vino but that script is such a mess I find it hard to believe anyone could have anything to gain from looking at it.

Many thanks to Paul Battley for the HTML Entities library, and even more so to why the lucky stiff for the magnificent hpricot web scraping library.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s