Getting Wukong and Pig Working Together on Amazon Elastic Map Reduce

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

(Ab)using memoize to quickly solve tricky n+1 problems

Usually, discovering n+1 problems in your Rails application that can’t be fixed with an :include statement means lots of changes to your views. Here’s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they already have all the information needed to render the page.

While simple belongs_to relationships are easy to fix with :include, lets take a look at a concrete example where that won’t work:

class User < ActiveRecord::Base
has_many :likes

class Item < ActiveRecord::Base
has_many :likes
def liked_by?(user)

class Like < ActiveRecord::Base
belongs_to :user
belongs_to :item

A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn’t be well solved by :include. Instead, we’d have to come up with a query to get the Likes for the set of items by this user:


Then we’d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.

Active Support’s memoize functionality stores the results of function calls so they’re only evaluated once. What if we could trick the method into thinking it’s already been called? We can do just that by writing data into the instance variables that memoize uses to save results on each of the model instances. First, we memoize liked_by:

memoize :liked_by?

Then bulk load the relevant likes and stash them into memoize’s internal state:

def precompute_data(items, user)
likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
items.each do |item|

The write_memo method is implemented as follows.

def write_memo(method, return_value, args=nil)
ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
if args
if hash = instance_variable_get(ivar)
hash[Array(args)] = return_value
instance_variable_set(ivar, {Array(args) => return_value})
instance_variable_set(ivar, [return_value])

This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there’s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.

Deferring Index costs for table to table copies in PostgreSQL

When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from the filesystem rather than directly from another table.

For table to table copying (and transformations) the situation isn’t as straight-forward. Recently I was working on a problem where we needed to perform some poor-man’s ETL, copying and transforming data between tables in different schemas. Since some of the destination tables were heavily indexed(including a full text index) the task took quite a while. In talking with a colleague about the problem, we came up with the idea of dropping the indexes and constraints prior to the data load, and restoring them afterwards.

First stop: how to get the DDL for indices on a table in postgres? Poking around the postgres catalogs, I managed to find a function pg_get_indexdef that would return the DDL for an index. Combining that with a query I found in a forum somewhere and altered, I came up with this query to get the names and DDL of all the indices on a table. (this one excludes the primary key index)

With that and the query to do the same for constraints its straightforward to build a helper function that will get the DDL for all indices and constraints, drop them, yield to evaluate a block and then restore the indices and constraints. The method is below:

Use of the function would look like the snippet below. This solution would also allow for arbitrarily complex transformations in Ruby as well as pure SQL.

For my task loading and transforming data into about 20 tables, doing this reduced the execution time by two-thirds. Of course, your mileage may vary depending how heavily indexed your destination tables are.

Here’s the whole module:

Creating thumbnails of PDFs with attachment_fu

We needed to create some thumbnails from uploading PDF files for a new site feature – We’re using attachment_fu which doesn’t support that (yet?), but we’re using RMagick as our processor and it understands PDF files.

I came up with the hack below (warning, first draft, only briefly tested) which works without having to modify the attachment_fu plugin itself. One day I’ll loop back and figure out a cleaner way to do this and see which of attachment_fu’s other image processors can even support pdfs.

There are three methods to override to make a go of this:

  1. self.image? : consider pdf files as an image so thumbnail process will happen
  2. thumbnail_name_for : change the extension of the saved thumbnail filename to png
  3. resize_image: override to change format via block passed to to_blob

Apologies for the crappy source formatting, I have to install a plugin to do that well one of these days

###Hacks to allow creation of png thumbnails for pdf uploads - depends on RMagic being the configured processor
## likely very fragile

def self.image?(content_type)
(content_types +  ['application/pdf']).include?(content_type)

alias_method :original_thumbnail_name_for, :thumbnail_name_for
def thumbnail_name_for(thumbnail=nil)
return original_thumbnail_name_for(thumbnail) unless (content_type == 'application/pdf' && !thumbnail.blank?)
basename = filename.gsub /.w+$/ do |s|
ext = s; ''
#copied from rmagick_processor with change in last few lines
def resize_image(img, size)
size = size.first if size.is_a?(Array) && size.length == 1 && !size.first.is_a?(Fixnum)
if size.is_a?(Fixnum) || (size.is_a?(Array) && size.first.is_a?(Fixnum))
size = [size, size] if size.is_a?(Fixnum)
img.change_geometry(size.to_s) { |cols, rows, image| image.resize!(cols<1 ? 1 : cols, rows<1 ? 1 : rows) }
img.strip! unless attachment_options[:keep_profile]
if content_type == 'application/pdf' # here force the output format to PNG if its a pdf
self.temp_path = write_to_temp_file(img.to_blob {self.format = 'PNG'})
self.temp_path = write_to_temp_file(img.to_blob)

Ruby operator precedence (the ors and ands of it)

I found out (by introducing a bug into the application I’ve been working on) that “or” and “||” do not have equal precedence in Ruby.

More importantly, the assignment operator “=” has higher precedence than “or” so that means that while the expression

>> foo = nil || 2
=> 2
>> foo
=> 2

results in foo being assigned the value 2 as you might expect, the following expression leaves foo assigned the value nil.

>> foo = nil or 2
=> 2
>> foo
=> nil

This is well covered ground online (see this post) but I was surprised that this oddity didn’t warrant an explicit mention in the operator precedence section of the Pickaxe book.

Boston Ruby User’s Group meeting

I attended my first Boston Ruby User’s group meeting earlier tonight. I wasn’t sure what to expect exactly, but I was surprised how many people attended (in the neighborhood of a hundred I would guess).
Both of the speakers were quite interesting.

  • David Black gave an interesting talk on the way Ruby implements inheritance with a particular emphasis on giving objects that “spring from” the same class different behaviors without defining additional classes.
    Learned a lot from this exercise in meta programming because I’ve really only dabbled in Ruby so far.
  • Zed Shaw had a really energetic, engaging and entertaining presentation touching on his http server, Mongrel, its competitors, evildoers and anti-social behavior on the internet and how he aims to address that with his Utu project

The sessions were video taped so they’ll apparently be up on Google video sometime soon. You don’t really have to know or care about ruby to enjoy and learn from Zed’s talk.

One of the great things about living somewhere like the Boston area is that people I’ve heard of before show up at things like this – attendees of the meeting tonight included Martin Fowler and John Resig (wrote JQuery), along with many other folks much smarter than me.

Gas prices, state by state, with and without state taxes

In the image below I’ve plotted the average gas price in each state for 4/25/07 (data from here) with and without state per-gallon taxes included. Without the taxes included, it becomes obvious that gas prices increase on the west coast, perhaps due to transportation costs? ( a quick search didn’t turn up any port-by-port oil import stats).


I created this using ruby-shapelib and rmagick as mentioned previously.

Loading and drawing maps with Ruby

Loading geographic map data and drawing maps is pretty easy to do with two Ruby tools – ruby-shapelib (to load the map data) and RImageMagick (to create the drawings).

I didn’t see any tutorials or sample code, so I’m posting this sample as is – it will draw every shape part of every shape in a given shape file. Note this code does not perform any geographic projections.

require 'rubygems'
require 'RMagick'
require 'rvg/rvg'
require 'shapelib'
include ShapeLib
include Magick


def drawshape shape, canvas
#each shape can have multiple shape parts...
#iterate over each shape part in this shape -
0.upto(shape.part_start.length-1) do |index|
part_begin = shape.part_start[index]
unless shape.part_start[index+1].nil? then
part_end = shape.part_start[index+1]-1
#NOTE we're assuming all the parts are polygons for now...
#draw a polygon with the current subset of the xvals and yvals point arrays
canvas.polygon(shape.xvals.slice(part_begin..part_end),shape.yvals.slice(part_begin..part_end)).styles(:fill =>"green",:stroke=>"black",:stroke_width=>0.01)

#create a viewbox with lat/long coordinate space in the correct range
def create_canvas rvg, shapefile
width = shapefile.maxbound[0] -shapefile.minbound[0]
height = shapefile.maxbound[1] -shapefile.minbound[1]
#puts "viewport #{shapefile.minbound[0]},#{shapefile.minbound[1]} - width= #{width} height= #{height}"
#invert the y axis so "up" is bigger and map the coordinate space to the shape's bounding box
canvas = rvg.translate(0,rvg.height).scale(1,-1).viewbox(shapefile.minbound[0],shapefile.minbound[1],width,height).preserve_aspect_ratio('xMinYMin', 'meet')

shapefile =,"rb")
#create a new RVG object
rvg =,100)
canvas = create_canvas rvg, shapefile
shapefile.each { |shape| drawshape(shape,canvas) }


I’m using the US State boundary file from the national atlas website.

Recursive deletes

Posting this little ruby snippet so i can reference it later. Need to recursively delete directories with a certain name in a large tree? The simplest example is scrubbing those pesky .svn directories in a subversion repository, which can be done like so:

require ‘fileutils’
Dir.glob(“**/.svn/”) {|fname| FileUtils.rm_r(fname) }

another use case I have here at work is to scrub extra maven generated versions of code out of each java project (so as to keep eclipse sane). In this case, we want to delete all directories (and their contents) named “target” except for the target directory at the root (because mvn clean is “too clean” in this instance):

require ‘fileutils’
Dir.glob(“**/target/”) {|fname| FileUtils.rm_r(fname) unless /^target.*/ =~ fname}

It gets a bit more complicated if you want to exclude list of directories from the operation. Here I found Ruby’s Enumerable module detect method quite handy to short circuit evaluate all the directories to exclude regex on each directory.

require 'fileutils'
@exclude= [/^foo.*/ , /^bar.*/ , /^james.*/, /^target.*/]
Dir.glob("**/target/") do |fname|
@erase = @exclude.detect{ |r| r =~ fname }.nil?
if @erase
puts "erasing #{fname}"
puts "skipping #{fname}"

Note: this code won’t copy and paste well because wordpress replaces quotes with smartquotes. I also really need to fix my stylesheet for code samples.