Archive for the ‘Programming’ Category

Plotting Game by Game Winning Percentages

Tuesday, April 6th, 2010

Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final winning percentage.

I created an interactive plot to look at the results for the 2009 season, team by team.

Take Boston. Seen below, Boston started slow, but pretty quickly arrived at their ultimate winning level.

On the other hand, the Yankees started even slower, and in fact didn’t reach their ultimate winning level until very late in the season.

See the results for the other teams on the visualization page.

The visualization was created using Javascript and the Raphaël JS library.

Multiple Phrase Search in PostgreSQL

Sunday, January 17th, 2010

Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases.

Before I go on, I’ll credit Paul Sephton’s Understanding Full Text Search for opening my eyes to some of the possibilities to enable phrase search on top of tsearch’s existing capabilities.

Tsearch operates on tsvectors and tsqueries. Tsvectors are a bag of words like structure – a list of the unique words appearing in a piece of text, along with their positions in the text. Searches are performed constructing a tsquery, which is boolean expression combining words with AND(&), OR(|), and NOT(!) operators, then comparing the tsquery against candidate tsvectors with the @@ operator.

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub';

will match articles where the the body contains the word meatball and the word sub. If there’s an index on to_tsvector(‘english’,articles.body), this query is a very efficient index lookup.

Single Phrase Search

Now how do we match articles with the phrase “meatball sub”, anywhere in the article’s body? Doing the naive query

select * from articles where body like '%meatball sub%'

will work, but it will be slow because the leading wildcard kills any chance of using an index on that column. What we can do to make this go fast is the following:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub' AND body like '%meatball sub%'

This will use the full text index to find the set of articles where the body has both words, then that (presumably) smaller set of articles can be scanned for the words together.

Multi Phrase Search

It’s simple to extend the above query to match two phrases:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like '%meatball sub%' AND body like '%ham sandwich%';

That query can be tightened up using postgres’s support for arrays:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like ALL('{"%meatball sub%","%ham sandwich%"}')

Stepping back a bit, let’s define create a table called “concepts” to allow users of an application to store searches on lists of phrases, and let’s also allow the user to specify that all phrases must match, or just one of them.

CREATE TABLE concepts
(
   id serial,
   match_all boolean,
   phrases character varying[],
   query tsquery
)

Now we can specify and execute that previous search this way:

insert into concepts(match_all,phrases,query) VALUES(TRUE,'{"%meatball sub%","%ham sandwich%"}','meatball & sub & ham & sandwich');
select articles.*, join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)));

Where this approach really shines compared with an external text search tools is aggregate queries like counting up matching articles by date.

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)))
group by articles.date

The logic to combine lists of phrases into the appropriate query based on the desire to match any or all of the phrases is easy to write at the application layer. It’s desirable not to have to include the wildcards into the phrase array, and it’s easy to write a function to do that at runtime.

CREATE OR REPLACE FUNCTION wildcard_wrapper(list varchar[]) RETURNS varchar[] AS $$
      DECLARE
       return_val varchar[];
      BEGIN
        for idx in 1 .. array_upper(list, 1)
        loop
          return_val[idx] := '%' || list[idx] || '%';
        end loop;
        return return_val;
      END;
      $$ LANGUAGE plpgsql;

With that function good to go we can make that long query just a little longer:

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(wildcard_wrapper(phrases))) OR (not match_all AND body like ANY(wildcard_wrapper(phrases))))
group by articles.date

It’s straightforward to collapse most, if not all of the sql on clause into a plpgsql function call without adversely affecting the query plan – it’s important that the tsvector index be involved in the query for adequate performance.

Further Work

This approach works well for lists of phrases. To support boolean logic on phrases, one approach might be to compile the request down to a tsquery as above, along with a regular expression to winnow down the matches to those containing the phrases.

Deferring Index costs for table to table copies in PostgreSQL

Tuesday, April 14th, 2009

When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from the filesystem rather than directly from another table.

For table to table copying (and transformations) the situation isn’t as straight-forward. Recently I was working on a problem where we needed to perform some poor-man’s ETL, copying and transforming data between tables in different schemas. Since some of the destination tables were heavily indexed(including a full text index) the task took quite a while. In talking with a colleague about the problem, we came up with the idea of dropping the indexes and constraints prior to the data load, and restoring them afterwards.

First stop: how to get the DDL for indices on a table in postgres? Poking around the postgres catalogs, I managed to find a function pg_get_indexdef that would return the DDL for an index. Combining that with a query I found in a forum somewhere and altered, I came up with this query to get the names and DDL of all the indices on a table. (this one excludes the primary key index)

With that and the query to do the same for constraints its straightforward to build a helper function that will get the DDL for all indices and constraints, drop them, yield to evaluate a block and then restore the indices and constraints. The method is below:

Use of the function would look like the snippet below. This solution would also allow for arbitrarily complex transformations in Ruby as well as pure SQL.

For my task loading and transforming data into about 20 tables, doing this reduced the execution time by two-thirds. Of course, your mileage may vary depending how heavily indexed your destination tables are.

Here’s the whole module:

PNG Thumbnails for PDF files. Take two

Tuesday, October 7th, 2008

Updating my previous post, I finished up the work of extending attachment_fu to optionally create PNG thumbnails of updated PDF files. Check out the fork on github

Creating thumbnails of PDFs with attachment_fu

Tuesday, September 16th, 2008

We needed to create some thumbnails from uploading PDF files for a new site feature – We’re using attachment_fu which doesn’t support that (yet?), but we’re using RMagick as our processor and it understands PDF files.

I came up with the hack below (warning, first draft, only briefly tested) which works without having to modify the attachment_fu plugin itself. One day I’ll loop back and figure out a cleaner way to do this and see which of attachment_fu’s other image processors can even support pdfs.

There are three methods to override to make a go of this:

  1. self.image? : consider pdf files as an image so thumbnail process will happen
  2. thumbnail_name_for : change the extension of the saved thumbnail filename to png
  3. resize_image: override to change format via block passed to to_blob

Apologies for the crappy source formatting, I have to install a plugin to do that well one of these days


###Hacks to allow creation of png thumbnails for pdf uploads - depends on RMagic being the configured processor
## likely very fragile

def self.image?(content_type)
(content_types +  ['application/pdf']).include?(content_type)
end

alias_method :o riginal_thumbnail_name_for, :thumbnail_name_for
def thumbnail_name_for(thumbnail=nil)
return original_thumbnail_name_for(thumbnail) unless (content_type == 'application/pdf' && !thumbnail.blank?)
basename = filename.gsub /\.\w+$/ do |s|
ext = s; ''
end
"#{basename}_#{thumbnail}.png"
end
#copied from rmagick_processor with change in last few lines
def resize_image(img, size)
size = size.first if size.is_a?(Array) && size.length == 1 && !size.first.is_a?(Fixnum)
if size.is_a?(Fixnum) || (size.is_a?(Array) && size.first.is_a?(Fixnum))
size = [size, size] if size.is_a?(Fixnum)
img.thumbnail!(*size)
else
img.change_geometry(size.to_s) { |cols, rows, image| image.resize!(cols<1 ? 1 : cols, rows<1 ? 1 : rows) }
end
img.strip! unless attachment_options[:keep_profile]
if content_type == 'application/pdf' # here force the output format to PNG if its a pdf
self.temp_path = write_to_temp_file(img.to_blob {self.format = 'PNG'})
else
self.temp_path = write_to_temp_file(img.to_blob)
end
end

Livepipe UI controls for JS

Thursday, August 21st, 2008

I’ve been pretty pleased so far with the popup window control from LivePipe. It plays nice with Prototype and is easy to style with regular CSS. We had considered using Prototype Window but I was put off that all their default styles looks like operating system windows and restyling their windows required a table and 9 images.

I’d recommend anyone looking for a popup window solution at least consider Livepipe. There are downsides however, chiefly that the project is pretty immature – technically I suppose this is an alpha release since Beta One is being worked on, so the community remains small. While there are some folks already submitting patches, progress on merging the patches is alarmingly slow, as one can see from their lighthouse page.

If you’re doing RESTful stuff in Rails however, you will need the contents of ticket #10 which modifies the popup window to accept an option to use different HTTP verbs.

Scriptaculous docs

Thursday, August 21st, 2008

For all the complaining I often do about the poor documentation of the scriptaculous project, I finally did something to help that today, creating (very thin) documentation for their new (if released in January is new) Effect.Tween function here on their github wiki.

I was creating a method to scroll the viewport so that the contents of an AJAX-loaded div would be fully visible on the screen – the (still undocumented) Effect.ScrollTo doesn’t quite do it because it doesn’t consider the height of the element it scrolls to, but in doing so I stumbled over Tween in the code. Once the math to figure out how much scrolling is needed, its easy to use Effect.Tween to smoothly scroll the window by repeatedly calling window.scrollBy();

This certainly isn’t rocket science, but here’s an outline of how to do it (this code only deals with downward vertical scrolling):

var elementHeight = element.getHeight();
var screenHeight = document.viewport.getHeight();
var elementScreenPos =element.viewportOffset()[1];
var amountToScroll = elementHeight - (screenHeight - elementScreenPos);
if (amountToScroll > 0){
var scrollPos = document.viewport.getScrollOffsets().top;
new Effect.Tween(null,scrollPos,scrollPos+amountToScroll,{},function(n) { window.scrollTo(0,n);});
}
}

The short circuit article

Thursday, April 10th, 2008

About three and a half years ago I co-wrote an article named “Increase stability and responsiveness by short-circuiting code” for IBM’s developer works site, and for some reason in the past few days it has repeatedly asked for attention (“hi this is 2004, your article is on the line, and its woefully dated”). First, the one page abstract we submitted fell out of a book on my bookshelf, then I was asked about it at at least one interview. Sadly, that was one of the top results for my name in google for a while and people still find it.
I figure its about time to revisit, and disavow, the implementation in the article, if that isn’t already obvious to anyone.

The idea was to provide a way to time-box operations that could take an unknown amount of time. In this way for example, a web page that must be displayed faster than a certain time can be guaranteed to run in that time, if it can do without the results of operations that take too long to execute.
One obvious flaw is that the code creates LOTS of new threads for a short period of time. It should have used a thread pool to reduce that churn.
The best reason not to use that code is that Java 1.5 introduced a whole set of Concurrency utilities. ExecutorService and Future. There are lots of examples about, so you can check them out.

The high level view is that you package your functionality in a Runnable or Callable (depending if you need to return a result), submit it to an instance of ExecutorService to run. It will return a Future object which can be queried to get the result. One can call get on the Future class, which will return right away if the task is done execuiting, or block until the sooner of a specified timeout or the task completing. Even better, one can submit multiple tasks at once with invokeAll(..) and that will return when all tasks are complete or the timeout has expired.

Interviews on Trivia

Monday, April 7th, 2008

I’ve started interviewing again now that I should be finished with my Master’s degree in a month or so. I’m reminded again of the wide range of interview styles people use. My least favorite is the trivia test. This seems to happen more often with Java-related job interviews than Ruby-related ones.

I may never understand why any employer would value memorizing the Java API over being able to reference the docs and know where to find things.

I had such an interview just last week. Here are some of those questions

  • How do you execute a PL-SQL stored procedure from JDBC?
  • How do you import classes into the classpath of a JSP page? (apparently ‘no one in their right mind does that anymore’ isn’t a good answer to this one)

Who memorizes that stuff?

My favorite question of all was this : what are the two conditions under which a finally block is not called. I got one of them, (System.exit()) but the interviewer wouldn’t even tell me the other one (“You won’t learn that way”). I googled it later to find the answer not well defined. One of the ways I saw mentioned was the thread “dying” but Thread.stop() is severely deprecated so that shouldn’t ever happen. The other answer I saw floating around doesn’t really fit – when a exception is thrown from the finally block it doesn’t complete, but the finally block is still called.

I was talking about this with Frank and he came up with another way: infinite loop in the try block. I then thought of calling PowerSystem.getMainPower().setPosition(OFF).

Now I can’t wait to get that question again!

Ruby operator precedence (the ors and ands of it)

Friday, March 14th, 2008

I found out (by introducing a bug into the application I’ve been working on) that “or” and “||” do not have equal precedence in Ruby.

More importantly, the assignment operator “=” has higher precedence than “or” so that means that while the expression


>> foo = nil || 2
=> 2
>> foo
=> 2

results in foo being assigned the value 2 as you might expect, the following expression leaves foo assigned the value nil.


>> foo = nil or 2
=> 2
>> foo
=> nil

This is well covered ground online (see this post) but I was surprised that this oddity didn’t warrant an explicit mention in the operator precedence section of the Pickaxe book.