James Kebinger – Page 2

OSX VPN Problems: Kill the racoon

Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10.6.5 and 10.6.6

(Ab)using memoize to quickly solve tricky n+1 problems

Usually, discovering n+1 problems in your Rails application that can’t be fixed with an :include statement means lots of changes to your views. Here’s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they already have all the information needed to render the page.

While simple belongs_to relationships are easy to fix with :include, lets take a look at a concrete example where that won’t work:

class User < ActiveRecord::Base
has_many :likes
end

class Item < ActiveRecord::Base
has_many :likes
def liked_by?(user)
likes.by_user(user).present?
end
end

class Like < ActiveRecord::Base
belongs_to :user
belongs_to :item
end

A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn’t be well solved by :include. Instead, we’d have to come up with a query to get the Likes for the set of items by this user:

Like.of_item(@items).by_user(user)

Then we’d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.

Active Support’s memoize functionality stores the results of function calls so they’re only evaluated once. What if we could trick the method into thinking it’s already been called? We can do just that by writing data into the instance variables that memoize uses to save results on each of the model instances. First, we memoize liked_by:

memoize :liked_by?

Then bulk load the relevant likes and stash them into memoize’s internal state:

def precompute_data(items, user)
likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
items.each do |item|
item.write_memo(:liked_by?,likes[item.id].present?,user)
end
end

The write_memo method is implemented as follows.

def write_memo(method, return_value, args=nil)
ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
if args
if hash = instance_variable_get(ivar)
hash[Array(args)] = return_value
else
instance_variable_set(ivar, {Array(args) => return_value})
end
else
instance_variable_set(ivar, [return_value])
end
end

This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there’s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.

Idiot Calling on Twitter – Frequency of You’re vs Your

At the risk of being forever branded a grammar elitist, lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two and a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may be interesting to compare Facebook status updates to see what the you’re/your ratio looks like there one day…

FluidSurveys Data Export Issue, Solved with iconv

I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, and they have great support, but I ran into a time consuming issue when it came time to process the responses because they’re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.

The files on first inspection are a bit strange, because although they spring from a csv export button, they’re tab-delimited, but with CSV-style quoting conventions. That’s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn’t quite get the parameters right.

Then I thought of iconv, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:

iconv -f UTF-16LE -t US-ASCII -c responses.csv > converted_responses.csv

And, as they say, Bob’s your uncle

Plotting Game by Game Winning Percentages

http://www.monkeyatlarge.com/projects/drillable-stacked-time-series/
Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final winning percentage.

I created an interactive plot to look at the results for the 2009 season, team by team.

Take Boston. Seen below, Boston started slow, but pretty quickly arrived at their ultimate winning level.

On the other hand, the Yankees started even slower, and in fact didn’t reach their ultimate winning level until very late in the season.

See the results for the other teams on the visualization page.

The visualization was created using Javascript and the RaphaÃ«l JS library.

Multiple Phrase Search in PostgreSQL

http://www.monkeyatlarge.com/projects/drillable-stacked-time-series/
Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases.

Before I go on, I’ll credit Paul Sephton’s Understanding Full Text Search for opening my eyes to some of the possibilities to enable phrase search on top of tsearch’s existing capabilities.

Tsearch operates on tsvectors and tsqueries. Tsvectors are a bag of words like structure – a list of the unique words appearing in a piece of text, along with their positions in the text. Searches are performed constructing a tsquery, which is boolean expression combining words with AND(&), OR(|), and NOT(!) operators, then comparing the tsquery against candidate tsvectors with the @@ operator.

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub';

will match articles where the the body contains the word meatball and the word sub. If there’s an index on to_tsvector(‘english’,articles.body), this query is a very efficient index lookup.

Single Phrase Search

Now how do we match articles with the phrase “meatball sub”, anywhere in the article’s body? Doing the naive query

select * from articles where body like '%meatball sub%'

will work, but it will be slow because the leading wildcard kills any chance of using an index on that column. What we can do to make this go fast is the following:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub' AND body like '%meatball sub%'

This will use the full text index to find the set of articles where the body has both words, then that (presumably) smaller set of articles can be scanned for the words together.

Multi Phrase Search

It’s simple to extend the above query to match two phrases:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like '%meatball sub%' AND body like '%ham sandwich%';

That query can be tightened up using postgres’s support for arrays:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like ALL('{"%meatball sub%","%ham sandwich%"}')

Stepping back a bit, let’s define create a table called “concepts” to allow users of an application to store searches on lists of phrases, and let’s also allow the user to specify that all phrases must match, or just one of them.

CREATE TABLE concepts
(
id serial,
match_all boolean,
phrases character varying[],
query tsquery
)

Now we can specify and execute that previous search this way:

insert into concepts(match_all,phrases,query) VALUES(TRUE,'{"%meatball sub%","%ham sandwich%"}','meatball & sub & ham & sandwich');
select articles.*, join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)));

Where this approach really shines compared with an external text search tools is aggregate queries like counting up matching articles by date.

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)))
group by articles.date

The logic to combine lists of phrases into the appropriate query based on the desire to match any or all of the phrases is easy to write at the application layer. It’s desirable not to have to include the wildcards into the phrase array, and it’s easy to write a function to do that at runtime.

CREATE OR REPLACE FUNCTION wildcard_wrapper(list varchar[]) RETURNS varchar[] AS $$
DECLARE
return_val varchar[];
BEGIN
for idx in 1 .. array_upper(list, 1)
loop
return_val[idx] := '%' || list[idx] || '%';
end loop;
return return_val;
END;
$$ LANGUAGE plpgsql;

With that function good to go we can make that long query just a little longer:

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(wildcard_wrapper(phrases))) OR (not match_all AND body like ANY(wildcard_wrapper(phrases))))
group by articles.date

It’s straightforward to collapse most, if not all of the sql on clause into a plpgsql function call without adversely affecting the query plan – it’s important that the tsvector index be involved in the query for adequate performance.

Further Work

This approach works well for lists of phrases. To support boolean logic on phrases, one approach might be to compile the request down to a tsquery as above, along with a regular expression to winnow down the matches to those containing the phrases.

Hierarchical Stacked Time Series

I built a Flash (Flex) tool to visualize stacked time series – based on the Prefuse Flare Job Viewer application,
but extended to afford exploration of a shallow, two level hierarchy of time series.

More information and links to the source code on the project page

Red line transit times on a google map

I visualized the transit time via the MBTA’s Red Line to park street from points within 1.5 miles of Red line stops on a google map.

More information, and the interactive google map on the project page.

Another MBTA visualization, this time with the commuter rail

I put together an animation of all the rail traffic in the course of a day on the MBTA’s red, blue, green and orange lines. Its a great way to see just how complicated the system is that takes me to work every day, and perhaps be a little more patient next time things go wrong in the sy
I’d thought about doing this before, but it would have taken screen scraping schedule information off the site. I learned recently through a developer outreach that the Massachusetts Department of Transportation is running that the MBTA had released their schedule information in the Google transit feed specification (GTFS). With the data in hand, I went to work using the ruby-processing wrapper of the excellent Processing graphics toolkit.
I put together an animation of all the rail traffic in the course of a day on the MBTA’s red, blue, green and orange lines, including the Matapan . Its a great way to see just how complicated the system is that takes me to work every day, and perhaps be a little more patient next time things go wrong in the system!

The current version of the animation assumes stop take no time (as does the scheduling data).

I’d thought about doing this before, but it would have taken screen scraping schedule information off the site. I learned recently through a developer outreach that the Massachusetts Department of Transportation is running that the MBTA had released their schedule information in the Google transit feed specification (GTFS). With the data in hand, I went to work using the ruby-processing wrapper of the excellent Processing graphics toolkit.

I put together an animation of all the rail traffic in the course of a day on the MBTA’s red, blue, green and orange lines, including the Mattapan line. Its a great way to see just how complicated the system is that takes me to work every day, and perhaps be a little more patient next time things go less than perfect!

The current version of the animation assumes stop take no time (as does the scheduling data).

Following up the surprising success of my first MBTA visualization, I made a new version that adds the commuter rail lines. This does have the unfortunate effect of squishing the system’s rapid transit lines because I’m not distorting the distances in any way. I also reduced the size of the markers, perhaps too small for viewing at youtube resolution unless viewed at full screen.

Also, for Frank and Elias, I made the length of the video shorter, enlarged the numbers, and added a little visualization showing active trips sampled at 5-minute intervals with a you are here indicator to give viewers perspective on how busy the system is at that moment relative to the whole day.

Day in the life of the MBTA system

I’m James Kebinger, currently a Software Engineer at PatientsLikeMe.
I’m an experienced Software Engineer and Web Developer with a variety of skills including Java and Ruby/Ruby on Rails and interests including usability and data visualization. I recently got a Master’s degree in Computer Science from Tufts University, and I’m determined to o
I’m James Kebinger, currently a Master’s candidate in Computer Science at Tufts University in Medford, MA.
I’m also an experienced Software Engineer and Web Developer with a variety of skills including Java and Ruby/Ruby on Rails.

The current version of the animation assumes stop take no time (as does the scheduling data).

See the video on youtube