<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Brain Lint &#187; Programming</title>
	<atom:link href="http://www.monkeyatlarge.com/archives/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.monkeyatlarge.com</link>
	<description>Random musings on life, technology and other miscellany.</description>
	<lastBuildDate>Fri, 03 Feb 2012 21:34:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Another use for generate_series: row multiplier</title>
		<link>http://www.monkeyatlarge.com/archives/2012/02/03/another-use-for-generate_series-row-multiplier/</link>
		<comments>http://www.monkeyatlarge.com/archives/2012/02/03/another-use-for-generate_series-row-multiplier/#comments</comments>
		<pubDate>Fri, 03 Feb 2012 21:33:26 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=442</guid>
		<description><![CDATA[I had a request the other day: how many simultaneous users are on the site, by time of day. I already have a session database that&#8217;s computed nightly from weblogs: it contains the times at which each session started and ended. I thought for sure the next step would be to dump some data, then [...]]]></description>
			<content:encoded><![CDATA[<p>I had a request the other day: how many simultaneous users are on the site, by time of day. I already have a session database that&#8217;s computed nightly from weblogs: it contains the times at which each session started and ended. </p>
<pre class="brush: sql; title: ; notranslate">
CREATE TABLE sessions
(
  user_id integer NOT NULL,
  start_at timestamp without time zone,
  end_at timestamp without time zone,
  duration double precision,
  views integer
)
</pre>
<p>I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.</p>
<p>Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. <a href="http://www.postgresql.org/docs/9.1/static/functions-srf.html">Generate_series</a> is a &#8220;set returning function&#8221; that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans. </p>
<p>From there, it&#8217;s easy to do a straight forward group by, counting distinct user_id:</p>
<pre class="brush: sql; title: ; notranslate">
with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1
</pre>
<p>The date_trunc call is important so that session rows are aligned to whole minutes, if that&#8217;s not done, then none of the rows will align for the counts. </p>
<p>That set won&#8217;t include rows that had no users logged in.  To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.</p>
<pre class="brush: sql; title: ; notranslate">

with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2012/02/03/another-use-for-generate_series-row-multiplier/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Computing Distinct Items Across Sliding Windows in SQL</title>
		<link>http://www.monkeyatlarge.com/archives/2012/02/03/computing-distinct-items-across-sliding-windows-in-sql/</link>
		<comments>http://www.monkeyatlarge.com/archives/2012/02/03/computing-distinct-items-across-sliding-windows-in-sql/#comments</comments>
		<pubDate>Fri, 03 Feb 2012 21:05:40 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=425</guid>
		<description><![CDATA[As a member of PatientsLikeMe&#8216;s Data team, from time to time we&#8217;re asked to compute how many unique users did action X on the site within a date range, say 28 days, or several date ranges (1,14,28 days for example). It&#8217;s easy enough to do that for a given day, but to do that for [...]]]></description>
			<content:encoded><![CDATA[<p>As a member of <a href="http://www.patientslikeme.com">PatientsLikeMe</a>&#8216;s Data team, from time to time we&#8217;re asked to compute how many unique users did action X on the site within a date range, say 28 days, or several date ranges (1,14,28 days for example). It&#8217;s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here&#8217;s what I came up with.</p>
<p><strong>One day at a time</strong></p>
<p>First, a simplified example table:</p>
<pre class="brush: sql; title: ; notranslate">
create table events (
  user_id integer,
  event varchar,
  date date
)
</pre>
<p>Getting unique user counts by event on any given day is easy. Below, we&#8217;ll get the counts of unique users by events for the 7 days leading up to Valentine&#8217;s day:</p>
<pre class="brush: sql; title: ; notranslate">
select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2
</pre>
<p><strong>Now Do That For Every Day</strong></p>
<p>The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We&#8217;re looking for something faster, and a bit more elegant.</p>
<p>Stepping back a bit, for a seven day time window, we&#8217;re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days &#8211; effectively we&#8217;re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.</p>
<p>With a table like the one below</p>
<table>
<thead>
<tr>
<th>from_date</th>
<th>to_date</th>
</tr>
</thead>
<tbody>
<tr>
<td>2011-01-01</td>
<td>2011-01-01</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-02</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-03</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-04</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-05</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-06</td>
</tr>
<tr>
<td>2011-01-01</td>
<td>2011-01-07</td>
</tr>
<tr>
<td>2011-01-02</td>
<td>2011-01-02</td>
</tr>
<tr>
<td colspan='2'>&#8230;</td>
</tr>
</tbody>
</table>
<p>This SQL becomes easy.</p>
<pre class="brush: sql; title: ; notranslate">
select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
</pre>
<table>
<thead>
<tr>
<th>to_date</th>
<th>event</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan='3'>&#8230;</td>
</tr>
<tr>
<td>2011-01-05</td>
<td>bar</td>
<td>20</td>
</tr>
<tr>
<td>2011-01-05</td>
<td>baz</td>
<td>27</td>
</tr>
<tr>
<td>2011-01-05</td>
<td>foo</td>
<td>24</td>
</tr>
<tr>
<td>2011-01-06</td>
<td>bar</td>
<td>31</td>
</tr>
<tr>
<td colspan='3'>&#8230;</td>
</tr>
</tbody>
</table>
<p>You&#8217;ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.<br />
That works for me on Postgresql 8.4. Your mileage may vary with other brands.</p>
<p><strong>How Do I Get One of Those?</strong><br />
A dates table like that is a one-liner using the generate_series method:</p>
<pre class="brush: sql; title: ; notranslate">
select date::date as from_date, date::date+plus_day as to_date from
 generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
 generate_series(0,6,1) as plus_day ;
</pre>
<p>There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you&#8217;re done.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2012/02/03/computing-distinct-items-across-sliding-windows-in-sql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting Wukong and Pig Working Together on Amazon Elastic Map Reduce</title>
		<link>http://www.monkeyatlarge.com/archives/2011/03/16/getting-wukong-and-pig-working-together-on-amazon-elastic-map-reduce/</link>
		<comments>http://www.monkeyatlarge.com/archives/2011/03/16/getting-wukong-and-pig-working-together-on-amazon-elastic-map-reduce/#comments</comments>
		<pubDate>Wed, 16 Mar 2011 17:06:26 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Pig]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=416</guid>
		<description><![CDATA[Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. Wukong is a great library to write map/reduce jobs for Hadoop from ruby. Together they can be really great, because problems unsolvable in pig without resorting writing a custom function in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://pig.apache.org/">Apache Pig</a> is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. </p>
<p><a href="https://github.com/mrflip/wukong">Wukong</a> is a great library to write map/reduce jobs for Hadoop from ruby. </p>
<p>Together they can be really great, because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a <a href="http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html">great example</a> of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.</p>
<h3>Working with Wukong on Elastic Map Reduce</h3>
<p>Elastic map reduce is a great resource &#8211; it&#8217;s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.</p>
<p>Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on <a href="http://stackoverflow.com/questions/4336842/when-bootstrapping-an-amazon-elastic-map-reduce-job-can-my-script-use-sudo">stackoverflow</a>):</p>
<pre class="brush: plain; title: ; notranslate">
sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri
</pre>
<p>Using Amazon&#8217;s command line utility, starting the cluster ready to use in pig interactive mode looks like this</p>
<p>elastic-mapreduce &#8211;create  &#8211;bootstrap-action [S3 path to wukong-bootstrap.sh] &#8211;num-instances [a number] &#8211;slave-instance-type [ machine type ] &#8211;pig-interactive -ssh </p>
<p>The web tool for creating clusters has a space for specifying the path to a bootstrap script.</p>
<p>Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job.  (It&#8217;s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2011/03/16/getting-wukong-and-pig-working-together-on-amazon-elastic-map-reduce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Redundant Indexing in PostgreSQL</title>
		<link>http://www.monkeyatlarge.com/archives/2011/02/08/redundant-indexing-in-postgresql/</link>
		<comments>http://www.monkeyatlarge.com/archives/2011/02/08/redundant-indexing-in-postgresql/#comments</comments>
		<pubDate>Tue, 08 Feb 2011 23:17:56 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=406</guid>
		<description><![CDATA[If you have a table with a column included as the first column in a multi-column index and then again with it&#8217;s own index, you may be over indexing. Postgres will use the multi-column index for queries on the first column. From the docs A multicolumn B-tree index can be used with query conditions that [...]]]></description>
			<content:encoded><![CDATA[<p>If you have a table with a column included as the first column in a multi-column index and then again with it&#8217;s own index, you may be over indexing. Postgres will use the multi-column index for queries on the first column. </p>
<p>From the <a href="http://www.postgresql.org/docs/8.4/static/indexes-multicolumn.html">docs</a></p>
<blockquote><p>A multicolumn B-tree index can be used with query conditions that involve any subset of the index&#8217;s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.</p></blockquote>
<p><strong><br />
Performance</strong></p>
<p>If you click around that section of the docs, you&#8217;ll surely come across the section on multi-column indexing and performance, in particular this <a href="http://www.postgresql.org/docs/8.4/static/indexes-bitmap-scans.html">section</a> (bold emphasis mine):</p>
<blockquote><p>You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. <strong>For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone</strong></p></blockquote>
<p>Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.</p>
<p>First, lets create a dummy table:</p>
<pre class="brush: sql; title: ; notranslate">
CREATE TABLE foos_and_bars
(
  id serial NOT NULL,
  foo_id integer,
  bar_id integer,
  CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)
</pre>
<p>Then, using R, we&#8217;ll create 3 million rows of nicely distributed data:</p>
<pre class="brush: r; title: ; notranslate">
rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))
</pre>
<p>Dump that to a text file and load it up with \copy and we&#8217;re good to go.</p>
<p>Create the compound index</p>
<pre class="brush: sql; title: ; notranslate">
CREATE INDEX foo_id_and_bar_id_index
  ON foos_and_bars
  USING btree
  (foo_id, bar_id);
</pre>
<p>Run a simple query to make sure the index is used:</p>
<pre class="brush: plain; title: ; notranslate">
test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
   Recheck Cond: (foo_id = 123)
   -&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
         Index Cond: (foo_id = 123)
 Total runtime: 0.072 ms
(5 rows)
</pre>
<p>Now we&#8217;ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:</p>
<pre class="brush: ruby; title: ; notranslate">
require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname =&gt; 'test_foo')
def perform_test(conn,foo_id)
  time = Benchmark.realtime do
    res = conn.exec(&quot;select * from foos_and_bars where foo_id = #{foo_id}&quot;)
    res.clear
  end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end
</pre>
<p>How do things stack up? I&#8217;d say about evenly:</p>
<p><a href="http://www.monkeyatlarge.com/blog/wp-content/uploads/2011/02/query-perf-comparison.png"><img src="http://www.monkeyatlarge.com/blog/wp-content/uploads/2011/02/query-perf-comparison.png" alt="" title="query-perf-comparison" width="569" height="452" class="aligncenter size-full wp-image-410" /></a></p>
<p>Remember: Indexing isn&#8217;t free, and Postgres is pretty good at using (and reusing) your indexes, so you may not need to create as many as you think. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2011/02/08/redundant-indexing-in-postgresql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>(Ab)using memoize to quickly solve tricky n+1 problems</title>
		<link>http://www.monkeyatlarge.com/archives/2010/12/08/abusing-memoize-to-quickly-solve-tricky-n1-problems/</link>
		<comments>http://www.monkeyatlarge.com/archives/2010/12/08/abusing-memoize-to-quickly-solve-tricky-n1-problems/#comments</comments>
		<pubDate>Thu, 09 Dec 2010 03:41:23 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=385</guid>
		<description><![CDATA[Usually, discovering n+1 problems in your Rails application that can&#8217;t be fixed with an :include statement means lots of changes to your views. Here&#8217;s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they [...]]]></description>
			<content:encoded><![CDATA[<p>Usually, discovering n+1 problems in your Rails application that can&#8217;t be fixed with an :include statement means lots of changes to your views. Here&#8217;s a workaround that skips the view changes that I discovered working with <a href="http://twitter.com/frogandcode">Rich</a> to improve performance of some <a href="http://dribbble.com/">Dribbble</a> pages.  It uses memoize to convince your n model instances that they already have all the information needed to render the page.</p>
<p>While simple belongs_to relationships are easy to fix with :include, lets take a look at a concrete example where that won&#8217;t work:</p>
<pre class="brush: ruby; light: true; title: ; notranslate">
class User &lt; ActiveRecord::Base
  has_many :likes
end

class Item &lt; ActiveRecord::Base
  has_many :likes
  def liked_by?(user)
     likes.by_user(user).present?
  end
end

class Like &lt; ActiveRecord::Base
  belongs_to :user
  belongs_to :item
end
</pre>
<p>A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn&#8217;t be well solved by :include.  Instead, we&#8217;d have to come up with a query to get the Likes for the set of items by this user:</p>
<pre class="brush: ruby; light: true; title: ; notranslate">
Like.of_item(@items).by_user(user)
</pre>
<p>Then we&#8217;d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.</p>
<p>Active Support&#8217;s memoize functionality stores the results of function calls so they&#8217;re only evaluated once. What if we could trick the method into thinking it&#8217;s already been called? We can do just that by writing data into the instance variables that memoize uses to save results on each of the model instances. First, we memoize liked_by:</p>
<pre class="brush: ruby; light: true; title: ; notranslate">
  memoize :liked_by?
</pre>
<p>Then bulk load the relevant likes and stash them into memoize&#8217;s internal state:</p>
<pre class="brush: ruby; light: true; title: ; notranslate">
def precompute_data(items, user)
  likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
  items.each do |item|
    item.write_memo(:liked_by?,likes[item.id].present?,user)
  end
end
</pre>
<p>The write_memo method is implemented as follows.</p>
<pre class="brush: ruby; light: true; title: ; notranslate">
  def write_memo(method, return_value, args=nil)
    ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
    if args
      if hash = instance_variable_get(ivar)
        hash[Array(args)] = return_value
      else
        instance_variable_set(ivar, {Array(args) =&gt; return_value})
      end
    else
      instance_variable_set(ivar, [return_value])
    end
  end
</pre>
<p>This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there&#8217;s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2010/12/08/abusing-memoize-to-quickly-solve-tricky-n1-problems/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>FluidSurveys Data Export Issue, Solved with iconv</title>
		<link>http://www.monkeyatlarge.com/archives/2010/11/24/fluidsurveys-data-export-issue-solved-with-iconv/</link>
		<comments>http://www.monkeyatlarge.com/archives/2010/11/24/fluidsurveys-data-export-issue-solved-with-iconv/#comments</comments>
		<pubDate>Wed, 24 Nov 2010 16:28:35 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=373</guid>
		<description><![CDATA[I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, and they have great support, but I ran into a time consuming issue when it came time to process the responses because they&#8217;re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle. The [...]]]></description>
			<content:encoded><![CDATA[<p>I recently ran a survey at work using <a href="http://fluidsurveys.com/">FluidSurveys</a>.  Their survey building tools are excellent, and they have great support, but I ran into a time consuming issue when it came time to process the responses because they&#8217;re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.</p>
<p>The files on first inspection are a bit strange, because although they spring from a csv export button, they&#8217;re tab-delimited, but with CSV-style quoting conventions. That&#8217;s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn&#8217;t quite get the parameters right.</p>
<p>Then I thought of <a href="http://www.gnu.org/software/libiconv/">iconv</a>, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:</p>
<pre class="brush: bash; light: true; title: ; notranslate">
iconv -f UTF-16LE -t US-ASCII -c responses.csv &gt; converted_responses.csv
</pre>
<p>And, as they say, Bob&#8217;s your uncle</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2010/11/24/fluidsurveys-data-export-issue-solved-with-iconv/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Plotting Game by Game Winning Percentages</title>
		<link>http://www.monkeyatlarge.com/archives/2010/04/06/plotting-game-by-game-winning-percentages/</link>
		<comments>http://www.monkeyatlarge.com/archives/2010/04/06/plotting-game-by-game-winning-percentages/#comments</comments>
		<pubDate>Wed, 07 Apr 2010 01:49:53 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Javascript]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=353</guid>
		<description><![CDATA[Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final [...]]]></description>
			<content:encoded><![CDATA[<p>Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final winning percentage.</p>
<p>I created an interactive plot to look at the results for the 2009 season, team by team.</p>
<p>Take Boston. Seen below, Boston started slow, but pretty quickly arrived at their ultimate winning level.<br />
<img src="http://www.monkeyatlarge.com/blog/wp-content/uploads/2010/04/bos-chart3.png" alt="" title="bos-chart" width="499" height="248" class="aligncenter size-full wp-image-359" /></p>
<p>On the other hand, the Yankees started even slower, and in fact didn&#8217;t reach their ultimate winning level until very late in the season.<br />
<img src="http://www.monkeyatlarge.com/blog/wp-content/uploads/2010/04/nyy-chart.png.png" alt="" title="nyy-chart.png" width="498" height="249" class="aligncenter size-full wp-image-357" /></p>
<p>See the results for the other teams <a href="http://www.monkeyatlarge.com/projects/baseball-game-winning-pct/">on the visualization page</a>.</p>
<p>The visualization was created using Javascript and the <a href="http://raphaeljs.com/">Raphaël JS </a> library.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2010/04/06/plotting-game-by-game-winning-percentages/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Multiple Phrase Search in PostgreSQL</title>
		<link>http://www.monkeyatlarge.com/archives/2010/01/17/multiple-phrase-search-in-postgresql/</link>
		<comments>http://www.monkeyatlarge.com/archives/2010/01/17/multiple-phrase-search-in-postgresql/#comments</comments>
		<pubDate>Mon, 18 Jan 2010 01:34:27 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=341</guid>
		<description><![CDATA[Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases. Before I go on, I&#8217;ll [...]]]></description>
			<content:encoded><![CDATA[<p>Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases. </p>
<p>Before I go on, I&#8217;ll credit Paul Sephton&#8217;s <a href="http://linuxgazette.net/164/sephton.html">Understanding Full Text Search</a> for opening my eyes to some of the possibilities to enable phrase search on top of tsearch&#8217;s existing capabilities.</p>
<p>Tsearch operates on tsvectors and tsqueries. Tsvectors are a bag of words like structure &#8211; a list of the unique words appearing in a piece of text, along with their positions in the text. Searches are performed constructing a tsquery, which is boolean expression combining words with AND(&#038;), OR(|), and NOT(!) operators, then comparing the tsquery against candidate tsvectors with the @@ operator.</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select * from articles where to_tsvector('english',articles.body) @@ 'meatball &amp; sub';
</pre>
<p>will match articles where the the body contains the word meatball and the word sub. If there&#8217;s an index on to_tsvector(&#8216;english&#8217;,articles.body), this query is a very efficient index lookup.</p>
<h3>
Single Phrase Search</h3>
<p>Now how do we match articles with the phrase &#8220;meatball sub&#8221;, anywhere in the article&#8217;s body? Doing the naive query</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select * from articles where body like '%meatball sub%'
</pre>
<p>will work, but it will be slow because the leading wildcard kills any chance of using an index on that column. What we can do to make this go fast is the following:</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select * from articles where to_tsvector('english',articles.body) @@ 'meatball &amp; sub' AND body like '%meatball sub%'
</pre>
<p>This will use the full text index to find the set of articles where the body has both words, then that (presumably) smaller set of articles can be scanned for the words together.</p>
<h3>Multi Phrase Search</h3>
<p>It&#8217;s simple to extend the above query to match two phrases:</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select * from articles where to_tsvector('english',articles.body) @@ 'meatball &amp; sub &amp; ham &amp; sandwich' AND body like '%meatball sub%' AND body like '%ham sandwich%';
</pre>
<p>That query can be tightened up using postgres&#8217;s support for arrays:</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select * from articles where to_tsvector('english',articles.body) @@ 'meatball &amp; sub &amp; ham &amp; sandwich' AND body like ALL('{&quot;%meatball sub%&quot;,&quot;%ham sandwich%&quot;}')
</pre>
<p>Stepping back a bit, let&#8217;s define create a table called &#8220;concepts&#8221; to allow users of an application to store searches on lists of phrases, and let&#8217;s also allow the user to specify that all phrases must match, or just one of them.</p>
<pre class="brush: sql; light: true; title: ; notranslate">
CREATE TABLE concepts
(
   id serial,
   match_all boolean,
   phrases character varying[],
   query tsquery
)
</pre>
<p>Now we can specify and execute that previous search this way:</p>
<pre class="brush: sql; light: true; title: ; notranslate">
insert into concepts(match_all,phrases,query) VALUES(TRUE,'{&quot;%meatball sub%&quot;,&quot;%ham sandwich%&quot;}','meatball &amp; sub &amp; ham &amp; sandwich');
select articles.*, join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)));
</pre>
<p>Where this approach really shines compared with an external text search tools is aggregate queries like counting up matching articles by date. </p>
<pre class="brush: sql; light: true; title: ; notranslate">
select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)))
group by articles.date
</pre>
<p>The logic to combine lists of phrases into the appropriate query based on the desire to match any or all of the phrases is easy to write at the application layer.  It&#8217;s desirable not to have to include the wildcards into the phrase array, and it&#8217;s easy to write a function to do that at runtime.</p>
<pre class="brush: sql; light: true; title: ; notranslate">
CREATE OR REPLACE FUNCTION wildcard_wrapper(list varchar[]) RETURNS varchar[] AS $$
      DECLARE
       return_val varchar[];
      BEGIN
        for idx in 1 .. array_upper(list, 1)
        loop
          return_val[idx] := '%' || list[idx] || '%';
        end loop;
        return return_val;
      END;
      $$ LANGUAGE plpgsql;
</pre>
<p>With that function good to go we can make that long query just a little longer:</p>
<pre class="brush: sql; light: true; title: ; notranslate">
select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(wildcard_wrapper(phrases))) OR (not match_all AND body like ANY(wildcard_wrapper(phrases))))
group by articles.date
</pre>
<p>It&#8217;s straightforward to collapse most, if not all of the sql on clause into a plpgsql function call without adversely affecting the query plan &#8211; it&#8217;s important that the tsvector index be involved in the query for adequate performance.</p>
<h3>Further Work</h3>
<p>This approach works well for lists of phrases. To support boolean logic on phrases, one approach might be to compile the request down to a tsquery as above, along with a regular expression to winnow down the matches to those containing the phrases.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2010/01/17/multiple-phrase-search-in-postgresql/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Deferring Index costs for table to table copies in PostgreSQL</title>
		<link>http://www.monkeyatlarge.com/archives/2009/04/14/deferring-index-costs-for-table-to-table-copies-in-postgresql/</link>
		<comments>http://www.monkeyatlarge.com/archives/2009/04/14/deferring-index-costs-for-table-to-table-copies-in-postgresql/#comments</comments>
		<pubDate>Tue, 14 Apr 2009 14:27:55 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=318</guid>
		<description><![CDATA[When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from [...]]]></description>
			<content:encoded><![CDATA[<p>When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from the filesystem rather than directly from another table.</p>
<p>For table to table copying (and transformations) the situation isn&#8217;t as straight-forward. Recently I was working on a problem where we needed to perform some poor-man&#8217;s <a href="http://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a>, copying and transforming data between tables in different schemas. Since some of the destination tables were heavily indexed(including a full text index) the task took quite a while. In talking with a colleague about the problem, we came up with the idea of dropping the indexes and constraints prior to the data load, and restoring them afterwards. </p>
<p>First stop: how to get the DDL for indices on a table in postgres? Poking around the postgres catalogs, I managed to find a function pg_get_indexdef that would return the DDL for an index. Combining that with a query I found in a forum somewhere and altered, I came up with this query to get the names and DDL of all the indices on a table. (this one excludes the primary key index)</p>
<p><script src="http://gist.github.com/94854.js"></script></p>
<p>With that and the query to do the same for constraints its straightforward to build a helper function that will get the DDL for all indices and constraints, drop them, yield to evaluate a block and then restore the indices and constraints. The method is below: </p>
<p><script src="http://gist.github.com/95196.js"></script></p>
<p>Use of the function would look like the snippet below. This solution would also allow for arbitrarily complex transformations in Ruby as well as pure SQL.</p>
<p><script src="http://gist.github.com/94867.js"></script></p>
<p>For my task loading and transforming data into about 20 tables, doing this reduced the execution time by two-thirds. Of course, your mileage may vary depending how heavily indexed your destination tables are.</p>
<p>Here&#8217;s the whole module:</p>
<p><script src="http://gist.github.com/94853.js"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2009/04/14/deferring-index-costs-for-table-to-table-copies-in-postgresql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PNG Thumbnails for PDF files. Take two</title>
		<link>http://www.monkeyatlarge.com/archives/2008/10/07/png-thumbnails-for-pdf-files-take-two/</link>
		<comments>http://www.monkeyatlarge.com/archives/2008/10/07/png-thumbnails-for-pdf-files-take-two/#comments</comments>
		<pubDate>Wed, 08 Oct 2008 02:16:20 +0000</pubDate>
		<dc:creator>James Kebinger</dc:creator>
				<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.monkeyatlarge.com/?p=305</guid>
		<description><![CDATA[Updating my previous post, I finished up the work of extending attachment_fu to optionally create PNG thumbnails of updated PDF files. Check out the fork on github]]></description>
			<content:encoded><![CDATA[<p>Updating my <a href="http://www.monkeyatlarge.com/archives/2008/09/16/creating-thumbnails-of-pdfs-with-attachment_fu/">previous post</a>, I finished up the work of extending attachment_fu to optionally create PNG thumbnails of updated PDF files. Check out the <a href="http://github.com/jkebinger/attachment_fu/tree/master">fork on github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.monkeyatlarge.com/archives/2008/10/07/png-thumbnails-for-pdf-files-take-two/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

