Computing Distinct Items Across Sliding Windows in SQL

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, site you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, check and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, misbirth you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, ed and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname => 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:


If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, order you may find your display settings doesn’t have a 1920×1080 setting, treatment and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, treatment you may find your display settings doesn’t have a 1920×1080 setting, and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. Wukong is a great library to leverage Hadoop from ruby. Together they can be really great, ascariasis
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, visit this site
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. Wukong is a great library to leverage Hadoop from ruby. Together they can be really great, ascariasis
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, visit this site
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, viagra approved
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, clinic
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, migraine because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, look which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map –create –bootstrap-action [S3 path to wukong-bootstrap.sh> –num-instances 3 –slave-instance-type –pig-interactive -ssh

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, cialis 40mg because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map --create  --bootstrap-action [S3 path to wukong-bootstrap.sh] --num-instances [a number] --slave-instance-type [ machine type ] --pig-interactive -ssh

The web tool for creating clusters has a

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, food because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map --create  --bootstrap-action [S3 path to wukong-bootstrap.sh] --num-instances [a number] --slave-instance-type [ machine type ] --pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, life because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, sanitary which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, malady because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, price which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, youth health because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, troche which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, link and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

As a member of PatientsLikeMe‘s Data team, abortion from time to time we’re asked to compute how many unique users did action X on the site within a date range, click say 28 days, visit or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

Leave a Reply

Your email address will not be published. Required fields are marked *