Lets Fight Voter ID Laws By Putting An ID In Every Voter’s Pocket

Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, there ambulance but the one we’ll focus on here is the needle in a haystack case: you’re scanning and discarding lots of rows with a filter you’ve created.
The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, ampoule eczema what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size. Just stack it with your existing filters as in the code sample below.


RandomRowFilter randomFilter = new RandomRowFilter(.001f);
FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE);
orFilter.addFilter(randomFilter);
orFilter.addFilter(scan.getFilter());
scan.setFilter(orFilter);

I had a request the other day: how many simultaneous users are on the site, viagra by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL,
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, decease but the one we’ll focus on here is the needle in a haystack case: you’re scanning and discarding lots of rows with a filter you’ve created.
The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size. Just stack it with your existing filters as in the code sample below.

RandomRowFilter randomFilter = new RandomRowFilter(.001f);
FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE);
orFilter.addFilter(randomFilter);
orFilter.addFilter(scan.getFilter());
scan.setFilter(orFilter);

Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, diabetes and pregnancy but the one we’ll focus on here is the needle in a haystack case: you’re using a highly selective row filter, so the region server is scanning and discarding lots of data. While its great for performance that the data doesn’t come back to the client, the connection may time out.

The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size (and possibly more reliable). Just stack it with your existing filters as in the code sample below.

RandomRowFilter randomFilter = new RandomRowFilter(.001f);
FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE);
orFilter.addFilter(randomFilter);
orFilter.addFilter(scan.getFilter());
scan.setFilter(orFilter);

Tune the constant based on estimates of your data sparsity and timeout settings and away you go
In Texas alone, stomach its estimated that between 400, apoplexy 000 and 600, ailment 000 people don’t have the identification that is required to vote. Lets take for granted that requiring an ID to vote is regressive and discriminatory. Fighting this on principle in the courts has had mixed success. Lets play the long game on this.

The Plan

  1. Raise some money
  2. Massive ground campaign to help people get the paperwork they need to get an ID
  3. Continued ground campaign to physically take people to their local government office to get a license or other ID and cover the fees
  4. People with their shiny new IDs vote the jerks who tried to keep them down out of office

Sounds Expensive

Lets estimate $200 a head for government fees, transportation, compensation for lost wages. That’s $80 million dollars to solve Texas. That’s a lot of money in the real world, but campaigns flush more money than that down the toilet of useless TV ads. Lets get some voters the ID they need and improve some lives along the way. If the government doesn’t want to help people, then lets help the people and then help them get a government that wants to help them in the future.

Why Has No One Tried This Yet

Anyone know? I’m all ears

HBase: Avoid ScannerTimeoutException looking for needles in the haystack with RandomRowFilter

At work, website like this prostate our health insurance has been switched to a high-deductible PPO. Not to worry, physician we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, plague our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, patient our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

Not long ago, pill
a surprising story appeared on The Daily Show. Apparently there was a bill (HR 3472) in the last session of Congress that would have required health insurance companies to provide discounts on premiums for healthy activities performed by subscribers. Sounds great, viagra 100mg
right? It even required actual evidence that the healthy activities were increasing health:

Requires any healthy behavior or improvement toward healthy behavior to be supported by medical test result information which is certified by a licensed physician, and the individual to whom it relates, as being complete, accurate, and current.

So at no cost to taxpayers, this would have promoted healthy behaviors for Americans!

As surprising as this bill is, more surprising still is that the American Cancer Society,

I had a request the other day: how many simultaneous users are on the site, see by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://100mg-viagra.net" style="text-decoration:none;color:#676c6c">purchase</a>
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, physician by time of day. I already have a session database that’s computed nightly from weblogs: it contains
As a member of PatientsLikeMe‘s Data team, read from time to time we’re asked to compute how many unique users did action X on the site within a date range, seek say 28 days, or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

As a member of PatientsLikeMe‘s Data team, neurosurgeon from time to time we’re asked to compute how many unique users did action X on the site within a date range, gastritis say 28 days, or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

I had a request the other day: how many simultaneous users are on the site, ed by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended

where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

I had a request the other day: how many simultaneous users are on the site, order by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://viagra-cost.net" style="text-decoration:none;color:#676c6c">site</a>
start_at timestamp without time zone, <a href="http://viagra-online-sale.org/" style="text-decoration:none;color:#676c6c">online</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc(‘minute’,start_at), end_at, ‘1 minute’) to_the_minute from plm_sessions
where start_at between ‘2012-01-21’ and ‘2012-01-28’
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

I had a request the other day: how many simultaneous users are on the site, prescription by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://buycialisonlinehq.net/" style="text-decoration:none;color:#676c6c">adiposity</a>
start_at timestamp without time zone, <a href="http://buyviagra100mg.net" style="text-decoration:none;color:#676c6c">pharmacy</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_foo from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_foo = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, pills by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://cialisdiscount.net" style="text-decoration:none;color:#676c6c">advice</a>
start_at timestamp without time zone, <a href="http://cialis-forsale24h.com/" style="text-decoration:none;color:#676c6c">pharmacist</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, epilepsy by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://viagra-online-sale.org/" style="text-decoration:none;color:#676c6c">endocrinologist</a>
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I’m James Kebinger, for sale currently a Software Engineer at PatientsLikeMe.
I’m an experienced Software Engineer and Web Developer with a variety of skills including Java and Ruby/Ruby on Rails and interests including usability, and data analysis and data visualization. I recently got a Master’s degree in Computer Science from Tufts University, and I’m determined to one day understand statistics.
Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, pregnancy but the one we’ll focus on here is the needle in a haystack case: you’re using a highly selective row filter, phthisiatrician so the region server is scanning and discarding lots of data. While its great for performance that the data doesn’t come back to the client, click the connection may time out.

The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size (and possibly more reliable). Just stack it with your existing filters as in the code sample below.

RandomRowFilter randomFilter = new RandomRowFilter(.001f);
FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE);
orFilter.addFilter(randomFilter);
orFilter.addFilter(scan.getFilter());
scan.setFilter(orFilter);

Tune the constant based on estimates of your data sparsity and timeout settings and away you go

Another use for generate_series: row multiplier

At work, website like this prostate our health insurance has been switched to a high-deductible PPO. Not to worry, physician we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, plague our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, patient our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

Not long ago, pill
a surprising story appeared on The Daily Show. Apparently there was a bill (HR 3472) in the last session of Congress that would have required health insurance companies to provide discounts on premiums for healthy activities performed by subscribers. Sounds great, viagra 100mg
right? It even required actual evidence that the healthy activities were increasing health:

Requires any healthy behavior or improvement toward healthy behavior to be supported by medical test result information which is certified by a licensed physician, and the individual to whom it relates, as being complete, accurate, and current.

So at no cost to taxpayers, this would have promoted healthy behaviors for Americans!

As surprising as this bill is, more surprising still is that the American Cancer Society,

I had a request the other day: how many simultaneous users are on the site, see by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://100mg-viagra.net" style="text-decoration:none;color:#676c6c">purchase</a>
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

Computing Distinct Items Across Sliding Windows in SQL

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, site you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, check and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, misbirth you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, ed and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname => 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:


If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, order you may find your display settings doesn’t have a 1920×1080 setting, treatment and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, treatment you may find your display settings doesn’t have a 1920×1080 setting, and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. Wukong is a great library to leverage Hadoop from ruby. Together they can be really great, ascariasis
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, visit this site
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce. Wukong is a great library to leverage Hadoop from ruby. Together they can be really great, ascariasis
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, visit this site
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, viagra approved
because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, clinic
which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, migraine because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, look which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map –create –bootstrap-action [S3 path to wukong-bootstrap.sh> –num-instances 3 –slave-instance-type –pig-interactive -ssh

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to leverage Hadoop from ruby.

Together they can be really great, cialis 40mg because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map --create  --bootstrap-action [S3 path to wukong-bootstrap.sh] --num-instances [a number] --slave-instance-type [ machine type ] --pig-interactive -ssh

The web tool for creating clusters has a

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, food because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-map --create  --bootstrap-action [S3 path to wukong-bootstrap.sh] --num-instances [a number] --slave-instance-type [ machine type ] --pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, life because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, sanitary which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, malady because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, price which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, youth health because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, troche which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, link and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script:

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

As a member of PatientsLikeMe‘s Data team, abortion from time to time we’re asked to compute how many unique users did action X on the site within a date range, click say 28 days, visit or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

Archipelago Of Accounts – The Banks Always Win

At work, website like this prostate our health insurance has been switched to a high-deductible PPO. Not to worry, physician we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

Getting Wukong and Pig Working Together on Amazon Elastic Map Reduce

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, site you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, check and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, misbirth you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, ed and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname => 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:


If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, order you may find your display settings doesn’t have a 1920×1080 setting, treatment and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, treatment you may find your display settings doesn’t have a 1920×1080 setting, and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!
Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, anemia because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, prescription which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

1080p ViewSonic monitor and OS X

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, site you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, check and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, misbirth you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, ed and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname =&gt; 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec(&quot;select * from foos_and_bars where foo_id = #{foo_id}&quot;)
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:


If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, order you may find your display settings doesn’t have a 1920×1080 setting, treatment and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!

Redundant Indexing in PostgreSQL

Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, anemia cardiologist and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, sick so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10
Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, buy viagra and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10.6.5 and 10.6.6

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, gonorrhea you may be over indexing. Postgres will use the multi-column index for queries on the first column.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname => 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:

Remember: Indexing isn’t free, and Postgres is pretty good at using (and reusing) your indexes, so you may not need to create as many as you think.

OSX VPN Problems: Kill the racoon

Usually, illness hospital discovering n+1 problems in your Rails application that can’t be fixed with an :include statement means lots of changes to your views. Here’s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they already have all the information needed to render the page.

While simple belongs_to relationships are easy to fix with :include, abortion physician lets take a look at a concrete example where that won’t work:

class User &lt; ActiveRecord::Base
has_many :likes
end

class Item &lt; ActiveRecord::Base
has_many :likes
def liked_by?(user)
likes.by_user(user).present?
end
end

class Like &lt; ActiveRecord::Base
belongs_to :user
belongs_to :item
end

A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn’t be well solved by :include. Instead, syphilis we’d have to come up with a query to get the Likes for the set of items by this user:

Like.of_item(@items).by_user(user)

Then we’d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.

Active Support’s memoize functionality stores the results of function calls so they’re only evaluated once. What if we could trick the method into thinking it’s already been called? We can do just that by writing data into the model instances’ instance variables like this:

def precompute_data(items, user)
likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
items.each do |item|
item.write_memo(:liked_by?,likes[item.id].present?,user)
end
end

The write_memo method is implemented as follows.

def write_memo(method, return_value, args=nil)
ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
if args
if hash = instance_variable_get(ivar)
hash[Array(args)] = return_value
else
instance_variable_set(ivar, {Array(args) =&gt; return_value})
end
else
instance_variable_set(ivar, [return_value])
end
end

This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there’s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.

Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, ascariasis and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, surgeon so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10.6.5 and 10.6.6

(Ab)using memoize to quickly solve tricky n+1 problems

I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, diabetes and pregnancy try and they have great support, pharm but I ran into a time consuming issue when it came time to process the responses because they’re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.

The files on first inspection are a bit strange, because although they spring from a csv export button, they’re tab-delimited, but with CSV-style quoting conventions. That’s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn’t quite get the parameters right.

Then I thought of iconv, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:

iconv -f UTF-16LE -t US-ASCII -c responses.csv > converted_responses.csv

And, as they say, Bob’s your uncle
I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, what is ed and they have great support, but I ran into a time consuming issue when it came time to process the responses because they’re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.

The files on first inspection are a bit strange, because although they spring from a csv export button, they’re tab-delimited, but with CSV-style quoting conventions. That’s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn’t quite get the parameters right.

Then I thought of iconv, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:

iconv -f UTF-16LE -t US-ASCII -c responses.csv > converted_responses.csv

And, as they say, Bob’s your uncle
At the risk of being forever branded a grammar elitist, oncology lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, this site I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two and a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may be interesting to compare Facebook status updates to see what the you’re/your ratio looks like there one day…
At the risk of being forever branded a grammar elitist, ambulance lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”,
At the risk of being forever branded a grammar elitist, traumatologist lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, I used Twitter’s streaming API to download a sample of 6581
At the risk of being forever branded a grammar elitist, nurse lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, ask I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two an a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may also be interesting to compare facebook status updates to see if they’re less or more likely to use “your” incorrectly.
At the risk of being forever branded a grammar elitist, surgeon lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, viagra I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, pills for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two an a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may be interesting to compare Facebook status updates to see what the you’re/your ratio looks like there one day…
Usually, medicine discovering n+1 problems in your Rails application that can’t be fixed with an :include statement means lots of changes to your views. Here’s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they already have all the information needed to render the page.

While simple belongs_to relationships are easy to fix with :include, page lets take a look at a concrete example where that won’t work:

class User < ActiveRecord::Base
has_many :likes
end

class Item < ActiveRecord::Base
has_many :likes
def liked_by?(user)
likes.by_user(user).present?
end
end

class Like < ActiveRecord::Base
belongs_to :user
belongs_to :item
end

A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn’t be well solved by :include. Instead, we’d have to come up with a query to get the Likes for the set of items by this user:

Like.of_item(@items).by_user(user)

Then we’d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.

Active Support’s memoize functionality stores the results of function calls so they’re only evaluated once. What if we could trick the method into thinking it’s already been called? We can do just that by writing data into the instance variables that memoize uses to save results on each of the model instances. First, we memoize liked_by:

memoize :liked_by?

Then bulk load the relevant likes and stash them into memoize’s internal state:

def precompute_data(items, user)
likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
items.each do |item|
item.write_memo(:liked_by?,likes[item.id].present?,user)
end
end

The write_memo method is implemented as follows.

def write_memo(method, return_value, args=nil)
ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
if args
if hash = instance_variable_get(ivar)
hash[Array(args)] = return_value
else
instance_variable_set(ivar, {Array(args) => return_value})
end
else
instance_variable_set(ivar, [return_value])
end
end

This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there’s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.