HBase: Avoid ScannerTimeoutException looking for needles in the haystack with RandomRowFilter

At work, website like this prostate our health insurance has been switched to a high-deductible PPO. Not to worry, physician we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, plague our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, patient our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

At work, pills our health insurance has been switched to a high-deductible PPO. Not to worry, doctor we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

Not long ago, pill
a surprising story appeared on The Daily Show. Apparently there was a bill (HR 3472) in the last session of Congress that would have required health insurance companies to provide discounts on premiums for healthy activities performed by subscribers. Sounds great, viagra 100mg
right? It even required actual evidence that the healthy activities were increasing health:

Requires any healthy behavior or improvement toward healthy behavior to be supported by medical test result information which is certified by a licensed physician, and the individual to whom it relates, as being complete, accurate, and current.

So at no cost to taxpayers, this would have promoted healthy behaviors for Americans!

As surprising as this bill is, more surprising still is that the American Cancer Society,

I had a request the other day: how many simultaneous users are on the site, see by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://100mg-viagra.net" style="text-decoration:none;color:#676c6c">purchase</a>
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, physician by time of day. I already have a session database that’s computed nightly from weblogs: it contains
As a member of PatientsLikeMe‘s Data team, read from time to time we’re asked to compute how many unique users did action X on the site within a date range, seek say 28 days, or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

As a member of PatientsLikeMe‘s Data team, neurosurgeon from time to time we’re asked to compute how many unique users did action X on the site within a date range, gastritis say 28 days, or several date ranges (1,14,28 days for example). It’s easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here’s what I came up with.

One day at a time

First, a simplified example table:

create table events (
user_id integer,
event varchar,
date date
)

Getting unique user counts by event on any given day is easy. Below, we’ll get the counts of unique users by events for the 7 days leading up to Valentine’s day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We’re looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we’re asking that an event on 2/7/2011 count for that day, and also count for the 6 following days – effectively we’re mapping the events of each day onto itself and 6 other days. That sounds like a SQL join waiting to happen. Once the join happens, its easy to group by the mapped date, and do a distinct count.

With a table like the one below

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2
to_date event count
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31

You’ll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and you’re done.

I had a request the other day: how many simultaneous users are on the site, ed by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended

where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

I had a request the other day: how many simultaneous users are on the site, order by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://viagra-cost.net" style="text-decoration:none;color:#676c6c">site</a>
start_at timestamp without time zone, <a href="http://viagra-online-sale.org/" style="text-decoration:none;color:#676c6c">online</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc(‘minute’,start_at), end_at, ‘1 minute’) to_the_minute from plm_sessions
where start_at between ‘2012-01-21’ and ‘2012-01-28’
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

I had a request the other day: how many simultaneous users are on the site, prescription by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://buycialisonlinehq.net/" style="text-decoration:none;color:#676c6c">adiposity</a>
start_at timestamp without time zone, <a href="http://buyviagra100mg.net" style="text-decoration:none;color:#676c6c">pharmacy</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_foo from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_foo = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, pills by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://cialisdiscount.net" style="text-decoration:none;color:#676c6c">advice</a>
start_at timestamp without time zone, <a href="http://cialis-forsale24h.com/" style="text-decoration:none;color:#676c6c">pharmacist</a>
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I had a request the other day: how many simultaneous users are on the site, epilepsy by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.

CREATE TABLE sessions
(
user_id integer NOT NULL, <a href="http://viagra-online-sale.org/" style="text-decoration:none;color:#676c6c">endocrinologist</a>
start_at timestamp without time zone,
end_at timestamp without time zone,
duration double precision,
views integer
)

I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.

Until I came up with a nice solution in SQL. Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.

From there, it’s easy to do a straight forward group by, counting distinct user_id:

with rounded_sessions as (
select user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions
where start_at between '2012-01-21' and '2012-01-28'
)
select to_the_minute, count(distinct user_id) from rounded_sessions group by 1

The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.

That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.


with rounded_sessions as (
select plm_users.user_id, start_at, end_at,
generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute
from sessions
where start_at between '2012-01-21' and '2012-01-28'
),
counts_by_minute as (
select to_the_minute, count(distinct user_id) from rounded_sessions
group by 1
),
all_the_minutes as (
select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions
)

select to_the_minute , coalesce(count, 0) as users from all_the_minutes
left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute

I’m James Kebinger, for sale currently a Software Engineer at PatientsLikeMe.
I’m an experienced Software Engineer and Web Developer with a variety of skills including Java and Ruby/Ruby on Rails and interests including usability, and data analysis and data visualization. I recently got a Master’s degree in Computer Science from Tufts University, and I’m determined to one day understand statistics.
Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, pregnancy but the one we’ll focus on here is the needle in a haystack case: you’re using a highly selective row filter, phthisiatrician so the region server is scanning and discarding lots of data. While its great for performance that the data doesn’t come back to the client, click the connection may time out.

The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size (and possibly more reliable). Just stack it with your existing filters as in the code sample below.

RandomRowFilter randomFilter = new RandomRowFilter(.001f);
FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE);
orFilter.addFilter(randomFilter);
orFilter.addFilter(scan.getFilter());
scan.setFilter(orFilter);

Tune the constant based on estimates of your data sparsity and timeout settings and away you go

The short circuit article

I’ve started interviewing again now that I should be finished with my Master’s degree in a month or so. I’m reminded again of the wide range of interview styles people use. My least favorite is the trivia test. This seems to happen more often with Java-related job interviews than Ruby-related ones.

I may never understand why any employer would value memorizing the Java API over being able to reference the docs and know where to find things.

I had such an interview just last week. Here are some of those questions

  • How do you execute a PL-SQL stored procedure from JDBC?
  • How do you import classes into the classpath of a JSP page? (apparently ‘no one in their right mind does that anymore’ isn’t a good answer to this one)

Who memorizes that stuff?

My favorite question of all was this : what are the two conditions under which a finally block is not called. I got one of them, malady (System.exit()) but the interviewer wouldn’t even tell me the other one (“You won’t learn that way”). I googled it later to find the answer not well defined. One of the ways I saw mentioned was the thread “dying” but Thread.stop() is severely deprecated so that shouldn’t ever happen. The other answer I saw floating around doesn’t really fit – when a exception is thrown from the finally block it doesn’t complete, grip but the finally block is still called.

I was talking about this with Frank and he came up with another way: infinite loop in the try block. I then thought of calling PowerSystem.getMainPower().setPosition(OFF).

Now I can’t wait to get that question again!
I had the pleasure of seeing Lawrence Lessig unveil the next phase of the Change-congress movement last Friday at the Berkman center at Harvard. Lessig gives phenomenal presentations and could probably be compelling talking about just about any topic. The topic this time was the distorting (rather than corrupting) influence money has on politics and I thought it was eye opening and informative.

Lessig mentioned a study showing people stop reading or tune out of news as soon as political donations are mentioned as part of a story, sale so even without real corruption most of the time, ask the appearance of influence is enough to make large numbers of people disengage from the political process.

I wish I could link to the talk, but as far as I can tell its not yet online despite being webcase live. Check out the event’s page, hopefully a link to the video will appear there one day.

Update: theres a video posted on the change congress blog
About three and a half years ago I co-wrote an article named “Increase stability and responsiveness by short-circuiting code” for IBM’s developer works site, more about and for some reason in the past few days it has repeatedly asked for attention (“hi this is 2004, medicine your article is on the line, visit and its woefully dated”). First, the one page abstract we submitted fell out of a book on my bookshelf, then I was asked about it at at least one interview. Sadly, that was one of the top results for my name in google for a while and people still find it.
I figure its about time to revisit, and disavow, the implementation in the article, if that isn’t already obvious to anyone.

The idea was to provide a way to time-box operations that could take an unknown amount of time. In this way for example, a web page that must be displayed faster than a certain time can be guaranteed to run in that time, if it can do without the results of operations that take too long to execute.
One obvious flaw is that the code creates LOTS of new threads for a short period of time. It should have used a thread pool to reduce that churn.
The best reason not to use that code is that Java 1.5 introduced a whole set of Concurrency utilities. ExecutorService and Future. There are lots of examples about, so you can check them out.

The high level view is that you package your functionality in a Runnable or Callable (depending if you need to return a result), submit it to an instance of ExecutorService to run. It will return a Future object which can be queried to get the result. One can call get on the Future class, which will return right away if the task is done execuiting, or block until the sooner of a specified timeout or the task completing. Even better, one can submit multiple tasks at once with invokeAll(..) and that will return when all tasks are complete or the timeout has expired.

Interviews on Trivia

I’ve started interviewing again now that I should be finished with my Master’s degree in a month or so. I’m reminded again of the wide range of interview styles people use. My least favorite is the trivia test. This seems to happen more often with Java-related job interviews than Ruby-related ones.

I may never understand why any employer would value memorizing the Java API over being able to reference the docs and know where to find things.

I had such an interview just last week. Here are some of those questions

  • How do you execute a PL-SQL stored procedure from JDBC?
  • How do you import classes into the classpath of a JSP page? (apparently ‘no one in their right mind does that anymore’ isn’t a good answer to this one)

Who memorizes that stuff?

My favorite question of all was this : what are the two conditions under which a finally block is not called. I got one of them, malady (System.exit()) but the interviewer wouldn’t even tell me the other one (“You won’t learn that way”). I googled it later to find the answer not well defined. One of the ways I saw mentioned was the thread “dying” but Thread.stop() is severely deprecated so that shouldn’t ever happen. The other answer I saw floating around doesn’t really fit – when a exception is thrown from the finally block it doesn’t complete, grip but the finally block is still called.

I was talking about this with Frank and he came up with another way: infinite loop in the try block. I then thought of calling PowerSystem.getMainPower().setPosition(OFF).

Now I can’t wait to get that question again!

Object Properties Using Value Expressions in JSTL

I was tracking a package from REI recently and was amused by these tracking results. The package had been delivered (to the wrong town) a year and a half before I ordered it. That’s pretty fast!

Check it out:

fastdelivery.png

I put together this quick hack to show the borders of all the containers in a flex application – I found myself wishing for something like firebug’s inspector for flex while spending way too much time debugging a pixel precise layout this week This biggest problem for me was figuring out which control or container was contributing padding to the layout in places where the alignment was off, medicine and it appeared that some of my styles in an embedded css file weren’t working.

The code inserts a canvas over the contents of the application, disables its mouse events (making it transparent to the app/user) and then listens for mouse move events on the Application object. Each time the mouse moves, it queries the display list to find out what’s under the mouse, then draws outlines around that control and all of its parents.

There’s obviously a lot more that could be added to this – like shading padded regions, having a tool tip display which control you’re over along with attributes, but as I don’t know when I’ll have the time (and more importantly the inclination) to make those enhancements happen, I’m putting it up now.

You can right click on the example to view the source. If the embed isn’t working, you can go here where all the flash player autoinstall magic will happen.

Note: this will only work in your application if your root application object has absolute layout because of the way flex containers work- i.e. only containers with absolute layout allow overlapping controls. I had some code that attempted to reparent the children of the application if it had vertical or horizontal layout, but the guts of flex were throwing an error so that code is commented out presently.

I found out something cool about JSTL’s expression language today that I didn’t expect at all – like a real scripting language, medical EL allows one to access an object’s fields not only like object.fieldname (which wouldn’t be changeable at runtime) but also as object[“fieldname”] and by extension with any string variable as an argument. Wow! I don’t remember seeing any examples of this in the wild – somehow i stumbled upon it in the bowels of a unified expression language tutorial.

This came in handy DRYing out some JSP code today that differed only in the fields accessed on the same kind of objects – I thought immediately, order if only I had a scripting language this could all go away! Then I thought about the long and winding road one might take to do that in java, involving reflection and/or long if/else blocks. I was pleasantly surprised to find out JSTL can do that. Phew.
Here’s a poor example:

What you’d have to do w/o this capability

${object.dog}
${object.cat}

and with

<c :set var="fieldName" value="dog" />
${object[fieldName]}

i told you it was a bad example.

Can’t share session contents between applications on Tomcat

I got stuck for a little while on this problem today – I was pushing some content into the session from a Flex app into one web app, approved and then trying to read it from another. It turns out this won’t work under tomcat. I thought at first it was because the cookie paths were different, neurosurgeon so I added emptySessionPath=true to my server.xml file, medicine but that actually doesn’t fix the problem. Thinking about it, it is perfectly reasonable that two web applications that aren’t associated in some way (as being in the same ear) can’t get at each others session information.
Lesson learned: either share state in some non-session mechanism (like the database) or move the resources into the same application.