I had a request the other day: how many simultaneous users are on the site, by time of day. I already have a session database that’s computed nightly from weblogs: it contains the times at which each session started and ended.
CREATE TABLE sessions ( user_id integer NOT NULL, start_at timestamp without time zone, end_at timestamp without time zone, duration double precision, views integer )
I thought for sure the next step would be to dump some data, then write some Ruby or R to scan through sessions and see how many sessions were open at a time.
Until I came up with a nice solution in SQL (Postgres). Stepping back, if I can sample from sessions at say, one-minute intervals, I can count the number of distinct sessions open at each minute. What I need is a row per session per minute spanned. Generate_series is a “set returning function” that can do just that. In the snippet below, I use generate_series to generate a set of (whole) minutes from the start of the session to the end of the session. That essentially multiplies the session row into n rows, one for each of the minutes the session spans.
From there, it’s easy to do a straight forward group by, counting distinct user_id:
with rounded_sessions as ( select user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') to_the_minute from sessions where start_at between '2012-01-21' and '2012-01-28' ) select to_the_minute, count(distinct user_id) from rounded_sessions group by 1
The date_trunc call is important so that session rows are aligned to whole minutes, if that’s not done, then none of the rows will align for the counts.
That set won’t include rows that had no users logged in. To do that, the query below will use generate_series again to generate all the minutes from the first minute present to the last, then left join the counts to that set, coalescing missing entries to zero.
with rounded_sessions as ( select plm_users.user_id, start_at, end_at, generate_series(date_trunc('minute',start_at), end_at, '1 minute') as to_the_minute from sessions where start_at between '2012-01-21' and '2012-01-28' ), counts_by_minute as ( select to_the_minute, count(distinct user_id) from rounded_sessions group by 1 ), all_the_minutes as ( select generate_series(min(to_the_minute), max(to_the_minute), '1 minute') as minute_fu from rounded_sessions ) select to_the_minute , coalesce(count, 0) as users from all_the_minutes left join counts_by_minute on all_the_minutes.minute_fu = counts_by_minute.to_the_minute