Technology – Brain Lint

1080p ViewSonic monitor and OS X

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, you may be over indexing. Postgres will use the multi-column index for queries on the first column. First a pointer to the postgres docs that I can never find, and then data on performance of multi-column indexes vs single.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname =&gt; 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
id serial NOT NULL,
foo_id integer,
bar_id integer,
CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
ON foos_and_bars
USING btree
(foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
Recheck Cond: (foo_id = 123)
-&gt;  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (foo_id = 123)
Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname =&gt; 'test_foo')
def perform_test(conn,foo_id)
time = Benchmark.realtime do
res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
res.clear
end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:

If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, you may find your display settings doesn’t have a 1920×1080 setting, and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!

OSX VPN Problems: Kill the racoon

Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10.6.5 and 10.6.6

My experience setting up a MOCA network at home

ING Direct put up a short manifesto titled “We, The Savers“. Its a good read,
and we could all do better by it.

Number 3 struck me especially:

We will take care of our money. Itâ€™s not enough to have money in a bank. We will put it where it will grow. Weâ€™ll keep track of it. And weâ€™ll check every account we have every year to protect ourselves against fraud or escheatment.

“We will put it where it will grow” – well where will it grow. It seems the first tool brought to bear on any stock market bump is to lower interest rates,
which in effect punishes those of us who do actually have money in a savings account. We lament the low savings rate in America, but then we go make it more appealing to borrow and less appealing to save.

Another item is this – not everyone has the internet access or savvy to move their money to a place like ING Direct. Those people have their money stuck in a savings account that probably pays well under one-percent interest. I think its high time this country had a better program to get more people online so people can get away from their no-interest paying bank.
I bought a couple of coax-ethernet bridges in the hopes of speeding media transfers to and from my Tivo HD. The devices work great, but it turns out my Tivo itself is the bottleneck – it just doesn’t serve media very fast even over ethernet. I recommend a “Moca”:http://www.mocalliance.org based ethernet over coax network if you’re in need of more speed than wireless will give you, but don’t expect miracles on the Tivo front.

Why go back to wires?

Sure wireless is nice and easy and fast enough for many applications, but you can’t beat the bandwidth of a wire for guaranteed bandwidth. I live in a densely populated area in which I can see about 40 wireless networks, and about a third of those overlap my wireless band to one degree or another. I get just a fraction of the theoretical 54mbps of a g-based wifi network. Compare that to 100 mbps point to point for coax (actually around 240mbps total band width if you’ve got a mesh network set up).

Taking the plunge

First you’ve got to get yourself a couple of coax bridges. The problem here is that no one sells them at retail right now. Fortunately Verizon’s FIOS service made heavy use of the Motorola NIM-100 bridge but is now phasing them out, so you can get them cheap on ebay. I got a pair for $75, shipped.

Each bridge has an ethernet port, and two coax ports, one labeled “in”, the other labeled “out”. If you have cable internet you’ll likely put one of these next to your cable modem. In that case, connect a wire from the wall to the coax in port, and another from the out port to the cable modem. An ethernet wire to your router, and now you’ve got an ethernet network running over your coaxial cable wires. Plug another one in somewhere else in your house, wall to the in port, and ethernet to some device and you’re in business. I got north of 80mbps between two laptops over the coax bridge.

This should work out of the box if your bridges came reset to their factory configuration. Unfortunately that means you can’t administer them and they’re using a default encryption key (traffic over the coax is encrypted because it probably leaks a bit out of your house)

Taking control

I’d recommend spending a bit of time to make your new bridges configurable- they have web interfaces, its just a matter of getting to them that’s tricky. I pieced together this information from several sources on the web.
The first problem is getting into the web interface. The default settings are for the bridge to auto assign itself an IP address in the range 169.254.1.x , and it won’t accept admin connections from devices that aren’t on the same ip range so here’s what you do:

Take a computer and set your ethernet interface to have a static IP address of 169.254.1.100
Connect the computer directly to the bridge over ethernet
Goto http://169.254.1.1 . If that doesn’t work, increment the last digit until it does
When you see the web interface, the default password is “entropic” – they’re apparently the only people who make the chips for these devices

Once you’re in the configuration works much like any other network device. You should definitely set a new password under “coax security” – you’ll have to repeat this for all your devices. Also, I’d recommend setting the device to use DHCP or a fixed IP in your usual IP range if you’d like to change anything in the future.

iPhone can’t keep time

I think the New York Times application for the iPhone is pretty good. My chief complaint (other than the random crashing) is that the head short
I think the New York Times application for the iPhone is pretty good. My chief complaint (other than the random crashing) is that the head short
Every now and then my iPhone has this issue where it can’t tell time properly. I wake it up, and it shows me a time several hours ago, then as if waking from a drunken stupor, slowly tries to catch up to reality, moving the clock forward by a small, random number of minutes. During these episodes the whole UI is sluggish, and it apparently doesn’t even accept phone calls. When “phone” is 5/6 of your name one would think at least that would work all of the time!

Check out this screenshot from the missed call sheet. It recorded 3 missed calls that arrived over the course of an afternoon all with the exact same arrival time, 9:40 AM. The phone never rang.

That was with v2.01, so I sure hope this is fixed in the future.

Update: to Frank’s comment – this wasn’t a matter of the phone bouncing between time zones. The phone’s time isn’t a whole number of hours behind.

Ben Fry Guest Lecture

A couple of weeks ago I had the good fortune of sitting in on lecture of a scientific visualization class* at Tufts at which Ben Fry, creator of many great works of visualizations that can even be called information art as well as the visualization toolkit, steroids Processing, was guest speaking. The talk was great, spanning lots of work and interesting commentary.

Some notes:

Ben showed quite a bit of his previous work – some of it would be familiar to readers of his book, Visualizing Data.
Showed off some of his work that has appeared in movies, highlighting the fact that he is asked to add rows of standard grey computer buttons to his work because it doesn’t look “real” otherwise.
Talked about some experience teaching classes, particularly the challenges of classes with mixtures of cs students and artists. Making CS students do projects more artsy and artists do more interactive, technical work can be interesting. He showed off some examples of student work. (One cool student project asked a set of Nobel laureates what type of pets they had. Quite a few found time to respond and the results are here.)

The coolest demos were of some of the work he’d done for Oblong Industries (Not a lot of information online now- here’s one cnet article)- they have a working Minority Report-style gesture interface that allows one to control a computer with hand movements. Paired with the right interface, this looks to make light work of navigating through vast amounts of multidimensional data. Ben showed some videos, along with a demo (running on his macbook pro w/o the fancy hardware it was still really cool).

* I’d been asking for a class like this to be offered several times while I as still working on my degree at Tufts, but to no avail. Of course it’s offered right after I graduate!

Ignite Boston Recap

I went to the third installment of Ignite Boston this evening. It’s a series of five-minute lightning talks on various technical talks, along with a couple of upsized keynotes. My (partial) recap:

The most illuminating talk for me was by Jonathan Zdziarski on security on the iPhone ecosystem. Turns out using the iPhone is a huge security risk because people are actively hacking on the iPhone but not disclosing their hacks because Apple will fix them. This makes it really easy to get data off a stolen iPhone. Scarier still is that if your iPhone breaks, and you turn it in for a new one under warranty, then the person who buys your old one refurbished has a pretty good chance of recovering your data. Pretty scary stuff, and downright outrageous that Apple doesn’t do a better job of wiping the memory under those circumstances. See Jonathon’s site for more information.
Jesse Vincent had a good rant on the parallel between sharecropping and Web 2.0 sites. Its your data and your time, but their property, tools and profit…
Juhan Sonin talked about design tenets for beautiful design. Is putting together an online collection of them on wikia somewhere, but I don’t have the url handy. There’s an earlier draft of the presentation on flickr.
Alexander Wissner-Gross presented co2stats.com which aims to (precisely) calculate the co2 emissions of a web site based on its location and the location of its users (eg, having lots of browsers from West Virginia = lots of coal burning). They’ll “automatically” buy carbon offsets for you so you can advertise your site as green, but I’m still not convinced carbon offsets mean anything. Lots of money pouring in there, but not a lot of proof what works and what doesn’t and for how long.

Lessig at the Berkman center

I had the pleasure of seeing Lawrence Lessig unveil the next phase of the Change-congress movement last Friday at the Berkman center at Harvard. Lessig gives phenomenal presentations and could probably be compelling talking about just about any topic. The topic this time was the distorting (rather than corrupting) influence money has on politics and I thought it was eye opening and informative.

Lessig mentioned a study showing people stop reading or tune out of news as soon as political donations are mentioned as part of a story, so even without real corruption most of the time, the appearance of influence is enough to make large numbers of people disengage from the political process.

I wish I could link to the talk, but as far as I can tell its not yet online despite being webcase live. Check out the event’s page, hopefully a link to the video will appear there one day.

Update: theres a video posted on the change congress blog

Google finance’s new stockscreener has sparklines

I noticed this morning that google finance has a new stock screener feature that lets you choose stocks with features in a certain range by way of an interactive sparkline. These are miniature graphs that go inline with text. In this case the graph is a histogram that indicates how much of the stock market falls into each part of the range – this will give one a quick preview how inclusive their search parameters are.

Not so much java for local web startups

There’s a local group of entrepreneurs and developers that meets every couple of months in Cambridge. I was curious about this month’s presenters’ choices of development platform, so I took at look at their headers and here’s what I found.

Of 7 presenters the platform stats fall out thusly:
2 Ruby on Rails (plus one suspected, but not confirmed)
2 PHP
1 Asp.net
1 Python (cherry py)

By way of contrast, a quick and dirty survey of jobs in boston/cambridge/brookline on craig’s list turned up the following stats
232 jobs containing Java
113 jobs containing ASP.net
164 jobs containing PHP
46 jobs containing Python
34 jobs containing Ruby

Presumably the difference is because of lots of folks in the area are working at medium sized companies on older, established (i won’t say “legacy”) systems?

We’ll keep the carbon credits, thanks

I saw on the Globe’s website that the founder of ZipCar has started a new company, goloco.com, which aims to promote ride sharing by splitting up the costs of a trip, handling payments to the driver, and taking a 10% cut of the proceeds. I don’t know why, but I happened to skim the terms of service which were all pretty standard stuff, until i found this:

13. Carbon Credits

You agree to assign the rights to any Carbon Credits resulting from any trips arranged using our service to GoLoco.

Pretty crafty – if they do well, and if we ever get some kind of cap and trade system for carbon (which is a lot of ifs) they could stand to make more money selling carbon credits than on their users’ tithe.