Archipelago Of Accounts – The Banks Always Win

August 8th, 2011

At work, our health insurance has been switched to a high-deductible PPO. Not to worry, we’ve also been granted Health Savings Accounts (HSA) in which to save money, tax-free, to pay bills before meeting the deductible.

That’s all well and good, but I can’t shake the feeling every time legislation comes out to do some activity (retire, save for education, health care) the only winner is the financial services industry.

Here’s why: all of these activities requires one to maroon a slice of money into an account designated for that purpose. What comes with accounts? That’s right, fees to the bank. The Wells-Fargo HSA we’ve got is $4.25 a month (paid, for now, by work). That’s $51 a year to hold money. The interest rate is a paltry 0.1%, so with $2000 in that account (the minimum cash balance before we’re allowed to invest), I’d make about $2.00, (net -$49 if I was paying the fees, as I will one day) Thanks for nothing. Further, while some banks graciously waive fees for meeting minimum balances, it’s harder for many people to meet the balance since their money is split so many ways.

These accounts limit my flexibility to spend as life events occur, limit the returns on my money, and cost me fees, and headaches. More statements to read, cards to carry, and fine print to decode.

If costs are to be tax-deductible, why not fix the tax code instead, so that all medical expenses, instead of those over a certain amount, are tax deductible, instead of these shameless handouts to the banks? Let me deduct things come tax time.

Getting Wukong and Pig Working Together on Amazon Elastic Map Reduce

March 16th, 2011

Apache Pig is a great language for processing large amounts of data on a Hadoop cluster without delving into the minutiae of map reduce.

Wukong is a great library to write map/reduce jobs for Hadoop from ruby.

Together they can be really great, because problems unsolvable in pig without resorting writing a custom function in Java can be solved by streaming data through an external script, which Wukong nicely wraps. The Data Chef blog has a great example of using Pig to choreograph the data flow, and ruby/wukong to compute Jaccard Similarity of sets.

Working with Wukong on Elastic Map Reduce

Elastic map reduce is a great resource – it’s very easy to quickly have a small hadoop cluster at your disposal to process some data. Getting wukong working requires an extra step: installing the wukong gem on all the machines in the cluster.

Fortunately, elastic map reduce allows the use of bootstrap scripts located on S3, which run on boot for all the machines in the cluster. I used the following script (based on an example on stackoverflow):

sudo apt-get update
sudo apt-get -y install rubygems
sudo gem install wukong --no-rdoc --no-ri

Using Amazon’s command line utility, starting the cluster ready to use in pig interactive mode looks like this

elastic-mapreduce –create –bootstrap-action [S3 path to wukong-bootstrap.sh] –num-instances [a number] –slave-instance-type [ machine type ] –pig-interactive -ssh

The web tool for creating clusters has a space for specifying the path to a bootstrap script.

Next step: upload your pig script and it accompanying wukong script to the name node, and launch the job. (It’s also possible to do all of that when starting the cluster with more arguments to elastic-map, with the added advantage that the cluster will terminate with your job)

1080p ViewSonic monitor and OS X

March 8th, 2011

If you’re hooking up a Mac OS X machine to a 1080p monitor via a mini displayport to HDMI adapter, you may find your display settings doesn’t have a 1920×1080 setting, and the 1080p setting produces an image with the edges cut off. Adjusting the overscan/underscan slider will make the image fit, but it turns fuzzy.

Solution: check the monitor’s settings. In my ViewSonic VX2453 the HDMI inputs have 2 settings “AV” and “PC”. Switching it to PC solved the problem, and now the picture is exactly the right size and crisp.

I spent some time futzing around with SwitchRes and several fruitless reboots before discovering the setting, so I hope this saves someone time!

Redundant Indexing in PostgreSQL

February 8th, 2011

If you have a table with a column included as the first column in a multi-column index and then again with it’s own index, you may be over indexing. Postgres will use the multi-column index for queries on the first column.

From the docs

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.


Performance

If you click around that section of the docs, you’ll surely come across the section on multi-column indexing and performance, in particular this section (bold emphasis mine):

You could also create a multicolumn index on (x, y). This index would typically be more efficient than index combination for queries involving both columns, but as discussed in Section 11.3, it would be almost useless for queries involving only y, so it should not be the only index. A combination of the multicolumn index and a separate index on y would serve reasonably well. For queries involving only x, the multicolumn index could be used, though it would be larger and hence slower than an index on x alone

Life is full of tradeoffs performance wise, so we should explore just how much slower it is to use a multi-column index for single column queries.

First, lets create a dummy table:

CREATE TABLE foos_and_bars
(
  id serial NOT NULL,
  foo_id integer,
  bar_id integer,
  CONSTRAINT foos_and_bars_pkey PRIMARY KEY (id)
)

Then, using R, we’ll create 3 million rows of nicely distributed data:

rows = 3000000
foo_ids = seq(1,250000,1)
bar_ids = seq(1,20,1)
data = data.frame(foo_id = sample(foo_ids, rows,TRUE), bar_id= sample(bar_ids,rows,TRUE))

Dump that to a text file and load it up with \copy and we’re good to go.

Create the compound index

CREATE INDEX foo_id_and_bar_id_index
  ON foos_and_bars
  USING btree
  (foo_id, bar_id);

Run a simple query to make sure the index is used:

test_foo=# explain analyze select * from foos_and_bars where foo_id = 123;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foos_and_bars  (cost=4.68..55.74 rows=13 width=12) (actual time=0.026..0.038 rows=8 loops=1)
   Recheck Cond: (foo_id = 123)
   ->  Bitmap Index Scan on foo_id_and_bar_id_index  (cost=0.00..4.68 rows=13 width=0) (actual time=0.020..0.020 rows=8 loops=1)
         Index Cond: (foo_id = 123)
 Total runtime: 0.072 ms
(5 rows)

Now we’ll make 100 queries by foo_id with this index, and then repeat with the single index installed using this code:

require 'rubygems'
require 'benchmark'
require 'pg'

TEST_IDS = [...] #randomly selected 100 ids in R

conn = PGconn.open(:dbname => 'test_foo')
def perform_test(conn,foo_id)
  time = Benchmark.realtime do
    res = conn.exec("select * from foos_and_bars where foo_id = #{foo_id}")
    res.clear
  end
end

TEST_IDS.map {|id| perform_test(conn,id)} #warm things up?
data = TEST_IDS.map {|id| perform_test(conn,id)}

data.each do |d|
puts d
end

How do things stack up? I’d say about evenly:

Remember: Indexing isn’t free, and Postgres is pretty good at using (and reusing) your indexes, so you may not need to create as many as you think.

OSX VPN Problems: Kill the racoon

January 18th, 2011

Occasionally my Mac will refuse to connect to work’s IPSec VPN with the error message:
“A configuration error occured. Verify your settings and try reconnecting”

This usually happens to me after a long time between reboots, and a reboot usually allows me to successfully connect again. Rebooting when I’m in the middle of something can be a pain, so I did some research and found a better way. There’s a process called “racoon” – it performs key exchange operations to set up IPSec tunnels. Kill it (using kill or activity monitor) and your VPN will start working again.

Works on OSX 10.6.5 and 10.6.6

(Ab)using memoize to quickly solve tricky n+1 problems

December 8th, 2010

Usually, discovering n+1 problems in your Rails application that can’t be fixed with an :include statement means lots of changes to your views. Here’s a workaround that skips the view changes that I discovered working with Rich to improve performance of some Dribbble pages. It uses memoize to convince your n model instances that they already have all the information needed to render the page.

While simple belongs_to relationships are easy to fix with :include, lets take a look at a concrete example where that won’t work:

class User < ActiveRecord::Base
  has_many :likes
end

class Item < ActiveRecord::Base
  has_many :likes
  def liked_by?(user)
     likes.by_user(user).present?
  end
end

class Like < ActiveRecord::Base
  belongs_to :user
  belongs_to :item
end

A view presenting a set of items that called Item#liked_by? would be an n+1 problem that wouldn’t be well solved by :include. Instead, we’d have to come up with a query to get the Likes for the set of items by this user:

Like.of_item(@items).by_user(user)

Then we’d have to store that in a controller instance variable, and change all the views that called item.liked_by?(user) to access the instance variable instead.

Active Support’s memoize functionality stores the results of function calls so they’re only evaluated once. What if we could trick the method into thinking it’s already been called? We can do just that by writing data into the instance variables that memoize uses to save results on each of the model instances. First, we memoize liked_by:

  memoize :liked_by?

Then bulk load the relevant likes and stash them into memoize’s internal state:

def precompute_data(items, user)
  likes = Like.of_item(items).by_user(user).index_by {|like| like.item_id}
  items.each do |item|
    item.write_memo(:liked_by?,likes[item.id].present?,user)
  end
end

The write_memo method is implemented as follows.

  def write_memo(method, return_value, args=nil)
    ivar = ActiveSupport::Memoizable.memoized_ivar_for(method)
    if args
      if hash = instance_variable_get(ivar)
        hash[Array(args)] = return_value
      else
        instance_variable_set(ivar, {Array(args) => return_value})
      end
    else
      instance_variable_set(ivar, [return_value])
    end
  end

This problem described here could be solved with some crafty left joins added to the query that fetched the items in the first place, but when there’s several different hard to prefetch properties, such a query would likely become unmanageable, if not terribly slow.

Idiot Calling on Twitter – Frequency of You’re vs Your

December 2nd, 2010

At the risk of being forever branded a grammar elitist, lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two and a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may be interesting to compare Facebook status updates to see what the you’re/your ratio looks like there one day…

FluidSurveys Data Export Issue, Solved with iconv

November 24th, 2010

I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, and they have great support, but I ran into a time consuming issue when it came time to process the responses because they’re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.

The files on first inspection are a bit strange, because although they spring from a csv export button, they’re tab-delimited, but with CSV-style quoting conventions. That’s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn’t quite get the parameters right.

Then I thought of iconv, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:

iconv -f UTF-16LE -t US-ASCII -c responses.csv > converted_responses.csv

And, as they say, Bob’s your uncle

The Sad Story of 8 Theriault Court, Cambridge Massachusetts

September 30th, 2010

If you’re buying a house, especially if you’re putting less than 20% down and/or the house is on a private way, you may want to read through our long and stressful ordeal that cost us many months and $20,000.

8 Theriault Court

In March 2010, we set about buying 8 Theriault Court, Cambridge, Massachusetts, owned by Catherine and Rafael Clemente, Jr. The house was listed at $449,000. We submitted an Offer to Purchase the house for $420,000, and after negotiations and an inspection, we agreed upon a purchase price of $434,000 and signed a Purchase & Sale (P&S) agreement.

We also made these concessions in the P&S at the seller’s request:

  • We proposed to put down $10,000 as a good faith deposit, but the sellers required a 5% deposit. We acquiesced and our total down payment held in escrow with Coldwell Banker (seller’s realtor) was $21,700.
  • We were willing to accommodate the new home search of the sellers by agreeing to a very flexible closing date up to June 30, 2010 (up to 86 days from the date of the signed P&S).

Mortage loan attempt #1

After finalizing the P&S agreement, it was time to apply for a mortgage. During the mortgage approval process, the appraisal came back with a value of $417,000 on April 22, 2010. Based on this appraisal, we went back to the sellers to negotiate a selling price for which we could obtain a mortgage, based on the true value of the house. We were only able to renegotiate the selling price down to $425,000, with us bringing cash to closing to make up the difference between the appraised value we could seek a mortgage for and the amount the sellers wanted to get for the house.

As part of the price renegotiation, the sellers changed the realtors’ commission rate from 5% to 4% without first getting permission from our realtor. Our realtor was cornered into accepting this pay cut or risk our losing the opportunity to buy the house.

At this point, we also discovered that our first mortgage loan was turned down due to a private roadway way issue (more about that below).

Mortgage loan attempt #2

We knew that we’d have to try another loan provider and would, thus, need to have another appraisal conducted. As part of our new price negotiations with the sellers, we also had to agree that if this second appraisal was,

“…higher than the purchase price [of $425,000], then the parties agree to negotiate in good faith relative to increasing the purchase price with a cap on the purchase price of $434,000.”

This provision was definitely in the seller’s favor, as they explicitly ruled out lowering the price should the second appraisal affirm the low valuation of the first. Despite all of these provisions, we decided to press on.

The second appraisal came back at $415,000 on May 21, 2010, confirming the earlier appraisal’s assertion that the sellers, at $425,000, were getting more for the house than it was worth.

During this period of time, we had about a week of daily negotiating around P&S amendments, pushing the mortgage contingency date back day by day as we waited for mortgage approval.
When approval arrived, it was conditional upon securing private mortgage insurance (PMI). Because we were paying 10% down to close on the house, we needed to secure PMI. Our mortgage broker and lawyer seemed certain that PMI is almost always approved. We thought we were good to go for closing, so we let the mortgage contingency date slide by.

PMI Denied

Ater the mortgage contingency date, two different PMI company underwriters decided that, based on the comparable houses selling in that neighborhood, 8 Theriault Court was worth less than our mortgage amount. They wouldn’t underwrite the insurance to finalize our mortgage loan approval. They wouldn’t say how much they thought the house was worth, but our denial letter said the following:

“The property does not meet [insurance company’s] minimum underwriting standard due to nonsupport of value from comparables,“ and, “The property does not meet [insurance company’s] minimum underwriting standards due to overall poor functional utility.”

And the other denial said, “Comps do not adequately support value.”

Despite these developments, the sellers were unwilling to come down in the selling price.

Mortgage loan attempt #3

Not being able to secure a conventional mortgage loan, we decided to instead pursue a government loan via FHA. FHA regulations require either the presence of an easement on the title documents of all properties on the private way, or a roadway maintenance agreement signed by all property owners abutting the way, stating that owners of the properties agree to share responsibility for the repair and maintenance of the road (excluding plowing by the City of Cambridge) and allowing them to access their own houses by driving over the roadway section in front of their neighbors’ houses. Neither of these exists among the six property owners on Theriault Court and the sellers were unwilling to make any effort to ask their neighbors to sign an agreement so that we could obtain an FHA loan and close on the sale. As a result, we were unable to get an FHA loan.

Backing out of the purchase

At this point, we could not get a mortgage loan to close on the house, due to the house’s low value and the private way issue, both of which were beyond our control. The P&S document specified that the deposit belonged to the sellers once all of the contingency dates had passed, but we hoped, as reasonable people, that we could reach an agreement with the sellers to compensate them for expenses related to canceling the sale, such as breaking the lease on an apartment, and still have a large amount of our deposit returned to us.

That was not to be, as the sellers made a quick offer of $2,000, backed by a notice to our lawyer that they had retained counsel to litigate over the matter, if necessary. In the end, after consulting our own legal counsel and determining we would be unlikely to come out fiscally ahead (due to the cost of retaining counsel) if we choose to litigate the matter, we negotiated with the sellers to give us back a mere $4,000 of our $21,700 deposit. They pocketed the rest. In addition to losing $17,700 of our deposit, we were also out several thousands of dollars for the inspection, appraisal, and attorney costs, bringing our total loss on this real estate transaction to around $20,000. We also lost out on the opportunity to get the first-time homebuyer’s credit of $8,000.

Lessons learned

Overall, we learned a lot about how to better protect ourselves for situations beyond our control. We now seek to protect other potential buyers, of any home, from encountering the same problems. Unless you are able to put 20% down on a house and forgo needing PMI, make sure your P&S agreement includes contingency for you to get out of the deal if PMI is denied after the mortgage contingency date. Also, if the property is on a private way and there is no written agreement in place, be prepared to go door to door asking the neighbors to sign a private roadway agreement—and know that there is no guarantee that they will sign a legal document proposed to them by a stranger.

The biggest lesson we learned is that if the sellers are unreasonable early on, as you negotiate on various things, or are unrealistic about the worth of the property in the face of mounting evidence to the contrary, walk away while you still can, because they’re not going to get any more reasonable as time passes. There are other houses out there and it’s not worth the stress and potential fiscal losses.

Postscript

The house at 8 Theriault Court is back on the market, this time at $439,000. Based on the two appraisals conducted just four and five months ago, and unless the market or the house has changed significantly since that time, the house is likely not worth even that amount.

Plotting Game by Game Winning Percentages

April 6th, 2010

Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final winning percentage.

I created an interactive plot to look at the results for the 2009 season, team by team.

Take Boston. Seen below, Boston started slow, but pretty quickly arrived at their ultimate winning level.

On the other hand, the Yankees started even slower, and in fact didn’t reach their ultimate winning level until very late in the season.

See the results for the other teams on the visualization page.

The visualization was created using Javascript and the Raphaël JS library.