Idiot Calling on Twitter – Frequency of You’re vs Your

At the risk of being forever branded a grammar elitist, lets take a quick look at use of the phrase “your an idiot” on twitter.

Inspired by the tweet by @doctorzaius referencing a URL to Twitter’s search page for “your an idiot”, I used Twitter’s streaming API to download a sample of 6581 tweets containing the word “idiot” overnight, for about 12 hours.

Of these 6581 tweets, 65 contained our friend “your an idiot”. 161, two and a half times as many, contained “you’re an idiot”. Additionally, there were 2 tweets with “your such an idiot”, and just one “you’re such an idiot”. The forces of good grammar have won this round?

Note: This is a very small sample. It may be interesting to compare Facebook status updates to see what the you’re/your ratio looks like there one day…

FluidSurveys Data Export Issue, Solved with iconv

I recently ran a survey at work using FluidSurveys. Their survey building tools are excellent, and they have great support, but I ran into a time consuming issue when it came time to process the responses because they’re double byte unicode, UTF-16LE to be specific. Turns out knowing that is 90% of the battle.

The files on first inspection are a bit strange, because although they spring from a csv export button, they’re tab-delimited, but with CSV-style quoting conventions. That’s easy enough to work around, but R and Ruby both barfed reading the files. I cottoned on to the fact that the files had some odd characters in them, so I recruited JRuby and ruby 1.9 to try to load them, due to better unicode support, but still couldn’t quite get the parameters right.

Then I thought of iconv, the character set converting utility. Since in this case, the only special characters was the ellipsis character, I was happy to strip those out, and the following command does the trick:

iconv -f UTF-16LE -t US-ASCII -c responses.csv > converted_responses.csv

And, as they say, Bob’s your uncle

The Sad Story of 8 Theriault Court, Cambridge Massachusetts

If you’re buying a house, prescription especially if you’re putting less than 20% down and/or the house is on a private way, capsule you may want to read through our long and stressful ordeal that cost us many months and $20,000.

8 Theriault Court

In March 2010, we set about buying 8 Theriault Court, Cambridge, Massachusetts, owned by Catherine and Rafael Clemente, Jr. The house was listed at $449,000. We submitted an Offer to Purchase the house for $420,000, and after negotiations and an inspection, we agreed upon a purchase price of $434,000 and signed a Purchase & Sale (P&S) agreement.

We also made these concessions in the P&S at the seller’s request:

  • We proposed to put down $10,000 as a good faith deposit, but the sellers required a 5% deposit. We acquiesced and our total down payment held in escrow with Coldwell Banker (seller’s realtor) was $21,700.
  • We were willing to accommodate the new home search of the sellers by agreeing to a very flexible closing date up to June 30, 2010 (up to 86 days from the date of the signed P&S).

Mortage loan attempt #1

After finalizing the P&S agreement, it was time to apply for a mortgage. During the mortgage approval process, the appraisal came back with a value of $417,000 on April 22, 2010. Based on this appraisal, we went back to the sellers to negotiate a selling price for which we could obtain a mortgage, based on the true value of the house. We were only able to renegotiate the selling price down to $425,000, with us bringing cash to closing to make up the difference between the appraised value we could seek a mortgage for and the amount the sellers wanted to get for the house.

As part of the price renegotiation, the sellers changed the realtors’ commission rate from 5% to 4% without first getting permission from our realtor. Our realtor was cornered into accepting this pay cut or risk our losing the opportunity to buy the house.

At this point, we also discovered that our first mortgage loan was turned down due to a private roadway way issue (more about that below).

Mortgage loan attempt #2

We knew that we’d have to try another loan provider and would, thus, need to have another appraisal conducted. As part of our new price negotiations with the sellers, we also had to agree that if this second appraisal was,

“…higher than the purchase price [of $425,000], then the parties agree to negotiate in good faith relative to increasing the purchase price with a cap on the purchase price of $434,000.”

This provision was definitely in the seller’s favor, as they explicitly ruled out lowering the price should the second appraisal affirm the low valuation of the first. Despite all of these provisions, we decided to press on.

The second appraisal came back at $415,000 on May 21, 2010, confirming the earlier appraisal’s assertion that the sellers, at $425,000, were getting more for the house than it was worth.

During this period of time, we had about a week of daily negotiating around P&S amendments, pushing the mortgage contingency date back day by day as we waited for mortgage approval.
When approval arrived, it was conditional upon securing private mortgage insurance (PMI). Because we were paying 10% down to close on the house, we needed to secure PMI. Our mortgage broker and lawyer seemed certain that PMI is almost always approved. We thought we were good to go for closing, so we let the mortgage contingency date slide by.

PMI Denied

Ater the mortgage contingency date, two different PMI company underwriters decided that, based on the comparable houses selling in that neighborhood, 8 Theriault Court was worth less than our mortgage amount. They wouldn’t underwrite the insurance to finalize our mortgage loan approval. They wouldn’t say how much they thought the house was worth, but our denial letter said the following:

“The property does not meet [insurance company’s] minimum underwriting standard due to nonsupport of value from comparables,“ and, “The property does not meet [insurance company’s] minimum underwriting standards due to overall poor functional utility.”

And the other denial said, “Comps do not adequately support value.”

Despite these developments, the sellers were unwilling to come down in the selling price.

Mortgage loan attempt #3

Not being able to secure a conventional mortgage loan, we decided to instead pursue a government loan via FHA. FHA regulations require either the presence of an easement on the title documents of all properties on the private way, or a roadway maintenance agreement signed by all property owners abutting the way, stating that owners of the properties agree to share responsibility for the repair and maintenance of the road (excluding plowing by the City of Cambridge) and allowing them to access their own houses by driving over the roadway section in front of their neighbors’ houses. Neither of these exists among the six property owners on Theriault Court and the sellers were unwilling to make any effort to ask their neighbors to sign an agreement so that we could obtain an FHA loan and close on the sale. As a result, we were unable to get an FHA loan.

Backing out of the purchase

At this point, we could not get a mortgage loan to close on the house, due to the house’s low value and the private way issue, both of which were beyond our control. The P&S document specified that the deposit belonged to the sellers once all of the contingency dates had passed, but we hoped, as reasonable people, that we could reach an agreement with the sellers to compensate them for expenses related to canceling the sale, such as breaking the lease on an apartment, and still have a large amount of our deposit returned to us.

That was not to be, as the sellers made a quick offer of $2,000, backed by a notice to our lawyer that they had retained counsel to litigate over the matter, if necessary. In the end, after consulting our own legal counsel and determining we would be unlikely to come out fiscally ahead (due to the cost of retaining counsel) if we choose to litigate the matter, we negotiated with the sellers to give us back a mere $4,000 of our $21,700 deposit. They pocketed the rest. In addition to losing $17,700 of our deposit, we were also out several thousands of dollars for the inspection, appraisal, and attorney costs, bringing our total loss on this real estate transaction to around $20,000. We also lost out on the opportunity to get the first-time homebuyer’s credit of $8,000.

Lessons learned

Overall, we learned a lot about how to better protect ourselves for situations beyond our control. We now seek to protect other potential buyers, of any home, from encountering the same problems. Unless you are able to put 20% down on a house and forgo needing PMI, make sure your P&S agreement includes contingency for you to get out of the deal if PMI is denied after the mortgage contingency date. Also, if the property is on a private way and there is no written agreement in place, be prepared to go door to door asking the neighbors to sign a private roadway agreement—and know that there is no guarantee that they will sign a legal document proposed to them by a stranger.

The biggest lesson we learned is that if the sellers are unreasonable early on, as you negotiate on various things, or are unrealistic about the worth of the property in the face of mounting evidence to the contrary, walk away while you still can, because they’re not going to get any more reasonable as time passes. There are other houses out there and it’s not worth the stress and potential fiscal losses.

Postscript

The house at 8 Theriault Court is back on the market, this time at $439,000. Based on the two appraisals conducted just four and five months ago, and unless the market or the house has changed significantly since that time, the house is likely not worth even that amount.

Plotting Game by Game Winning Percentages

Another baseball season is upon us, and fans are quick to project the results of their favorite team from the first few games. I wondered if many teams tend to arrive at a winning percentage near their whole-season results, and then oscillate around a little, versus having early results that differ substantially from the final winning percentage.

I created an interactive plot to look at the results for the 2009 season, team by team.

Take Boston. Seen below, Boston started slow, but pretty quickly arrived at their ultimate winning level.

On the other hand, the Yankees started even slower, and in fact didn’t reach their ultimate winning level until very late in the season.

See the results for the other teams on the visualization page.

The visualization was created using Javascript and the Raphaël JS library.

Multiple Phrase Search in PostgreSQL

Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases.

Before I go on, I’ll credit Paul Sephton’s Understanding Full Text Search for opening my eyes to some of the possibilities to enable phrase search on top of tsearch’s existing capabilities.

Tsearch operates on tsvectors and tsqueries. Tsvectors are a bag of words like structure – a list of the unique words appearing in a piece of text, along with their positions in the text. Searches are performed constructing a tsquery, which is boolean expression combining words with AND(&), OR(|), and NOT(!) operators, then comparing the tsquery against candidate tsvectors with the @@ operator.

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub';

will match articles where the the body contains the word meatball and the word sub. If there’s an index on to_tsvector(‘english’,articles.body), this query is a very efficient index lookup.

Single Phrase Search

Now how do we match articles with the phrase “meatball sub”, anywhere in the article’s body? Doing the naive query

select * from articles where body like '%meatball sub%'

will work, but it will be slow because the leading wildcard kills any chance of using an index on that column. What we can do to make this go fast is the following:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub' AND body like '%meatball sub%' 

This will use the full text index to find the set of articles where the body has both words, then that (presumably) smaller set of articles can be scanned for the words together.

Multi Phrase Search

It’s simple to extend the above query to match two phrases:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like '%meatball sub%' AND body like '%ham sandwich%';

That query can be tightened up using postgres’s support for arrays:

select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like ALL('{"%meatball sub%","%ham sandwich%"}') 

Stepping back a bit, let’s define create a table called “concepts” to allow users of an application to store searches on lists of phrases, and let’s also allow the user to specify that all phrases must match, or just one of them.

CREATE TABLE concepts
(
   id serial, 
   match_all boolean, 
   phrases character varying[], 
   query tsquery
) 

Now we can specify and execute that previous search this way:

insert into concepts(match_all,phrases,query) VALUES(TRUE,'{"%meatball sub%","%ham sandwich%"}','meatball & sub & ham & sandwich');
select articles.*, join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)));

Where this approach really shines compared with an external text search tools is aggregate queries like counting up matching articles by date.

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)))
group by articles.date

The logic to combine lists of phrases into the appropriate query based on the desire to match any or all of the phrases is easy to write at the application layer. It’s desirable not to have to include the wildcards into the phrase array, and it’s easy to write a function to do that at runtime.

CREATE OR REPLACE FUNCTION wildcard_wrapper(list varchar[]) RETURNS varchar[] AS $$
      DECLARE
       return_val varchar[];
      BEGIN
        for idx in 1 .. array_upper(list, 1)
        loop
          return_val[idx] := '%' || list[idx] || '%';
        end loop; 
        return return_val;
      END;
      $$ LANGUAGE plpgsql;

With that function good to go we can make that long query just a little longer:

select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(wildcard_wrapper(phrases))) OR (not match_all AND body like ANY(wildcard_wrapper(phrases))))
group by articles.date

It’s straightforward to collapse most, if not all of the sql on clause into a plpgsql function call without adversely affecting the query plan – it’s important that the tsvector index be involved in the query for adequate performance.

Further Work

This approach works well for lists of phrases. To support boolean logic on phrases, one approach might be to compile the request down to a tsquery as above, along with a regular expression to winnow down the matches to those containing the phrases.

Another MBTA visualization, this time with the commuter rail

Following up the surprising success of my first MBTA visualization, I made a new version that adds the commuter rail lines. This does have the unfortunate effect of squishing the system’s rapid transit lines because I’m not distorting the distances in any way. I also reduced the size of the markers, perhaps too small for viewing at youtube resolution unless viewed at full screen.

Also, for Frank and Elias, I made the length of the video shorter, enlarged the numbers, and added a little visualization showing active trips sampled at 5-minute intervals with a you are here indicator to give viewers perspective on how busy the system is at that moment relative to the whole day.

Day in the life of the MBTA system

I put together an animation of all the rail traffic in the course of a day on the MBTA’s red, blue, green and orange lines, including the Mattapan line. Its a great way to see just how complicated the system is that takes me to work every day, and perhaps be a little more patient next time things go less than perfect!

The current version of the animation assumes stop take no time (as does the scheduling data).

I’d thought about doing this before, but it would have taken screen scraping schedule information off the site. I learned recently through a developer outreach that the Massachusetts Department of Transportation is running that the MBTA had released their schedule information in the Google transit feed specification (GTFS). With the data in hand, I went to work using the ruby-processing wrapper of the excellent Processing graphics toolkit.


See the video on youtube

Deferring Index costs for table to table copies in PostgreSQL

When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from the filesystem rather than directly from another table.

For table to table copying (and transformations) the situation isn’t as straight-forward. Recently I was working on a problem where we needed to perform some poor-man’s ETL, copying and transforming data between tables in different schemas. Since some of the destination tables were heavily indexed(including a full text index) the task took quite a while. In talking with a colleague about the problem, we came up with the idea of dropping the indexes and constraints prior to the data load, and restoring them afterwards.

First stop: how to get the DDL for indices on a table in postgres? Poking around the postgres catalogs, I managed to find a function pg_get_indexdef that would return the DDL for an index. Combining that with a query I found in a forum somewhere and altered, I came up with this query to get the names and DDL of all the indices on a table. (this one excludes the primary key index)

With that and the query to do the same for constraints its straightforward to build a helper function that will get the DDL for all indices and constraints, drop them, yield to evaluate a block and then restore the indices and constraints. The method is below:

Use of the function would look like the snippet below. This solution would also allow for arbitrarily complex transformations in Ruby as well as pure SQL.

For my task loading and transforming data into about 20 tables, doing this reduced the execution time by two-thirds. Of course, your mileage may vary depending how heavily indexed your destination tables are.

Here’s the whole module: