HBase: Avoid ScannerTimeoutException looking for needles in the haystack with RandomRowFilter

Scanner timeout exceptions happen in HBase when no network activity occurs between the client and server within the timeout period. This can happen for a variety of reasons, but the one we’ll focus on here is the needle in a haystack case: you’re using a highly selective row filter, so the region server is scanning and discarding lots of data. While its great for performance that the data doesn’t come back to the client, the connection may time out.

The first easy fix is to reduce the caching you’re setting up on the connection. There’s only network activity per n (n=cache size) rows when caching is setup. Jeff Dwyer has a quick writeup about that.

If adjusting the cache still doesn’t work, what you can do is add a RandomRowFilter to randomly accept some small fraction of the rows and return them to the client. You just need to re-check the filters on the returned rows, but it may be more efficient than reducing cache size (and possibly more reliable). Just stack it with your existing filters as in the code sample below.

    RandomRowFilter randomFilter = new RandomRowFilter(.001f);
    FilterList orFilter = new FilterList(Operator.MUST_PASS_ONE); 

Tune the constant based on estimates of your data sparsity and timeout settings and away you go

The short circuit article

About three and a half years ago I co-wrote an article named “Increase stability and responsiveness by short-circuiting code” for IBM’s developer works site, and for some reason in the past few days it has repeatedly asked for attention (“hi this is 2004, your article is on the line, and its woefully dated”). First, the one page abstract we submitted fell out of a book on my bookshelf, then I was asked about it at at least one interview. Sadly, that was one of the top results for my name in google for a while and people still find it.
I figure its about time to revisit, and disavow, the implementation in the article, if that isn’t already obvious to anyone.

The idea was to provide a way to time-box operations that could take an unknown amount of time. In this way for example, a web page that must be displayed faster than a certain time can be guaranteed to run in that time, if it can do without the results of operations that take too long to execute.
One obvious flaw is that the code creates LOTS of new threads for a short period of time. It should have used a thread pool to reduce that churn.
The best reason not to use that code is that Java 1.5 introduced a whole set of Concurrency utilities. ExecutorService and Future. There are lots of examples about, so you can check them out.

The high level view is that you package your functionality in a Runnable or Callable (depending if you need to return a result), submit it to an instance of ExecutorService to run. It will return a Future object which can be queried to get the result. One can call get on the Future class, which will return right away if the task is done execuiting, or block until the sooner of a specified timeout or the task completing. Even better, one can submit multiple tasks at once with invokeAll(..) and that will return when all tasks are complete or the timeout has expired.

Interviews on Trivia

I’ve started interviewing again now that I should be finished with my Master’s degree in a month or so. I’m reminded again of the wide range of interview styles people use. My least favorite is the trivia test. This seems to happen more often with Java-related job interviews than Ruby-related ones.

I may never understand why any employer would value memorizing the Java API over being able to reference the docs and know where to find things.

I had such an interview just last week. Here are some of those questions

  • How do you execute a PL-SQL stored procedure from JDBC?
  • How do you import classes into the classpath of a JSP page? (apparently ‘no one in their right mind does that anymore’ isn’t a good answer to this one)

Who memorizes that stuff?

My favorite question of all was this : what are the two conditions under which a finally block is not called. I got one of them, (System.exit()) but the interviewer wouldn’t even tell me the other one (“You won’t learn that way”). I googled it later to find the answer not well defined. One of the ways I saw mentioned was the thread “dying” but Thread.stop() is severely deprecated so that shouldn’t ever happen. The other answer I saw floating around doesn’t really fit – when a exception is thrown from the finally block it doesn’t complete, but the finally block is still called.

I was talking about this with Frank and he came up with another way: infinite loop in the try block. I then thought of calling PowerSystem.getMainPower().setPosition(OFF).

Now I can’t wait to get that question again!

Object Properties Using Value Expressions in JSTL

I found out something cool about JSTL’s expression language today that I didn’t expect at all – like a real scripting language, EL allows one to access an object’s fields not only like object.fieldname (which wouldn’t be changeable at runtime) but also as object[“fieldname”] and by extension with any string variable as an argument. Wow! I don’t remember seeing any examples of this in the wild – somehow i stumbled upon it in the bowels of a unified expression language tutorial.

This came in handy DRYing out some JSP code today that differed only in the fields accessed on the same kind of objects – I thought immediately, if only I had a scripting language this could all go away! Then I thought about the long and winding road one might take to do that in java, involving reflection and/or long if/else blocks. I was pleasantly surprised to find out JSTL can do that. Phew.
Here’s a poor example:

What you’d have to do w/o this capability


and with

<c :set var="fieldName" value="dog" />

i told you it was a bad example.

Can’t share session contents between applications on Tomcat

I got stuck for a little while on this problem today – I was pushing some content into the session from a Flex app into one web app, and then trying to read it from another. It turns out this won’t work under tomcat. I thought at first it was because the cookie paths were different, so I added emptySessionPath=true to my server.xml file, but that actually doesn’t fix the problem. Thinking about it, it is perfectly reasonable that two web applications that aren’t associated in some way (as being in the same ear) can’t get at each others session information.
Lesson learned: either share state in some non-session mechanism (like the database) or move the resources into the same application.