Tuesday, September 23, 2008

Rolling Restarts, Migrations, and Deployments


I was looking into a question on stackoverflow (sweet site!) and it brought to mind the awesomeness and challenges of dealing with restarting a load balanced array of servers. The awesome part is you can do a deployment with any downtime by taking down a server, updating it, bringing it back up, and moving on to the next one. One issue with this is that you end up having multiple versions of the code live simultaneously. This can get a little weird if a user hits a new view page then the request ends up being loadbalanced to a server that isn't up to date yet.

The solution we came up with (before we heard about SeeSaw) was to take half of the mongrels off line from the load balancer. Shut them down. Update them. Start them up. Put those mongrels back online in the load balancer and take the other half off. Shut the second half down. Update the second half. Start them up. This greatly minimizes the time where you have two different versions of the application running simultaneously. I wrote a windows bat file to do this. (Deploying on Windows is not recommended, btw)

A truly awesome solution to this would be a load balancer that is somehow aware of the version level of the balanced set and just makes the switch for you. Until that is invented, Apache mod_proxy_balancer is easy enough to control remotely.

It is very important to note that having database migrations can make the whole approach a little dangerous. If you have only additive migrations, you can run those at any time before the deployment. If you are removing columns, you need to do it after the deployment. If you are renaming columns, it is better to split it into a create a new column and copy data into it migration to run before deployment and a separate script to remove the old column after deployment. In fact, it may be dangerous to use your regular migrations on a production database in general if you don't make a specific effort to organize them. All of this points to making more frequent deliveries so each update is lower risk and less complex, but that's a subject for another response.

What's a defect? What's a missing feature?

I was reading this about a nice idea some bloke had about developers fixing bugs in their spare time. Most of the serious developers I know are already working in their spare time- let's not make all defects their problem too.

Many software testers cause great frustration among software developers.  One of the biggest issues that arises on agile projects is that testers have a hard time distinguishing between defects and features that have not yet been implemented.

It's always a bit of a challenge to deal with bugs- even outside of agile.  What to some people is an obvious bug is a feature that was never requested to someone else.

For example, I was called into a meeting today where a bug that had been discovered during a user demo. The search was not "not finding telephone numbers in documents". The search term entered was 5551212. Some documents contained 555 1212. Some contained 555-1212. Some contained 1(703)555-1212.

They tried 555*1212, still not working. Search is broken. Developer suggests- try "555 1212", magic happens. * matches any character, not word boundaries...

It was an obvious problem...to the developers who understood that searching for "breakup" was not going to match documents containing "break up".  With text, it's obvious, but I can certainly see where people might not see the issue with phone numbers.

We'll add normalized versions of phone numbers into the search index and we'll normalize search terms that look like phone numbers or something like that, but...this is not an insignificant effort. (Even though there is some decent code out there to handle it.) There are trade-offs in performance that have to be considered.  Ask a tester though- they'll say found a bug; it's their job, and they want to be able to show how good they are at finding them.  If it's a bug, we can't even mark our existing search work as complete.  If it's a missing feature, we have to allocate it to the next release.  Fortunately, a tester didn't find this one, so we don't quite have that problem, but the users do want the feature of being able to match many different formats of phone numbers.

It seems like a simple semantic difference, and it seems like developers are being too sensitive, but it's actually a big deal.  Some "bugs" might actually end up costing a huge amount of money and not be worth fixing- particularly if they aren't really bugs.  I have seen teams show excessive deference to testers and spend a time equal to the time spent working on the basic features to handle some edge cases that would never really occur and were better handled by error messages than by trying to do something useful. Meanwhile, the project sponsor is seldom asked to decide whether a bug should be fixed- few are even estimated as to the cost.

I say- if you want to fix a bug, you have to pay up. If you are really smart, you do do the five whys and find the root cause, but if it happens to be something that it never occured to anyone to ask for, try not to ask the developers to take care of it on their own time.

Tuesday, September 16, 2008

Reading about Hadoop

Tom White's book on Hadoop is up on Rough Cuts. If you aren't familiar with Apache Hadoop, it's Doug Cutting's effort to go beyond Lucene and build an open source implementation of much of the other Google infrastructure.

There is a two day Hadoop Camp at the upcoming ApacheCon in New Orleans. Learning about Hadoop is a great way to become familiar with some of the innovations that Google has put forward in the last few years- and to see the technology behind Yahoo's big set of nodes. What follows is the beginning of Mr. White's book, I am looking forward to the chapter on HBase.

"Hadoop was created by Doug Cutting, the creator of Lucene, the widely-used text search library. Hadoop has its origins in Nutch, an open source web search engine, itself a part of the Lucene project.

Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, it is also a challenge to run without a dedicated operations team — there are lots of moving parts. It's expensive too — Cutting and Cafarella estimated a system supporting a 1-billion-page index would cost around half a million dollars in hardware, with a $30,00 monthly running cost. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratise search engine algorithms.

Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn't scale to the billions of pages on the web. Help was at hand with the publication of "The Google File System" (GFS) [ref] in 2003. This paper described the architecture of Google's distributed filesystem that was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004 they set about writing an open source implementation, the Nutch Distributed File System (NDFS) as it became to be known.

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. Yahoo! hired Doug Cutting and, with a dedicated team, provided the resources to turn Hadoop a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that their production search index was being generated by a 10,000 core Hadoop cluster.

Earlier, in January 2008, Hadoop was made a top-level project at Apache, confirming its success and dynamic community."

Tuesday, September 02, 2008

Another day, another blog

I'm also now infrequently blogging at Collision of Influences, my company blog, to add to my infrequent posting here on more arcane topics of interest to nearly 43 people!  Topics over there are going to be a little different, I just haven't figured out how yet.  I'll let it develop as is its wont.