June 2009 Archives
Moving on from the previous post to the topic of transaction transparency. The previous post, which in hindsight seemed a bit dry, focused on the numbers of being highly available. Sure, I had a colored graphic and all, but it was just math. There were probably only one or two people that were inquisitive enough to actually check the math. Transaction level redirection is where meeting the numeric goals gets interesting. It may be the key distinguishing characteristic between a hosted web application and a system that is designed for web scale.
To recap Transparent: Transaction level redirection without user knowledge. Any service within the infrastructure should fail over quietly and reliably to an alternate service with little to no disruption to the user. There should be no degradation when services go down
Sessions and State leading to Efficient Workflow
Can you believe that it was almost 15 years ago that we started down the path of trying to force our session based, stateful systems into the "World Wide Web" and the Mosaic browser? We slapped web skins on existing applications. Many even continued to architect this way because it was the knowledge that existed at the time. Unfortunately, some of that legacy is still with us. Admittedly, OCLC has some remaining pockets of this model.
The limitations led many to view statefulness and session based designs as bad juju. Over the next years, the information industry has meandered through various solutions to this problem: Cookies, bad cookies; Session URL tags, bad session URL tags; etc. Fast forward to today and we find that most systems strive to implement good stateful models without the legacy of doing so with sessions. Think of "good stateful models" as those that support efficient workflows. Workflows might be a good future topic... but I will just throw out that "efficient" is usually not equal to "the way I always did it".
What does that have to do with Transaction Redirects?
Efficient workflows require that your services maintain some context for you. Why are you here? What did you do last? What are you likely to do next? These are relatively simple things to do in a single host environment. They become a little harder across multiple machines. They become very difficult across data centers. Now introduce failures into the system: machines crashing, disks failing, applications failing, networks failing... all things that will happen. Your systems must be prepared on every transaction to infer some context even if that system has not previously seen your history.
Consistency and Availability
As you might imagine, transaction redirections introduce a balance between consistency and availability... something that has been difficult in the library industry. Up to a point, they both can increase together with well built software. At some point however, higher consistency across larger and larger stores of data leads to lower reliability (insert your favorite metasearch story here). There is not a single right answer. For some applications, you don't want an answer unless it is guaranteed right... at your doctor's office for example. In other cases, having the service available is more important than getting the exact same answer on multiple attempts... shopping on Amazon. The key is finding the right balance for your environment.
Again, what does this have to do with Transaction Redirects?
I used the above path to demonstrate the issues of statefulness and sessions in human interaction with our systems. These issues apply to service components in service oriented architectures. In modern service oriented systems, the number of components can be onerous (70+ in worldcat.org). We must build the overall system anticipating failures. The value of SOA systems is that components can be scaled independently and fail over independently. For example, if cover art is becoming slow, we can add cover art virtual servers within minutes. If one fails, the calling applications can switch over silently to an alternative. Each component of the system must be prepared for taking transaction loads in growth situations as well as failures. They must do this without forcing the user to back up to the top of a workflow chain.
But how do you do that?
The good news is that there are many blazing the trail and pointing us to what works and what doesn't work. The not-so-good news is there is not an easy way to take an existing application and add this stuff after the fact. There is much written about the architectures of the large internet services. The common point amongst all of them is that they are designing core infrastructures that support models of scale and availability. They are not simply hosting an application on the web.
An example component at OCLC is our internally developed text engine. All of the data is in memory for absurdly fast response time. It is spread across three clusters. Each transaction is sent to all three at the same time. The first one to respond wins. Each partition within each cluster is also replicated. A transaction failure, outside of those we humans cause, is practically impossible. We have even tested pulling plugs from the wall and watching a load tests continue to hum along without skipping a beat.