Moving on from the
previous post to the topic of transaction transparency. The previous post, which in hindsight seemed
a bit dry, focused on the numbers of being highly available. Sure, I had a colored graphic and all, but it
was just math. There were probably only
one or two people that were inquisitive enough to actually check the math. Transaction level redirection is where
meeting the numeric goals gets interesting.
It may be the key distinguishing characteristic between a hosted
web application and a system that is designed for web scale.
To recap Transparent: Transaction level redirection without user knowledge. Any service within the infrastructure should fail over quietly and reliably to an alternate service with little to no disruption to the user. There should be no degradation when services go down
Sessions and State leading to Efficient Workflow
Can you believe that it was almost 15 years ago that we started down the path of trying to force our session based, stateful systems into the "World Wide Web" and the Mosaic browser? We slapped web skins on existing applications. Many even continued to architect this way because it was the knowledge that existed at the time. Unfortunately, some of that legacy is still with us. Admittedly, OCLC has some remaining pockets of this model.
The limitations led many to view statefulness and session based designs as bad juju. Over the next years, the information industry has meandered through various solutions to this problem: Cookies, bad cookies; Session URL tags, bad session URL tags; etc. Fast forward to today and we find that most systems strive to implement good stateful models without the legacy of doing so with sessions. Think of "good stateful models" as those that support efficient workflows. Workflows might be a good future topic... but I will just throw out that "efficient" is usually not equal to "the way I always did it".
What does that have to do with Transaction Redirects?
Efficient workflows require that your services maintain some context for you. Why are you here? What did you do last? What are you likely to do next? These are relatively simple things to do in a single host environment. They become a little harder across multiple machines. They become very difficult across data centers. Now introduce failures into the system: machines crashing, disks failing, applications failing, networks failing... all things that will happen. Your systems must be prepared on every transaction to infer some context even if that system has not previously seen your history.
Consistency and Availability
As you might imagine, transaction redirections introduce a balance between consistency and availability... something that has been difficult in the library industry. Up to a point, they both can increase together with well built software. At some point however, higher consistency across larger and larger stores of data leads to lower reliability (insert your favorite metasearch story here). There is not a single right answer. For some applications, you don't want an answer unless it is guaranteed right... at your doctor's office for example. In other cases, having the service available is more important than getting the exact same answer on multiple attempts... shopping on Amazon. The key is finding the right balance for your environment.
Again, what does this have to do with Transaction Redirects?
I used the above path to demonstrate the issues of statefulness and sessions in human interaction with our systems. These issues apply to service components in service oriented architectures. In modern service oriented systems, the number of components can be onerous (70+ in worldcat.org). We must build the overall system anticipating failures. The value of SOA systems is that components can be scaled independently and fail over independently. For example, if cover art is becoming slow, we can add cover art virtual servers within minutes. If one fails, the calling applications can switch over silently to an alternative. Each component of the system must be prepared for taking transaction loads in growth situations as well as failures. They must do this without forcing the user to back up to the top of a workflow chain.
But how do you do that?
The good news is that there are many blazing the trail and pointing us to what works and what doesn't work. The not-so-good news is there is not an easy way to take an existing application and add this stuff after the fact. There is much written about the architectures of the large internet services. The common point amongst all of them is that they are designing core infrastructures that support models of scale and availability. They are not simply hosting an application on the web.
An example component at OCLC is our internally developed text engine. All of the data is in memory for absurdly fast response time. It is spread across three clusters. Each transaction is sent to all three at the same time. The first one to respond wins. Each partition within each cluster is also replicated. A transaction failure, outside of those we humans cause, is practically impossible. We have even tested pulling plugs from the wall and watching a load tests continue to hum along without skipping a beat.
This is the second post in the series on "What is Web-Scale". It has been a while
since my first post so I had better get on with it. I took an informal poll on twitter to select
which area of the web-scale / cloud concepts to expand. Transparency was popular but the most
popular was "just do them in order". That
is what I will do. Transparency will be
next.
Reiterating the bullet from the summary post: Available and Reliable: 99.9 or 99.99% Availability (24x7x365, not against an advertised availability) Always On: No down time, planned or otherwise. The site must always be available.
A common internet forum statement is "If
there aren't pictures, it didn't happen"... so here is a picture. The first thing about availability at scale is that you cannot depend on opinions or feelings about whether it is good enough. You must measure it. It
must be measured every second of every day.
The data must be logged over long periods to determine individual
service frailty. For massively scalable
systems, this data must be reviewed daily with alarms going off anytime a
system falls out of specification. The
following is a high level dashboard of one of our system monitors at OCLC. Failures must be evaluated for corrective action. It's just not optional.
The numbers: What does it mean... system managers slang is "Two nines" or "Four nines". 99% available is "2 nines", 99.99% is four. Simple enough right? While it seems mathematically simple, this area tends to be often misunderstood. We all tend to relate statements on reliability to personal devices and machines that are very local and singular in nature. A single machine at 99.99% is down for 52 minutes a year.
Summary:
99% - 3.65 days outage per year
99.9% - 8.76 hours outage per year
99.99% - 52.56 minutes outage per year
99.999% - 5.256 minutes outage per year
99.9999% - 31.536 seconds outage per year
99.99999% - 3.1536 seconds outage per year
Now the bad news: A service actually drops to 99.98 available when it is dependent on just two 99.99% lower level services (105 minutes per year). This can be called series availability. The more services you chain together the worse your reliability gets.
2 services in series: 365 * 24 * 60 * .9999 * .9999 = 105 minutes annually.
3 services in series: 365 * 24 * 60 * .9999 * .9999 * .9999 = 157 minutes annually.
As you might guess, our current Web 2.0 mashup world is
generating reliability issues as services are very typically series based...
metasearch -> webui -> SRU -> database -> data just as a
common example.
Don't despair, there is good news! This good news actually supports a service architecture environment instead of detracts. There are ways to improve availability with a SOA model. The first, and most expensive way is to buy and manage very highly reliable individual systems. This is the path the big iron of the 80's took... and it got very, very expensive. In modern highly available environments, this issue is addressed by parallelizing the workload. Simply double up each service so that both must be down for the entire system to be down and you are back to 52 minutes with 99.99% on each machine.
Availability = 1 - (1-MachineAvail)**2
Given that, we have even better news, if you double 99% available machines you get 99.99% availability, triple and you get 99.9999%! This is why the massively scalable architectures now can use commodity hardware instead of paying for it in the individual machines.
In real life examples however it is never "simply double..." There are issues in software design, data integrity
issues, transaction routing, load balancing, fail-over, etc.
These all contribute a significant cost to obtaining highly reliable
systems. In other words, we moved some expense from hardware to software. This is good news again since software copies scale less expensively than hardware.
Planned verses Unplanned: How many of our services have an outage notification page or warning page of pending outage or a current outage? Historically we have struggled over the words to use as I am sure everyone has. We carefully craft messages and explanations. But realize this... NOBODY READS THEM! We might feel a little better when we find the notice after we see a service has failed but the vast majority of users of our systems just see a failure and move on to an alternate.
Another false comfort is that somehow planned and unplanned outages are
different. Outages for
upgrades are really not tolerated by users.
Major internet services figured this out from the beginning. The service must be on at all times. Software installs must be done on a rolling
basis while user transactions are serviced.
Hardware additions or replacements must be the same. Always on is now the default end-user
expectation.
A positive byproduct of the scaling across commodity hardware for reliability is that there are now many options for rolling installs across an environment. It can be done in parallel data centers, farms within data centers, individual machines or even virtual machines on a single host. Again, it takes software design and configuration management design, but it is quite practical in today's environments.
OCLC: Focusing on just one service platform, worldcat.org is comprised of 150 servers. These servers are divided into farms by function... 65 database servers, 75 application servers, and 10 servers supporting harvesting and bots. We continually add hardware and rebalance the environment with demand. We have two data centers today and will likely have more in the future as we grow and balance load geographically.
The recent OCLC press release about Worldcat Local "Quick Start" and network management environment used the term "Web-Scale" 12 times (Press Release). Many assumptions can be made about what this means, but let me outline what this means to us in deploying technology. Our use of the term within OCLC describes both a technical architecture and the impact the services have within the community they serve. The following is a brief outline focused on the technology aspects. Over the next several weeks I will provide a detailed post on each of these topics.
What is Web Scale? A system which is Highly Available, Reliable, Transparent, High Performance, Scalable, Accessible, Secure, Usable, and Inexpensive
There are alternate phrases, some of which are true alternatives and some of which have different meanings. These include: "utility computing", "web-scale computing", "on-demand infrastructure", "cloud computing", "Software as a Service (SaaS)" and "Platform as a Service (PaaS)".
Available and Reliable: 99.9 or 99.99% Availability (24x7x365, not against an
advertised availability)
Always On: No down time, planned or otherwise. The site must always be
available.
Transparent: Transaction level redirection without user knowledge. Any service within the infrastructure should fail over quietly and reliably to an alternate service with little to no disruption to the user. There should be no degradation when services go down
High Performance: Fast response time. Sub-second response time on every transaction type at the internal service level. We should be clearly faster than anything implemented locally. Network latency should be the only performance concern.
Scalable: Three-dimensional scaling. The environment must allow us to add users with decreasing costs per user, add servers with decreasing cost per server, and add services with decreasing costs per service.
Accessible: Allow access to all that the web has to offer. It must allow external services to integrate in many different ways convenient to them, and must integrate to external services in commonly used standard ways. It must integrate into the web where the current and new users spend time.
Secure:
Must provide identity management, service protection, data protection while insuring personal as well as institutional privacy.
Usable: The system must be usable by a massively diverse community. For us this means professionals as well as general internet users; publics, academics and special libraries; novice system users as well as experienced. Decisions concerning UI design, workflow and usability must be based on sound evidence not opinions.
Inexpensive: Must do ALL of the above at a lower price than the local environment can do SOME of the things above. The value proposition must be simple and clear.
Detail about these will come in later posts. Feel free to comment! I welcome your feedback.

