May 2009 Archives
This is the second post in the series on "What is Web-Scale". It has been a while since my first post so I had better get on with it. I took an informal poll on twitter to select which area of the web-scale / cloud concepts to expand. Transparency was popular but the most popular was "just do them in order". That is what I will do. Transparency will be next.
Reiterating the bullet from the summary post: Available and Reliable: 99.9 or 99.99% Availability (24x7x365, not against an advertised availability) Always On: No down time, planned or otherwise. The site must always be available.
A common internet forum statement is "If
there aren't pictures, it didn't happen"... so here is a picture. The first thing about availability at scale is that you cannot depend on opinions or feelings about whether it is good enough. You must measure it. It
must be measured every second of every day.
The data must be logged over long periods to determine individual
service frailty. For massively scalable
systems, this data must be reviewed daily with alarms going off anytime a
system falls out of specification. The
following is a high level dashboard of one of our system monitors at OCLC. Failures must be evaluated for corrective action. It's just not optional.
The numbers: What does it mean... system managers slang is "Two nines" or "Four nines". 99% available is "2 nines", 99.99% is four. Simple enough right? While it seems mathematically simple, this area tends to be often misunderstood. We all tend to relate statements on reliability to personal devices and machines that are very local and singular in nature. A single machine at 99.99% is down for 52 minutes a year.
99% - 3.65 days outage per year
99.9% - 8.76 hours outage per year
99.99% - 52.56 minutes outage per year
99.999% - 5.256 minutes outage per year
99.9999% - 31.536 seconds outage per year
99.99999% - 3.1536 seconds outage per year
Now the bad news: A service actually drops to 99.98 available when it is dependent on just two 99.99% lower level services (105 minutes per year). This can be called series availability. The more services you chain together the worse your reliability gets.
2 services in series: 365 * 24 * 60 * .9999 * .9999 = 105 minutes annually.
3 services in series: 365 * 24 * 60 * .9999 * .9999 * .9999 = 157 minutes annually.
As you might guess, our current Web 2.0 mashup world is
generating reliability issues as services are very typically series based...
metasearch -> webui -> SRU -> database -> data just as a
Don't despair, there is good news! This good news actually supports a service architecture environment instead of detracts. There are ways to improve availability with a SOA model. The first, and most expensive way is to buy and manage very highly reliable individual systems. This is the path the big iron of the 80's took... and it got very, very expensive. In modern highly available environments, this issue is addressed by parallelizing the workload. Simply double up each service so that both must be down for the entire system to be down and you are back to 52 minutes with 99.99% on each machine.
Availability = 1 - (1-MachineAvail)**2
Given that, we have even better news, if you double 99% available machines you get 99.99% availability, triple and you get 99.9999%! This is why the massively scalable architectures now can use commodity hardware instead of paying for it in the individual machines.
In real life examples however it is never "simply double..." There are issues in software design, data integrity
issues, transaction routing, load balancing, fail-over, etc.
These all contribute a significant cost to obtaining highly reliable
systems. In other words, we moved some expense from hardware to software. This is good news again since software copies scale less expensively than hardware.
Planned verses Unplanned: How many of our services have an outage notification page or warning page of pending outage or a current outage? Historically we have struggled over the words to use as I am sure everyone has. We carefully craft messages and explanations. But realize this... NOBODY READS THEM! We might feel a little better when we find the notice after we see a service has failed but the vast majority of users of our systems just see a failure and move on to an alternate.
Another false comfort is that somehow planned and unplanned outages are different. Outages for upgrades are really not tolerated by users. Major internet services figured this out from the beginning. The service must be on at all times. Software installs must be done on a rolling basis while user transactions are serviced. Hardware additions or replacements must be the same. Always on is now the default end-user expectation.
A positive byproduct of the scaling across commodity hardware for reliability is that there are now many options for rolling installs across an environment. It can be done in parallel data centers, farms within data centers, individual machines or even virtual machines on a single host. Again, it takes software design and configuration management design, but it is quite practical in today's environments.
OCLC: Focusing on just one service platform, worldcat.org is comprised of 150 servers. These servers are divided into farms by function... 65 database servers, 75 application servers, and 10 servers supporting harvesting and bots. We continually add hardware and rebalance the environment with demand. We have two data centers today and will likely have more in the future as we grow and balance load geographically.
The recent OCLC press release about Worldcat Local "Quick Start" and network management environment used the term "Web-Scale" 12 times (Press Release). Many assumptions can be made about what this means, but let me outline what this means to us in deploying technology. Our use of the term within OCLC describes both a technical architecture and the impact the services have within the community they serve. The following is a brief outline focused on the technology aspects. Over the next several weeks I will provide a detailed post on each of these topics.
What is Web Scale? A system which is Highly Available, Reliable, Transparent, High Performance, Scalable, Accessible, Secure, Usable, and Inexpensive
There are alternate phrases, some of which are true alternatives and some of which have different meanings. These include: "utility computing", "web-scale computing", "on-demand infrastructure", "cloud computing", "Software as a Service (SaaS)" and "Platform as a Service (PaaS)".
Available and Reliable: 99.9 or 99.99% Availability (24x7x365, not against an
Always On: No down time, planned or otherwise. The site must always be
Transparent: Transaction level redirection without user knowledge. Any service within the infrastructure should fail over quietly and reliably to an alternate service with little to no disruption to the user. There should be no degradation when services go down
High Performance: Fast response time. Sub-second response time on every transaction type at the internal service level. We should be clearly faster than anything implemented locally. Network latency should be the only performance concern.
Scalable: Three-dimensional scaling. The environment must allow us to add users with decreasing costs per user, add servers with decreasing cost per server, and add services with decreasing costs per service.
Accessible: Allow access to all that the web has to offer. It must allow external services to integrate in many different ways convenient to them, and must integrate to external services in commonly used standard ways. It must integrate into the web where the current and new users spend time.
Must provide identity management, service protection, data protection while insuring personal as well as institutional privacy.
Usable: The system must be usable by a massively diverse community. For us this means professionals as well as general internet users; publics, academics and special libraries; novice system users as well as experienced. Decisions concerning UI design, workflow and usability must be based on sound evidence not opinions.
Inexpensive: Must do ALL of the above at a lower price than the local environment can do SOME of the things above. The value proposition must be simple and clear.
Detail about these will come in later posts. Feel free to comment! I welcome your feedback.