What is an Service Level Agreement?
A key performance indicator (KPI) that many cloud service providers and cloud product vendors use is service availability. It is usually recorded as uptime per year and defines the service performance in a so-called service-level agreement (SLA) or sometimes also external service agreement.
A service license agreement contains information on a certain level of service that is delivered to a particular customer. Sometimes there are various levels of service for different pricing models. SLAs are not static documents and may change over time.
It's important to note that there are different types of SLAs, they are not restricted to software or services. Also, not all software- or service-related SLAs come with an uptime warranty.
|Availability||Downtime per year||Downtime per month||Downtime per week|
|95%||18 days 6 hours||1 day 7 hours||8 hours 25 minutes|
|99%||3 days 16 hours||7 hours 12 minutes||1 hour 41 minutes|
|99.9%||8 hours 46 minutes||43 minutes 12 seconds||10 minutes 6 seconds|
|99.97%||2 hours 38 minutes||13 Minutes 8 seconds||3 Minutes 2 seconds|
|99.99%||52 minutes 34 seconds||4 minutes 23 seconds||1 Minute 1 second|
|99.999%||5 minutes 15 seconds||26.3 seconds||6.1 seconds|
|99.9999%||31.5 seconds||2.63 seconds||0.61 seconds|
Why availability is also called the "number of nines" should be clear after checking the table above.
These service agreements are not mere lie-documents but define authoritative service standards. An SLA is a binding contract and a missed KPI is a contract breach. When a service provider is not able to keep up his performance standards they usually compensate customers either in direct refunds or service credits. How service credits are calculated is also part of Service Level Agreement. Some vendors also describe a right to earn back service credits, when their service performed above the defined goals.
AWS for example states that the service credit percentage for EC2 is 10% between 99.98 and 99%, 30% between 99% and 95% and 100% for less than 95%. So Amazon Webservices will refund 100% of your costs if the service was unavailable for more than 1 day and 7 hours in a specific month.
Calculating the availability
There are certain things you should keep in mind when calculating the availability of your service:
- The availability of your system is calculated by multiplying the availabilities of all your components
- Redundancy decreases the fail-rate by one potency (Increases the number of nines by one potency)
- The availability may not exceed the availability of your most available service
- Redundancy increases the availability of specific components but may not exceed the availability of the wrapping service
Example with AWS EC2
Let's look at a simple example to shed light on these rather abstract claims. Assume you have a service running on AWS EC2. AWS EC2 has an uptime warranty of 99.99%.
Assuming your service runs smoothly, does not encounter unexpected errors and AWS keeps up their promise, the uptime of your service is thereby expected to be 99,99%.
If you now want to deliver an uptime that is greater than 99,99% availability, the temptation is great to just spawn another EC2 instance with the same service and load balance these. That would result in a fail-rate of 0.0000001%
But the problem to this calculation is, that the bottleneck here would be wrapping service, such as Elastic Load Balancing for example. Elastic load balancing itself has an uptime warranty of 99,99% thus, your service may not exceed that value using only EC2 and Elastic Load Balancing.
Example with self-hosted components
Lets assume you have an application consisting of 4 microservices and a database running on different servers of some random Infrastructure-as-a-Service (IaaS) vendor. Your service itself has an availaiblity of 100% the servers offered by your vendor have a uptime warranty of 99.95%. All your services have a simple redundancy and are wrapped by a load balancer that comes with 99.99% availability.
Your availaiblity would now be calculated the following way:
Your system could possibly provide an uptime warranty of 99.5% which would sum up to a maximum downtime of 1 day and 20 hours per year.
I wrote a small calculator that calculates the availability of your system based on your input. It may not fit all use cases but its easy to get a rough understanding of how specific components affect the overall availability of your system.
Load Balancer Availability = 0%
Components Availability = 0%
Overall Availability = 0%