High-Availability

December 2016

Introduction to Reliability

No matter what service is being performed by a computer system, users must have confidence in how the system operates in order to be able to use it under good conditions. The term "reliability" characterises how trustworthy a computer system is.

A failure is when a service does not function properly, i.e. a state of operation that is abnormal or, more precisely, not in accordance with specifications. From the user's point of view, a service has two statuses:

  • appropriate service, i.e. in accordance with expectations
  • inappropriate service, i.e. not in accordance with expectations

A failure is attributable to an error, i.e. a local dysfunction. Not all errors lead to service failure.

There are several ways to limit service failure:

  • Error prevention, which consists of avoiding errors by anticipating them
  • Fault tolerance, the goal of which is to provide a service that is in accordance with specification despite errors by introducing redundancy
  • Error elimination, aiming to reduce the number of errors through corrective actions
  • Error prediction, by anticipating errors and their impact on service

Introduction to High-Availability

"High-availability" is all the measures that aim to guarantee service availability, i.e. ensure around-the-clock operation of a service.

The term "availability" refers to the probability that a service is operating properly at a given time.

The term "reliability", which is also sometimes used, refers to the probability that a system is operating normally over a given period of time. This is called "continuity of service".

Availability is most often expressed by the availability rate (a percentage), which is measured by dividing the time the service is available by the total time. Availability is most often expressed by the availability rate (a percentage), which is measured by dividing the time the service is available by the total time.

Availability Rate Length of Downtime
97% 11 days
98% 7 days
99% 3 days and 15 hours
99,9% 8 hours and 48 minutes
99,99% 53 minutes
99,999% 5 minutes
99,9999% 32 seconds

Risk Evaluation

Indeed, the failure of a computer system can cause losses in productivity and money and even material and human losses in certain critical cases. Thus, it is necessary to evaluate the risks tied to the dysfunction (failure) of one of the components of a computer system and anticipate the means and measures to be used to avoid the incidents or to reestablish service in an acceptable amount of time.

As everybody knows, there are numerous ways in which a network computer system can fail. The causes of failures can be broken down as follows:

  • Physical causes (these can be natural or criminal in nature):
    • Natural disaster (flood, earthquake, fire)
    • Environment (bad weather, humidity, temperature)
    • Material failure
    • Network failure
    • Power cut
  • Human causes (these can be intentional or accidental):
    • Design error (software bug, poor network provisioning)
  • Human causes (these can be intentional or accidental):
    • Design error (software bug, poor network provisioning)
  • Operational causes (these are linked to system status at a given moment):
    • Software bug
    • Software failure

All of these risks can have different causes such as the following:

  • Intentional maliciousness

Fault Tolerance

Since it is impossible to totally prevent breakdowns, one solution consists in setting up redundancy mechanisms by duplicating critical resources.

The ability of a system to operate despite the failure of one of its components is called fault tolerance.

When one of the resources breaks down, the other resources take over in order to give system administrators the time to find a solution to the problem. This is called "Fail-Over Service" (FOS).

Ideally, in the case of material failures, the faulty material elements should be hot swappable, i.e. capable of being extracted and replaced without service interruption.

Backup

Setting up a redundant architecture ensures that system data will be available but does not protect the data against user-introduced errors or against natural disasters such as fires, floods or even earthquakes.

Therefore it is necessary to set up backup mechanisms (ideally remote) in order to guarantee data perenniality.

Moreover, a backup mechanism can also be used for archival storage, i.e. saving data in a state that corresponds to a given date.


Related :


Alta disponibilidad
Alta disponibilidad
Haute disponibilité
Haute disponibilité
Alta disponibilità
Alta disponibilità
Elevada disponibilidade
Elevada disponibilidade
This document entitled « High-Availability » from CCM (ccm.net) is made available under the Creative Commons license. You can copy, modify copies of this page, under the conditions stipulated by the license, as this note appears clearly.