Cloud Computing and System Fault Tolerance

By | November 27, 2016

Cloud Computing and System Fault Tolerance

Cloud computing use has risen in direct correlation with the development of web 2.0 technologies, as well as the development and increased availability of high-speed communications infrastructure and internet access. Well known advantages of cloud computing include; cost savings, reliability, manageability, and competitive edge (, n.d.). One aspect of cloud computing related to reliability is system fault tolerance. Fault tolerance is a systems ability to function as intended even in the event of failures or faults.

Traditional methods of achieving fault tolerance in information systems require users to have an in-depth knowledge of the underlying mechanisms. This presents a challenge in cloud computing systems since their architectural details are often not known to customers. Additionally,  cloud architecture is based on a layered architecture consisting of four layers.; The base layer is the physical hardware layer. The next layer built on the hardware layer is the Infrastructure as a Service (IaaS)  layer. The following layers built upon IaaS layer in succession are the Platform as a Service (PaaS) layer, and the Software as a Service (SaaS) layer. In this model a failure in a layer has an impact on the services above it, example; a failure at the IaaS layer will affect both the PaaS and SaaS levels above it.

Servers in cloud data centers typically contain multiple processors, storage disks, memory, and network adapters. Tracking failure statistics on this hardware proves the need for robust fault tolerance to meet high-availability standards required in cloud computing. Fault behavior can be analyzed and modeled to determine the effect of failure of various components on a cloud computing system. Fault tolerance and redundancy within individual server components is achieved through mechanisms such as multiple hard disk arrays and multiple redundant Network Interface Cards  (NIC)’s  (Jhawar & Piuri, 2013, pg.5-10).


Different Levels of Fault Tolerance in Cloud Computing

Based on cloud computing architecture, different levels of failure independence can be achieved through various strategies as noted below:

  • Multiple machines within server clusters. In this fault tolerance model, applications are hosted on two or more different hosts that are connected using a devices such as a Top of Rack (ToR) switches which act as a failover and load balancer device between each server. In this model a server failure will not create a service unavailability, but a power failure or failure of a switch could result in an outage of an entire application (Jhawar & Piuri, 2013, pg.11).
  • Multiple clusters within a data center. In this model replicas of an application are hosted on servers in different clusters within a data center which are connected with ToR switches and Aggregation Switches (AggS). Failure independence in this model is moderate since failure of an application server on one cluster does not directly affect the remaining application servers on other clusters (Jhawar & Piuri, 2013, pg.11).
  • Multiple data centers. Two or more replicas of data and applications are hosted within different data centers. This model provides the highest fault tolerance but has the drawback that if there is high network latency or low bandwidth, it can adversely affect overall application performance (Jhawar & Piuri, 2013, pg.12).

High-demand, high-availability cloud computing systems require robust fault tolerance mechanisms that provide the failover and load scaling required to host various cloud services. The best strategy for attaining this requirement is the use of various fault tolerance techniques applied in layers. At the base level, cloud system servers need to have multiple redundant components such as Redundant Array of Independent Disks (RAID) configurations and multiple NIC’s. Fault tolerance is further achieved by configuring cloud applications and services within failover servers on different clusters. Maximum failover can be achieved by hosting replica application servers within multiple clusters that are also in turn hosted in multiple data centers.




Jhawar, R., & Piuri, V. (2013). Fault Tolerance and Resilience in Cloud Computing Environments. Retrieved July 1, 2016, from (n.d.). Advantages and Disadvantages of Cloud Computing. Retrieved June 28, 2016, from



Leave a Reply

Your email address will not be published. Required fields are marked *