Monday, February 11, 2013

How services live forever in the Cloud

The self-service cluster-based service (CBS) is becoming more and more popular. For the CBSs vendors it is therefore all the more important to have a comprehensive reliable concept that, in addition to preventive protection measures, also contains methods for back-up and disaster recovery.

In this post I would like to concentrate on one aspect of the cloud-based infrastructure – cluster management.

Cluster management addresses the complexity of handling a dynamic, large-scale system with many servers. Such systems must handle software and hardware failures, setup tasks such as bootstrapping data, and operational issues such as data placement, load balancing, planned upgrades, and cluster expansion.



Frameworks like Apache Helix provide an abstraction for a system developer to separate coordination and management tasks from component functional tasks of a distributed system.



The term Cluster Management is a broad one. Therefore I will set of common tasks required to

run and maintain such infrastructure. These tasks are:


Resource management:The resource (database, index, etc.) of the cloud-service provides must be divided among different nodes in the cluster.
Fault tolerance:The CBS must continue to function amid node failures, including not losing data and maintaining read and writes availability.
Elasticity:As workloads grow, clusters must grow to meet increased demand by adding more nodes. The CBS resources must be redistributed appropriately across the new nodes.
Monitoring:The cluster must be monitored, both for component failures that affect fault tolerance, and for various health metrics such as load imbalance and SLA misses that affect performance. Monitoring requires follow up action, such as re-replicating lost data or re-balancing data across nodes.



There are different the system-specific models of behavior. Before we will take a look at a formal language for defining a finite state machine behavior, I would like to introduce the basic terminology for it.

NodeA single machine.
ClusterA collection of nodes, usually within a single data center, that operates collectively and constitutes the CBS.
ResourceA logical entity defined by and whose purpose is specific to the CBS.
(Examples are a database, application server, or communication system.)
PartitionA partition is a subset of the resource.Resources are often too large or must support too high a request rate to maintain them in their entirety, but instead are broken into pieces.The manner in which the resource is broken is system-specific; one common approach for a database is to horizontally partition it and assign records to partitions by hashing on their keys.
ReplicaFor reliability and performance, CBSs usually maintain multiple copies of each partition, stored on different nodes.
Copies of the same partition are known as replicas.
StateThe status of a partition replica in a CBS. A finite state machine defines all possible states of the system and the transitions between them. We also consider a partitions' state to be the set of states of all of its replicas.
TransitionA CBS-defined action specified in a finite state machine that lets a replica move from one state to another.



Example:

Formal language for defining Finite State Machine:

Each resource has a set of partitions P

each partition piϵP has a replica set R(pi), RX(pi) is the subset of R(pi) in state X,

and RT(pi) is the subset of R(pi) undergoing transition T.



A finite state machine has sufficient expressiveness for a system to describe, at a partition granularity, all its valid states and all legal state transitions:



On the diagram above you can see a graphical example of the cluster management in the event of the failure of the Node 3.

The cluster consists of different nodes (Node1 – Node4). For this cluster is in this case just one Resource defined, let’s say some database. The Resource is broken into more pieces – so called Partitions.

Each partition has two Replicas (Master/Slave). From the color of Replicas can be recognized on which node the particular Replica is running. For example on the Node 2 are Replicas R1.2 and R2.1 running. Only one Replica per partition on each node is allowed.

In the event of a failure of one Node, the system continues working without interruption and data loss. In this example the Node 2 crashed and it means that Replicas R2.1 and R3.2 are not available anymore.

The cluster management takes the formal declaration of the automated behavior see in the diagram.

At first the rule will be applied that every partition has at most one master. It means the Replica R2.2 change the state from Slave to Master.

As second the next rule will be applied that every partition has at least one slave. So the Slave Replicas R2.1 and R3.2 will be created on other Nodes.

Please note that the Replica R3.2 cannot be created on the Node 2 because of the last rule: At most one replica per partition on each mode is allowed!

The text and illustrations represented in these very simplified examples are of complex cluster management. This post should present the first introduction and overview about this complex topic.

References:

  • Untangling Cluster Management With Helix. In SoCC’12

  • LinkedIn Data Infrastructure Team. Data infrastructure at LinkedIn. In ICDE, 2012.

  • helix.incubator.apache.org/