Suppose you have a cluster


And suppose you have n services on this cluster... you have n! problems.

Each service is likely to interrelate with other services in some kind of dependency tracking.

And you just broke a key service.

How do you solve this problem? How do I solve this problem? I’ve been thinking about this on about three separate levels lately.

Implementation - do you have some sort of heartbeat service? Are  
your failures discrete? Continuous? What do you do on failure? Do  
you have a database of running services and their pids? How are  
you polling for liveness?  
Software/cluster - this is a very traditional problem in computer  
science . There is a lot of literature available on distributed  
systems. There seems to be a loose consensus in the solutions  
pragmatically available online: master/slave (perhaps, as my old  
advisor posed, this should be captain/sargeant?), master/master,  
and some fashion of failover, and gross redundancy via some kind  
of load balancer.  
Abstract. Distributed failure exists across a multitude of areas:  
crystals, transportation networks, power grids, software cluster,  
neural damage, etc, etc. This seems to be an interesting  
problem. What do other - non-artifical - systems do to manage  
system level change/damage? I’d like to draw from those models to  
really bring a novel twist. Anyway. 

Let me formalize the problem as I understand it today, basing this formal description from the language and capabilities of software.

Let there be a directed graph G of vertex/node (V) and edges (E).

Each vertex is associated with a failure model.

Each edge is a triple (src, dest, failure_model). I’m not sure we need a failure model associated with an edge - possibly if an edge can fail, it is better modeled by an intermediate vertex lying in between src and dest.

This can be easily seen to have two forms to view by (maybe more). Form one is a steady-state model giving a probablilty of failure: the network is considered to have a total probability of failure. Form two is a simulation view, where on each simulation step, the failure model for each node is executed and failures propagated along edges. This has a tremendous similarity to the McCulloch-Pitts model of neural networks, although I do not presuppose training the weights on the edges.

Let’s add a new spin to this: the supervisory node. The supervisory node can ‘unfail’ other ( possibly supervisory) nodes, as well as add nodes. This of course models some kind of monitor spinning up a new VM instance or restarting the process. To add a node, we ought to indicate some kind of VM server M - M is not a 'regular’ node (or should it be? just one with an unusual service? Hmmmm).

For the full complexity, supervisors can communicate with each other and take decisions. It is practically possible/probable to have a “control plane” communication network to handle these sorts of communications. For this post, let’s just assume that we can add an edge to the graph that models the control plane.

More formally, we have a tuple:

(V, , M)

One idea that seems to be a useful framework is the Viable System Model. Suppose we make the crass and hard assumption that our cluster is a dynamic system. Then we have to introduce the systems required to maintain some kind homeostasis.

Taking the VSM, we have 5 systems to consider:

  • Tactical - basic tasks
  • Communications
  • Policy/Process/Procedure
  • External monitoring feeding into other areas
  • Overall steering/guidance.

The next post in this 'series’ will be my thoughts on how to put together some kind of VSM model on how to configure a software cluster.