Suppose you have a cluster
And suppose you have many services on this cluster… you have n! problems.
Each service is likely to interrelate with other services in some kind of dependency tracking.
And you just broke a key service.
How do you solve this problem? How do I solve this problem? I’ve been thinking about this on about three separate levels lately.
Let me formalize the problem as I understand it today, basing this formal description from the language and capabilities of software.
Let there be a directed graph G of vertex/node (V) and edges (E).
Each vertex is associated with a failure model.
Each edge is a triple (src, dest, failure_model). I’m not sure we need a failure model associated with an edge - possibly if an edge can fail, it is better modeled by an intermediate vertex lying in between src and dest.
This can be easily seen to have two forms to view by (maybe more). Form one is a steady-state model giving a probablilty of failure: the network is considered to have a total probability of failure. Form two is a simulation view, where on each simulation step, the failure model for each node is executed and failures propagated along edges. This has a tremendous similarity to the McCulloch-Pitts model of neural networks, although I do not presuppose training the weights on the edges.
Let’s add a new spin to this: the supervisory node. The supervisory node can ‘unfail’ other ( possibly supervisory) nodes, as well as add nodes. This of course models some kind of monitor spinning up a new VM instance or restarting the process. To add a node, we ought to indicate some kind of VM server M - M is not a 'regular’ node (or should it be? just one with an unusual service? Hmmmm).
For the full complexity, supervisors can communicate with each other and take decisions. It is practically possible/probable to have a “control plane” communication network to handle these sorts of communications. For this post, let’s just assume that we can add an edge to the graph that models the control plane.
More formally, we have a tuple:
(V, , M)
One idea that seems to be a useful framework is the Viable System Model. Suppose we make the crass and hard assumption that our cluster is a dynamic system. Then we have to introduce the systems required to maintain some kind homeostasis.
Taking the VSM, we have 5 systems to consider:
The next post in this 'series’ will be my thoughts on how to put together some kind of VSM model on how to configure a software cluster.